I originally ran into these after learning about something called "sympathy crunch" from someone at Bioware, who claimed this was common there. A sympathy crunch is where you end up "crunching" even though you don't have any work to do, basically idling in the office with extreme overtime hours because other people have work to do.
> We find that
LLMs can pinpoint many more issues than traditional static
analysis tools, outperforming traditional tools in terms of recall
and F1 scores.
Here we fucking go folks. ROLLS EYES SO HARD THEY BURN THROUGH BACK OF SKULL
> The results are compared with two traditional static
analysis tools, CodeQL [13] and SpotBugs [14]. The
comparison includes the best-performing configurations
of these tools. Similar studies have either not used traditional tools as a comparison point [18, 19] or have not
elaborated on the exact configurations and optimisations
of the tools used [6].
Well they aren't wrong there, most of these studies use super old SAST tools.
> When the focus is on the quality of
the output from LLMs, it is found that LLMs can struggle
to provide correct, understandable, concise, consistent and
compliant responses [32].
oh jeez they use juliet, but DON"T WORRY THEY REMOVED COMMENTS THAT POINT TO THE VULNERABILITY!!! Except there's a 99.99% chance this data has absolutely been included in the original training data that GPT 4-turbo and Claude Opus are trained on.
> Prior research has shown leaking some relevant keywords
in the code, like variables named ”secure”, could influence the output of the LLMs [39]. To avoid introducing this bias,
these types of hints are removed from the dataset. The original
dataset contains comments explaining the vulnerabilities, so
all comments are removed.
That's not how it fuckin' works mate. Not with public data sets.
> By default, the Juliet dataset is
configured for function or file-level vulnerability detection.
Similarly to previous research, the non-standard test cases
spanning multiple files or only containing vulnerable examples
are removed [6].
Annnnndd there it is folks. Once again, only looking at a tiny context / scope of code is not how vulnerabilities work in the real world.
OK the abstract is a bit misleading, yes they got GPT-4 to get better recall/f1 score (on a dataset that was most likey included in GPT-4s training set), but you have more FPs, and it's non-deterministic so, have fun getting different results every time. Anyways, that's enough of this paper, I applaud the authors for including the cost and using newer tools, But maybe use a custom/private dataset next time.
Also not very clear what their call graph or "vuln related dependency prediction" task is all about, it almost looks like they are just pulling out symbols then 'guessing' if the symbols are calling functions? Like why are they using Jaccard similarity at all??