4 research outputs found
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities
Security vulnerabilities in modern software are prevalent and harmful. While
automated vulnerability detection tools have made promising progress, their
scalability and applicability remain challenging. Recently, Large Language
Models (LLMs), such as GPT-4 and CodeLlama, have demonstrated remarkable
performance on code-related tasks. However, it is unknown whether such LLMs can
do complex reasoning over code. In this work, we explore whether pre-trained
LLMs can detect security vulnerabilities and address the limitations of
existing tools. We evaluate the effectiveness of pre-trained LLMs on a set of
five diverse security benchmarks spanning two languages, Java and C/C++, and
including code samples from synthetic and real-world projects. We evaluate the
effectiveness of LLMs in terms of their performance, explainability, and
robustness.
By designing a series of effective prompting strategies, we obtain the best
results on the synthetic datasets with GPT-4: F1 scores of 0.79 on OWASP, 0.86
on Juliet Java, and 0.89 on Juliet C/C++. Expectedly, the performance of LLMs
drops on the more challenging real-world datasets: CVEFixes Java and CVEFixes
C/C++, with GPT-4 reporting F1 scores of 0.48 and 0.62, respectively. We show
that LLMs can often perform better than existing static analysis and deep
learning-based vulnerability detection tools, especially for certain classes of
vulnerabilities. Moreover, LLMs also often provide reliable explanations,
identifying the vulnerable data flows in code. We find that fine-tuning smaller
LLMs can outperform the larger LLMs on synthetic datasets but provide limited
gains on real-world datasets. When subjected to adversarial attacks on code,
LLMs show mild degradation, with average accuracy reduction of up to 12.67%.
Finally, we share our insights and recommendations for future work on
leveraging LLMs for vulnerability detection
Automata Learning with an Incomplete Teacher (Artifact)
We provide an implementation of the automata learning software described in the associated ECOOP article. In particular, the artifact is a Docker image with the source code for nerode and nerode-learn, along with the scripts and benchmark inputs needed to reproduce the experiments described in the paper