Search CORE

4 research outputs found

Automata Learning with an Incomplete Teacher (Artifact)

Author: Foster Nate
Koch Caleb
Moeller Mark
Silva Alexandra
Solko-Breslin Alaia
Wiener Thomas
Publication venue: DARTS - Dagstuhl Artifacts Series. DARTS, Volume 9, Issue 2, Special Issue of the 37th European Conference on Object-Oriented Programming (ECOOP 2023)
Publication date: 01/01/2023
Field of study

We provide an implementation of the automata learning software described in the associated ECOOP article. In particular, the artifact is a Docker image with the source code for nerode and nerode-learn, along with the scripts and benchmark inputs needed to reproduce the experiments described in the paper

Dagstuhl Research Online Publication Server

Automata Learning with an Incomplete Teacher

Author: Foster Nate
Koch Caleb
Moeller Mark
Silva Alexandra
Solko-Breslin Alaia
Wiener Thomas
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 37th European Conference on Object-Oriented Programming (ECOOP 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Author: Alur Rajeev
Dutta Saikat
Khare Avishree
Li Ziyang
Naik Mayur
Solko-Breslin Alaia
Publication venue
Publication date: 16/11/2023
Field of study

Security vulnerabilities in modern software are prevalent and harmful. While automated vulnerability detection tools have made promising progress, their scalability and applicability remain challenging. Recently, Large Language Models (LLMs), such as GPT-4 and CodeLlama, have demonstrated remarkable performance on code-related tasks. However, it is unknown whether such LLMs can do complex reasoning over code. In this work, we explore whether pre-trained LLMs can detect security vulnerabilities and address the limitations of existing tools. We evaluate the effectiveness of pre-trained LLMs on a set of five diverse security benchmarks spanning two languages, Java and C/C++, and including code samples from synthetic and real-world projects. We evaluate the effectiveness of LLMs in terms of their performance, explainability, and robustness. By designing a series of effective prompting strategies, we obtain the best results on the synthetic datasets with GPT-4: F1 scores of 0.79 on OWASP, 0.86 on Juliet Java, and 0.89 on Juliet C/C++. Expectedly, the performance of LLMs drops on the more challenging real-world datasets: CVEFixes Java and CVEFixes C/C++, with GPT-4 reporting F1 scores of 0.48 and 0.62, respectively. We show that LLMs can often perform better than existing static analysis and deep learning-based vulnerability detection tools, especially for certain classes of vulnerabilities. Moreover, LLMs also often provide reliable explanations, identifying the vulnerable data flows in code. We find that fine-tuning smaller LLMs can outperform the larger LLMs on synthetic datasets but provide limited gains on real-world datasets. When subjected to adversarial attacks on code, LLMs show mild degradation, with average accuracy reduction of up to 12.67%. Finally, we share our insights and recommendations for future work on leveraging LLMs for vulnerability detection

arXiv.org e-Print Archive