This paper presents the FormAI dataset, a large collection of 112, 000
AI-generated compilable and independent C programs with vulnerability
classification. We introduce a dynamic zero-shot prompting technique
constructed to spawn diverse programs utilizing Large Language Models (LLMs).
The dataset is generated by GPT-3.5-turbo and comprises programs with varying
levels of complexity. Some programs handle complicated tasks like network
management, table games, or encryption, while others deal with simpler tasks
like string manipulation. Every program is labeled with the vulnerabilities
found within the source code, indicating the type, line number, and vulnerable
function name. This is accomplished by employing a formal verification method
using the Efficient SMT-based Bounded Model Checker (ESBMC), which uses model
checking, abstract interpretation, constraint programming, and satisfiability
modulo theories to reason over safety/security properties in programs. This
approach definitively detects vulnerabilities and offers a formal model known
as a counterexample, thus eliminating the possibility of generating false
positive reports. We have associated the identified vulnerabilities with Common
Weakness Enumeration (CWE) numbers. We make the source code available for the
112, 000 programs, accompanied by a separate file containing the
vulnerabilities detected in each program, making the dataset ideal for training
LLMs and machine learning algorithms. Our study unveiled that according to
ESBMC, 51.24% of the programs generated by GPT-3.5 contained vulnerabilities,
thereby presenting considerable risks to software safety and security.Comment: https://github.com/FormAI-Datase