3 research outputs found
Defectors: A Large, Diverse Python Dataset for Defect Prediction
Defect prediction has been a popular research topic where machine learning
(ML) and deep learning (DL) have found numerous applications. However, these
ML/DL-based defect prediction models are often limited by the quality and size
of their datasets. In this paper, we present Defectors, a large dataset for
just-in-time and line-level defect prediction. Defectors consists of
213K source code files ( 93K defective and 120K defect-free)
that span across 24 popular Python projects. These projects come from 18
different domains, including machine learning, automation, and
internet-of-things. Such a scale and diversity make Defectors a suitable
dataset for training ML/DL models, especially transformer models that require
large and diverse datasets. We also foresee several application areas of our
dataset including defect prediction and defect explanation.
Dataset link: https://doi.org/10.5281/zenodo.770898