159 research outputs found
Multi-Granularity Detector for Vulnerability Fixes
With the increasing reliance on Open Source Software, users are exposed to
third-party library vulnerabilities. Software Composition Analysis (SCA) tools
have been created to alert users of such vulnerabilities. SCA requires the
identification of vulnerability-fixing commits. Prior works have proposed
methods that can automatically identify such vulnerability-fixing commits.
However, identifying such commits is highly challenging, as only a very small
minority of commits are vulnerability fixing. Moreover, code changes can be
noisy and difficult to analyze. We observe that noise can occur at different
levels of detail, making it challenging to detect vulnerability fixes
accurately.
To address these challenges and boost the effectiveness of prior works, we
propose MiDas (Multi-Granularity Detector for Vulnerability Fixes). Unique from
prior works, Midas constructs different neural networks for each level of code
change granularity, corresponding to commit-level, file-level, hunk-level, and
line-level, following their natural organization. It then utilizes an ensemble
model that combines all base models to generate the final prediction. This
design allows MiDas to better handle the noisy and highly imbalanced nature of
vulnerability-fixing commit data. Additionally, to reduce the human effort
required to inspect code changes, we have designed an effort-aware adjustment
for Midas's outputs based on commit length. The evaluation results demonstrate
that MiDas outperforms the current state-of-the-art baseline in terms of AUC by
4.9% and 13.7% on Java and Python-based datasets, respectively. Furthermore, in
terms of two effort-aware metrics, EffortCost@L and Popt@L, MiDas also
outperforms the state-of-the-art baseline, achieving improvements of up to
28.2% and 15.9% on Java, and 60% and 51.4% on Python, respectively
Ccbert:Self-Supervised Code Change Representation Learning
Numerous code changes are made by developers in their daily work, and a superior representation of code changes is desired for effective code change analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based approach that learns a distributed representation of code changes to capture the semantic intent of the changes. Despite demonstrated effectiveness in multiple tasks, CC2Vec has several limitations: 1) it considers only coarse-grained information about code changes, and 2) it relies on log messages rather than the self-contained content of the code changes. In this work, we propose CCBERT (\underline{C}ode \underline{C}hange \underline{BERT}), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised objectives that are specialized for learning code change representations based on the contents of code changes. CCBERT perceives fine-grained code changes at the token level by learning from the old and new versions of the content, along with the edit actions. Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7\%--14.0\% in terms of different metrics and tasks. CCBERT consistently outperforms large pre-trained code models, such as CodeBERT, while requiring 6--10 less training time, 5--30 less inference time, and 7.9 less GPU memory
Mining Fix Patterns for FindBugs Violations
In this paper, we first collect and track a large number of fixed and unfixed
violations across revisions of software.
The empirical analyses reveal that there are discrepancies in the
distributions of violations that are detected and those that are fixed, in
terms of occurrences, spread and categories, which can provide insights into
prioritizing violations.
To automatically identify patterns in violations and their fixes, we propose
an approach that utilizes convolutional neural networks to learn features and
clustering to regroup similar instances. We then evaluate the usefulness of the
identified fix patterns by applying them to unfixed violations.
The results show that developers will accept and merge a majority (69/116) of
fixes generated from the inferred fix patterns. It is also noteworthy that the
yielded patterns are applicable to four real bugs in the Defects4J major
benchmark for software testing and automated repair.Comment: Accepted for IEEE Transactions on Software Engineerin
Learning representations for effective and explainable software bug detection and fixing
Software has an integral role in modern life; hence software bugs, which undermine software quality and reliability, have substantial societal and economic implications. The advent of machine learning and deep learning in software engineering has led to major advances in bug detection and fixing approaches, yet they fall short of desired precision and recall. This shortfall arises from the absence of a \u27bridge,\u27 known as learning code representations, that can transform information from source code into a suitable representation for effective processing via machine and deep learning.
This dissertation builds such a bridge. Specifically, it presents solutions for effectively learning code representations using four distinct methods?context-based, testing results-based, tree-based, and graph-based?thus improving bug detection and fixing approaches, as well as providing developers insight into the foundational reasoning. The experimental results demonstrate that using learning code representations can significantly enhance explainable bug detection and fixing, showcasing the practicability and meaningfulness of the approaches formulated in this dissertation toward improving software quality and reliability
CC2Vec: Distributed representations of code changes
National Research Foundation (NRF) Singapore; ANR ITrans projec
Changeset-based Retrieval of Source Code Artifacts for Bug Localization
Modern software development is extremely collaborative and agile, with unprecedented speed and scale of activity. Popular trends like continuous delivery and continuous deployment aim at building, fixing, and releasing software with greater speed and frequency. Bug localization, which aims to automatically localize bug reports to relevant software artifacts, has the potential to improve software developer efficiency by reducing the time spent on debugging and examining code. To date, this problem has been primarily addressed by applying information retrieval techniques based on static code elements, which are intrinsically unable to reflect how software evolves over time. Furthermore, as prior approaches frequently rely on exact term matching to measure relatedness between a bug report and a software artifact, they are prone to be affected by the lexical gap that exists between natural and programming language.
This thesis explores using software changes (i.e., changesets), instead of static code elements, as the primary data unit to construct an information retrieval model toward bug localization. Changesets, which represent the differences between two consecutive versions of the source code, provide a natural representation of a software change, and allow to capture both the semantics of the source code, and the semantics of the code modification. To bridge the lexical gap between source code and natural language, this thesis investigates using topic modeling and deep learning architectures that enable creating semantically rich data representation with the goal of identifying latent connection between bug reports and source code. To show the feasibility of the proposed approaches, this thesis also investigates practical aspects related to using a bug localization tool, such retrieval delay and training data availability.
The results indicate that the proposed techniques effectively leverage historical data about bugs and their related source code components to improve retrieval accuracy, especially for bug reports that are expressed in natural language, with little to no explicit code references. Further improvement in accuracy is observed when the size of the training dataset is increased through data augmentation and data balancing strategies proposed in this thesis, although depending on the model architecture the magnitude of the improvement varies. In terms of retrieval delay, the results indicate that the proposed deep learning architecture significantly outperforms prior work, and scales up with respect to search space size
Too Few Bug Reports? Exploring Data Augmentation for Improved Changeset-based Bug Localization
Modern Deep Learning (DL) architectures based on transformers (e.g., BERT,
RoBERTa) are exhibiting performance improvements across a number of natural
language tasks. While such DL models have shown tremendous potential for use in
software engineering applications, they are often hampered by insufficient
training data. Particularly constrained are applications that require
project-specific data, such as bug localization, which aims at recommending
code to fix a newly submitted bug report. Deep learning models for bug
localization require a substantial training set of fixed bug reports, which are
at a limited quantity even in popular and actively developed software projects.
In this paper, we examine the effect of using synthetic training data on
transformer-based DL models that perform a more complex variant of bug
localization, which has the goal of retrieving bug-inducing changesets for each
bug report. To generate high-quality synthetic data, we propose novel data
augmentation operators that act on different constituent components of bug
reports. We also describe a data balancing strategy that aims to create a
corpus of augmented bug reports that better reflects the entire source code
base, because existing bug reports used as training data usually reference a
small part of the code base
- …