299 research outputs found
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
Code intelligence leverages machine learning techniques to extract knowledge
from extensive code corpora, with the aim of developing intelligent tools to
improve the quality and productivity of computer programming. Currently, there
is already a thriving research community focusing on code intelligence, with
efforts ranging from software engineering, machine learning, data mining,
natural language processing, and programming languages. In this paper, we
conduct a comprehensive literature review on deep learning for code
intelligence, from the aspects of code representation learning, deep learning
techniques, and application tasks. We also benchmark several state-of-the-art
neural models for code intelligence, and provide an open-source toolkit
tailored for the rapid prototyping of deep-learning-based code intelligence
models. In particular, we inspect the existing code intelligence models under
the basis of code representation learning, and provide a comprehensive overview
to enhance comprehension of the present state of code intelligence.
Furthermore, we publicly release the source code and data resources to provide
the community with a ready-to-use benchmark, which can facilitate the
evaluation and comparison of existing and future code intelligence models
(https://xcodemind.github.io). At last, we also point out several challenging
and promising directions for future research
CONCORD: Towards a DSL for Configurable Graph Code Representation
Deep learning is widely used to uncover hidden patterns in large code
corpora. To achieve this, constructing a format that captures the relevant
characteristics and features of source code is essential. Graph-based
representations have gained attention for their ability to model structural and
semantic information. However, existing tools lack flexibility in constructing
graphs across different programming languages, limiting their use.
Additionally, the output of these tools often lacks interoperability and
results in excessively large graphs, making graph-based neural networks
training slower and less scalable.
We introduce CONCORD, a domain-specific language to build customizable graph
representations. It implements reduction heuristics to reduce graphs' size
complexity. We demonstrate its effectiveness in code smell detection as an
illustrative use case and show that: first, CONCORD can produce code
representations automatically per the specified configuration, and second, our
heuristics can achieve comparable performance with significantly reduced size.
CONCORD will help researchers a) create and experiment with customizable
graph-based code representations for different software engineering tasks
involving DL, b) reduce the engineering work to generate graph representations,
c) address the issue of scalability in GNN models, and d) enhance the
reproducibility of experiments in research through a standardized approach to
code representation and analysis
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation
Large-scale language models have made great progress in the field of software
engineering in recent years. They can be used for many code-related tasks such
as code clone detection, code-to-code search, and method name prediction.
However, these large-scale language models based on each code token have
several drawbacks: They are usually large in scale, heavily dependent on
labels, and require a lot of computing power and time to fine-tune new
datasets.Furthermore, code embedding should be performed on the entire code
snippet rather than encoding each code token. The main reason for this is that
encoding each code token would cause model parameter inflation, resulting in a
lot of parameters storing information that we are not very concerned about. In
this paper, we propose a novel framework, called TransformCode, that learns
about code embeddings in a contrastive learning manner. The framework uses the
Transformer encoder as an integral part of the model. We also introduce a novel
data augmentation technique called abstract syntax tree transformation: This
technique applies syntactic and semantic transformations to the original code
snippets to generate more diverse and robust anchor samples. Our proposed
framework is both flexible and adaptable: It can be easily extended to other
downstream tasks that require code representation such as code clone detection
and classification. The framework is also very efficient and scalable: It does
not require a large model or a large amount of training data, and can support
any programming language.Finally, our framework is not limited to unsupervised
learning, but can also be applied to some supervised learning tasks by
incorporating task-specific labels or objectives. To explore the effectiveness
of our framework, we conducted extensive experiments on different software
engineering tasks using different programming languages and multiple datasets
Deep Learning Software Repositories
Bridging the abstraction gap between artifacts and concepts is the essence of software engineering (SE) research problems. SE researchers regularly use machine learning to bridge this gap, but there are three fundamental issues with traditional applications of machine learning in SE research. Traditional applications are too reliant on labeled data. They are too reliant on human intuition, and they are not capable of learning expressive yet efficient internal representations. Ultimately, SE research needs approaches that can automatically learn representations of massive, heterogeneous, datasets in situ, apply the learned features to a particular task and possibly transfer knowledge from task to task. Improvements in both computational power and the amount of memory in modern computer architectures have enabled new approaches to canonical machine learning tasks. Specifically, these architectural advances have enabled machines that are capable of learning deep, compositional representations of massive data depots. The rise of deep learning has ushered in tremendous advances in several fields. Given the complexity of software repositories, we presume deep learning has the potential to usher in new analytical frameworks and methodologies for SE research and the practical applications it reaches. This dissertation examines and enables deep learning algorithms in different SE contexts. We demonstrate that deep learners significantly outperform state-of-the-practice software language models at code suggestion on a Java corpus. Further, these deep learners for code suggestion automatically learn how to represent lexical elements. We use these representations to transmute source code into structures for detecting similar code fragments at different levels of granularity—without declaring features for how the source code is to be represented. Then we use our learning-based framework for encoding fragments to intelligently select and adapt statements in a codebase for automated program repair. In our work on code suggestion, code clone detection, and automated program repair, everything for representing lexical elements and code fragments is mined from the source code repository. Indeed, our work aims to move SE research from the art of feature engineering to the science of automated discovery
A Mocktail of Source Code Representations
Efficient representation of source code is essential for various software
engineering tasks such as code search and code clone detection. One such
technique for representing source code involves extracting paths from the AST
and using a learning model to capture program properties. Code2vec is a
commonly used path-based approach that uses an attention-based neural network
to learn code embeddings which can then be used for various software
engineering tasks. However, this approach uses only ASTs and does not leverage
other graph structures such as Control Flow Graphs (CFG) and Program Dependency
Graphs (PDG). Similarly, most recent approaches for representing source code
still use AST and do not leverage semantic graph structures. Even though there
exists an integrated graph approach (Code Property Graph) for representing
source code, it has only been explored in the domain of software security.
Moreover, it does not leverage the paths from the individual graphs. In our
work, we extend the path-based approach code2vec to include semantic graphs,
CFG, and PDG, along with AST, which is still largely unexplored in the domain
of software engineering. We evaluate our approach on the task of MethodNaming
using a custom C dataset of 730K methods collected from 16 C projects from
GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on
the full dataset and up to 100% with individual projects. We show that semantic
features from the CFG and PDG paths are indeed helpful. We envision that
looking at a mocktail of source code representations for various software
engineering tasks can lay the foundation for a new line of research and a
re-haul of existing research
- …