250 research outputs found

    Deep Learning Software Repositories

    Get PDF
    Bridging the abstraction gap between artifacts and concepts is the essence of software engineering (SE) research problems. SE researchers regularly use machine learning to bridge this gap, but there are three fundamental issues with traditional applications of machine learning in SE research. Traditional applications are too reliant on labeled data. They are too reliant on human intuition, and they are not capable of learning expressive yet efficient internal representations. Ultimately, SE research needs approaches that can automatically learn representations of massive, heterogeneous, datasets in situ, apply the learned features to a particular task and possibly transfer knowledge from task to task. Improvements in both computational power and the amount of memory in modern computer architectures have enabled new approaches to canonical machine learning tasks. Specifically, these architectural advances have enabled machines that are capable of learning deep, compositional representations of massive data depots. The rise of deep learning has ushered in tremendous advances in several fields. Given the complexity of software repositories, we presume deep learning has the potential to usher in new analytical frameworks and methodologies for SE research and the practical applications it reaches. This dissertation examines and enables deep learning algorithms in different SE contexts. We demonstrate that deep learners significantly outperform state-of-the-practice software language models at code suggestion on a Java corpus. Further, these deep learners for code suggestion automatically learn how to represent lexical elements. We use these representations to transmute source code into structures for detecting similar code fragments at different levels of granularity—without declaring features for how the source code is to be represented. Then we use our learning-based framework for encoding fragments to intelligently select and adapt statements in a codebase for automated program repair. In our work on code suggestion, code clone detection, and automated program repair, everything for representing lexical elements and code fragments is mined from the source code repository. Indeed, our work aims to move SE research from the art of feature engineering to the science of automated discovery

    Static detection of control-flow-related vulnerabilities using graph embedding

    Full text link
    © 2019 IEEE. Static vulnerability detection has shown its effectiveness in detecting well-defined low-level memory errors. However, high-level control-flow related (CFR) vulnerabilities, such as insufficient control flow management (CWE-691), business logic errors (CWE-840), and program behavioral problems (CWE-438), which are often caused by a wide variety of bad programming practices, posing a great challenge for existing general static analysis solutions. This paper presents a new deep-learning-based graph embedding approach to accurate detection of CFR vulnerabilities. Our approach makes a new attempt by applying a recent graph convolutional network to embed code fragments in a compact and low-dimensional representation that preserves high-level control-flow information of a vulnerable program. We have conducted our experiments using 8,368 real-world vulnerable programs by comparing our approach with several traditional static vulnerability detectors and state-of-the-art machine-learning-based approaches. The experimental results show the effectiveness of our approach in terms of both accuracy and recall. Our research has shed light on the promising direction of combining program analysis with deep learning techniques to address the general static analysis challenges

    FFCV: Accelerating Training by Removing Data Bottlenecks

    Full text link
    We present FFCV, a library for easy and fast machine learning model training. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from the training process. In particular, we combine techniques such as an efficient file storage format, caching, data pre-loading, asynchronous data transfer, and just-in-time compilation to (a) make data loading and transfer significantly more efficient, ensuring that GPUs can reach full utilization; and (b) offload as much data processing as possible to the CPU asynchronously, freeing GPU cycles for training. Using FFCV, we train ResNet-18 and ResNet-50 on the ImageNet dataset with competitive tradeoff between accuracy and training time. For example, we are able to train an ImageNet ResNet-50 model to 75\% in only 20 mins on a single machine. We demonstrate FFCV's performance, ease-of-use, extensibility, and ability to adapt to resource constraints through several case studies. Detailed installation instructions, documentation, and Slack support channel are available at https://ffcv.io/

    Source-code Summarization of Java Methods Using Control-Flow Graphs

    Get PDF
    Source-code summarization aims to generate natural-language summaries for software artifacts (e.g., method and class). % Researchers have been exploring source-code summarization as one research area in software engineering. Various research works showed the use of text-retrieval-based techniques, heuristic-based techniques, and data-driven techniques for source-code summarization. In data-driven techniques, researchers used a sequence of source-code tokens and other representations of source code (e.g., application programming interface (API) sequences and abstract syntax tree (AST)) as an input to source-code summarization models. According to the current published literature in source-code summarization, researchers have not explored the use of a sequence extracted from control-flow graph that shows a contextual relationship between program instructions based on control-flow relationships for source-code summarization models. In this work, we employ control-flow graph representations to increase the prediction accuracy of a bi-directional long-short term memory (LSTM) source-code summarization model in terms of describing the functionality of Java methods. We use an attention-based bi-directional LSTM sequence-to-sequence model to show the use of linearized control-flow graph sequences alongside a sequence of source-code tokens. We compared our model with the current state-of-the-art and with or without a linearized control-flow graph. We created a source-code summarization dataset to train and evaluate our approach and conducted expert and automatic evaluations. In the expert evaluation, the participants gave rating for summaries generated by each model in terms of correctly describing the functionality of a Java method. Our models outperformed the state-of-the-art in terms of the mean average-rating. Also, the expert evaluation showed us the model benefit from the structural information. In the automatic evaluation, we found that the use of control-flow graphs does not increase the prediction accuracy of a bi-directional LSTM model in terms of BLEU score compared to a bi-directional LSTM model that does not use control-flow graphs. However, we found our source-code summarization approach that uses a control-flow graph as an additional representation better than encoding AST in graph neural networks. Overall, we improved the state-of-the-art for method summarization with our models that take sequence of method tokens with and without a control-flow graph

    Refining Decompiled C Code with Large Language Models

    Full text link
    A C decompiler converts an executable into source code. The recovered C source code, once re-compiled, is expected to produce an executable with the same functionality as the original executable. With over twenty years of development, C decompilers have been widely used in production to support reverse engineering applications. Despite the prosperous development of C decompilers, it is widely acknowledged that decompiler outputs are mainly used for human consumption, and are not suitable for automatic recompilation. Often, a substantial amount of manual effort is required to fix the decompiler outputs before they can be recompiled and executed properly. This paper is motived by the recent success of large language models (LLMs) in comprehending dense corpus of natural language. To alleviate the tedious, costly and often error-prone manual effort in fixing decompiler outputs, we investigate the feasibility of using LLMs to augment decompiler outputs, thus delivering recompilable decompilation. Note that different from previous efforts that focus on augmenting decompiler outputs with higher readability (e.g., recovering type/variable names), we focus on augmenting decompiler outputs with recompilability, meaning to generate code that can be recompiled into an executable with the same functionality as the original executable. We conduct a pilot study to characterize the obstacles in recompiling the outputs of the de facto commercial C decompiler -- IDA-Pro. We then propose a two-step, hybrid approach to augmenting decompiler outputs with LLMs. We evaluate our approach on a set of popular C test cases, and show that our approach can deliver a high recompilation success rate to over 75% with moderate effort, whereas none of the IDA-Pro's original outputs can be recompiled. We conclude with a discussion on the limitations of our approach and promising future research directions
    corecore