8 research outputs found
Learning to Answer Semantic Queries over Code
During software development, developers need answers to queries about
semantic aspects of code. Even though extractive question-answering using
neural approaches has been studied widely in natural languages, the problem of
answering semantic queries over code using neural networks has not yet been
explored. This is mainly because there is no existing dataset with extractive
question and answer pairs over code involving complex concepts and long chains
of reasoning. We bridge this gap by building a new, curated dataset called
CodeQueries, and proposing a neural question-answering methodology over code.
We build upon state-of-the-art pre-trained models of code to predict answer
and supporting-fact spans. Given a query and code, only some of the code may be
relevant to answer the query. We first experiment under an ideal setting where
only the relevant code is given to the model and show that our models do well.
We then experiment under three pragmatic considerations: (1) scaling to
large-size code, (2) learning from a limited number of examples and (3)
robustness to minor syntax errors in code. Our results show that while a neural
model can be resilient to minor syntax errors in code, increasing size of code,
presence of code that is not relevant to the query, and reduced number of
training examples limit the model performance. We are releasing our data and
models to facilitate future work on the proposed problem of answering semantic
queries over code
Recommended from our members
Methods and Tools for Practical Software Testing and Maintenance
As software continues to envelop traditional industries the need for increased attention to cybersecurity is higher than ever. Software security helps protect businesses and governments from financial losses due to cyberattacks and data breaches, as well as reputational damage. In theory, securing software is relatively straightforward—it involves following certain best practices and guidelines to ensure that the software is secure. In practice, however, software security is often much more complicated. It requires a deep understanding of the underlying system and code (including potentially legacy code), as well as a comprehensive understanding of the threats and vulnerabilities that could be present. Additionally, software security also involves the implementation of strategies to protect against those threats and vulnerabilities, which may involve a combination of technologies, processes, and procedures. In fact many real cyber attacks are caused not from zero day vulnerabilities but from known issues that haven't been addressed so real software security also requires ongoing monitoring and maintenance to ensure critical systems remain secure.
This thesis presents a series of novel techniques that together form an enhanced software maintenance methodology from initial bug reporting all the way through patch deployment. We begin by introducing Ad Hoc Test Generation, a novel testing technique that handles when a security vulnerability or other critical bugis not detected by the developers’ test suite, and is discovered post-deployment, developers must quickly devise a new test that reproduces the buggy behavior. Then the developers need to test whether their candidate patch indeed fixes the bug, without breaking other functionality, while racing to deploy before attackers pounce on exposed user installations. This work builds on record-replay and binary rewriting to automatically generate and run targeted tests for candidate patches significantly faster and more efficiently than traditional test suite generation techniques like symbolic execution.
Our prototype of this concept is called ATTUNE.
To construct patches in some instances developers maintaining software may be forced to deal directly with the binary since source code is no longer available. In these instances this work presents a transformer based model called DIRECT that provides semantics related names for variables and function names that have been lost giving developers the opportunity to work with a facsimile of the source code that would otherwise be unavailable. In the event developers need even more support deciphering the decompiled code we provide another tool called REINFOREST that allows developers to search for similar code which they can use to further understand the code in question and use as a reference when developing a patch.
After patches have been written, deployment remains a challenge. In some instances deploying a patch for the buggy behavior may require supporting legacy systems where software cannot be upgraded without causing compatibility issues. To support these updates this work introduces the concept of binary patch decomposition which breaks a software release down into its component parts and allows software administrators to apply only the critical portions without breaking functionality.
We present a novel software patching methodology that we can recreate bugs, develop patches, and deploy updates in the presence of the typical challenges that come when patching production software including deficient test suites, lack of source code, lack of documentation, compatibility issues, and the difficulties associated with patching binaries directly
Effficient Graph-based Computation and Analytics
With data explosion in many domains, such as social media, big code repository, Internet of Things (IoT), and inertial sensors, only 32% of data available to academic and industry is put to work, and the remaining 68% goes unleveraged. Moreover, people are facing an increasing number of obstacles concerning complex analytics on the sheer size of data, which include 1) how to perform dynamic graph analytics in a parallel and robust manner within a reasonable time? 2) How to conduct performance optimizations on a property graph representing and consisting of the semantics of code, data, and runtime systems for big data applications? 3) How to innovate neural graph approaches (ie, Transformer) to solve realistic research problems, such as automated program repair and inertial navigation? To tackle these problems, I present two efforts along this road: efficient graph-based computation and intelligent graph analytics. Specifically, I firstly propose two theory-based dynamic graph models to characterize temporal trends in large social media networks, then implement and optimize them atop Apache Spark GraphX to improve their performances. In addition, I investigate a semantics-aware optimization framework consisting of offline static analysis and online dynamic analysis on a property graph representing the skeleton of a data-intensive application, to interactively and semi-automatically assist programmers to scrutinize the performance problems camouflaged in the source code. In the design of intelligent graph-based algorithms, I innovate novel neural graph-based approaches with multi-task learning techniques to repair a broad range of programming bugs automatically, and also improve the accuracy of pedestrian navigation systems in only consideration of sensor data of Inertial Measurement Units (IMU, ie accelerometer, gyroscope, and magnetometer). In this dissertation, I elaborate on the definitions of these research problems and leverage the knowledge of graph computation, program analysis, and deep learning techniques to seek solutions to them, followed by comprehensive comparisons with the state-of-the-art baselines and discussions on future research