8 research outputs found

    A Mocktail of Source Code Representations

    Full text link
    Efficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of MethodNaming using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research

    Effficient Graph-based Computation and Analytics

    Get PDF
    With data explosion in many domains, such as social media, big code repository, Internet of Things (IoT), and inertial sensors, only 32% of data available to academic and industry is put to work, and the remaining 68% goes unleveraged. Moreover, people are facing an increasing number of obstacles concerning complex analytics on the sheer size of data, which include 1) how to perform dynamic graph analytics in a parallel and robust manner within a reasonable time? 2) How to conduct performance optimizations on a property graph representing and consisting of the semantics of code, data, and runtime systems for big data applications? 3) How to innovate neural graph approaches (ie, Transformer) to solve realistic research problems, such as automated program repair and inertial navigation? To tackle these problems, I present two efforts along this road: efficient graph-based computation and intelligent graph analytics. Specifically, I firstly propose two theory-based dynamic graph models to characterize temporal trends in large social media networks, then implement and optimize them atop Apache Spark GraphX to improve their performances. In addition, I investigate a semantics-aware optimization framework consisting of offline static analysis and online dynamic analysis on a property graph representing the skeleton of a data-intensive application, to interactively and semi-automatically assist programmers to scrutinize the performance problems camouflaged in the source code. In the design of intelligent graph-based algorithms, I innovate novel neural graph-based approaches with multi-task learning techniques to repair a broad range of programming bugs automatically, and also improve the accuracy of pedestrian navigation systems in only consideration of sensor data of Inertial Measurement Units (IMU, ie accelerometer, gyroscope, and magnetometer). In this dissertation, I elaborate on the definitions of these research problems and leverage the knowledge of graph computation, program analysis, and deep learning techniques to seek solutions to them, followed by comprehensive comparisons with the state-of-the-art baselines and discussions on future research
    corecore