106,997 research outputs found

    Split, Encode and Aggregate for Long Code Search

    Full text link
    Code search with natural language plays a crucial role in reusing existing code snippets and accelerating software development. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly compared to traditional information retrieval (IR) based models. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. Unlike long text paragraph that can be regarded as a whole with complete semantics, the semantics of long code is discontinuous as a piece of long code may contain different code modules. Therefore, it is unreasonable to directly apply the long text processing methods to long code. To tackle the long code problem, we propose SEA (Split, Encode and Aggregate for Long Code Search), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and repretraining. Leveraging abstract syntax tree (AST) based splitting and attention-based aggregation methods, SEA achieves significant improvements in long code search performance. We also compare SEA with two sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark.Comment: 9 page

    Semantic code search and analysis

    Get PDF
    Title from PDF of title page, viewed on July 28, 2014Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 33-35)Thesis (M. S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2014As open source software repositories have been enormously growing, the high quality source codes have been widely available. A greater access to open source software also leads to an increase of software quality and reduces the overhead of software development. However, most of the available search engines are limited to lexical or code based searches and do not take semantics that underlie the source codes. Thus, object oriented (OO) principles, such as inheritance and composition, cannot be efficiently utilized for code search or analysis. This thesis proposes a novel approach for searching source code using semantics and structure. This approach will allow users to analyze software systems in terms of code similarity. For this purpose, a semantic measurement, called CoSim, was designed based on OO programing models including Package, Class, Method and Interface. We accessed and extracted the source code from open source repositories like Github and converted them into Resource Description Framework (RDF) model. Using the measurement, we queried the source code with SPARQL Query Language and analyzed the systems. We carried out a pilot study for preliminary evaluation of seven different versions of Apache Hadoop systems in terms of their similarities. In addition, we compared the search outputs from our system with those by the Github Code Search. It was shown that our search engine provided more comprehensive and relevant information than the Github does. In addition, the proposed CoSim measurement precisely reflected the significant and evolutionary properties of the systems in the similarity comparison of Hadoop software systemsAbstract -- Illustrations -- Tables - Introduction -- Background and related work -- Semantic code search and analysis model -- Semantic code search and analysis implementation -- Results and evaluation -- Conclusion and future work -- Reference

    Genetic Programming + Proof Search = Automatic Improvement

    Get PDF
    Search Based Software Engineering techniques are emerging as important tools for software maintenance. Foremost among these is Genetic Improvement, which has historically applied the stochastic techniques of Genetic Programming to optimize pre-existing program code. Previous work in this area has not generally preserved program semantics and this article describes an alternative to the traditional mutation operators used, employing deterministic proof search in the sequent calculus to yield semantics-preserving transformations on algebraic data types. Two case studies are described, both of which are applicable to the recently-introduced `grow and graft' technique of Genetic Improvement: the first extends the expressiveness of the `grafting' phase and the second transforms the representation of a list data type to yield an asymptotic efficiency improvement
    • …
    corecore