Split, Encode and Aggregate for Long Code Search

Du, Lun; Han, Shi; Hu, Fan; Li, Xirong; Wang, Yanlin; Zhang, Dongmei; Zhang, Hongyu

Split, Encode and Aggregate for Long Code Search

Authors: Lun Du
Shi Han
Fan Hu
Xirong Li
Yanlin Wang
Dongmei Zhang
Hongyu Zhang
Publication date: 12 August 2023
Publisher

Abstract

Code search with natural language plays a crucial role in reusing existing code snippets and accelerating software development. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly compared to traditional information retrieval (IR) based models. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. Unlike long text paragraph that can be regarded as a whole with complete semantics, the semantics of long code is discontinuous as a piece of long code may contain different code modules. Therefore, it is unreasonable to directly apply the long text processing methods to long code. To tackle the long code problem, we propose SEA (Split, Encode and Aggregate for Long Code Search), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and repretraining. Leveraging abstract syntax tree (AST) based splitting and attention-based aggregation methods, SEA achieves significant improvements in long code search performance. We also compare SEA with two sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark.Comment: 9 page

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2208.11271

Last time updated on 18/08/2023