Code search with natural language plays a crucial role in reusing existing
code snippets and accelerating software development. Thanks to the
Transformer-based pretraining models, the performance of code search has been
improved significantly compared to traditional information retrieval (IR) based
models. However, due to the quadratic complexity of multi-head self-attention,
there is a limit on the input token length. For efficient training on standard
GPUs like V100, existing pretrained code models, including GraphCodeBERT,
CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes
them unable to represent the complete information of long code that is greater
than 256 tokens. Unlike long text paragraph that can be regarded as a whole
with complete semantics, the semantics of long code is discontinuous as a piece
of long code may contain different code modules. Therefore, it is unreasonable
to directly apply the long text processing methods to long code. To tackle the
long code problem, we propose SEA (Split, Encode and Aggregate for Long Code
Search), which splits long code into code blocks, encodes these blocks into
embeddings, and aggregates them to obtain a comprehensive long code
representation. With SEA, we could directly use Transformer-based pretraining
models to model long code without changing their internal structure and
repretraining. Leveraging abstract syntax tree (AST) based splitting and
attention-based aggregation methods, SEA achieves significant improvements in
long code search performance. We also compare SEA with two sparse Trasnformer
methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean
reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on
the CodeSearchNet benchmark.Comment: 9 page