Code clones can detrimentally impact software maintenance and manually
detecting them in very large codebases is impractical. Additionally, automated
approaches find detection of Type 3 and Type 4 (inexact) clones very
challenging. While the most recent artificial deep neural networks (for example
BERT-based artificial neural networks) seem to be highly effective in detecting
such clones, their pairwise comparison of every code pair in the target
system(s) is inefficient and scales poorly on large codebases.
We therefore introduce SSCD, a BERT-based clone detection approach that
targets high recall of Type 3 and Type 4 clones at scale (in line with our
industrial partner's requirements). It does so by computing a representative
embedding for each code fragment and finding similar fragments using a nearest
neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other
Neural Network approaches while also using parallel, GPU-accelerated search to
tackle scalability.
This paper details the approach and an empirical assessment towards
configuring and evaluating that approach in industrial setting. The
configuration analysis suggests that shorter input lengths and text-only based
neural network models demonstrate better efficiency in SSCD, while only
slightly decreasing effectiveness. The evaluation results suggest that SSCD is
more effective than state-of-the-art approaches like SAGA and SourcererCC. It
is also highly efficient: in its optimal setting, SSCD effectively locates
clones in the entire 320 million LOC BigCloneBench (a standard clone detection
benchmark) in just under three hours.Comment: 10 pages, 2 figures, 38th IEEE International Conference on Software
Maintenance and Evolutio