12 research outputs found
SourcererCC: Scaling Code Clone Detection to Big Code
Despite a decade of active research, there is a marked lack in clone
detectors that scale to very large repositories of source code, in particular
for detecting near-miss clones where significant editing activities may take
place in the cloned code. We present SourcererCC, a token-based clone detector
that targets three clone types, and exploits an index to achieve scalability to
large inter-project repositories using a standard workstation. SourcererCC uses
an optimized inverted-index to quickly query the potential clones of a given
code block. Filtering heuristics based on token ordering are used to
significantly reduce the size of the index, the number of code-block
comparisons needed to detect the clones, as well as the number of required
token-comparisons needed to judge a potential clone.
We evaluate the scalability, execution time, recall and precision of
SourcererCC, and compare it to four publicly available and state-of-the-art
tools. To measure recall, we use two recent benchmarks, (1) a large benchmark
of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of
thousands of fine-grained artificial clones. We find SourcererCC has both high
recall and precision, and is able to scale to a large inter-project repository
(250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised
Efficiently Measuring an Accurate and Generalized Clone Detection Precision using Clone Clustering
Abstract-An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods of efficiently and accurately measuring the precision of a clone detection tool. Instead, tool authors simply validate a small random sample of the clones their tools detected in a subject software system. Since there could be many thousands of clones reported by the tool, such a small random sample cannot guarantee an accurate and generalized measure of the tool's precision for all the varieties of clones that can occur in any arbitrary software system. In this paper, we propose a machine-learning based approach that can cluster similar clones together, and which can be used to maximize the variety of clones examined when measuring precision, while significantly reducing the biases a specific subject system has on the generality of the precision measured. Our technique reduces the efforts in measuring precision, while doubling the variety of clones validated and reducing biases that harm the generality of the measure by up to an order of magnitude. Our case study with the NiCad clone detector and the Java class library shows that our approach is effective in efficiently measuring an accurate and generalized precision of a subject clone detection tool
ForkSim: Generating software forks for evaluating cross-project similarity analysis tools
Software project forking, that is copying an existing project and developing a new independent project from the copy, occurs frequently in software development. Analysing the code similarities between such software projects is useful as developers can use similarity information to merge the forked systems or migrate them towards a reuse approach. Several techniques for detecting cross-project similarities have been proposed. However, no good benchmark to measure their performance is available. We developed ForkSim, a tool for generating datasets of synthetic software forks with known similarities and differences. This allows the performance of cross-project similarity tools to be measured in terms of recall and precision by comparing their output to the known properties of the generated dataset. These datasets can also be used in controlled experiments to evaluate further aspects of the tools, such as usability or visualization concepts. As a demonstration of our tool, we evaluated the performance of the clone detector NiCad for similarity detection across software forks, which showed the high potential of ForkSim