Clone Detection: How accurate is your data set?

Cory J. Kapser; Michael W. Godfrey

Clone Detection: How accurate is your data set?

Authors: Cory J. Kapser
Michael W. Godfrey
Publication date
Publisher

Abstract

Duplication of code in software systems is considered to be a serious problem that can affect a systems maintainability and extendability. It is reported that 10-15% of code in a software system is involved in cloning. However, because of the difficultly of objectively measuring the number of false positives in a clone result set, the accuracy of these reports is difficult to evaluate. Although an important topic, little work has been done in the area of evaluating the accuracy of clone detection methods. In this paper we propose a study to estimate the number of false positives that are likely to be in a data set in an objective way by measuring the number of clones found in a large body of unrelated code. We also propose a method to measure the impact of external factors such as programing idioms and API protocols on the detected results set. The results of this work will provide tools and knowledge to better evaluate the current state of the art of clone detection research

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.60.49...

Last time updated on 22/10/2014