Clone Detection: How accurate is your data set?

Abstract

Duplication of code in software systems is considered to be a serious problem that can affect a systems maintainability and extendability. It is reported that 10-15% of code in a software system is involved in cloning. However, because of the difficultly of objectively measuring the number of false positives in a clone result set, the accuracy of these reports is difficult to evaluate. Although an important topic, little work has been done in the area of evaluating the accuracy of clone detection methods. In this paper we propose a study to estimate the number of false positives that are likely to be in a data set in an objective way by measuring the number of clones found in a large body of unrelated code. We also propose a method to measure the impact of external factors such as programing idioms and API protocols on the detected results set. The results of this work will provide tools and knowledge to better evaluate the current state of the art of clone detection research

    Similar works

    Full text

    thumbnail-image

    Available Versions