Search CORE

13,625 research outputs found

Attack of the clones: an investigation into removing redundant source code

Author: Bailey John Oliver
Publication venue
Publication date: 01/01/2002
Field of study

Long-term maintenance of code will often lead to the introduction of duplicated or 'cloned' code. Legacy systems riddled with these clones have large amounts of redundant code and are more difficult to understand and maintain. One option available to improve maintainability and to increase software reuse, is to re-engineer code clones into reusable components. However, before this can be achieved detection and removal of this redundant code is necessary. There are several established clone detection tools for software maintenance and this thesis aims to investigate the similarities between their output. It also looks at how maintainers may best use them to reduce the amount of redundant code in a software system. This will be achieved by running clone detection tools on several different case studies. Included in these case studies will be a novel tool called Covet inspired by research of Mayrand [May96b] which attempted to identify cloned routines through a comparison of software metrics generated from each routine. It was found that none of the clone detection tools achieved either 100% precision or 100% recall. Each tool identified very different sets of clones. Overall MOSS achieved the greatest precision and CCFinder the greatest recall. Also observed was that the use of automatically generated code increased the proportion of clones found in a software system

Durham e-Theses

Code similarity and clone search in large-scale source code data

Author: Ragkhitwetsagul Chaiyong
Publication venue: UCL (University College London)
Publication date: 28/10/2018
Field of study

Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse

UCL Discovery

SLACC: Simion-based Language Agnostic Code Clones

Author: Beit-Aharon Jonathan
Bowles Stephen W
Elva Rochelle
Kessel Marcus
Khan Mohd Ehmer
Li Zhenmin
Su Fang-Hsiang
Walenstein Andrew
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/02/2020
Field of study

Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior. In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%). This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.Comment: 11 Pages, 3 Figures, Accepted at ICSE 2020 technical trac

arXiv.org e-Print Archive

Crossref

SourcererCC: Scaling Code Clone Detection to Big Code

Author: Lopes Cristina V.
Roy Chanchal K.
Saini Vaibhav
Sajnani Hitesh
Svajlenko Jeffrey
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/12/2015
Field of study

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised

arXiv.org e-Print Archive

Crossref

Recommended from our members

Code Relatives: Detecting Similar Software Behavior

Author: Harvey Kenneth
Jebara Tony
Kaiser Gail E.
Sethumadhavan Simha
Su Fang-Hsiang
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

Detecting "similar code" is fundamental to many software engineering tasks. Current tools can help detect code with statically similar syntactic features (code clones). Unfortunately, some code fragments that behave alike without similar syntax may be missed. In this paper, we propose the term "code relatives" to refer to code with dynamically similar execution features. Code relatives can be used for such tasks as implementation-agnostic code search and classification of code with similar behavior for human understanding, which code clone detection cannot achieve. To detect code relatives, we present DyCLINK, which constructs an approximate runtime representation of code using a dynamic instruction graph. With our link analysis based subgraph matching algorithm, DyCLINK detects fine-grained code relatives efficiently. In our experiments, DyCLINK analyzed 290+ million prospective subgraph matches. The results show that DyCLINK detects not only code relatives, but also code clones that the state-of-the-art system is unable to identify. In a code classification problem, DyCLINK achieved 96% precision on average compared with the competitor's 61%

Columbia University Academic Commons

An Extended Stable Marriage Problem Algorithm for Clone Detection

Author: AlHakami Hosam
Chen Feng
Janicke Helge
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 01/01/2014
Field of study

Code cloning negatively affects industrial software and threatens intellectual property. This paper presents a novel approach to detecting cloned software by using a bijective matching technique. The proposed approach focuses on increasing the range of similarity measures and thus enhancing the precision of the detection. This is achieved by extending a well-known stable-marriage problem (SMP) and demonstrating how matches between code fragments of different files can be expressed. A prototype of the proposed approach is provided using a proper scenario, which shows a noticeable improvement in several features of clone detection such as scalability and accuracy.Comment: 20 pages, 10 figures, 6 table

arXiv.org e-Print Archive

CiteSeerX