10 research outputs found

    Similarity of Source Code in the Presence of Pervasive Modifications

    Get PDF
    Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of pervasive code modifications, i.e. transformations that may have a global effect. We compare 30 similarity detection techniques and tools against pervasive code modifications. We evaluate the tools using two experimental scenarios for Java source code. These are (1) pervasive modifications created with tools for source code and bytecode obfuscation and (2) source code normalisation through compilation and decompilation using different decompilers. Our experimental results show that highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for six of the tools. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    A comparison of code similarity analysers

    Get PDF
    Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    Structural analysis of source code plagiarism using graphs

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg in fulfillment of the requirements for the degree of Master of Science. May 2017Plagiarism is a serious problem in academia. It is prevalent in the computing discipline where students are expected to submit source code assignments as part of their assessment; hence, there is every likelihood of copying. Ideally, students can collaborate with each other to perform a programming task, but it is expected that each student submit his/her own solution for the programming task. More so, one might conclude that the interaction would make them learn programming. Unfortunately, that may not always be the case. In undergraduate courses, especially in the computer sciences, if a given class is large, it would be unfeasible for an instructor to manually check each and every assignment for probable plagiarism. Even if the class size were smaller, it is still impractical to inspect every assignment for likely plagiarism because some potentially plagiarised content could still be missed by humans. Therefore, automatically checking the source code programs for likely plagiarism is essential. There have been many proposed methods that attempt to detect source code plagiarism in undergraduate source code assignments but, an ideal system should be able to differentiate actual cases of plagiarism from coincidental similarities that usually occur in source code plagiarism. Some of the existing source code plagiarism detection systems are either not scalable, or performed better when programs are modified with a number of insertions and deletions to obfuscate plagiarism. To address this issue, a graph-based model which considers structural similarities of programs is introduced to address cases of plagiarism in programming assignments. This research study proposes an approach to measuring cases of similarities in programming assignments using an existing plagiarism detection system to find similarities in programs, and a graph-based model to annotate the programs. We describe experiments with data sets of undergraduate Java programs to inspect the programs for plagiarism and evaluate the graph-model with good precision. An evaluation of the graph-based model reveals a high rate of plagiarism in the programs and resilience to many obfuscation techniques, while false detection (coincident similarity) rarely occurred. If this detection method is adopted into use, it will aid an instructor to carry out the detection process conscientiously.MT 201

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse
    corecore