3 research outputs found

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Mining and Analysis of Control Structure Variant Clones

    Get PDF
    Code duplication (software clones) is a very common phenomenon in existing software systems, and is also considered to be an indication of poor software maintainability. In recent years, the detection of clones has drawn considerable attention. The majority of existing clone detection techniques focus on the syntactic similarity of code fragments, and more specifically, they support the detection of Type-1 clones (i.e., identical code fragments except for variations in whitespace, layout, and comments), Type-2 clones (i.e., structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout, and comments), and Type-3 clones (i.e., copied fragments with statements changed, added, or removed in addition to variations in identifiers, literals, types, layout and comments). However, recent studies have shown that when developers implement the same functionalities, their code solutions may differ substantially in terms of their syntactical structure. This is because developers follow different programming styles or language features when implementing, for instance, control structures, such as loops and conditionals. From the perspective of clone management, different strategies are required to detect and refactor these control structure variant clones. Thus, there is a clear need for functionality-aware clone mining approaches, which are capable of distinguishing functional clones from syntactical clones. In this thesis, we are proposing a method for mining control structure variant clones. More specifically, the proposed approach can mine clones which use different, but functionally equivalent control structures to implement functionally similar iterations and conditionals. Our method is evaluated on six open-source systems by manually inspecting the mined clones and computing the precision and recall of our technique. Moreover, we create a publicly available benchmark of control structure variant clones. Based on the clones we found, we also propose some improvements to tackle the limitations of JDeodorant in the refactoring of control structure variant clones
    corecore