3 research outputs found

    Detection and analysis of near-miss clone genealogies

    Get PDF
    It is believed that identical or similar code fragments in source code, also known as code clones, have an impact on software maintenance. A clone genealogy shows how a group of clone fragments evolve with the evolution of the associated software system, and thus may provide important insights on the maintenance implications of those clone fragments. Considering the importance of studying the evolution of code clones, many studies have been conducted on this topic. However, after a decade of active research, there has been a marked lack of progress in understanding the evolution of near-miss software clones, especially where statements have been added, deleted, or modified in the copied fragments. Given that there are a significant amount of near-miss clones in the software systems, we believe that without studying the evolution of near-miss clones, one cannot have a complete picture of the clone evolution. In this thesis, we have advanced the state-of-the-art in the evolution of clone research in the context of both exact and near-miss software clones. First, we performed a large-scale empirical study to extend the existing knowledge about the evolution of exact and renamed clones where identifiers have been modified in the copied fragments. Second, we have developed a framework, gCad that can automatically extract both exact and near-miss clone genealogies across multiple versions of a program and identify their change patterns reasonably fast while maintaining high precision and recall. Third, in order to gain a broader perspective of clone evolution, we extended gCad to calculate various evolutionary metrics, and performed an in-depth empirical study on the evolution of both exact and near-miss clones in six open source software systems of two different programming languages with respect to five research questions. We discovered several interesting evolutionary phenomena of near-miss clones which either contradict with previous findings or are new. Finally, we further improved gCad, and investigated a wide range of attributes and metrics derived from both the clones themselves and their evolution histories to identify certain attributes, which developers often use to remove clones in the real world. We believe that our new insights in the evolution of near-miss clones, and about how developers approach and remove duplication, will play an important role in understanding the maintenance implications of clones and will help design better clone management systems

    Codeklonerkennung mit Dominatorinformationen

    Get PDF
    If an existing function in a software project is copied and reused (in a slightly modified version), the result is a code clone. If there was an error or vulnerability in the original function, this error or vulnerability is now contained in several places in the software project. This is one of the reasons why research is being done to develop powerful and scalable clone detection techniques. In this thesis, a new clone detection method is presented that uses paths and path sets derived from the dominator trees of the functions to detect the code clones. A dominator tree is a special form of the control flow graph, which does not contain cycles. The dominator tree based method has been implemented in the StoneDetector tool and can detect code clones in Java source code as well as in Java bytecode. It has equally good or better recall and precision results than previously published code clone detection methods. The evaluation was performed using the BigCloneBench. Scalability measurements showed that even source code with several 100 million lines of code can be searched in a reasonable time. In order to evaluate the bytecode based StoneDetector variant, the BigCloneBench files had to be compiled. For this purpose, the Stubber tool was developed, which can compile Java source code files without the required libraries. Finally, it could be shown that using the register code generated from the Java bytecode, similar recall and precision values could be achieved compared to the source code based variant. Since some machine learning studies specify that very good recall and precision values can be achieved for all clone types, a machine learning method was trained with dominator trees. It could be shown that the results published by the studies are not reproducible on unseen data.Wird eine bestehende Funktion in einem Softwareprojekt kopiert und (in leicht angepasster Form) erneut genutzt, entsteht ein Codeklon. War in der ursprünglichen Funktion jedoch ein Fehler oder eine Schwachstelle, so ist dieser Fehler beziehungsweise diese Schwachstelle jetzt an mehreren Stellen im Softwareprojekt enthalten. Dies ist einer der Gründe, weshalb an der Entwicklung von leistungsstarken und skalierbaren Klonerkennungsverfahren geforscht wird. In der hier vorliegenden Arbeit wird ein neues Klonerkennungsverfahren vorgestellt, das zum Detektieren der Codeklone Pfade und Pfadmengen nutzt, die aus den Dominatorbäumen der Funktionen abgeleitet werden. Ein Dominatorbaum wird aus dem Kontrollflussgraphen abgeleitet und enthält keine Zyklen. Das Dominatorbaum-basierte Verfahren wurde in dem Werkzeug StoneDetector umgesetzt und kann Codeklone sowohl im Java-Quelltext als auch im Java-Bytecode detektieren. Dabei hat es gleich gute oder bessere Recall- und Precision-Werte als bisher veröffentlichte Codeklonerkennungsverfahren. Die Wert-Evaluierungen wurden dabei unter Verwendung des BigClone-Benchs durchgeführt. Skalierbarkeitsmessungen zeigten, dass sogar Quellcodedateien mit mehreren 100-Millionen Codezeilen in angemessener Zeit durchsucht werden können. Damit die Bytecode-basierte StoneDetector-Variante auch evaluiert werden konnte, mussten die Dateien des BigCloneBench kompiliert werden. Dazu wurde das Stubber-Tool entwickelt, welches Java-Quelltextdateien ohne die benötigten Abhängigkeiten kompilieren kann. Schlussendlich konnte somit gezeigt werden, dass mithilfe des aus dem Java-Bytecode generierten Registercodes ähnliche Recall- und Precision-Werte im Vergleich zu der Quelltext-basierten Variante erreicht werden können. Da einige Arbeiten mit maschinellen Lernverfahren angeben, bei allen Klontypen sehr gute Recall- und Precision-Werte zu erreichen, wurde ein maschinelles Lernverfahren mit Dominatoräumen trainiert. Es konnte gezeigt werden, dass die von den Arbeiten veröffentlichten Ergebnisse nicht auf ungesehenen Daten reproduzierbar sind

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse
    corecore