13 research outputs found
Survey of Research on Software Clones
This report summarizes my overview talk on software clone detection
research. It first discusses the notion of software redundancy, cloning, duplication,
and similarity. Then, it describes various categorizations of clone types, empirical
studies on the root causes for cloning, current opinions and wisdom of consequences
of cloning, empirical studies on the evolution of clones, ways to remove, to avoid,
and to detect them, empirical evaluations of existing automatic clone detector performance
(such as recall, precision, time and space consumption) and their fitness
for a particular purpose, benchmarks for clone detector evaluations, presentation
issues, and last but not least application of clone detection in other related fields.
After each summary of a subarea, I am listing open research questions
Comparison and Evaluation of Clone Detection Tools
Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs
Arten der Redundanz im Zusammenhang mit Code-Clones
Durch Redundanz im Quellcode kommt es zur Einschränkung wichtiger Faktoren wie der
Lesbarkeit oder Wartbarkeit des Codes. Damit einhergehend kann Fehlverhalten im
Programmablauf entstehen, wenn Code-Fragmente gezielt dupliziert werden, anstatt sie
wiederzuverwenden. Für die frühzeitige Erkennung solcher Probleme ist es daher nötig,
die Redundanz in ihren verschiedenen Formen aufzuschlĂĽsseln.
Das Ziel dieser Arbeit war es zu untersuchen, wodurch sich diese Formen beziehungsweise
Arten der Redundanz unterscheiden, wie diese zusammenhängen und auf welche Weise
man Redundanz mit dem Begriff Code-Clone zusammenfĂĽhren kann. Zu diesem Zweck
wurde eine Literaturstudie durchgefĂĽhrt, um den aktuellen Forschungsstand zu erfassen.
Dabei wurden neben der Redundanz auch die Themen Code-Clones und Ă„hnlichkeit
betrachtet. Die Ergebnisse der Literaturstudie wurden anhand der Arten der Redundanz
gegliedert und durch Code-Clone-Beispiele verdeutlicht.
Die Literaturstudie ergab, dass Redundanz vorwiegend durch Duplikation von Code-
Fragmenten entsteht, wodurch sich mithilfe von Code-Clones ein GroĂźteil der Redundanz
abbilden lässt. Des Weiteren sind die Arten der Redundanz nicht disjunkt, wodurch sich
eine hundertprozentige Untergliederung nicht durchführen lässt.:Gliederung
AbbildungsverzeichnisI
Quellcode-Listing
1. Einleitung
1.1 Motivation
1.2 Zielstellung
1.3 Aufbau der Arbeit
2. Definitionen
3.Vorgehen
3.1Methodisches Vorgehen
3.2 Planung
3.3 Selektion
3.4 Extraktion
3.5 AusfĂĽhrung
4. Ergebnisse
4.1 Negative Software Redundanz
4.2 Textuelle Redundanzen
4.3 Funktionelle Redundanz
4.4 Boilerplate-Code
4.5 Entstehungsgrund-basierte Redundanzen
4.5.1 Gezwungene Redundanz
4.5.2 Zufällige Redundanz
4.6 Abgrenzung der Redundanzarten voneinander
5. Fazit
6. Ausblick
Quelle
Recommended from our members
Detecting Java software similarities by using different clustering techniques
Background: Research on empirical software engineering has increasingly been conducted by analysing and measuring vast amounts of software systems. Hundreds, thousands and even millions of systems have been (and are) considered by researchers, and often within the same study, in order to test theories, demonstrate approaches or run prediction models. A much less investigated aspect is whether the collected metrics might be context-specific, or whether systems should be better analysed in clusters.
Objective: The objectives of this study are (i) to define a set of clustering techniques that might be used to group similar software systems, and (ii) to evaluate whether a suite of well-known object-oriented metrics is context-specific, and its values differ along the defined clusters.
Method: We group software systems based on three different clustering techniques, and we collect the values of the metrics suite in each cluster. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing.
Results: Our results show that, for two of the used techniques, the KS null hypothesis (e.g., the clusters come from the same population) is rejected for most of the metrics chosen: the clusters that we extracted, based on application domains, show statistically different structural properties.
Conclusions: The implications for researchers can be profound: metrics and their interpretation might be more sensitive to context than acknowledged so far, and application domains represent a promising filter to cluster similar systems
Erstellung und Evaluation eines Verfahrens zur Messung von Redundanz anhand von Tokenzerlegung
Die vorliegende Arbeit beschäftigt sich mit der Frage, wie verschiedene Arten von
Redundanz gemessen werden können und wie sich die Entfernung dieser Redundanzen auf die Größe des Quellcodes auswirkt. Des Weiteren wird untersucht, inwieweit verschiedene Programmierkonzepte und Features einen Einfluss auf Redundanzminderung besitzen. Zu diesem Zweck wird zunächst eine Literaturstudie durchgeführt. Weiterhin wird eine Definition von intrinsischer Redundanz eingeführt. Zudem werden verschiedene Messverfahren analysiert und darauf basierend wird ein Verfahren zur Messung von Redundanz anhand der Tokenzerlegung erstellt. Für die Untersuchung der Redundanz des Quellcodes sowie die Auswirkung der Verwendung von Programmierkonzepten und Features auf die Redundanz, wird ein Experiment anhand eines mehrteiligen Code Korpus mit funktional identischen Code Paaren durchgeführt. Zunächst wird dazu die bereinigte Tokenanzahl der Code Fragmente gemessen, ohne dass ein bestimmtes Konstrukt verwendet wurde. Diese Messwerte werden dann mit den Messwerten der Code Fragmente verglichen, in denen das betreffende Konstrukt angewendet wird. Es hat sich dabei gezeigt, dass es bestimmte Sprachkonstrukte gibt, die eine redundanzmindernde Wirkung besitzen. Für verschiedene Konstrukte zeigte sich außerdem, dass ihre Verwendung erst ab einer bestimmten Anzahl oder Größe der zu ersetzenden Code Fragmente, eine redundanzmindernde Wirkung aufweist
Understanding the Evolution of Code Clones in Software Systems
Code cloning is a common practice in software development. However, code cloning has both positive aspects such as accelerating the development process and negative aspects such as causing code bloat. After a decade of active research, it is clear that removing all of the clones from a software system is not desirable. Therefore, it is better to manage clones than to remove them. A software system can have thousands of clones in it, which may serve multiple purposes. However, some of the clones may cause unwanted management difficulties and clones like these should be refactored. Failure to manage clones may cause inconsistencies in the code, which is prone to error. Managing thousands of clones manually would be a difficult task. A clone management system can help manage clones and find patterns of how clones evolve during the evolution of a software system. In this research, we propose a framework for constructing and visualizing clone genealogies with change patterns (e.g., inconsistent changes), bug information, developer information and several other important metrics in a software system. Based on the framework we design and build an interactive prototype for a multi-touch surface (e.g., an iPad). The prototype uses a variety of techniques to support understanding clone genealogies, including: identifying and providing a compact overview of the clone genealogies along with their key characteristics; providing interactive navigation of genealogies, cloned source code and the differences between clone fragments; providing the ability to filter and organize genealogies based on their properties; providing a feature for annotating clone fragments with comments to aid future review; and providing the ability to contact developers from within the system to find out more information about specific clones. To investigate the suitability of the framework and prototype for investigating and managing cloned code, we elicit feedback from practicing researchers and developers, and we conduct two empirical studies: a detailed investigation into the evolution of function clones and a detailed investigation into how clones contribute to bugs. In both empirical studies we are able to use the prototype to quickly investigate the cloned source code to gain insights into clone use. We believe that the clone management system and the findings will play an important role in future studies and in managing code clones in software systems
Visualization and analysis of software clones
Code clones are identical or similar fragments of code in a software system. Simple copy-paste programming practices of developers, reusing existing code fragments instead of implementing from the scratch, limitations of both programming languages and developers are the primary reasons behind code cloning. Despite the maintenance implications of clones, it is not possible to conclude that cloning is harmful because there are also benefits in using them (e.g. faster and independent development). As a result, researchers at least agree that clones need to be analyzed before aggressively refactoring them. Although a large number of state-of-the-art clone detectors are available today, handling raw clone data is challenging due to the textual nature and large volume. To address this issue, we propose a framework for large-scale clone analysis and develop a maintenance support environment based on the framework called VisCad. To manage the large volume of clone data, VisCad employs the Visual Information Seeking Mantra: overview first, zoom and filter, then provide details-on-demand. With VisCad users can analyze and identify distinctive code clones through a set of visualization techniques, metrics covering different clone relations and data filtering operations. The loosely coupled architecture of VisCad allows users to work with any clone detection tool that reports source-coordinates of the found clones. This yields the opportunity to work with the clone detectors of choice, which is important because each clone detector has its own strengths and weaknesses. In addition, we extend the support for clone evolution analysis, which is important to understand the cause and effect of changes at the clone level during the evolution of a software system. Such information can be used to make software maintenance decisions like when to refactor clones. We propose and implement a set of visualizations that can allow users to analyze the evolution of clones from a coarse grain to a fine grain level. Finally, we use VisCad to extract both spatial and temporal clone data to predict changes to clones in a future release/revision of the software, which can be used to rank clone classes as another means of handling a large volume of clone data. We believe that VisCad makes clone comprehension easier and it can be used as a test-bed to further explore code cloning, necessary in building a successful clone management system