202 research outputs found

    An Empirical Assessment of Bellon's Clone Benchmark

    Get PDF
    Context: Clone benchmarks are essential to the assessment and improvement of clone detection tools and algorithms. Among existing benchmarks, Bellon’s benchmark is widely used by the research community. However, a serious threat to the validity of this benchmark is that reference clones it contains have been manually validated by Bellon alone. Other persons may disagree with Bellon’s judgment. Ob-jective: In this paper, we perform an empirical assessment of Bellon’s benchmark. Method: We seek the opinion of eighteen participants on a subset of Bellon’s benchmark to determine if researchers should trust the reference clones it contains. Results: Our experiment shows that a significant amount of the reference clones are debatable, and this phe-nomenon can introduce noise in results obtained using this benchmark

    A novel approach for Software Clone detection using Data Mining in Software

    Get PDF
    The Similar Program structures which recur in variant forms in software systems are code clones. Many techniques are proposed in order to detect similar code fragments in software. The software maintenance is generally helped by maintenance is generally helped by the identification and subsequent unification. When the patterns of simple clones reoccur, it is an indication for the presence of interesting higher-level similarities. They are called as Structural Clones. The structural clones when compared to simple clones show a bigger picture of similarities. The problem of huge number of clones is alleviated by the structural clones, which are part of logical groups of simple clones. In order to understand the design of the system for better maintenance and reengineering for reuse, detection of structural clones is essential. In this paper, a technique which is useful to detect some useful types of structural clones is proposed. The novelty of the present approach comprises the formulation of the structural clone concept and the application of data mining techniques. A novel approach is useful for implementation of the proposed technique is described

    How Accurate Is Coarse-grained Clone Detection?: Comparision with Fine-grained Detectors

    Get PDF
    Research on clone detection has been quite successful over the past two decades, which produced a number of state-of-the-art clone detectors.However, it has been still challenging to detect clones, even with such successful detectors, across multiple projects or on thousands of revisions of code in limited time.A simple and coarse-grained detector will be an alternative of detectors using fine-grained analysis.It will drastically reduce time required for detection although it may miss some of clones that fine-grained detectors can detect.Hence, it should be adequate for a tentative analysis of clones if it has an acceptable accuracy.However, it is not clear how accurate such a coarse-grained approach is.This paper evaluates the accuracy of a coarse-grained clone detector compared with some fine-grained clone detectors.Our experiment provides an empirical evidence about acceptable accuracy of such a coarse-grained approach.Thus, we conclude that coarse-grained detection is adequate to make a summary of clone analysis and to be a starter of detailed analysis including manual inspections and bug detection

    Automatic Refactoring for Renamed Clones in Test Code

    Get PDF
    Unit testing plays an essential role in software development and maintenance, especially in Test-Driven Development. Conventional unit tests, which have no input parameters, often exercise similar scenarios with small variations to achieve acceptable coverage, which often results in duplicated code in test suites. Test code duplication hinders comprehension of test cases and maintenance of test suites. Test refactoring is a potential tool for developers to use to control technical debt arising due to test cloning. In this thesis, we present a novel tool, JTestParametrizer, for automatically refactoring method-scope renamed clones in test suites. We propose three levels of refactoring to parameterize type, data, and behaviour differences in clone pairs. Our technique works at the Abstract Syntax Tree level by extracting a parameterized template utility method and instantiating it with appropriate parameter values. We applied our technique to 5 open-source Java benchmark projects and conducted an empirical study on our results. Our technique examined 14,431 test methods in our benchmark projects and identified 415 renamed clone pairs as effective candidates for refactoring. On average, 65% of the effective candidates (268 clone pairs) in our test suites are refactorable using our technique. All of the refactored test methods are compilable, and 94% of them pass when executed as tests. We believe that our proposed refactorings generally improve code conciseness, reduce the amount of duplication, and make test suites easier to maintain and extend

    Koodikloonien hyödyntäminen asiakaskohtaisten erojen havaitsemiseksi tuotteistusprosessissa

    Get PDF
    The topic for this thesis was inspired by two case studies. The case studies are applications that are conceptually but not technically products. Their code bases contain customer-specific branches. The development strategy with the case studies has been forking an existing branch and customizing it to the needs of the new client. Code reuse and forking can be an efficient or even a necessary development strategy due to time pressure. However, code duplication may result in harder maintainability of the code base which in turn increases the maintenance costs. Finding similar code fragments is researched in the field of code clone detection. Code clones are code fragments that are either the same or similar. The similarity can be categorized into 4 types. Type I clones are exact matches that differ only in layout, whitespace or comments. In addition to type I changes, type II clones can differ in identifier names and types or literal values. Furthermore, type III clones can have statements added, deleted or modified within the code fragments under comparison. Type IV clones are functionally similar clones. There are different kinds of techniques and tools for both detecting and visualizing clones. Different techniques find different sets of clone types. Code clone visualizations present both the overview of the cloning situation, and the details in the source code level. The branches of the same product of the case studies can be considered as clones of each other. They are expected to remind type III clones. They essentially originate from the same code base, but each one has added, deleted and modified statements within the corresponding files between the other branches. Identifying these changes facilitate forming an overall picture of how much the branches truly differ. The transformation process from development of customer-specific software to product software is called productization. In order to productize, the differences in the branches must be determined. Each customization needs to be considered in the productization process to avoid reducing the value of the product. We defined a process how to utilize code clone visualizations to explore differences between customer-specific branches. Conclusion of this thesis is that utilization of code clones clearly expedites the productization process. The visualizations aid to locate the differences much faster than manually. Code clone detection is applied to fade out the uninteresting differences between the branches. Hence, the method aids to navigate to the truly interesting customizations that require manual inspection. The method also provides a general view of the cloning situation, which eases the task of estimating the workload. The process is applicable in situations, where the diverged code bases are expected to remind each other structurally, yet contain so many changes that a manual comparison of the branches with file comparison tools would be too time-consuming.Motivaatio diplomityön tekemiselle syntyi kahden tapaustutkimuksen johdosta. Ne käsittelevät sovelluksia, jotka ovat käsitteellisellä tasolla tuotteita, mutta eivät teknisesti. Niiden lähdekoodit sisältävät asiakaskohtaisia haaroja. Kehitysstrategia sovellusten kohdalla on ollut haarauttaa koodipohja asiakaskohtaiseksi koodipohjaksi ja muokata se asiakastoiveiden mukaiseksi. Koodin uusiokäyttö voi olla tehokas tai jopa tarvittava kehitysstrategia aikataulupaineiden johdosta. Toisteinen koodi voi kuitenkin hankaloittaa sovellusten ylläpitoa ja täten nostaa ylläpitokustannuksia. Samankaltaisten koodin osien etsimistä on tutkittu koodikloonien tutkimusalalla. Koodikloonit ovat koodin osia, jotka ovat joko samoja tai samankaltaisia. Samankaltaisuus voidaan luokitella neljään tyyppiin. Tyypin I kloonit eroavat vain ulkoasun, tyhjätilamerkkien tai kommenttien osalta. Tyypin II kloonit voivat erota myös muuttujien nimien tai tyyppien osalta tai literaalien arvoissa. Tyypin III klooneissa voi olla lisättyjä, poistettuja tai muuttuneita lauseita välissä. Tyypin IV kloonit ovat toiminnaltaan samankaltaisia. Koodikloonien tunnistamiseen ja visualisointiin on erilaisia menetelmiä. Eri tekniikat löytävät eri tyyppisiä klooneja. Koodiklooneista voidaan visualisoida sekä kokonaiskuva kloonaustilanteesta että yksityiskohdat lähdekooditasolla. Saman tuotteen haarat tapaustutkimuksissamme voidaan ajatella olevan tyypin III klooneja toisistaan. Ne periytyvät alun perin samasta koodipohjasta, mutta jokaisessa on lisättyjä, poistettuja ja muutettuja lauseita toisiaan vastaavien tiedostojen välillä. Nämä muutokset halutaan havaita, jotta voimme saada kokonaiskuvan siitä, kuinka paljon haarat todellisuudessa eroavat toisistaan. Tutkimuksen kohteena oli tuotteistusprosessi, jossa asiakaskohtaisesti räätälöidyt koodipohjat pyrittiin muuntamaan yhdeksi tuotteeksi. Tavoitteena oli selvittää kaikkien koodipohjien asiakaskohtaisesti räätälöidyt osat, jotta ne tulisivat huomioitua tuotteistusprosessissa. Jokainen räätälöinti voi olla tuotteen arvoa nostava tekijä. Kehitimme prosessin, jonka mukaisesti kloonien visualisointeja voidaan käyttää tuotteistusprosessissa. Tutkimuksessa havaittiin, että koodikloonien hyödyntäminen nopeutti selkeästi tutkimuskohteiden tuotteistusprosessia. Visualiointien avulla erot löydetään huomattavasti nopeammin kuin manuaalisesti. Kloonien tunnistusmenetelmiä käytetään tässä yhteydessä häivyttämään koodipohjasta epäkiinnostavat erot. Täten menetelmä ohjaa niiden erojen äärelle, joiden tarkastelu oikeasti vaatii manuaalista tulkintaa. Menetelmä antaa myös kokonaiskuvan tilanteesta, mikä helpottaa tuotteistamiseen tarvittavien työmääräarvioiden tekemistä. Menetelmä sopii tilanteisiin, jossa toisistaan erkaantuneet koodipohjat muistuttavat vielä rakenteeltaan toisiaan, mutta sisältävät niin paljon muutoksia, että käsin tehtävä koodihaarojen vertailu tiedostojen vertailuun tarkoitetulla työkalulla olisi liian aikaa vievää

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Identifying Refactoring Opportunities for Removing Code Clones with A Metrics-based Approach

    Get PDF
    Java in Academia and Research (2nd Edition)978-1-922227-010 (Hardcover)978-1-481261-609 (Paperback

    Dealing with clones in software : a practical approach from detection towards management

    Get PDF
    Despite the fact that duplicated fragments of code also called code clones are considered one of the prominent code smells that may exist in software, cloning is widely practiced in industrial development. The larger the system, the more people involved in its development and the more parts developed by different teams result in an increased possibility of having cloned code in the system. While there are particular benefits of code cloning in software development, research shows that it might be a source of various troubles in evolving software. Therefore, investigating and understanding clones in a software system is important to manage the clones efficiently. However, when the system is fairly large, it is challenging to identify and manage those clones properly. Among the various types of clones that may exist in software, research shows detection of near-miss clones where there might be minor to significant differences (e.g., renaming of identifiers and additions/deletions/modifications of statements) among the cloned fragments is costly in terms of time and memory. Thus, there is a great demand of state-of-the-art technologies in dealing with clones in software. Over the years, several tools have been developed to detect and visualize exact and similar clones. However, usually the tools are standalone and do not integrate well with a software developer's workflow. In this thesis, first, a study is presented on the effectiveness of a fingerprint based data similarity measurement technique named 'simhash' in detecting clones in large scale code-base. Based on the positive outcome of the study, a time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems. The novel detection approach has been made available as a highly configurable and fully fledged standalone clone detection tool named 'SimCad', which can be configured for detection of clones in both source code and non-source code based data. Second, we show a robust use of the clone detection approach studied earlier by assembling its detection service as a portable library named 'SimLib'. This library can provide tightly coupled (integrated) clone detection functionality to other applications as opposed to loosely coupled service provided by a typical standalone tool. Because of being highly configurable and easily extensible, this library allows the user to customize its clone detection process for detecting clones in data having diverse characteristics. We performed a user study to get some feedback on installation and use of the 'SimLib' API (Application Programming Interface) and to uncover its potential use as a third-party clone detection library. Third, we investigated on what tools and techniques are currently in use to detect and manage clones and understand their evolution. The goal was to find how those tools and techniques can be made available to a developer's own software development platform for convenient identification, tracking and management of clones in the software. Based on that, we developed a clone-aware software development platform named 'SimEclipse' to promote the practical use of code clone research and to provide better support for clone management in software. Finally, we evaluated 'SimEclipse' by conducting a user study on its effectiveness, usability and information management. We believe that both researchers and developers would enjoy and utilize the benefit of using these tools in different aspect of code clone research and manage cloned code in software systems
    corecore