171 research outputs found

    A comparison of code similarity analysers

    Get PDF
    Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    Similarity of Source Code in the Presence of Pervasive Modifications

    Get PDF
    Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of pervasive code modifications, i.e. transformations that may have a global effect. We compare 30 similarity detection techniques and tools against pervasive code modifications. We evaluate the tools using two experimental scenarios for Java source code. These are (1) pervasive modifications created with tools for source code and bytecode obfuscation and (2) source code normalisation through compilation and decompilation using different decompilers. Our experimental results show that highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for six of the tools. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    An Empirical Study on Dependence Clusters for Effort-Aware Fault-Proneness Prediction

    Get PDF
    A dependence cluster is a set of mutually inter-dependent program elements. Prior studies have found that large dependence clusters are prevalent in software systems. It has been suggested that dependence clusters have potentially harmful effects on software quality. However, little empirical evidence has been provided to support this claim. The study presented in this paper investigates the relationship between dependence clusters and software quality at the function-level with a focus on effort-aware fault-proneness prediction. The investigation first analyzes whether or not larger dependence clusters tend to be more fault-prone. Second, it investigates whether the proportion of faulty functions inside dependence clusters is significantly different from the proportion of faulty functions outside dependence clusters. Third, it examines whether or not functions inside dependence clusters playing a more important role than others are more fault-prone. Finally, based on two groups of functions (i.e., functions inside and outside dependence clusters), the investigation considers a segmented fault-proneness prediction model. Our experimental results, based on five well-known open-source systems, show that (1) larger dependence clusters tend to be more fault-prone; (2) the proportion of faulty functions inside dependence clusters is significantly larger than the proportion of faulty functions outside dependence clusters; (3) functions inside dependence clusters that play more important roles are more fault-prone; (4) our segmented prediction model can significantly improve the effectiveness of effort-aware fault-proneness prediction in both ranking and classification scenarios. These findings help us better understand how dependence clusters influence software quality

    MOBS: Multi-operator Observation-Based Slicing using lexical approximation of program dependence

    Get PDF
    Observation-Based Slicing (ORBS) is a recently-introduced program slicing technique based on direct observation of program semantics. Previous ORBS implementations slice a the program by iteratively deleting adjacent lines of code. This paper introduces two new deletion operators based on lexical similarity. Furthermore, it presents a generalization of ORBS that can exploit multiple deletion operators: Multi-operator Observation-Based Slicing (MOBS). Empirical evaluation of MOBS using three real world Java projects finds that the use of lexical information, improves the efficiency of ORBS: MOBS can delete up to 87% of lines while taking only about 33% of the execution time with respect to the original ORBS implementation

    The Impact of Code Review on Architectural Changes

    Get PDF
    Although considered one of the most important decisions in the software development lifecycle, empirical evidence on how developers perform and perceive architectural changes remains scarce. Architectural decisions have far-reaching consequences yet, we know relatively little about the level of developers' awareness of their changes' impact on the software's architecture. We also know little about whether architecture-related discussions between developers lead to better architectural changes. To provide a better understanding of these questions, we use the code review data from 7 open source systems to investigate developers' intent and awareness when performing changes alongside the evolution of the changes during the reviewing process. We extracted the code base of 18,400 reviews and 51,889 revisions. 4,171 of the reviews have changes in their computed architectural metrics, and 731 present significant changes to the architecture. We manually inspected all reviews that caused significant changes and found that developers are discussing the impact of their changes on the architectural structure in only 31% of the cases, suggesting a lack of awareness. Moreover, we noticed that in 73% of the cases in which developers provided architectural feedback during code review, the comments were addressed, where the final merged revision tended to exhibit higher architectural improvement than reviews in which the system's structure is not discussed

    A comparison of tree- and line-oriented observational slicing

    Get PDF
    Observation-based slicing and its generalization observational slicing are recently-introduced, language-independent dynamic slicing techniques. They both construct slices based on the dependencies observed during program execution, rather than static or dynamic dependence analysis. The original implementation of the observation-based slicing algorithm used lines of source code as its program representation. A recent variation, developed to slice modelling languages (such as Simulink), used an XML representation of an executable model. We ported the XML slicer to source code by constructing a tree representation of traditional source code through the use of srcML. This work compares the tree- and line-based slicers using four experiments involving twenty different programs, ranging from classic benchmarks to million-line production systems. The resulting slices are essentially the same size for the majority of the programs and are often identical. However, structural constraints imposed by the tree representation sometimes force the slicer to retain enclosing control structures. It can also “bog down” trying to delete single-token subtrees. This occasionally makes the tree-based slices larger and the tree-based slicer slower than a parallelised version of the line-based slicer. In addition, a Java versus C comparison finds that the two languages lead to similar slices, but Java code takes noticeably longer to slice. The initial experiments suggest two improvements to the tree-based slicer: the addition of a size threshold, for ignoring small subtrees, and subtree replacement. The former enables the slicer to run 3.4 times faster while producing slices that are only about 9% larger. At the same time the subtree replacement reduces size by about 8–12% and allows the tree-based slicer to produce more natural slices

    Real-Time Noninvasive Analysis of Biocatalytic PET Degradation

    Get PDF
    The Earth has entered the Anthropocene, which is branded by ubiquitous and devastating environmental pollution from plastics such as polyethylene terephthalate (PET). Ecofriendly and at the same time economical solutions for plastic recycling and reuse are being sought more urgently now than ever. With the possibility to recover its building blocks, the hydrolysis of PET waste by its selective biodegradation with polyester hydrolases is an appealing solution. We demonstrate how changing the dielectric properties of PET films can be used to evaluate the performance of polyester hydrolases. For this purpose, a PET film separates two reaction chambers in an impedimetric setup to quantify the film thickness- and surface area-dependent change in capacitance caused by the enzyme. The derived degradation rates determined for the polyester hydrolases PHL7 and LCC were similar to those obtained by gravimetric and vertical scanning interferometry measurements. Compared to optical methods, this technique is also insensitive to changes in the solution composition. AFM and FEM simulations further supported that impedance spectroscopy is a powerful tool for the detailed analysis of the enzymatic hydrolysis process of PET films. The developed monitoring system enabled both high-temporal resolution and parallel processing suitable for the analysis of the enzymatic degradability of polyester films and the properties of the biocatalysts.Version 2.0 is updated to include an acknowledgement of funding from the ENZYCLE projec

    Evaluating Lexical Approximation of Program Dependence

    Get PDF
    Complex dependence analysis typically provides an underpinning approximation of true program dependence. We investigate the effectiveness of using lexical information to approximate such dependence, introducing two new deletion operators to Observation-Based Slicing (ORBS). ORBS provides direct observation of program dependence, computing a slice using iterative, speculative deletion of program parts. Deletions become permanent if they do not affect the slicing criterion. The original ORBS uses a bounded deletion window operator that attempts to delete consecutive lines together. Our new deletion operators attempt to delete multiple, non-contiguous lines that are lexically similar to each other. We evaluate the lexical dependence approximation by exploring the trade-off between the precision and the speed of dependence analysis performed with new deletion operators. The deletion operators are evaluated independently, as well as collectively via a novel generalization of ORBS that exploits multiple deletion operators: Multi-operator Observation-Based Slicing (MOBS). An empirical evaluation using three Java projects, six C projects, and one multi-lingual project written in Python and C finds that the lexical information provides a useful approximation to the underlying dependence. On average, MOBS can delete 69% of lines deleted by the original ORBS, while taking only 36% of the wall clock time required by ORBS

    Behind the Intents: An In-depth Empirical Study on Software Refactoring in Modern Code Review

    Get PDF
    Code refactorings are of pivotal importance in modern code review. Developers may preserve, revisit, add or undo refactorings through changes’ revisions. Their goal is to certify that the driving intent of a code change is properly achieved. Developers’ intents behind refactorings may vary from pure structural improvement to facilitating feature additions and bug fixes. However, there is little understanding of the refactoring practices performed by developers during the code review process. It is also unclear whether the developers’ intents influence the selection, composition, and evolution of refactorings during the review of a code change. Through mining 1,780 reviewed code changes from 6 systems pertaining to two large open-source communities, we report the first in-depth empirical study on software refactoring during code review. We inspected and classified the developers’ intents behind each code change into 7 distinct categories. By analyzing data generated during the complete reviewing process, we observe: (i) how refactorings are selected, composed and evolved throughout each code change, and (ii) how developers’ intents are related to these decisions. For instance, our analysis shows developers regularly apply non-trivial sequences of refactorings that crosscut multiple code elements (i.e., widely scattered in the program) to support a single feature addition. Moreover, we observed that new developers’ intents commonly emerge during the code review process, influencing how developers select and compose their refactorings to achieve the new and adapted goals. Finally, we provide an enriched dataset that allows researchers to investigate the context and motivations behind refactoring operations during the code review process

    CROP: linking code reviews to source code changes

    Get PDF
    Code review has been widely adopted by both industrial and open source software development communities. Research in code review is highly dependant on real-world data, and although existing researchers have attempted to provide code review datasets, there is still no dataset that links code reviews with complete versions of the system's code base mainly because reviewed versions are not kept in the system's version control repository. Thus, we present CROP, the Code Review Open Platform, the first curated code review repository that links review data with isolated complete versions (snapshots) of the source code at the time of review. CROP currently provides data for 8 software systems, 48,975 reviews and 112,617 patches, including versions of the systems that are inaccessible in the systems' original repositories. Moreover, CROP is extensible, and it will be continuously curated and extended
    corecore