266 research outputs found

    Management Aspects of Software Clone Detection and Analysis

    Get PDF
    Copying a code fragment and reusing it by pasting with or without minor modifications is a common practice in software development for improved productivity. As a result, software systems often have similar segments of code, called software clones or code clones. Due to many reasons, unintentional clones may also appear in the source code without awareness of the developer. Studies report that significant fractions (5% to 50%) of the code in typical software systems are cloned. Although code cloning may increase initial productivity, it may cause fault propagation, inflate the code base and increase maintenance overhead. Thus, it is believed that code clones should be identified and carefully managed. This Ph.D. thesis contributes in clone management with techniques realized into tools and large-scale in-depth analyses of clones to inform clone management in devising effective techniques and strategies. To support proactive clone management, we have developed a clone detector as a plug-in to the Eclipse IDE. For clone detection, we used a hybrid approach that combines the strength of both parser-based and text-based techniques. To capture clones that are similar but not exact duplicates, we adopted a novel approach that applies a suffix-tree-based k-difference hybrid algorithm, borrowed from the area of computational biology. Instead of targeting all clones from the entire code base, our tool aids clone-aware development by allowing focused search for clones of any code fragment of the developer's interest. A good understanding on the code cloning phenomenon is a prerequisite to devise efficient clone management strategies. The second phase of the thesis includes large-scale empirical studies on the characteristics (e.g., proportion, types of similarity, change patterns) of code clones in evolving software systems. Applying statistical techniques, we also made fairly accurate forecast on the proportion of code clones in the future versions of software projects. The outcome of these studies expose useful insights into the characteristics of evolving clones and their management implications. Upon identification of the code clones, their management often necessitates careful refactoring, which is dealt with at the third phase of the thesis. Given a large number of clones, it is difficult to optimally decide what to refactor and what not, especially when there are dependencies among clones and the objective remains the minimization of refactoring efforts and risks while maximizing benefits. In this regard, we developed a novel clone refactoring scheduler that applies a constraint programming approach. We also introduced a novel effort model for the estimation of efforts needed to refactor clones in source code. We evaluated our clone detector, scheduler and effort model through comparative empirical studies and user studies. Finally, based on our experience and in-depth analysis of the present state of the art, we expose avenues for further research and development towards a versatile clone management system that we envision

    Automatically assessing and improving code readability and understandability

    Get PDF

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

    Get PDF
    De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

    The 6th Conference of PhD Students in Computer Science

    Get PDF

    Automated Improvement of Software Design by Search-Based Refactoring

    Get PDF
    Le coût de maintenance du logiciel est estimé à plus de 70% du coût total du système, en raison de nombreux facteurs, y compris les besoins des nouveaux utilisateurs, l’adoption de nouvelles technologies et la qualité des systèmes logiciels. De ces facteurs, la qualité est celle que nous pouvons contrôler et continuellement améliorer pour empêcher la dégradation de la performance et la réduction de l’efficacité (par exemple, la dégradation de la conception du logiciel). De plus, pour rester compétitive, l’industrie du logiciel a raccourci ses cycles de lancement afin de fournir de nouveaux produits et fonctionnalités plus rapidement, ce qui entraîne une pression accrue sur les équipes de développeurs et une accélération de l’évolution de la conception du système. Une façon d’empêcher la dégradation du logiciel est l’identification et la correction des anti-patrons qui sont des indicateurs de mauvaise qualité de conception. Pour améliorer la qualité de la conception et supprimer les anti-patrons, les développeurs effectuent de petites transformations préservant le comportement (c.-à-d., refactoring). Le refactoring manuel est coûteux, car il nécessite (1) d’identifier les entités de code qui doivent être refactorisées ; (2) générer des opérations de refactoring pour les classes identifiées à l’étape précédente ; (3) trouver le bon ordre d’application des refactorings générés, pour maximiser le bénéfice pour la qualité du code et minimiser les conflits. Ainsi, les chercheurs et les praticiens ont formulé le refactoring comme un problème d’optimisation et utilisent des techniques basées sur la recherche pour proposer des approches (semi) automatisées pour le résoudre. Dans cette thèse, je propose plusieurs méthodes pour résoudre les principaux problèmes des outils existants, afin d’aider les développeurs dans leurs activités de maintenance et d’assurance qualité. Ma thèse est qu’il est possible d’améliorer le refactoring automatisé en considérant de nouvelles dimensions : (1) le contexte de tâche du développeur pour prioriser le refactoring des classes pertinentes ; (2) l’effort du test pour réduire le coût des tests après le refactoring ; (3) l’identification de conflit entre opérations de refactoring afin de réduire le coût de refactoring ; et (4) l’efficacité énergétique pour améliorer la consommation d’énergie des applications mobiles après refactoring.----------ABSTRACT: Software maintenance cost is estimated to be more than 70% of the total cost of system, because of many factors, including new user’s requirements, the adoption of new technologies and the quality of software systems. From these factors, quality is the one that we can control and continually improved to prevent degradation of performance and reduction of effectiveness (a.k.a. design decay). Moreover, to stay competitive, the software industry has shortened its release cycles to deliver new products and features faster, which results in more pressure on developer teams and the acceleration of system’s design evolution. One way to prevent design decay is the identification and correction of anti-patterns which are indicators of poor design quality. To improve design quality and remove anti-patterns, developers perform small behavior-preserving transformations (a.k.a. refactoring). Manual refactoring is expensive, as it requires to (1) identify the code entities that need to be refactored; (2) generate refactoring operations for classes identified in the previous step; (3) find the correct order of application of the refactorings generated, to maximize the quality effect and to minimize conflicts. Hence, researchers and practitioners have formulated refactoring as an optimization problem and use search-based techniques to propose (semi)automated approaches to solve it. In this dissertation, we propose several approaches to tackle some of the major issues in existing refactoring tools, to assist developers in their maintenance and quality assurance activities

    Dealing with clones in software : a practical approach from detection towards management

    Get PDF
    Despite the fact that duplicated fragments of code also called code clones are considered one of the prominent code smells that may exist in software, cloning is widely practiced in industrial development. The larger the system, the more people involved in its development and the more parts developed by different teams result in an increased possibility of having cloned code in the system. While there are particular benefits of code cloning in software development, research shows that it might be a source of various troubles in evolving software. Therefore, investigating and understanding clones in a software system is important to manage the clones efficiently. However, when the system is fairly large, it is challenging to identify and manage those clones properly. Among the various types of clones that may exist in software, research shows detection of near-miss clones where there might be minor to significant differences (e.g., renaming of identifiers and additions/deletions/modifications of statements) among the cloned fragments is costly in terms of time and memory. Thus, there is a great demand of state-of-the-art technologies in dealing with clones in software. Over the years, several tools have been developed to detect and visualize exact and similar clones. However, usually the tools are standalone and do not integrate well with a software developer's workflow. In this thesis, first, a study is presented on the effectiveness of a fingerprint based data similarity measurement technique named 'simhash' in detecting clones in large scale code-base. Based on the positive outcome of the study, a time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems. The novel detection approach has been made available as a highly configurable and fully fledged standalone clone detection tool named 'SimCad', which can be configured for detection of clones in both source code and non-source code based data. Second, we show a robust use of the clone detection approach studied earlier by assembling its detection service as a portable library named 'SimLib'. This library can provide tightly coupled (integrated) clone detection functionality to other applications as opposed to loosely coupled service provided by a typical standalone tool. Because of being highly configurable and easily extensible, this library allows the user to customize its clone detection process for detecting clones in data having diverse characteristics. We performed a user study to get some feedback on installation and use of the 'SimLib' API (Application Programming Interface) and to uncover its potential use as a third-party clone detection library. Third, we investigated on what tools and techniques are currently in use to detect and manage clones and understand their evolution. The goal was to find how those tools and techniques can be made available to a developer's own software development platform for convenient identification, tracking and management of clones in the software. Based on that, we developed a clone-aware software development platform named 'SimEclipse' to promote the practical use of code clone research and to provide better support for clone management in software. Finally, we evaluated 'SimEclipse' by conducting a user study on its effectiveness, usability and information management. We believe that both researchers and developers would enjoy and utilize the benefit of using these tools in different aspect of code clone research and manage cloned code in software systems
    • …
    corecore