353 research outputs found

    Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches

    Get PDF
    Much initial research on automatic program repair has focused on experimental results to probe their potential to find patches and reduce development effort. Relatively less effort has been put into understanding the hows and whys of such approaches. For example, a critical assumption of the GenProg technique is that certain bugs can be fixed by copying and re-arranging existing code. In other words, GenProg assumes that the fix ingredients already exist elsewhere in the code. In this paper, we formalize these assumptions around the concept of ''temporal redundancy''. A temporally redundant commit is only composed of what has already existed in previous commits. Our experiments show that a large proportion of commits that add existing code are temporally redundant. This validates the fundamental redundancy assumption of GenProg.Comment: ICSE - 36th IEEE International Conference on Software Engineering (2014

    Normalizer: Augmenting Code Clone Detectors using Source Code Normalization

    Get PDF
    Code clones are duplicate fragments of code that perform the same task. As software code bases increase in size, the number of code clones also tends to increase. These code clones, possibly created through copy-and-paste methods or unintentional duplication of effort, increase maintenance cost over the lifespan of the software. Code clone detection tools exist to identify clones where a human search would prove unfeasible, however the quality of the clones found may vary. I demonstrate that the performance of such tools can be improved by normalizing the source code before usage. I developed Normalizer, a tool to transform C source code to normalized source code where the code is written as consistently as possible. By maintaining the code\u27s function while enforcing a strict format, the variability of the programmer\u27s style will be taken out. Thus, code clones may be easier to detect by tools regardless of how it was written. Reordering statements, removing useless code, and renaming identifiers are used to achieve normalized code. Normalizer was used to show that more clones can be found in Introduction to Computer Networks assignments by normalizing the source code versus the original source code using a small variety of code clone detection tools

    Deep Learning Software Repositories

    Get PDF
    Bridging the abstraction gap between artifacts and concepts is the essence of software engineering (SE) research problems. SE researchers regularly use machine learning to bridge this gap, but there are three fundamental issues with traditional applications of machine learning in SE research. Traditional applications are too reliant on labeled data. They are too reliant on human intuition, and they are not capable of learning expressive yet efficient internal representations. Ultimately, SE research needs approaches that can automatically learn representations of massive, heterogeneous, datasets in situ, apply the learned features to a particular task and possibly transfer knowledge from task to task. Improvements in both computational power and the amount of memory in modern computer architectures have enabled new approaches to canonical machine learning tasks. Specifically, these architectural advances have enabled machines that are capable of learning deep, compositional representations of massive data depots. The rise of deep learning has ushered in tremendous advances in several fields. Given the complexity of software repositories, we presume deep learning has the potential to usher in new analytical frameworks and methodologies for SE research and the practical applications it reaches. This dissertation examines and enables deep learning algorithms in different SE contexts. We demonstrate that deep learners significantly outperform state-of-the-practice software language models at code suggestion on a Java corpus. Further, these deep learners for code suggestion automatically learn how to represent lexical elements. We use these representations to transmute source code into structures for detecting similar code fragments at different levels of granularity—without declaring features for how the source code is to be represented. Then we use our learning-based framework for encoding fragments to intelligently select and adapt statements in a codebase for automated program repair. In our work on code suggestion, code clone detection, and automated program repair, everything for representing lexical elements and code fragments is mined from the source code repository. Indeed, our work aims to move SE research from the art of feature engineering to the science of automated discovery

    Animating the evolution of software

    Get PDF
    The use and development of open source software has increased significantly in the last decade. The high frequency of changes and releases across a distributed environment requires good project management tools in order to control the process adequately. However, even with these tools in place, the nature of the development and the fact that developers will often work on many other projects simultaneously, means that the developers are unlikely to have a clear picture of the current state of the project at any time. Furthermore, the poor documentation associated with many projects has a detrimental effect when encouraging new developers to contribute to the software. A typical version control repository contains a mine of information that is not always obvious and not easy to comprehend in its raw form. However, presenting this historical data in a suitable format by using software visualisation techniques allows the evolution of the software over a number of releases to be shown. This allows the changes that have been made to the software to be identified clearly, thus ensuring that the effect of those changes will also be emphasised. This then enables both managers and developers to gain a more detailed view of the current state of the project. The visualisation of evolving software introduces a number of new issues. This thesis investigates some of these issues in detail, and recommends a number of solutions in order to alleviate the problems that may otherwise arise. The solutions are then demonstrated in the definition of two new visualisations. These use historical data contained within version control repositories to show the evolution of the software at a number of levels of granularity. Additionally, animation is used as an integral part of both visualisations - not only to show the evolution by representing the progression of time, but also to highlight the changes that have occurred. Previously, the use of animation within software visualisation has been primarily restricted to small-scale, hand generated visualisations. However, this thesis shows the viability of using animation within software visualisation with automated visualisations on a large scale. In addition, evaluation of the visualisations has shown that they are suitable for showing the changes that have occurred in the software over a period of time, and subsequently how the software has evolved. These visualisations are therefore suitable for use by developers and managers involved with open source software. In addition, they also provide a basis for future research in evolutionary visualisations, software evolution and open source development

    A comparison of code similarity analysers

    Get PDF
    Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    Semi-Automatische Deduktion von Feature-Lokalisierung während der Softwareentwicklung: Masterarbeit

    Get PDF
    Despite extensive research on software product lines in the last decades, ad-hoc clone-and-own development is still the dominant way for introducing variability to software systems. Therefore, the same issues for which software product lines were developed in the first place are still imminent in clone-and-own development: Fixing bugs consistently throughout clones and avoiding duplicate implementation effort is extremely diffcult as similarities and differences between variants are unknown. In order to remedy this, we enhance clone-and-own development with techniques from product-line engineering for targeted variant synchronisation such that domain knowledge can be integrated stepwise and without obligation. Contrary to retroactive feature mapping recovery (e.g., mining) techniques, we infer feature-to-code mappings directly during software development when concrete domain knowledge is present. In this thesis, we focus on the first step towards targeted synchronisation between variants: the recording of feature mappings. By letting developers specify on which feature they are working on, we derive feature mappings directly during software development. We ensure syntactic validity of feature mappings and variant synchronisation by implementing disciplined annotations through abstract syntax trees. To bridge the mismatch between change classification in the implementation and abstract layer, we synthesise semantic edits on abstract syntax trees. We show that our derivation can be used to reproduce variability-related real-world code changes and compare it to the feature mapping derivation of the projectional variation control system VTS by Stanciulescu et al.Trotz umfangreicher Forschung zu Software-Produktlinien in den letzten Jahrzehnten ist Clone-and-Own immer noch der dominierende Ansatz zur Einführung von Variabilität in Softwaresystemen. Daher stehen bei Clone-and-Own immer noch die gleichen Probleme im Vordergrund, für die Software-Produktlinien überhaupt erst entwickelt wurden: Die konsistente Behebung von Fehlern in allen Klonen und die Vermeidung von doppeltem Implementierungsaufwand sind äußerst schwierig, da Ähnlichkeiten und Unterschiede zwischen den Varianten unbekannt sind. Um hier Abhilfe zu schaffen, erweitern wir die Clone-and-Own-Entwicklung mit Techniken aus der Produktlinien-Entwicklung zur gezielten Synchronisierung von Varianten, sodass Entwickler ihr Domänenwissen schrittweise und unverbindlich integrieren können. Im Gegensatz zu nachträglich arbeitenden Feature-Mapping-Recovery- oder auch Mining-Techniken, leiten wir Zuordungen von Features zu Quellcode direkt während der Softwareentwicklung ab, wenn konkretes Domänenwissen vorhanden ist. In dieser Arbeit entwickeln wir den ersten Schritt zur gezielten Synchronisation von Varianten: die Aufzeichnung von Feature-Mappings. Indem Entwickler spezifizieren an welchem Feature sie arbeiten, leiten wir Feature-Mappings direkt während der Softwareentwicklung ab. Wir stellen die syntaktische Korrektheit von Feature-Mappings und der Synchronisation von Varianten sicher, indem wir disziplinierte Annotationen mithilfe von abstrakten Syntaxbäumen implementieren. Um die Diskrepanz der Klassifizierung von Änderungen zwischen der Implementierungs- und der Abstraktionsschicht zu überbrücken, synthetisieren wir Semantic Edits auf abstrakten Syntaxbäumen. Wir zeigen, dass unsere Ableitung von Feature-Mappings in der Lage ist reale Codeänderungen zu reproduzieren und vergleichen sie mit der Feature-Mapping-Ableitung des Variationskontrollsystems VTS von Stanciulescu et al

    Detecting Vulnerable JavaScript Libraries in Hybrid Android Applications

    Get PDF
    Smartphone devices are very popular. There are a lot of devices being sold, a lot of applications that are created and a lot of people using those applications. However, mobile applications could only be created in the native language of the mobile platform, for example, Java for Android, or Objective-C for iOS. The concept of hybrid mobile applications was introduced, in order to help developers create mobile cross-platform applications. These hybrid application platforms allow mobile application developers to use other development languages and their ecosystems. In particular, we are focusing on Android hybrid applications created using the Apache Cordova framework, which uses JavaScript (JS) as its development language. We develop an automated method that can detect libraries used by hybrid applications that are known to be vulnerable. This search is performed by applying methods similar to Java code clone detection methods to the dynamic JavaScript language. We derive a signature from the reference JS file in a library and a signature from the unknown JS file in an application. Further, we compare those two signatures to produce a numerical similarity value indicating how close two files are. From this, we conclude whether the unknown file is identical or similar to the known reference file. From the npm repository, we collect JS libraries that are known to be vulnerable based on the vulnerability data provided by Snyk, and end up with 10 686 distinct versions across 698 distinct libraries. We also have access to roughly 100 000 carefully collected Android applications, from which we find that 5652 are Cordova based hybrid applications. We find with manual verification for ten random apps that we can match 71% of library names and 80% of library names and versions. From the analysis of the entire application set, we find that 2557 (45.24%) hybrid applications from our reference set have at least one vulnerable library. Our results show that it is possible to create a tool that conducts code clone detection for the dynamic JS language. Our approach still requires some refinement and improvements for minified JS files. However, it could be used as a stepping stone towards a very precise code clone detection tool based on JS source code analysis
    corecore