4,079 research outputs found

    Clone Detection via Structural Abstraction

    Get PDF
    This paper describes the design, implementation, and application of a new algorithm to detect cloned code. It operates on the abstract syntax trees formed by many compilers as an intermediate representation. It extends prior work by identifying clones even when arbitrary subtrees have been changed. On a 440,000-line code corpus, 20- 50%of the clones it detected were missed by previous methods. The method also identifies cloning in declarations, so it is somewhat more general than conventional procedural abstraction

    Structured Review of the Evidence for Effects of Code Duplication on Software Quality

    Get PDF
    This report presents the detailed steps and results of a structured review of code clone literature. The aim of the review is to investigate the evidence for the claim that code duplication has a negative effect on code changeability. This report contains only the details of the review for which there is not enough place to include them in the companion paper published at a conference (Hordijk, Ponisio et al. 2009 - Harmfulness of Code Duplication - A Structured Review of the Evidence)

    An Extended Stable Marriage Problem Algorithm for Clone Detection

    Full text link
    Code cloning negatively affects industrial software and threatens intellectual property. This paper presents a novel approach to detecting cloned software by using a bijective matching technique. The proposed approach focuses on increasing the range of similarity measures and thus enhancing the precision of the detection. This is achieved by extending a well-known stable-marriage problem (SMP) and demonstrating how matches between code fragments of different files can be expressed. A prototype of the proposed approach is provided using a proper scenario, which shows a noticeable improvement in several features of clone detection such as scalability and accuracy.Comment: 20 pages, 10 figures, 6 table

    apk2vec: Semi-supervised multi-view representation learning for profiling Android applications

    Full text link
    Building behavior profiles of Android applications (apps) with holistic, rich and multi-view information (e.g., incorporating several semantic views of an app such as API sequences, system calls, etc.) would help catering downstream analytics tasks such as app categorization, recommendation and malware analysis significantly better. Towards this goal, we design a semi-supervised Representation Learning (RL) framework named apk2vec to automatically generate a compact representation (aka profile/embedding) for a given app. More specifically, apk2vec has the three following unique characteristics which make it an excellent choice for largescale app profiling: (1) it encompasses information from multiple semantic views such as API sequences, permissions, etc., (2) being a semi-supervised embedding technique, it can make use of labels associated with apps (e.g., malware family or app category labels) to build high quality app profiles, and (3) it combines RL and feature hashing which allows it to efficiently build profiles of apps that stream over time (i.e., online learning). The resulting semi-supervised multi-view hash embeddings of apps could then be used for a wide variety of downstream tasks such as the ones mentioned above. Our extensive evaluations with more than 42,000 apps demonstrate that apk2vec's app profiles could significantly outperform state-of-the-art techniques in four app analytics tasks namely, malware detection, familial clustering, app clone detection and app recommendation.Comment: International Conference on Data Mining, 201

    Syntax tree fingerprinting: a foundation for source code similarity detection

    Get PDF
    Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases

    Recommending Stack Overflow Posts for Fixing Runtime Exceptions using Failure Scenario Matching

    Full text link
    Using online Q&A forums, such as Stack Overflow (SO), for guidance to resolve program bugs, among other development issues, is commonplace in modern software development practice. Runtime exceptions (RE) is one such important class of bugs that is actively discussed on SO. In this work we present a technique and prototype tool called MAESTRO that can automatically recommend an SO post that is most relevant to a given Java RE in a developer's code. MAESTRO compares the exception-generating program scenario in the developer's code with that discussed in an SO post and returns the post with the closest match. To extract and compare the exception scenario effectively, MAESTRO first uses the answer code snippets in a post to implicate a subset of lines in the post's question code snippet as responsible for the exception and then compares these lines with the developer's code in terms of their respective Abstract Program Graph (APG) representations. The APG is a simplified and abstracted derivative of an abstract syntax tree, proposed in this work, that allows an effective comparison of the functionality embodied in the high-level program structure, while discarding many of the low-level syntactic or semantic differences. We evaluate MAESTRO on a benchmark of 78 instances of Java REs extracted from the top 500 Java projects on GitHub and show that MAESTRO can return either a highly relevant or somewhat relevant SO post corresponding to the exception instance in 71% of the cases, compared to relevant posts returned in only 8% - 44% instances, by four competitor tools based on state-of-the-art techniques. We also conduct a user experience study of MAESTRO with 10 Java developers, where the participants judge MAESTRO reporting a highly relevant or somewhat relevant post in 80% of the instances. In some cases the post is judged to be even better than the one manually found by the participant

    On Using UML Diagrams to Identify and Assess Software Design Smells

    Get PDF
    Deficiencies in software design or architecture can severely impede and slow down the software development and maintenance progress. Bad smells and anti-patterns can be an indicator for poor software design and suggest for refactoring the affected source code fragment. In recent years, multiple techniques and tools have been proposed to assist software engineers in identifying smells and guiding them through corresponding refactoring steps. However, these detection tools only cover a modest amount of smells so far and also tend to produce false positives which represent conscious constructs with symptoms similar or identical to actual bad smells (e.g., design patterns). These and other issues in the detection process demand for a code or design review in order to identify (missed) design smells and/or re-assess detected smell candidates. UML diagrams are the quasi-standard for documenting software design and are often available in software projects. In this position paper, we investigate whether (and to what extent) UML diagrams can be used for identifying and assessing design smells. Based on a description of difficulties in the smell detection process, we discuss the importance of design reviews. We then investigate to what extent design documentation in terms of UML2 diagrams allows for representing and identifying software design smells. In particular, 14 kinds of design smells and their representability in UML class and sequence diagrams are analyzed. In addition, we discuss further challenges for UML-based identification and assessment of bad smells

    Primary Structure and Catalytic Mechanism of the Epoxide Hydrolase from Agrobacterium radiobacter AD1

    Get PDF
    The epoxide hydrolase gene from Agrobacterium radiobacter AD1, a bacterium that is able to grow on epichlorohydrin as the sole carbon source, was cloned by means of the polymerase chain reaction with two degenerate primers based on the N-terminal and C-terminal sequences of the enzyme. The epoxide hydrolase gene coded for a protein of 294 amino acids with a molecular mass of 34 kDa. An identical epoxide hydrolase gene was cloned from chromosomal DNA of the closely related strain A. radiobacter CFZ11. The recombinant epoxide hydrolase was expressed up to 40% of the total cellular protein content in Escherichia coli BL21(DE3) and the purified enzyme had a kcat of 21 s-1 with epichlorohydrin. Amino acid sequence similarity of the epoxide hydrolase with eukaryotic epoxide hydrolases, haloalkane dehalogenase from Xanthobacter autotrophicus GJ10, and bromoperoxidase A2 from Streptomyces aureofaciens indicated that it belonged to the α/β-hydrolase fold family. This conclusion was supported by secondary structure predictions and analysis of the secondary structure with circular dichroism spectroscopy. The catalytic triad residues of epoxide hydrolase are proposed to be Asp107, His275, and Asp246. Replacement of these residues to Ala/Glu, Arg/Gln, and Ala, respectively, resulted in a dramatic loss of activity for epichlorohydrin. The reaction mechanism of epoxide hydrolase proceeds via a covalently bound ester intermediate, as was shown by single turnover experiments with the His275 → Arg mutant of epoxide hydrolase in which the ester intermediate could be trapped.
    corecore