Search CORE

2,217 research outputs found

Building Program Vector Representations for Deep Learning

Author: Jin Zhi
Li Ge
Liu Yuxuan
Mou Lili
Peng Hao
Xu Yan
Zhang Lu
Publication venue
Publication date: 11/09/2014
Field of study

Deep learning has made significant breakthroughs in various fields of artificial intelligence. Advantages of deep learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, it is still virtually impossible to use deep learning to analyze programs since deep architectures cannot be trained effectively with pure back propagation. In this pioneering paper, we propose the "coding criterion" to build program vector representations, which are the premise of deep learning for program analysis. Our representation learning approach directly makes deep learning a reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based on the experiments, the coding criterion is successful in building program representations. To evaluate whether deep learning is beneficial for program analysis, we feed the representations to deep neural networks, and achieve higher accuracy in the program classification task than "shallow" methods, such as logistic regression and the support vector machine. This result confirms the feasibility of deep learning to analyze programs. It also gives primary evidence of its success in this new field. We believe deep learning will become an outstanding technique for program analysis in the near future.Comment: This paper was submitted to ICSE'1

arXiv.org e-Print Archive

CiteSeerX

Crossref

DSpot: Test Amplification for Automatic Assessment of Computational Diversity

Author: Allier Simon
Baudry Benoit
Monperrus Martin
Rodriguez-Cancio Marcelino
Publication venue
Publication date: 09/06/2015
Field of study

Context: Computational diversity, i.e., the presence of a set of programs that all perform compatible services but that exhibit behavioral differences under certain conditions, is essential for fault tolerance and security. Objective: We aim at proposing an approach for automatically assessing the presence of computational diversity. In this work, computationally diverse variants are defined as (i) sharing the same API, (ii) behaving the same according to an input-output based specification (a test-suite) and (iii) exhibiting observable differences when they run outside the specified input space. Method: Our technique relies on test amplification. We propose source code transformations on test cases to explore the input domain and systematically sense the observation domain. We quantify computational diversity as the dissimilarity between observations on inputs that are outside the specified domain. Results: We run our experiments on 472 variants of 7 classes from open-source, large and thoroughly tested Java classes. Our test amplification multiplies by ten the number of input points in the test suite and is effective at detecting software diversity. Conclusion: The key insights of this study are: the systematic exploration of the observable output space of a class provides new insights about its degree of encapsulation; the behavioral diversity that we observe originates from areas of the code that are characterized by their flexibility (caching, checking, formatting, etc.).Comment: 12 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Generic modelling of code clones

Author: Giesecke Simon
Publication venue: Dagstuhl Seminar Proceedings. 06301 - Duplication, Redundancy, and Similarity in Software
Publication date: 01/01/2006
Field of study

Code clones, i.e. instances of duplicated code, can be found in many software systems. They adversely affect the software systems ’ quality, in particular their maintainability and comprehensibility. Thus, this as-pect is particularly important to consider in software maintenance and re-engineering. Many different algorithms detecting code clones have been developed. For various reasons, it is difficult to compare the results of different algorithms. Most notable among these reasons is that there is no conceptual model allowing description of code clones determined by different algorithms. Much more, each algorithm uses its specific concept of code clones, which is rarely made explicit. To overcome these problems, we have developed a generic model for describing clones. The model is generic in that it is independent of the pro-gramming language examined and of the clone detection algorithm used. It is flexible enough to facilitate various granularities of artifacts employed for selection and comparison, including inexact clones. The model allows separation of concerns between clone detection, description and manage-ment, which reduces the effort for the implementation of tools supporting these activities. On the basis of the model, we have implemented a pro-totype tool supporting these activities, tightly integrated into the Eclipse environment.

CiteSeerX

Dagstuhl Research Online Publication Server

Comparison and Evaluation of Clone Detection Tools

Author: Ettore Merlo
Giuliano Antoniol
Jens Krinke
Rainer Koschke
Stefan Bellon
Publication venue
Publication date: 01/01/2007
Field of study

Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs

CiteSeerX

UCL Discovery

PolyPublie

Attack of the clones: an investigation into removing redundant source code

Author: Bailey John Oliver
Publication venue
Publication date: 01/01/2002
Field of study

Long-term maintenance of code will often lead to the introduction of duplicated or 'cloned' code. Legacy systems riddled with these clones have large amounts of redundant code and are more difficult to understand and maintain. One option available to improve maintainability and to increase software reuse, is to re-engineer code clones into reusable components. However, before this can be achieved detection and removal of this redundant code is necessary. There are several established clone detection tools for software maintenance and this thesis aims to investigate the similarities between their output. It also looks at how maintainers may best use them to reduce the amount of redundant code in a software system. This will be achieved by running clone detection tools on several different case studies. Included in these case studies will be a novel tool called Covet inspired by research of Mayrand [May96b] which attempted to identify cloned routines through a comparison of software metrics generated from each routine. It was found that none of the clone detection tools achieved either 100% precision or 100% recall. Each tool identified very different sets of clones. Overall MOSS achieved the greatest precision and CCFinder the greatest recall. Also observed was that the use of automatically generated code increased the proportion of clones found in a software system

Durham e-Theses

Implant Global and Local Hierarchy Information to Sequence based Code Representation Models

Author: Jin Zhi
Li Ge
Li Zhuo
Zhang Kechi
Publication venue
Publication date: 14/03/2023
Field of study

Source code representation with deep learning techniques is an important research field. There have been many studies that learn sequential or structural information for code representation. But sequence-based models and non-sequence-models both have their limitations. Researchers attempt to incorporate structural information to sequence-based models, but they only mine part of token-level hierarchical structure information. In this paper, we analyze how the complete hierarchical structure influences the tokens in code sequences and abstract this influence as a property of code tokens called hierarchical embedding. The hierarchical embedding is further divided into statement-level global hierarchy and token-level local hierarchy. Furthermore, we propose the Hierarchy Transformer (HiT), a simple but effective sequence model to incorporate the complete hierarchical embeddings of source code into a Transformer model. We demonstrate the effectiveness of hierarchical embedding on learning code structure with an experiment on variable scope detection task. Further evaluation shows that HiT outperforms SOTA baseline models and show stable training efficiency on three source code-related tasks involving classification and generation tasks across 8 different datasets.Comment: Accepted by ICPC 202

arXiv.org e-Print Archive