Search CORE

10,231 research outputs found

An Extended Stable Marriage Problem Algorithm for Clone Detection

Author: AlHakami Hosam
Chen Feng
Janicke Helge
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 01/01/2014
Field of study

Code cloning negatively affects industrial software and threatens intellectual property. This paper presents a novel approach to detecting cloned software by using a bijective matching technique. The proposed approach focuses on increasing the range of similarity measures and thus enhancing the precision of the detection. This is achieved by extending a well-known stable-marriage problem (SMP) and demonstrating how matches between code fragments of different files can be expressed. A prototype of the proposed approach is provided using a proper scenario, which shows a noticeable improvement in several features of clone detection such as scalability and accuracy.Comment: 20 pages, 10 figures, 6 table

arXiv.org e-Print Archive

CiteSeerX

Structured Review of Code Clone Literature

Author: Hordijk Wiebe
Ponisio María Laura
Wieringa Roel
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report presents the results of a structured review of code clone literature. The aim of the review is to assemble a conceptual model of clone-related concepts which helps us to reason about clones. This conceptual model unifies clone concepts from a wide range of literature, so that findings about clones can be compared with each other

University of Twente Research Information

Clone Detection and Elimination for Haskell

Author: Brown Christopher
Thompson Simon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Duplicated code is a well known problem in software maintenance and refactoring. Code clones tend to increase program size and several studies have shown that duplicated code makes maintenance and code understanding more complex and time consuming. This paper presents a new technique for the detection and removal of duplicated Haskell code. The system is implemented within the refactoring framework of the Haskell Refactorer (HaRe), and uses an Abstract Syntax Tree (AST) based approach. Detection of duplicate code is automatic, while elimination is semi-automatic, with the user managing the clone removal. After presenting the system, an example is given to show how it works in practice

Kent Academic Repository

University of St. Andrews - Pure

Building Program Vector Representations for Deep Learning

Author: Jin Zhi
Li Ge
Liu Yuxuan
Mou Lili
Peng Hao
Xu Yan
Zhang Lu
Publication venue
Publication date: 11/09/2014
Field of study

Deep learning has made significant breakthroughs in various fields of artificial intelligence. Advantages of deep learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, it is still virtually impossible to use deep learning to analyze programs since deep architectures cannot be trained effectively with pure back propagation. In this pioneering paper, we propose the "coding criterion" to build program vector representations, which are the premise of deep learning for program analysis. Our representation learning approach directly makes deep learning a reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based on the experiments, the coding criterion is successful in building program representations. To evaluate whether deep learning is beneficial for program analysis, we feed the representations to deep neural networks, and achieve higher accuracy in the program classification task than "shallow" methods, such as logistic regression and the support vector machine. This result confirms the feasibility of deep learning to analyze programs. It also gives primary evidence of its success in this new field. We believe deep learning will become an outstanding technique for program analysis in the near future.Comment: This paper was submitted to ICSE'1

arXiv.org e-Print Archive

CiteSeerX

Crossref

SourcererCC: Scaling Code Clone Detection to Big Code

Author: Lopes Cristina V.
Roy Chanchal K.
Saini Vaibhav
Sajnani Hitesh
Svajlenko Jeffrey
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/12/2015
Field of study

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised

arXiv.org e-Print Archive

Crossref

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Author: Li Xiaopeng
Luo Lannan
Young Patrick
Zeng Qiang
Zhang Zhexin
Zuo Fei
Publication venue: 'Internet Society'
Publication date: 16/12/2018
Field of study

Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium 201

arXiv.org e-Print Archive

Crossref

apk2vec: Semi-supervised multi-view representation learning for profiling Android applications

Author: Chen Lihui
Liu Yang
Narayanan Annamalai
Soh Charlie
Wang Lipo
Publication venue
Publication date: 01/01/2018
Field of study

Building behavior profiles of Android applications (apps) with holistic, rich and multi-view information (e.g., incorporating several semantic views of an app such as API sequences, system calls, etc.) would help catering downstream analytics tasks such as app categorization, recommendation and malware analysis significantly better. Towards this goal, we design a semi-supervised Representation Learning (RL) framework named apk2vec to automatically generate a compact representation (aka profile/embedding) for a given app. More specifically, apk2vec has the three following unique characteristics which make it an excellent choice for largescale app profiling: (1) it encompasses information from multiple semantic views such as API sequences, permissions, etc., (2) being a semi-supervised embedding technique, it can make use of labels associated with apps (e.g., malware family or app category labels) to build high quality app profiles, and (3) it combines RL and feature hashing which allows it to efficiently build profiles of apps that stream over time (i.e., online learning). The resulting semi-supervised multi-view hash embeddings of apps could then be used for a wide variety of downstream tasks such as the ones mentioned above. Our extensive evaluations with more than 42,000 apps demonstrate that apk2vec's app profiles could significantly outperform state-of-the-art techniques in four app analytics tasks namely, malware detection, familial clustering, app clone detection and app recommendation.Comment: International Conference on Data Mining, 201

arXiv.org e-Print Archive

Crossref

DR-NTU (Digital Repository of NTU)