5,786 research outputs found
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
DSpot: Test Amplification for Automatic Assessment of Computational Diversity
Context: Computational diversity, i.e., the presence of a set of programs
that all perform compatible services but that exhibit behavioral differences
under certain conditions, is essential for fault tolerance and security.
Objective: We aim at proposing an approach for automatically assessing the
presence of computational diversity. In this work, computationally diverse
variants are defined as (i) sharing the same API, (ii) behaving the same
according to an input-output based specification (a test-suite) and (iii)
exhibiting observable differences when they run outside the specified input
space. Method: Our technique relies on test amplification. We propose source
code transformations on test cases to explore the input domain and
systematically sense the observation domain. We quantify computational
diversity as the dissimilarity between observations on inputs that are outside
the specified domain. Results: We run our experiments on 472 variants of 7
classes from open-source, large and thoroughly tested Java classes. Our test
amplification multiplies by ten the number of input points in the test suite
and is effective at detecting software diversity. Conclusion: The key insights
of this study are: the systematic exploration of the observable output space of
a class provides new insights about its degree of encapsulation; the behavioral
diversity that we observe originates from areas of the code that are
characterized by their flexibility (caching, checking, formatting, etc.).Comment: 12 page
Behavioral Analysis for Detecting Code Clones
The activities of copy and paste fragments of code from a source code into the other source code is often done by software developers because it's easier than generate code manually. This behavior leads to the increase of effort to maintain the code. One of the detection methods of semantic cloning is based on the behavior of the code. The code behavior detected by observing at an input, output and the effects of the method. Methods with the same value of input, output, and effect will indicate that semantically the same. However, the detection method based on the input, output, and effect could not be used in a void method or method without parameters, another side comprehensively detection is required. The challenge is how to detect which variable in a method that acts as input, output, and effect. Detection of the variable input, output, and effects in a void method done using Program Dependence Graph. The use of clone detection methods semantically based on behavior can increase the agreement value
Semantic Clone Detection via Probabilistic Software Modeling
Semantic clone detection is the process of finding program elements with
similar or equal runtime behavior. For example, detecting the semantic equality
between the recursive and iterative implementation of the factorial
computation. Semantic clone detection is the de facto technical boundary of
clone detectors. This boundary was tested over the last years with interesting
new approaches. This work contributes a semantic clone detection approach that
detects clones with 0% syntactic similarity. We present Semantic Clone
Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise
solution to semantic clone detection. PSM builds a probabilistic model of a
program that is capable of evaluating and generating runtime data. SCD-PSM
leverages this model and its model elements to finding behaviorally equal model
elements. This behavioral equality is then generalized to semantic equality of
the original program elements. It uses the likelihood between model elements as
a distance metric. Then, it employs the likelihood ratio significance test to
decide whether this distance is significant, given a pre-specified and
controllable false-positive rate. The output of SCD-PSM are pairs of program
elements (i.e., methods), their distance, and a decision whether they are
clones or not. SCD-PSM yields excellent results with a Matthews Correlation
Coefficient greater 0.9. These results are obtained on classical semantic clone
detection problems such as detecting recursive and iterative versions of an
algorithm, but also on complex problems used in coding competitions.Comment: 12 pages, 2 pages of references, 5 listings, 2 figures, 4 table
- …