6 research outputs found
Malware Detection Using Dynamic Analysis
In this research, we explore the field of dynamic analysis which has shown promis- ing results in the field of malware detection. Here, we extract dynamic software birth- marks during malware execution and apply machine learning based detection tech- niques to the resulting feature set. Specifically, we consider Hidden Markov Models and Profile Hidden Markov Models. To determine the effectiveness of this dynamic analysis approach, we compare our detection results to the results obtained by using static analysis. We show that in some cases, significantly stronger results can be obtained using our dynamic approach
Software similarity and classification
This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities
a framework for automated similarity analysis of malware
Malware, a category of software including viruses, worms, and other malicious programs, is developed by hackers to damage, disrupt, or perform other harmful actions on data, computer systems and networks. Malware analysis, as an indispensable part of the work of IT security specialists, aims to gain an in-depth understanding of malware code. Manual analysis of malware is a very costly and time-consuming process. As more malware variants are evolved by hackers who occasionally use a copy-paste-modify programming style to accelerate the generation of large number of malware, the effort spent in analyzing similar pieces of malicious code has dramatically grown. One approach to remedy this situation is to automatically perform similarity analysis on malware samples and identify the functions they share in order to minimize duplicated effort in analyzing similar codes of malware variants.
In this thesis, we present a framework to match cloned functions in a large chunk of malware samples. Firstly, the instructions of the functions to be analyzed are extracted from the disassembled malware binary code and then normalized. We propose a new similarity metric and use it to determine the pair-wise similarity among malware samples based on the calculated similarity of their functions. The developed tool also includes an API class recognizer designed to determine probable malicious operations that can be performed by malware functions. Furthermore, it allows us to visualize the relationship among functions inside malware codes and locate similar functions importing the same API class. We evaluate this framework on three malware datasets including metamorphic viruses created by malware generation tools, real-life malware variants in the wild, and two well-known botnet trojans. The obtained experimental results confirm that the proposed framework is effective in detecting similar malware code
Automated Failure Explanation Through Execution Comparison
When fixing a bug in software, developers must build an understanding or explanation of the bug and how the bug flows through a program. The effort that developers must put into building this explanation is costly and laborious. Thus, developers need tools that can assist them in explaining the behavior of bugs. Dynamic slicing is one technique that can effectively show how a bug propagates through an execution up to the point where a program fails. However, dynamic slices are large because they do not just explain the bug itself; they include extra information that explains any observed behavior that might be connected to the bug. Thus, the explanation of the bug is hidden within this other tangentially related information. This dissertation addresses the problem and shows how a failing execution and a correct execution may be compared in order to construct explanations that include only information about what caused the bug. As a result, these automated explanations are significantly more concise than those explanations produced by existing dynamic slicing techniques.
To enable the comparison of executions, we develop new techniques for dynamic analyses that identify the commonalities and differences between executions. First, we devise and implement the notion of a point within an execution that may exist across multiple executions. We also note that comparing executions involves comparing the state or variables and their values that exist within the executions at different execution points. Thus, we design an approach for identifying the locations of variables in different executions so that their values may be compared. Leveraging these tools, we design a system for identifying the behaviors within an execution that can be blamed for a bug and that together compose an explanation for the bug. These explanations are up to two orders of magnitude smaller than those produced by existing state of the art techniques. We also examine how different choices of a correct execution for comparison can impact the practicality or potential quality of the explanations produced via our system
THE SCALABLE AND ACCOUNTABLE BINARY CODE SEARCH AND ITS APPLICATIONS
The past decade has been witnessing an explosion of various applications and devices.
This big-data era challenges the existing security technologies: new analysis techniques
should be scalable to handle “big data” scale codebase; They should be become smart
and proactive by using the data to understand what the vulnerable points are and where
they locate; effective protection will be provided for dissemination and analysis of the data
involving sensitive information on an unprecedented scale.
In this dissertation, I argue that the code search techniques can boost existing security
analysis techniques (vulnerability identification and memory analysis) in terms of scalability and accuracy. In order to demonstrate its benefits, I address two issues of code search by using the code analysis: scalability and accountability. I further demonstrate the benefit of code search by applying it for the scalable vulnerability identification [57] and the
cross-version memory analysis problems [55, 56].
Firstly, I address the scalability problem of code search by learning “higher-level” semantic
features from code [57]. Instead of conducting fine-grained testing on a single device
or program, it becomes much more crucial to achieve the quick vulnerability scanning
in devices or programs at a “big data” scale. However, discovering vulnerabilities in “big
code” is like finding a needle in the haystack, even when dealing with known vulnerabilities. This new challenge demands a scalable code search approach. To this end, I leverage successful techniques from the image search in computer vision community and propose a novel code encoding method for scalable vulnerability search in binary code. The evaluation results show that this approach can achieve comparable or even better accuracy and efficiency than the baseline techniques.
Secondly, I tackle the accountability issues left in the vulnerability searching problem
by designing vulnerability-oriented raw features [58]. The similar code does not always
represent the similar vulnerability, so it requires that the feature engineering for the code
search should focus on semantic level features rather than syntactic ones. I propose to
extract conditional formulas as higher-level semantic features from the raw binary code to
conduct the code search. A conditional formula explicitly captures two cardinal factors
of a vulnerability: 1) erroneous data dependencies and 2) missing or invalid condition
checks. As a result, the binary code search on conditional formulas produces significantly
higher accuracy and provides meaningful evidence for human analysts to further examine
the search results. The evaluation results show that this approach can further improve
the search accuracy of existing bug search techniques with very reasonable performance
overhead.
Finally, I demonstrate the potential of the code search technique in the memory analysis
field, and apply it to address their across-version issue in the memory forensic problem
[55, 56]. The memory analysis techniques for COTS software usually rely on the
so-called “data structure profiles” for their binaries. Construction of such profiles requires
the expert knowledge about the internal working of a specified software version. However,
it is still a cumbersome manual effort most of time. I propose to leverage the code search
technique to enable a notion named “cross-version memory analysis”, which can update a
profile for new versions of a software by transferring the knowledge from the model that
has already been trained on its old version. The evaluation results show that the code search based approach advances the existing memory analysis methods by reducing the
manual efforts while maintaining the reasonable accuracy. With the help of collaborators, I
further developed two plugins to the Volatility memory forensic framework [2], and show
that each of the two plugins can construct a localized profile to perform specified memory
forensic tasks on the same memory dump, without the need of manual effort in creating the corresponding profile