53 research outputs found

    Malware Detection Using Dynamic Analysis

    Get PDF
    In this research, we explore the field of dynamic analysis which has shown promis- ing results in the field of malware detection. Here, we extract dynamic software birth- marks during malware execution and apply machine learning based detection tech- niques to the resulting feature set. Specifically, we consider Hidden Markov Models and Profile Hidden Markov Models. To determine the effectiveness of this dynamic analysis approach, we compare our detection results to the results obtained by using static analysis. We show that in some cases, significantly stronger results can be obtained using our dynamic approach

    Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

    Full text link
    Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium 201

    Software similarity and classification

    Full text link
    This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities

    Automatic mining of functionally equivalent code fragments via random testing

    Get PDF
    Similar code may exist in large software projects due to some com-mon software engineering practices, such as copying and pasting code and n-version programming. Although previous work has studied syntactic equivalence and small-scale, coarse-grained pro-gram-level and function-level semantic equivalence, it is not known whether significant fine-grained, code-level semantic duplications exist. Detecting such semantic equivalence is also desirable be-cause it can enable many applications such as code understanding, maintenance, and optimization. In this paper, we introduce the first algorithm to automatically mine functionally equivalent code fragments of arbitrary size— down to an executable statement. Our notion of functional equiva-lence is based on input and output behavior. Inspired by Schwartz’s randomized polynomial identity testing, we develop our core algo-rithm using automated random testing: (1) candidate code frag-ments are automatically extracted from the input program; and (2) random inputs are generated to partition the code fragments based on their output values on the generated inputs. We implemented the algorithm and conducted a large-scale empirical evaluation of it on the Linux kernel 2.6.24. Our results show that there exist many functionally equivalent code fragments that are syntactically different (i.e., they are unlikely due to copying and pasting code). The algorithm also scales to million-line programs; it was able to analyze the Linux kernel with several days of parallel processing

    a framework for automated similarity analysis of malware

    Get PDF
    Malware, a category of software including viruses, worms, and other malicious programs, is developed by hackers to damage, disrupt, or perform other harmful actions on data, computer systems and networks. Malware analysis, as an indispensable part of the work of IT security specialists, aims to gain an in-depth understanding of malware code. Manual analysis of malware is a very costly and time-consuming process. As more malware variants are evolved by hackers who occasionally use a copy-paste-modify programming style to accelerate the generation of large number of malware, the effort spent in analyzing similar pieces of malicious code has dramatically grown. One approach to remedy this situation is to automatically perform similarity analysis on malware samples and identify the functions they share in order to minimize duplicated effort in analyzing similar codes of malware variants. In this thesis, we present a framework to match cloned functions in a large chunk of malware samples. Firstly, the instructions of the functions to be analyzed are extracted from the disassembled malware binary code and then normalized. We propose a new similarity metric and use it to determine the pair-wise similarity among malware samples based on the calculated similarity of their functions. The developed tool also includes an API class recognizer designed to determine probable malicious operations that can be performed by malware functions. Furthermore, it allows us to visualize the relationship among functions inside malware codes and locate similar functions importing the same API class. We evaluate this framework on three malware datasets including metamorphic viruses created by malware generation tools, real-life malware variants in the wild, and two well-known botnet trojans. The obtained experimental results confirm that the proposed framework is effective in detecting similar malware code

    A Method for Detecting Abnormal Program Behavior on Embedded Devices

    Get PDF
    A potential threat to embedded systems is the execution of unknown or malicious software capable of triggering harmful system behavior, aimed at theft of sensitive data or causing damage to the system. Commercial off-the-shelf embedded devices, such as embedded medical equipment, are more vulnerable as these type of products cannot be amended conventionally or have limited resources to implement protection mechanisms. In this paper, we present a self-organizing map (SOM)-based approach to enhance embedded system security by detecting abnormal program behavior. The proposed method extracts features derived from processor's program counter and cycles per instruction, and then utilises the features to identify abnormal behavior using the SOM. Results achieved in our experiment show that the proposed method can identify unknown program behaviors not included in the training set with over 98.4% accuracy

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Malware variant detection

    Get PDF
    Malware programs (e.g., viruses, worms, Trojans, etc.) are a worldwide epidemic. Studies and statistics show that the impact of malware is getting worse. Malware detectors are the primary tools in the defence against malware. Most commercial anti-malware scanners maintain a database of malware patterns and heuristic signatures for detecting malicious programs within a computer system. Malware writers use semantic-preserving code transformation (obfuscation) techniques to produce new stealth variants of their malware programs. Malware variants are hard to detect with today's detection technologies as these tools rely mostly on syntactic properties and ignore the semantics of malicious executable programs. A robust malware detection technique is required to handle this emerging security threat. In this thesis, we propose a new methodology that overcomes the drawback of existing malware detection methods by analysing the semantics of known malicious code. The methodology consists of three major analysis techniques: the development of a semantic signature, slicing analysis and test data generation analysis. The core element in this approach is to specify an approximation for malware code semantics and to produce signatures for identifying, possibly obfuscated but semantically equivalent, variants of a sample of malware. A semantic signature consists of a program test input and semantic traces of a known malware code. The key challenge in developing our semantics-based approach to malware variant detection is to achieve a balance between improving the detection rate (i.e. matching semantic traces) and performance, with or without the e ects of obfuscation on malware variants. We develop slicing analysis to improve the construction of semantic signatures. We back our trace-slicing method with a theoretical result that shows the notion of correctness of the slicer. A proof-of-concept implementation of our malware detector demonstrates that the semantics-based analysis approach could improve current detection tools and make the task more di cult for malware authors. Another important part of this thesis is exploring program semantics for the selection of a suitable part of the semantic signature, for which we provide two new theoretical results. In particular, this dissertation includes a test data generation method that works for binary executables and the notion of correctness of the method

    Automated Failure Explanation Through Execution Comparison

    Get PDF
    When fixing a bug in software, developers must build an understanding or explanation of the bug and how the bug flows through a program. The effort that developers must put into building this explanation is costly and laborious. Thus, developers need tools that can assist them in explaining the behavior of bugs. Dynamic slicing is one technique that can effectively show how a bug propagates through an execution up to the point where a program fails. However, dynamic slices are large because they do not just explain the bug itself; they include extra information that explains any observed behavior that might be connected to the bug. Thus, the explanation of the bug is hidden within this other tangentially related information. This dissertation addresses the problem and shows how a failing execution and a correct execution may be compared in order to construct explanations that include only information about what caused the bug. As a result, these automated explanations are significantly more concise than those explanations produced by existing dynamic slicing techniques. To enable the comparison of executions, we develop new techniques for dynamic analyses that identify the commonalities and differences between executions. First, we devise and implement the notion of a point within an execution that may exist across multiple executions. We also note that comparing executions involves comparing the state or variables and their values that exist within the executions at different execution points. Thus, we design an approach for identifying the locations of variables in different executions so that their values may be compared. Leveraging these tools, we design a system for identifying the behaviors within an execution that can be blamed for a bug and that together compose an explanation for the bug. These explanations are up to two orders of magnitude smaller than those produced by existing state of the art techniques. We also examine how different choices of a correct execution for comparison can impact the practicality or potential quality of the explanations produced via our system
    • …
    corecore