56 research outputs found

    Identifying Compiler and Optimization Options from Binary Code using Deep Learning Approaches

    Full text link
    D. Pizzolotto and K. Inoue, "Identifying Compiler and Optimization Options from Binary Code using Deep Learning Approaches," 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), Adelaide, Australia, 2020, pp. 232-242, doi: 10.1109/ICSME46990.2020.00031

    BinComp: A Stratified Approach to Compiler Provenance Attribution

    Get PDF
    Compiler provenance encompasses numerous pieces of information, such as the compiler family, compiler version, optimization level, and compiler-related functions. The extraction of such information is imperative for various binary analysis applications, such as function fingerprinting, clone detection, and authorship attribution. It is thus important to develop an efficient and automated approach for extracting compiler provenance. In this study, we present BinComp, a practical approach which, analyzes the syntax, structure, and semantics of disassembled functions to extract compiler provenance. BinComp has a stratified architecture with three layers. The first layer applies a supervised compilation process to a set of known programs to model the default code transformation of compilers. The second layer employs an intersection process that disassembles functions across compiled binaries to extract statistical features (e.g., numerical values) from common compiler/linker-inserted functions. This layer labels the compiler-related functions. The third layer extracts semantic features from the labeled compiler-related functions to identify the compiler version and the optimization level. Our experimental results demonstrate that BinComp is efficient in terms of both computational resources and time

    Cross-compiler bipartite vulnerability search

    Get PDF
    Open-source libraries are widely used in software development, and the functions from these libraries may contain security vulnerabilities that can provide gateways for attackers. This paper provides a function similarity technique to identify vulnerable functions in compiled programs and proposes a new technique called Cross-Compiler Bipartite Vulnerability Search (CCBVS). CCBVS uses a novel training process, and bipartite matching to filter SVM model false positives to improve the quality of similar function identification. This research uses debug symbols in programs compiled from open-source software products to generate the ground truth. This automatic extraction of ground truth allows experimentation with a wide range of programs. The results presented in the paper show that an SVM model trained on a wide variety of programs compiled for Windows and Linux, x86 and Intel 64 architectures can be used to predict function similarity and that the use of bipartite matching substantially improves the function similarity matching performance. © 2021 by the authors. Licensee MDPI, Basel, Switzerland

    Improving Precision for x86 Binary Analysis Techniques

    Get PDF
    Static binary analysis is being used extensively for detecting security flaws in binary programs. Multiple solutions have been proposed to tackle challenges presented by static binary analysis. We propose two methods to improve these solutions for better precision on x86-64 binaries. First, we propose a machine learning based approach to detect compiler and optimization level for a binary program with the aim of augmenting existing heuristic based solutions to fine tune those heuristics. We are able to detect the aforementioned information with 83% precision on coreutils, binutils and SPECCPU2006 binaries. Second, we propose an analysis to detect memory layout from a binary program’s perspective. This analysis aims to enhance existing solutions by allowing them to track values across loads and stores in fine grained memory locations. We are able to detect layout of stack objects with 56.3% accuracy for coreutils, binutils and SPECCPU2006 C binaries

    Compiler Provenance Recovery for Multi-CPU Architectures Using a Centrifuge Mechanism

    Full text link
    Bit-stream recognition (BSR) has many applications, such as forensic investigations, detection of copyright infringement, and malware analysis. We propose the first BSR that takes a bare input bit-stream and outputs a class label without any preprocessing. To achieve our goal, we propose a centrifuge mechanism, where the upstream layers (sub-net) capture global features and tell the downstream layers (main-net) to switch the focus, even if a part of the input bit-stream has the same value. We applied the centrifuge mechanism to compiler provenance recovery, a type of BSR, and achieved excellent classification. Additionally, downstream transfer learning (DTL), one of the learning methods we propose for the centrifuge mechanism, pre-trains the main-net using the sub-net's ground truth instead of the sub-net's output. We found that sub-predictions made by DTL tend to be highly accurate when the sub-label classification contributes to the essence of the main prediction.Comment: 8 pages, 4 figures, 5 table

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    An Efficient Platform for the Automatic Extraction of Patterns in Native Code

    Get PDF
    Different software tools, such as decompilers, code quality analyzers, recognizers of packed executable files, authorship analyzers, and malware detectors, search for patterns in binary code. The use of machine learning algorithms, trained with programs taken from the huge number of applications in the existing open source code repositories, allows finding patterns not detected with the manual approach. To this end, we have created a versatile platform for the automatic extraction of patterns from native code, capable of processing big binary files. Its implementation has been parallelized, providing important runtime performance benefits for multicore architectures. Compared to the single-processor execution, the average performance improvement obtained with the best configuration is 3.5 factors over the maximum theoretical gain of 4 factors
    • …
    corecore