90 research outputs found

    Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

    Full text link
    Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium 201

    Intelligent Match Merging to Prevent Obfuscation Attacks on Software Plagiarism Detectors

    Get PDF
    Aufgrund der steigenden Anzahl der Informatikstudierenden verlassen sich Dozenten auf aktuelle Werkzeuge zur Erkennung von Quelltextplagiaten, um zu verhindern, dass Studierende plagiierte Programmieraufgaben einreichen. Während diese auf Token basierenden Plagiatsdetektoren inhärent resilient gegen einfache Verschleierungen sind, ermöglichen kürzlich veröffentlichte Verschleierungswerkzeuge den Studierenden, ihre Abgaben mühelos zu ändern, um die Erkennung zu umgehen. Der Vormarsch von ChatGPT hat zusätzliche Bedenken hinsichtlich seiner Verschleierungsfähigkeiten und der Notwendigkeit wirksamer Gegenstrategien aufgeworfen. Bestehende Verteidigungsmechanismen gegen Verschleierung sind oft durch ihre Spezifität für bestimmte Angriffe oder ihre Abhängigkeit von Programmiersprachen begrenzt, was eine mühsame und fehleranfällige Neuimplementierung erfordert. Als Antwort auf diese Herausforderung führt diese Arbeit einen neuartigen Verteidigungsmechanismus gegen automatische Verschleierungsangriffe namens Match-Zusammenführung ein. Er macht sich die Tatsache zunutze, dass Verschleierungsangriffe die Token-Sequenz ändern, um Übereinstimmungen zwischen zwei Abgaben aufzuspalten, sodass die gebrochenen Übereinstimmungen vom Plagiatsdetektor verworfen werden. Match-Zusammenführung macht die Auswirkungen dieser Angriffe rückgängig, indem benachbarte Übereinstimmungen auf der Grundlage einer Heuristik intelligent zusammengeführt werden, um falsch positive Ergebnisse zu minimieren. Die Widerstandsfähigkeit unserer Methode gegen klassische Verschleierungsangriffe wird durch Evaluationen anhand verschiedener realer Datensätze, einschließlich Studienarbeiten und Programmierwettbewerbe, in sechs verschiedenen Angriffsszenarien demonstriert. Darüber hinaus verbessert sie die Erkennungsleistung gegen KI-basierte Verschleierung signifikant. Was diesen Mechanismus auszeichnet, ist seine Unabhängigkeit von Sprache und Angriff, während sein minimaler Laufzeit-Aufwand ihn nahtlos mit anderen Verteidigungsmechanismen kompatibel macht

    Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

    Full text link
    Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.Comment: 22 pages, under revision to Transactions on Software Engineering (July 2021

    Similarity of Source Code in the Presence of Pervasive Modifications

    Get PDF
    Source code analysis to detect code cloning, code plagiarism, and code reuse suffers from the problem of pervasive code modifications, i.e. transformations that may have a global effect. We compare 30 similarity detection techniques and tools against pervasive code modifications. We evaluate the tools using two experimental scenarios for Java source code. These are (1) pervasive modifications created with tools for source code and bytecode obfuscation and (2) source code normalisation through compilation and decompilation using different decompilers. Our experimental results show that highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for six of the tools. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    A comparison of code similarity analysers

    Get PDF
    Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    An Inclusive Report on Robust Malware Detection and Analysis for Cross-Version Binary Code Optimizations

    Get PDF
    Numerous practices exist for binary code similarity detection (BCSD), such as Control Flow Graph, Semantics Scrutiny, Code Obfuscation, Malware Detection and Analysis, vulnerability search, etc. On the basis of professional knowledge, existing solutions often compare particular syntactic aspects retrieved from binary code. They either have substantial performance overheads or have inaccurate detection. Furthermore, there aren't many tools available for comparing cross-version binaries, which may differ not only in programming with proper syntax but also marginally in semantics. This Binary code similarity detection is existing for past 10 years, but this research area is not yet systematically analysed. The paper presents a comprehensive analysis on existing Cross-version Binary Code Optimization techniques on four characteristics: 1. Structural analysis, 2. Semantic Analysis, 3. Syntactic Analysis, 4. Validation Metrics.  It helps the researchers to best select the suitable tool for their necessary implementation on binary code analysis. Furthermore, this paper presents scope of the area along with future directions of the research
    corecore