535 research outputs found
An Expert System for Automatic Software Protection
L'abstract è presente nell'allegato / the abstract is in the attachmen
Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned
Binary code similarity analysis (BCSA) is widely used for diverse security
applications such as plagiarism detection, software license violation
detection, and vulnerability discovery. Despite the surging research interest
in BCSA, it is significantly challenging to perform new research in this field
for several reasons. First, most existing approaches focus only on the end
results, namely, increasing the success rate of BCSA, by adopting
uninterpretable machine learning. Moreover, they utilize their own benchmark
sharing neither the source code nor the entire dataset. Finally, researchers
often use different terminologies or even use the same technique without citing
the previous literature properly, which makes it difficult to reproduce or
extend previous work. To address these problems, we take a step back from the
mainstream and contemplate fundamental research questions for BCSA. Why does a
certain technique or a feature show better results than the others?
Specifically, we conduct the first systematic study on the basic features used
in BCSA by leveraging interpretable feature engineering on a large-scale
benchmark. Our study reveals various useful insights on BCSA. For example, we
show that a simple interpretable model with a few basic features can achieve a
comparable result to that of recent deep learning-based approaches.
Furthermore, we show that the way we compile binaries or the correctness of
underlying binary analysis tools can significantly affect the performance of
BCSA. Lastly, we make all our source code and benchmark public and suggest
future directions in this field to help further research.Comment: 22 pages, under revision to Transactions on Software Engineering
(July 2021
An Inclusive Report on Robust Malware Detection and Analysis for Cross-Version Binary Code Optimizations
Numerous practices exist for binary code similarity detection (BCSD), such as Control Flow Graph, Semantics Scrutiny, Code Obfuscation, Malware Detection and Analysis, vulnerability search, etc. On the basis of professional knowledge, existing solutions often compare particular syntactic aspects retrieved from binary code. They either have substantial performance overheads or have inaccurate detection. Furthermore, there aren't many tools available for comparing cross-version binaries, which may differ not only in programming with proper syntax but also marginally in semantics. This Binary code similarity detection is existing for past 10 years, but this research area is not yet systematically analysed. The paper presents a comprehensive analysis on existing Cross-version Binary Code Optimization techniques on four characteristics: 1. Structural analysis, 2. Semantic Analysis, 3. Syntactic Analysis, 4. Validation Metrics. It helps the researchers to best select the suitable tool for their necessary implementation on binary code analysis. Furthermore, this paper presents scope of the area along with future directions of the research
Automated Analysis of ARM Binaries using the Low-Level Virtual Machine Compiler Framework
Binary program analysis is a critical capability for offensive and defensive operations in Cyberspace. However, many current techniques are ineffective or time-consuming and few tools can analyze code compiled for embedded processors such as those used in network interface cards, control systems and mobile phones. This research designs and implements a binary analysis system, called the Architecture-independent Binary Abstracting Code Analysis System (ABACAS), which reverses the normal program compilation process, lifting binary machine code to the Low-Level Virtual Machine (LLVM) compiler\u27s intermediate representation, thereby enabling existing security-related analyses to be applied to binary programs. The prototype targets ARM binaries but can be extended to support other architectures. Several programs are translated from ARM binaries and analyzed with existing analysis tools. Programs lifted from ARM binaries are an average of 3.73 times larger than the same programs compiled from a high-level language (HLL). Analysis results are equivalent regardless of whether the HLL source or ARM binary version of the program is submitted to the system, confirming the hypothesis that LLVM is effective for binary analysis
Looking for Criminal Intents in JavaScript Obfuscated Code
The majority of websites incorporate JavaScript for client-side execution in a supposedly protected environment. Unfortunately, JavaScript has also proven to be a critical attack vector for both independent and state-sponsored groups of hackers. On the one hand, defenders need to analyze scripts to ensure that no threat is delivered and to respond to potential security incidents. On the other, attackers aim to obfuscate the source code in order to disorient the defenders or even to make code analysis practically impossible. Since code obfuscation may also be adopted by companies for legitimate intellectual-property protection, a dilemma remains on whether a script is harmless or malignant, if not criminal. To help analysts deal with such a dilemma, a methodology is proposed, called JACOB, which is based on five steps, namely: (1) source code parsing, (2) control flow graph recovery, (3) region identification, (4) code structuring, and (5) partial evaluation. These steps implement a sort of decompilation for control flow flattened code, which is progressively transformed into something that is close to the original JavaScript source, thereby making eventual code analysis possible. Most relevantly, JACOB has been successfully applied to uncover unwanted user tracking and fingerprinting in e-commerce websites operated by a well-known Chinese company
Automated Failure Explanation Through Execution Comparison
When fixing a bug in software, developers must build an understanding or explanation of the bug and how the bug flows through a program. The effort that developers must put into building this explanation is costly and laborious. Thus, developers need tools that can assist them in explaining the behavior of bugs. Dynamic slicing is one technique that can effectively show how a bug propagates through an execution up to the point where a program fails. However, dynamic slices are large because they do not just explain the bug itself; they include extra information that explains any observed behavior that might be connected to the bug. Thus, the explanation of the bug is hidden within this other tangentially related information. This dissertation addresses the problem and shows how a failing execution and a correct execution may be compared in order to construct explanations that include only information about what caused the bug. As a result, these automated explanations are significantly more concise than those explanations produced by existing dynamic slicing techniques.
To enable the comparison of executions, we develop new techniques for dynamic analyses that identify the commonalities and differences between executions. First, we devise and implement the notion of a point within an execution that may exist across multiple executions. We also note that comparing executions involves comparing the state or variables and their values that exist within the executions at different execution points. Thus, we design an approach for identifying the locations of variables in different executions so that their values may be compared. Leveraging these tools, we design a system for identifying the behaviors within an execution that can be blamed for a bug and that together compose an explanation for the bug. These explanations are up to two orders of magnitude smaller than those produced by existing state of the art techniques. We also examine how different choices of a correct execution for comparison can impact the practicality or potential quality of the explanations produced via our system
A Dynamically Scheduled HLS Flow in MLIR
In High-Level Synthesis (HLS), we consider abstractions that span from software to hardware and target heterogeneous architectures. Therefore, managing the complexity introduced by this is key to implementing good, maintainable, and extendible HLS compilers. Traditionally, HLS flows have been built on top of software compilation infrastructure such as LLVM, with hardware aspects of the flow existing peripherally to the core of the compiler. Through this work, we aim to show that MLIR, a compiler infrastructure with a focus on domain-specific intermediate representations (IR), is a better infrastructure for HLS compilers. Using MLIR, we define HLS and hardware abstractions as first-class citizens of the compiler, simplifying analysis, transformations, and optimization. To demonstrate this, we present a C-to-RTL, dynamically scheduled HLS flow. We find that our flow generates circuits comparable to those of an equivalent LLVM-based HLS compiler. Notably, we achieve this while lacking key optimization passes typically found in HLS compilers and through the use of an experimental front-end. To this end, we show that significant improvements in the generated RTL are but low-hanging fruit, requiring engineering effort to attain. We believe that our flow is more modular and more extendible than comparable open-source HLS compilers and is thus a good candidate as a basis for future research. Apart from the core HLS flow, we provide MLIR-based tooling for C-to-RTL cosimulation and visual debugging, with the ultimate goal of building an MLIR-based HLS infrastructure that will drive innovation in the field
- …