9 research outputs found
UniASM: Binary Code Similarity Detection without Fine-tuning
Binary code similarity detection (BCSD) is widely used in various binary
analysis tasks such as vulnerability search, malware detection, clone
detection, and patch analysis. Recent studies have shown that the
learning-based binary code embedding models perform better than the traditional
feature-based approaches. In this paper, we proposed a novel transformer-based
binary code embedding model, named UniASM, to learn representations of the
binary functions. We designed two new training tasks to make the spatial
distribution of the generated vectors more uniform, which can be used directly
in BCSD without any fine-tuning. In addition, we proposed a new tokenization
approach for binary functions, increasing the token's semantic information
while mitigating the out-of-vocabulary (OOV) problem. The experimental results
show that UniASM outperforms state-of-the-art (SOTA) approaches on the
evaluation dataset. We achieved the average scores of recall@1 on
cross-compilers, cross-optimization-levels and cross-obfuscations are 0.72,
0.63, and 0.77, which is higher than existing SOTA baselines. In a real-world
task of known vulnerability searching, UniASM outperforms all the current
baselines.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
A deep learning approach to program similarity.
In this work we tackle the problem of binary code similarity by using deep learning applied to binary code visualization techniques. Our idea is to represent binaries as images and then to investigate whether it is possible to recognize similar binaries by applying deep learning algorithms for image classification. In particular, we apply the proposed deep learning framework to a dataset of binary code variants obtained through code obfuscation. These binary variants exhibit similar behaviours while being syntactically different. Our results show that the problem of binary code recognition is strictly separated from simple image recognition problems. Moreover, the analysis of the results of the experiments conducted in this work lead us to the identification of interesting research challenges. For example, in order to use image recognition approaches to recognize similar binary code samples it is important to further investigate how to build a suitable mapping from executables to images
Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned
Binary code similarity analysis (BCSA) is widely used for diverse security
applications such as plagiarism detection, software license violation
detection, and vulnerability discovery. Despite the surging research interest
in BCSA, it is significantly challenging to perform new research in this field
for several reasons. First, most existing approaches focus only on the end
results, namely, increasing the success rate of BCSA, by adopting
uninterpretable machine learning. Moreover, they utilize their own benchmark
sharing neither the source code nor the entire dataset. Finally, researchers
often use different terminologies or even use the same technique without citing
the previous literature properly, which makes it difficult to reproduce or
extend previous work. To address these problems, we take a step back from the
mainstream and contemplate fundamental research questions for BCSA. Why does a
certain technique or a feature show better results than the others?
Specifically, we conduct the first systematic study on the basic features used
in BCSA by leveraging interpretable feature engineering on a large-scale
benchmark. Our study reveals various useful insights on BCSA. For example, we
show that a simple interpretable model with a few basic features can achieve a
comparable result to that of recent deep learning-based approaches.
Furthermore, we show that the way we compile binaries or the correctness of
underlying binary analysis tools can significantly affect the performance of
BCSA. Lastly, we make all our source code and benchmark public and suggest
future directions in this field to help further research.Comment: 22 pages, under revision to Transactions on Software Engineering
(July 2021