5 research outputs found
Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned
Binary code similarity analysis (BCSA) is widely used for diverse security
applications such as plagiarism detection, software license violation
detection, and vulnerability discovery. Despite the surging research interest
in BCSA, it is significantly challenging to perform new research in this field
for several reasons. First, most existing approaches focus only on the end
results, namely, increasing the success rate of BCSA, by adopting
uninterpretable machine learning. Moreover, they utilize their own benchmark
sharing neither the source code nor the entire dataset. Finally, researchers
often use different terminologies or even use the same technique without citing
the previous literature properly, which makes it difficult to reproduce or
extend previous work. To address these problems, we take a step back from the
mainstream and contemplate fundamental research questions for BCSA. Why does a
certain technique or a feature show better results than the others?
Specifically, we conduct the first systematic study on the basic features used
in BCSA by leveraging interpretable feature engineering on a large-scale
benchmark. Our study reveals various useful insights on BCSA. For example, we
show that a simple interpretable model with a few basic features can achieve a
comparable result to that of recent deep learning-based approaches.
Furthermore, we show that the way we compile binaries or the correctness of
underlying binary analysis tools can significantly affect the performance of
BCSA. Lastly, we make all our source code and benchmark public and suggest
future directions in this field to help further research.Comment: 22 pages, under revision to Transactions on Software Engineering
(July 2021
Cross-Inlining Binary Function Similarity Detection
Binary function similarity detection plays an important role in a wide range
of security applications. Existing works usually assume that the query function
and target function share equal semantics and compare their full semantics to
obtain the similarity. However, we find that the function mapping is more
complex, especially when function inlining happens.
In this paper, we will systematically investigate cross-inlining binary
function similarity detection. We first construct a cross-inlining dataset by
compiling 51 projects using 9 compilers, with 4 optimizations, to 6
architectures, with 2 inlining flags, which results in two datasets both with
216 combinations. Then we construct the cross-inlining function mappings by
linking the common source functions in these two datasets. Through analysis of
this dataset, we find that three cross-inlining patterns widely exist while
existing work suffers when detecting cross-inlining binary function similarity.
Next, we propose a pattern-based model named CI-Detector for cross-inlining
matching. CI-Detector uses the attributed CFG to represent the semantics of
binary functions and GNN to embed binary functions into vectors. CI-Detector
respectively trains a model for these three cross-inlining patterns. Finally,
the testing pairs are input to these three models and all the produced
similarities are aggregated to produce the final similarity. We conduct several
experiments to evaluate CI-Detector. Results show that CI-Detector can detect
cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds
all state-of-the-art works.Comment: Accepted at ICSE 2024 (Second Cycle). Camera-ready versio
On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis
Recent years have witnessed the emergence of new and more sophisticated malware targeting insecure Internet of Things (IoT) devices, as part of orchestrated large-scale botnets. Moreover, the
public release of the source code of popular malware families such as Mirai [1] has spawned diverse variants, making it harder to disambiguate their ownership, lineage, and correct label. Such a rapidly
evolving landscape makes it also harder to deploy and generalize effective learning models against retired, updated, and/or new threat campaigns. To mitigate such threat, there is an utmost need for effective IoT malware detection, classification and family attribution, which provide essential steps towards initiating attack mitigation/prevention countermeasures, as well as understanding the evolutionary trajectories and tangled relationships of IoT malware. This is particularly challenging
due to the lack of fine-grained empirical data about IoT malware, the diverse architectures of IoT-targeted devices, and the massive code reuse between IoT malware families.
To address these challenges, in this thesis, we leverage the general lack of obfuscation in IoT malware to extract and combine static features from multi-modal views of the executable binaries (e.g., images, strings, assembly instructions), along with Deep Learning (DL) architectures for effective IoT malware classification and family attribution. Additionally, we aim to address concept drift and the limitations of inter-family classification due to the evolutionary nature of IoT malware, by detecting in-class evolving IoT malware variants and interpreting the meaning behind their mutations. To this end, we perform the following to achieve our objectives:
First, we analyze 70,000 IoT malware samples collected by a specialized IoT honeypot and popular malware repositories in the past 3 years. Consequently, we utilize features extracted from strings- and image-based representations of IoT malware to implement a multi-level DL architecture that fuses the learned features from each sub-component (i.e, images, strings) through a neural network classifier. Our in-depth experiments with four prominent IoT malware families highlight
the significant accuracy of the proposed approach (99.78%), which outperforms conventional single-level classifiers, by relying on different representations of the target IoT malware binaries that do not
require expensive feature extraction. Additionally, we utilize our IoT-tailored approach for labeling unknown malware samples, while identifying new malware strains.
Second, we seek to identify when the classifier shows signs of aging, by which it fails to effectively recognize new variants and adapt to potential changes in the data. Thus, we introduce a robust and effective method that uses contrastive learning and attentive Transformer models to learn and compare semantically meaningful representations of IoT malware binaries and codes without the need for expensive target labels. We find that the evolution of IoT binaries can be used as an augmentation strategy to learn effective representations to contrast (dis)similar variant pairs. We discuss the impact and findings of our analysis and present several evaluation studies to highlight the tangled relationships of IoT malware, as well as the efficiency of our contrastively learned fine-grained feature vectors in preserving semantics and reducing out-of-vocabulary size in cross-architecture IoT malware binaries.
We conclude this thesis by summarizing our findings and discussing research gaps that lay the way for future work