1,225 research outputs found
XFL: Naming Functions in Binaries with Extreme Multi-label Learning
Reverse engineers benefit from the presence of identifiers such as function
names in a binary, but usually these are removed for release. Training a
machine learning model to predict function names automatically is promising but
fundamentally hard: unlike words in natural language, most function names occur
only once. In this paper, we address this problem by introducing eXtreme
Function Labeling (XFL), an extreme multi-label learning approach to selecting
appropriate labels for binary functions. XFL splits function names into tokens,
treating each as an informative label akin to the problem of tagging texts in
natural language. We relate the semantics of binary code to labels through
DEXTER, a novel function embedding that combines static analysis-based features
with local context from the call graph and global context from the entire
binary. We demonstrate that XFL/DEXTER outperforms the state of the art in
function labeling on a dataset of 10,047 binaries from the Debian project,
achieving a precision of 83.5%. We also study combinations of XFL with
alternative binary embeddings from the literature and show that DEXTER
consistently performs best for this task. As a result, we demonstrate that
binary function labeling can be effectively phrased in terms of multi-label
learning, and that binary function embeddings benefit from including explicit
semantic features
Probabilistic Naming of Functions in Stripped Binaries
Debugging symbols in binary executables carry the names of functions and global variables. When present, they greatly simplify the process of reverse engineering, but they are almost always removed (stripped) for deployment. We present the design and implementation of punstrip, a tool which combines a probabilistic fingerprint of binary code based on high-level features with a probabilistic graphical model to learn the relationship between function names and program structure. As there are many naming conventions and developer styles, functions from different applications do not necessarily have the exact same name, even if they implement the exact same functionality. We therefore evaluate punstrip across three levels of name matching: exact; an approach based on natural language processing of name components; and using Symbol2Vec, a new embedding of function names based on random walks of function call graphs. We show that our approach is able to recognize functions compiled across different compilers and optimization levels and then demonstrate that punstrip can predict semantically similar function names based on code structure. We evaluate our approach over open source C binaries from the Debian Linux distribution and compare against the state of the art
Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining
Binary2source function matching is a fundamental task for many security
applications, including Software Component Analysis (SCA). The "1-to-1"
mechanism has been applied in existing binary2source matching works, in which
one binary function is matched against one source function. However, we
discovered that such mapping could be "1-to-n" (one query binary function maps
multiple source functions), due to the existence of function inlining.
To help conduct binary2source function matching under function inlining, we
propose a method named O2NMatcher to generate Source Function Sets (SFSs) as
the matching target for binary functions with inlining. We first propose a
model named ECOCCJ48 for inlined call site prediction. To train this model, we
leverage the compilable OSS to generate a dataset with labeled call sites
(inlined or not), extract several features from the call sites, and design a
compiler-opt-based multi-label classifier by inspecting the inlining
correlations between different compilations. Then, we use this model to predict
the labels of call sites in the uncompilable OSS projects without compilation
and obtain the labeled function call graphs of these projects. Next, we regard
the construction of SFSs as a sub-tree generation problem and design root node
selection and edge extension rules to construct SFSs automatically. Finally,
these SFSs will be added to the corpus of source functions and compared with
binary functions with inlining. We conduct several experiments to evaluate the
effectiveness of O2NMatcher and results show our method increases the
performance of existing works by 6% and exceeds all the state-of-the-art works
BinComp: A Stratified Approach to Compiler Provenance Attribution
Compiler provenance encompasses numerous pieces of information, such as the compiler family, compiler version, optimization level, and compiler-related functions. The extraction of such information is imperative for various binary analysis applications, such as function fingerprinting, clone detection, and authorship attribution. It is thus important to develop an efficient and automated approach for extracting compiler provenance. In this study, we present BinComp, a practical approach which, analyzes the syntax, structure, and semantics of disassembled functions to extract compiler provenance. BinComp has a stratified architecture with three layers. The first layer applies a supervised compilation process to a set of known programs to model the default code transformation of compilers. The second layer employs an intersection process that disassembles functions across compiled binaries to extract statistical features (e.g., numerical values) from common compiler/linker-inserted functions. This layer labels the compiler-related functions. The third layer extracts semantic features from the labeled compiler-related functions to identify the compiler version and the optimization level. Our experimental results demonstrate that BinComp is efficient in terms of both computational resources and time
- …