720 research outputs found
From MinX to MinC: Semantics-Driven Decompilation of Recursive Datatypes
Reconstructing the meaning of a program from its binary executable is known as
reverse engineering; it has a wide range of applications in software security, exposing piracy, legacy systems, etc. Since reversing is ultimately a search for meaning, there is much interest in inferring a type (a meaning) for the elements of a binary in a consistent way. Unfortunately existing approaches do not guarantee any semantic relevance for their reconstructed types. This paper presents a new and semantically-founded approach that provides strong guarantees for the reconstructed types. Key to our approach is the derivation of a witness program in a high-level language alongside the reconstructed types. This witness has the same semantics as the binary, is type correct by construction, and it induces a (justifiable) type assignment on the binary. Moreover, the approach effectively yields a type-directed decompiler. We formalise and implement the approach for reversing Minx, an abstraction of x86, to MinC, a type-safe dialect of C with recursive datatypes. Our evaluation compiles a range of textbook C algorithms to MinX and then recovers the original structures
The Effectiveness Of Bytecode Decompilation
High-level bytecodes used by object-oriented managed execution environments make it easy to decompile them. This paper studies the reasons that make bytecode decompilers such efficient and presents basic obfuscation techniques as an efficient protection against binary code reverse engineering.basic obfuscation, reverse engineering, object-oriented execution
Test Data Generation of Bytecode by CLP Partial Evaluation
We employ existing partial evaluation (PE) techniques developed for Constraint Logic Programming (CLP) in order to automatically generate test-case generators for glass-box testing of bytecode. Our approach consists of two independent CLP PE phases. (1) First, the bytecode is transformed into an equivalent (decompiled) CLP program. This is already a well studied transformation which can be done either by using an ad-hoc decompiler or by specialising a bytecode interpreter by means of existing PE techniques. (2) A second PE is performed in order to supervise the generation of test-cases by execution of the CLP decompiled program. Interestingly, we employ control strategies previously defined in the context of CLP PE in order to capture coverage criteria for glass-box testing of bytecode. A unique feature of our approach is that, this second PE phase allows generating not only test-cases but also test-case generators. To the best of our knowledge, this is the first time that (CLP) PE techniques are applied for test-case generation as well as to generate test-case generators
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
Decompilation is a well-studied area with numerous high-quality tools
available. These are frequently used for security tasks and to port legacy
code. However, they regularly generate difficult-to-read programs and require a
large amount of engineering effort to support new programming languages and
ISAs. Recent interest in neural approaches has produced portable tools that
generate readable code. However, to-date such techniques are usually restricted
to synthetic programs without optimization, and no models have evaluated their
portability. Furthermore, while the code generated may be more readable, it is
usually incorrect. This paper presents SLaDe, a Small Language model Decompiler
based on a sequence-to-sequence transformer trained over real-world code. We
develop a novel tokenizer and exploit no-dropout training to produce
high-quality code. We utilize type-inference to generate programs that are more
readable and accurate than standard analytic and recent neural approaches.
Unlike standard approaches, SLaDe can infer out-of-context types and unlike
neural approaches, it generates correct code. We evaluate SLaDe on over 4,000
functions from ExeBench on two ISAs and at two optimizations levels. SLaDe is
up to 6 times more accurate than Ghidra, a state-of-the-art,
industrial-strength decompiler and up to 4 times more accurate than the large
language model ChatGPT and generates significantly more readable code than
both
An Efficient Platform for the Automatic Extraction of Patterns in Native Code
Different software tools, such as decompilers, code quality analyzers, recognizers of packed executable files, authorship analyzers, and malware detectors, search for patterns in binary code. The use of machine learning algorithms, trained with programs taken from the huge number of applications in the existing open source code repositories, allows finding patterns not detected with the manual approach. To this end, we have created a versatile platform for the automatic extraction of patterns from native code, capable of processing big binary files. Its implementation has been parallelized, providing important runtime performance benefits for multicore architectures. Compared to the single-processor execution, the average performance improvement obtained with the best configuration is 3.5 factors over the maximum theoretical gain of 4 factors
- …