4 research outputs found
A Human-Centric Approach For Binary Code Decompilation
Many security techniques have been developed both in academia and industry to analyze source code, including methods to discover bugs, apply taint tracking, or find vulnerabilities. These source-based techniques leverage the wealth of high-level abstractions available in the source code to achieve good precision and efficiency. Unfortunately, these methods cannot be applied directly on binary code which lacks such abstractions. In security, there are many scenarios where analysts only have access to the compiled version of a program. When compiled, all high-level abstractions, such as variables, types, and functions, are removed from the final version of the program that security analysts have access to. This dissertation investigates novel methods to recover abstractions from binary code. First, a novel pattern-independent control flow structuring algorithm is presented to recover high-level control-flow abstractions from binary code. Unlike existing structural analysis algorithms which produce unstructured code with many goto statements, our algorithm produces fully-structured goto-free decompiled code. We implemented this algorithm in a decompiler called DREAM. Second, we develop three categories of code optimizations in order to simplify the decompiled code and increase readability. These categories are expression simplification, control-flow simplification and semantics-aware naming. We have implemented our usability extensions on top of DREAM and call this extended version DREAM++. We conducted the first user study to evaluate the quality of decompilers for malware analysis. We have chosen malware since it represents one of the most challenging cases for binary code analysis. The study included six reverse engineering tasks of real malware samples that we obtained from independent malware experts. We evaluated three decompilers: the leading industry decompiler Hex-Rays and both versions of our decompiler DREAM and DREAM++. The results of our study show that our improved decompiler DREAM++ produced significantly more understandable code that outperforms both Hex-Rays and DREAM. Using DREAM++participants solved 3 times more tasks than when using Hex-Rays and 2 times more tasks than when using DREAM. Moreover, participants rated DREAM++ significantly higher than the competition
A Framework for Assessing Decompiler Inference Accuracy of Source-Level Program Constructs
Decompilation is the process of reverse engineering a binary program into an equivalent source code representation with the objective to recover high-level program constructs such as functions, variables, data types, and control flow mechanisms. Decompilation is applicable in many contexts, particularly for security analysts attemptingto decipher the construction and behavior of malware samples. However, due to the loss of information during compilation, this process is naturally speculative and thus is prone to inaccuracy. This inherent speculation motivates the idea of an evaluation framework for decompilers. In this work, we present a novel framework to quantitatively evaluate the inference accuracy of decompilers, regarding functions, variables, and data types. Within our framework, we develop a domain-specific language (DSL) for representing such program information from any "ground truth" or decompiler source. Using our DSL, we implement a strategy for comparing ground truth and decompiler representations of the same program. Subsequently, we extract and present insightful metrics illustrating the accuracy of decompiler inference regarding functions, variables, and data types, over a given set of benchmark programs. We leverage our framework to assess the correctness of the Ghidra decompiler when compared to ground truth information scraped from DWARF debugging information. We perform this assessment over a subset of the GNU Core Utilities (Coreutils) programs and discuss our findings
Recommended from our members
The ZARF Architecture for Recursive Functions
For highly critical workloads, the legitimate fear of catastrophic failure leads to both highlyconservative design practices and excessive assurance costs. One import part of the problem isthat modern machines, while providing impressive performance and efficiency, are difficult toreason about formally. We explore the microarchitectural support needed to create a machinewith a compact and well defined semantics, lowering the difficulty of sound and compositionalreasoning across the hardware/software interface. Specifically, we explore implementationoptions for a machine organization devoid of programmer-visible memory, registers, or stateupdate, built instead around function primitives. The resulting machine can be precisely andmathematically described in a brief set of semantics, which we quantitatively and qualitativelydemonstrate is amenable to software proofs at the binary level.As time continues, we become increasingly dependent on computational devices for allfacets of our lives — including our health, well-being, and safety. Many of these devices live“in the wild,” in resource-constrained and/or embedded environments, without access to largesoftware stacks and heavy language run-times. At the same, increasing trends in heterogeneityin computer architecture gives the opportunity for new cores in system-on-chips (SoC’s) thatprovide support for increasing critical workloads. We propose an implementation and providean evaluation of such a device, the Zarf Architecture for Recursive Functions (Zarf), provid-ing a interface of reduced semantic complexity at the ISA level, giving designers a platformamenable to reasoning and static analysis. The described prototype is comparable to normalembedded systems in size and resource usage, but it is far easier to reason about programsaccording to analysis. This can serve both resource-constrained devices, providing a new hard-ware platform, and resource-rich SoC’s, serving as a small, trusted co-processor that can handlecritical workloads in the larger ecosystem