Abstract-Disassembly is fundamental to binary analysis and rewriting. We present a novel disassembly technique that takes a stripped binary and produces reassembleable assembly code. The resulting assembly code has accurate symbolic information providing cross-references for analysis and enabling adjustment of code and data pointers to accommodate rewriting. Our technique features multiple static analyses and heuristics in a combined Datalog implementation. We argue that Datalog's inference process is particularly well suited for disassembly and the required analyses. Our implementation and experiments supports this claim. We have implemented our approach into an open-source tool called Ddisasm . In extensive experiments in which we rewrite thousands of x64 binaries we find Ddisasm is both faster and more accurate than the current state-of-the-art binary reassembling tool, Ramblr .
I. INTRODUCTION
Software is increasingly ubiquitous and the identification and mitigation of software vulnerabilities are increasingly essential to the functioning of modern society. In many cases, e.g., COTS or legacy binaries, libraries, and drivers, source code is not available requiring binary analysis and rewriting. Many disassemblers [18] , [48] , [49] , [7] , [51] , [31] , [8] , [26] , analysis frameworks [16] , [3] , [21] , [5] , [40] , [10] , [33] , [20] , [22] , [24] , rewriting frameworks [50] , [7] , [28] , [15] , [44] , [55] , [47] , [27] , [14] , and reassembling tools [48] , [49] , [31] have been developed to support this need. Many applications depend on these tools including binary hardening with control flow protection [56] , [46] , [17] , [32] , [54] , [34] memory protections [13] , [41] , [35] and diversity [25] , [12] , binary refactoring [45] , binary instrumentation [38] , and binary optimization [47] , [39] , [38] .
Modifying a binary is not easy. Machine code is not designed to be modified and the compilation and assembly process discards essential information. In general reversing assembly is not decidable. The information required to produce reassembleable disassembly includes:
Instruction boundaries Binaries do not contain information specifying where instructions start and end. Recovering this information can be challenging especially in architectures such as x86 that have variable length instruction and dense instruction sets 1 . This problem is sometimes referred as content classification. Symbolization information In binaries there is no distinction between a number that represents a literal and should remain constant and a reference that points to a location in the code or in the data. If we modify a binary, for example by moving a block of code, all of the references that point to that block, and to all of the subsequently shifted blocks, have to be updated. On the other hand, literals, even if they coincide with the address of a block, have to remain unchanged. This problem is also referred to as Literal Reference Disambiguation.
In this work we have developed a disassembler that infers precise information for each of these categories and thus generates reassembleable disassembly code for a large variety of programs. These problems are not solvable in general so our approach leverages a combination of static program analysis and heuristics derived from empirical analysis of common compiler and assembler idioms. The static analysis, the heuristics, and their combination are implemented in Datalog. Datalog is a declarative language that can be used to express dataflow analyses very concisely [42] and it has recently gained attention with the appearance of engines such as Souffle [23] that generate highly efficient parallel C++ code from a Datalog program. We argue that Datalog is so well suited to the implementation of a disassembler that it represents a qualitative change in what is possible in terms of accuracy and efficiency.
We can conceptualize disassembly as taking a series of decisions. Instruction boundary identification amounts to deciding, for each address x in the code section, whether x represent the beginning of an instruction or not. Symbolization information amounts to deciding for each number that appears inside an instruction operand or data section whether it corresponds to a literal or to a symbolic expression and what kind of symbolic expression it is.
2 Function boundary identification is also often considered as part of the disassembly process. Though not strictly required to produce reassembleable assembly, it is useful for performing analyses on the generated assembly code. Ddisasm also includes heuristics for identifying functions but we do not discuss them in this paper.
The high level approach for each of these decisions is the same. A variety of static analyses are performed that gather evidence for possible interpretations. Then, Datalog rules assign weights to the evidence and aggregate the results for each interpretation. Finally, a decision is taken according to the aggregate weight of each possible interpretation. In our implementation, we perform code location first (described in Section IV). Then we perform several static analyses to support the symbolization procedure: the computation of def-use chains, a novel register value analysis, and a data access pattern analysis described in Sections V-A, V-C, and V-D respectively. Finally, we combine the results of the static analyses and other heuristics to inform the symbolization procedure. All these steps are implemented in a single Datalog program. It is worth noting that, being Datalog a purely declarative language, the sequence in which each of the disassembly steps is computed stems solely from the logical dependencies among the different Datalog rules.
We have tested Ddisasm and compared it to Ramblr [48] (the current best published disassembler that produces reassembleable assembly) on 200 benchmark programs including 106 coreutils, 25 real world applications, and 69 binaries from DARPA's Cyber Grand Challenge (CGC) [1] . We compile each benchmark using 4 compilers and 5 or 6 optimization flags (depending on the benchmark) yielding a total of 4376 unique binaries (399 MB of binaries). We compare the precision of the disassemblers by making semantics-preserving modifications to the assembly code, reassembling the modified assembly code, and then running the test suites distributed with the binaries to check that they retain functionality. We also check the precision of the symbolization information step by comparing the results of the disassembler to the ground truth extracted from binaries generated with all relocation information. Finally, we compare the disassemblers in terms of the time taken by the disassembly process. The Datalog disassembler Ddisasm is faster and more accurate than Ramblr .
Our contributions are:
1) We present a new disassembly framework based on combining static analysis and heuristics expressed in Datalog. This framework is highly flexible enabling much faster development and empirical evaluation of new heuristics and analyses. 2) We present multiple static analyses implemented in our datalog framework which support the production of reassembleable assembly. 3) We present multiple empirically motivated heuristics that are effective in inferring the necessary information to produce reassembleable assembly. 4) We share an implementation of this populated framework in a tool called Ddisasm which is open source and publicly available 3 . Ddisasm produces assembly text as well as an intermediate representation tailored for binary analysis and rewriting 4 . 5) We demonstrate the effectiveness of our approach through an extensive experimental evaluation over 4376 binaries in which we compare Ddisasm to the state-of-the-art tool in reassembleable disassembly Ramblr .
II. RELATED WORK

A. Disassemblers
Bin-CFI [56] is an early work in reassembleable disassembly. This work requires relocation information (avoiding the need for symbolization). With this information, disassembly is reduced to the problem of identifying code location. Bin-CFI used a simple process of: 1) perform linear disassembly 2) identify errors as invalid opcodes (rare in x64) or invalid jumps (i.e., jumps outside of a module or to the middle of an instruction) 3 https://github.com/GrammaTech/ddisasm 4 https://github.com/grammatech/gtirb 3) identify as non-code the region from the error to the preceding unconditional jump 4) identify as non-code the region from the error to the proceeding jump target.
Our code location also propagates invalid opcodes and invalid jump backwards, but it does not perform linear disassembly. Instead we use a combination of linear and recursive traversal.
There are many other works that focus solely on instruction boundary identification [31] , [51] , [8] , [26] . None of these address symbolization. In general, these approaches try to obtain a superset of all possible instructions or basic blocks in the binary and then determine which ones are real and which ones are not using heuristics. This idea is also present in our approach. Both Miller et al. [31] and Wartell et al. [51] use probabilistic methods to determine which addresses contain instructions. In the former, probabilistic techniques with weighted heuristics are used to estimate the probability that each offset in the code address is the start of an instruction. In the latter, a probabilistic finite state machine is trained on a large corpus of disassembled programs to learn common opcode operand pairs. These pairs are used to select among possible program disassemblies.
Despite all the work on disassembly, there are disagreements on how often challenging features for instruction boundary identification such as overlapping instructions, data in code sections and multi-entry functions are present in real code [30] , [4] . Our experience so far matches the one of [4] . We did not find data in code sections (we found only padding) nor overlapping instructions in compiled ELF binaries.
There are only a few systems that address the symbolization problem directly. Uroboros [49] uses linear disassembly as introduced by Bin-CFI [56] and adds heuristics for symbolization. The authors distinguish four classes of symbolization depending on if the source and target of the reference are present in code or data. The difficulty of each class is assessed and partial solutions are proposed for each class.
Ramblr [48] is the closest related work. It improves upon Uroboros with increasingly sophisticated static analyses. Ramblr operates in two modes: a fast mode which avoids expensive analyses and a slow mode which improves accuracy. Ramblr is part of the Angr framework for binary analysis [40] which provides a complete analysis infrastructure on the disassembled binaries. Our system also uses static analyses in combination with heuristics. Our static analyses (Sec. V) are specially tailored to obtain the necessary information for symbolization while remaining efficient. Moreover, the fact that our disassembler is implemented in Datalog means that the results of analyses and heuristics can be easily combined and it is straightforward to add new heuristics or refine the existing ones.
B. Rewriting Systems
REINS [50] rewrites binaries in such a way as to avoid making difficult decisions about symbolization. REINS partition the memory of rewritten programs into untrusted /lowmemory/ which includes rewritten code and trusted /highmemory/ (divided at a power of two for efficient guarding). They implement a lightweight binary lookup table to rewrite each old jump targets with a tagged pointer to its new location in the rewritten code. REINS targets windows binaries and its main goal is to rewrite untrusted code to execute it safely. REINS uses IDA Pro [2] to perform instruction boundary identification and to locate indirect jump targets.
SecondWrite [44] also avoids making symbolization decisions by translating jump targets at their point of usage. They do a conservative content classification (distinguishing code and data) by performing speculative disassembly and keeping the original code section intact. If there is data in the code section, it can still be accessed but jumps and call targets will be translated to a rewritten code section. Second write translates binaries into LLVM IR.
The MULTIVERSE [7] goes a step further than SecondWrite and also avoids making code location determinations by treating every possible instruction offset as a valid instruction. Similarly to SecondWrite, it avoids making symbolization determinations by generating rewritten executables in which every indirect control flow is mediated by additional machinery to determine where the control flow would have gone in the original program and redirecting it to the appropriate portion of the rewritten program.
The approaches of REINS, SecondWrite and MULTI-VERSE increasingly avoid making decisions about code location and symbolization and thus offer more guarantees to work for arbitrary binaries. However, these approaches also have disadvantages. They introduce overhead in the rewritten binaries both in terms of speed and size. Moreover, the additional translation process for indirect jumps or calls is likely to make later analyses on the disassembled code more challenging. On the other hand, our approach, although not guaranteed to work, generates assembly code with symbolic references. This enables performing advanced static analyses on the assembly code that can be used to support certain more sophisticated rewriting techniques. It also enables rewriting a binary multiple times without introducing a new layer of indirection in every rewrite.
C. Static Analysis Using Datalog
Datalog has a long history of being used to specify and implement static analyses. In 1995 Reps [37] presents an approach to obtain demand driven dataflow analyses from the exhaustive counterparts specified in Datalog using the magic sets transformation. Much of the subsequent effort has been in scaling Datalog implementations. In that vein, Whaley et al. [53] , [52] achieved significant pointer analysis scalability improvements using an implementation based on binary decision diagrams. More recently, Datalog based program analysis has received new impetus with the development of Souffle [23] , a highly efficient Datalog engine. The most prominent application of Datalog to program analysis to date has been Doop [9] , [43] , [42] , a context sensitive pointer analysis for Java bytecode that scales to large applications. Doop is currently one of the most comprehensive and efficient pointer analysis for Java. In the context of binary analysis, we are only aware of the work of Brumley et al. [11] which uses Datalog to specify an alias analysis for assembly code.
Very recently, Grech et al. [19] have implemented a decompiler, named Gigahorse, for Etherium virtual machine (EVM) byte code using Datalog. Gigahorse shares some high level ideas with our approach, i.e. the inference of high level information from low-level code using Datalog. However, both the target and the inferred information differ considerably. In EVM byte code, the main challenge is to obtain a register based IR (EVM byte code is stack based), resolve jump targets and identify function boundaries. On the other hand, Ddisasm focuses on obtaining instruction boundaries and symbolization information for x64 binaries. Additionally, although Gigahorse also implements heuristics using Datalog rules, it does not use our approach of assigning weights to heuristics and aggregating them to make final decisions.
III. PRELIMINARIES
A. Introduction to Datalog
A Datalog program is a collection of Datalog rules. A Datalog rule is a restricted kind of horn clause with the following format: h: −t 1 , t 2 , . . . , t n where h, t 1 , t 2 , . . . , t n are predicates. Rules represent a logical entailment: t 1 ∧ t 2 ∧ . . . ∧ t n → h. Predicates in Datalog are limited to flat terms of the form t(s 1 , s 2 , . . . , s n ) where s 1 , s 2 . . . , s n are variables, integers or strings. Given a Datalog rule h : − t 1 , t 2 , . . . , t n , we say h is the head of the rule and t 1 , t 2 , . . . , t n is its body.
Datalog rules are often recursive, and they can contain negated predicates, represented as !t. However, negated predicates need to be stratified (there cannot be circular dependencies that involves negated predicates e.g. p(X): −!q(X) and q(A): −!p(A)). This restriction guarantees that the semantics are well defined. Additionally, all variables in a Datalog rule need to be grounded, i.e. they need to appear in a nonnegated predicate on the body of the rule. Datalog also admits disjunctive rules denoted with a semicolon e.g. h : − t 1 ; t 2 that are equivalent to several regular rules h : − t 1 and h : − t 2 .
The Datalog dialect that we adopt (Souffle's dialect) supports additional constructs such as arithmetic operations, string operations and aggregates. Aggregates compute operations over a complete set of predicates such as computing a summation, a maximum or a minimum. For example, we use aggregates to integrate the results of our heuristics.
A Datalog engine takes as input a set of facts, which are predicates known to be true, and a Datalog program (a set of rules). The initial set of facts defines our initial knowledge, and it is commonly known as the extensional database (EDB). The set of Datalog rules is commonly known as the intensional database (IDB). The Datalog engine generates new predicates by repeatedly applying the inference rules until a fixpoint is reached. One of the appeals of Datalog is that it is fully declarative. That means that the result of a computation does not depend on the order in which rules are considered or the order in which predicates within a rule's body are evaluated. This makes it easy to define multiple analyses that depend and collaborate with each other.
In our case, the initial set of facts encodes all the information present in the binary, the disassembly procedure (with all its auxiliary analyses) is specified as a set of Datalog rules, and the results of the disassembly are the new set of predicates that are generated by the Datalog engine. These predicates are then used to build an Intermediate Representation (IR) for binaries that can be reassembled. 
B. Encoding Binaries in Datalog
The first step in our analysis is to encode all the information present in the binary into Datalog facts (i.e. we populate the EDB). We consider two basic domains: strings, denoted as S, and 64 bit machine numbers, denoted as Z 64 . We consider also the following sub-domains: addresses A ⊆ Z 64 , register names R ⊆ S and operand identifiers O ⊆ Z 64 . We adopt the convention of having Datalog variables start with a capital letter and predicates with lower case. We represent addresses in hexadecimal and all other numbers in decimal. We only use the prefix 0x for hexadecimal numbers if there is ambiguity. Fig. 1 declares the predicates used to represent the initial set of raw instruction facts in the EDB. Predicate fields are annotated with their type. To generate these initial facts we apply a decoder (Capstone [36] ) to attempt to decode every address x in the executable sections of a binary. If the decoder succeeds, we generate an instruction fact with A= x. If the decoder fails the fact invalid(x) is generated instead. In each instruction predicate, the field Size represents the size of the instruction, Prefix is a string representation of the instruction's prefix, and Opcode is a string representation of the instruction code. Instruction operands are stored as independent facts op_regdirect, op_immediate and op_indirect, whose first field Op contains a unique identifier. This identifier is used to match the operands to their instructions. The fields Op1 to Op4 in predicate instruction contain the operands' unique identifiers or 0 if the instruction does not have as many operands. We place source operands first and the destination operand last. Note that the operand identifiers have no particular meaning. They are assigned to operands sequentially as these are encountered during the decoding.
The encoding of the data sections is simpler. For each address A in a data section, a fact data_byte(A,Val) is generated where Val is the value of the byte at address A. We also generate the facts address_in_data(A,Addr) for each address A in a data section such that the values of the bytes from A to A+7 (8 bytes) 5 correspond to an address Addr that falls in the range of a code or data section of the binary. These facts will be our initial candidates for symbolization.
At the moment, we do not generate data_byte and address_in_data from the code sections. This is because our main analysis target has been unix ELF binaries that generally do not contain data in code sections [4] . However this could be easily changed. Note that ELF binaries do often contain padding that has to be properly handled.
Finally, additional facts are generated from the section, relocation and symbol tables of the executable as well as an special fact entry_point(ea:A) with the entry point of the executable. Note that for libraries, function symbol predicates are generated for all exported functions. p
IV. INSTRUCTION BOUNDARY IDENTIFICATION
The predicate instruction contains all the possible instructions that might be in the executable. Instruction boundary identification amounts to deciding which of these are real instructions.
Our instruction boundary identification is based on two traversals: 1) A backward traversal starting from addresses that are invalid.
2) A forward traversal that combines elements of linearsweep and recursive-traversal. 5 Our analysis considers x64 architecture.
(1) and must_fallthrough (From:A,To:A) to represent instructions at address From that may fall through or must fall through to an address To. Fig. 3 contains the rules that define both predicates 6 . An instruction at address From may fallthrough to the next one at address From+Size as long as it is not a return, halt, or an unconditional jump instruction. Rule 1 depends in turn on other auxiliary predicates that abstract away specific aspects of concrete assembler instructions e.g. return_operation is simply defined as return_operation('ret') for x64. The predicate must_fallthrough restricts may_fallthrough further by descarding instructions that might not continue to the next instruction i.e. calls, jumps or interrupt operations (we consider instructions with a loop prefix as having a jump to themselves).
The traversals also depend on other auxiliary predicates whose definitions we omit: Example 4.1: Consider the instructions in Fig. 2 . The mov instruction at address 416C4E generates the predicates must_fallthrough(416C4E,416C53) and may_fallthrough (416C4E,416C53) whereas the call instruction only generates may_fallthrough(416C53,416C58). This is because the function at address 413050 (the target of the call) might not return. The call instruction also generates the predicate direct_call(416C53,413050).
A. Backward Traversal
Our backward traversal simply expands the amount of invalid predicates through the implication that any instruction leading unconditionally to an invalid instruction must itself be invalid. 6 Some of the rules have been slightly adapted for presentation purposes. 
Rule 3 specifies that an instruction at address From that jumps, calls or must fall through to an address To that does not contain an potential instruction or to an address To that contains an invalid instruction is also invalid. The predicate possible_ea(A:A) (as in possible effective address) contains the addresses of the remaining instructions not discarded by invalid (Rule 4).
B. Forward Traversal
The forward traversal follows an approach that falls between the two classical approaches linear-sweep and recursivetraversal. Linear sweep starts from the beginning of the code section and traverses the whole section sequentially. On the other hand, recursive-traversal starts from a set of entry points and traverses the assembly code following the control flow graph i.e. following jumps and calls. Linear-sweep can present problems in programs that contain data in the code sections. It can also be misled by padding introduced by compilers to ensure functions are aligned. Recursive-traversal can fail to discover parts of the code that are only reachable through indirect jumps or calls.
Our traversal traverses the code recursively but is much more aggressive than typical traversals in terms of the targets that it considers. Instead of starting the traversal only on the targets of direct jumps or calls, every address that appears in one of the operands of the already traversed code is considered a possible target. For example, in Fig. 2 , as soon as the analysis traverses instruction mov EDX, OFFSET 0x45CB23, it will consider the address 45CB23 as a potential target that it needs to explore. Additionally, addresses appearing in the data section (instances of predicate address_in_data) are also considered potential targets.
The traversal is defined with two mutually recursive predicates: possible_target(A:A) specifies addresses where we start traversing the code and code_in_block_candidate(A:A, Block:A) takes care of the traversing and assigning instructions to basic blocks. A predicate code_in_block_candidate( A:A,Block:A) denotes that the instruction address A belongs to the candidate code block that starts at address Block.
The definition of these predicates can be found in Fig. 4 . The traversal starts with the initial_target (Rule 8) that contains the addresses of: entry points, any existing function symbols, the start address of the code sections and all addresses in address_in_data. This last component implies that all the targets of jump tables 7 or function pointers present in the data sections will be traversed.
A possible target, marks the beginning of a new basic block candidate (Rule 5). The candidate block is then extended as long as the instructions are guaranteed to fall through and we do not reach a block_limit (Rule 6). The predicate block_limit over-approximates possible_target (it is computed the same way but without requiring the predicate code_in_block_candidate in Rule 9). Rule 7 starts a new block if the instruction is not guaranteed to fall through or if there might be a block limit. That is where the previous block ends. Any addresses or jump/call targets that appear in a block candidate are considered new possible targets (Rule 9). For example, instruction mov EDX, OFFSET 45CB23 generates may_have_symbolic_immediate(416C4E,45CB23). Note that this is much more aggressive that a typical recursive traversal that would only consider the targets of jumps or calls. Finally, Rule 10 adds linear-sweep component to the traversal. after_block_end(End:A,A:A) contains addresses A after blocks that end with an instruction that cannot fall through at End (e.g. an unconditional jump or a return). This predicate skips any padding that might be found after the end of the previous block.
It is worth noting that in our Datalog specification we do not have to worry about many issues that would be important in lower level implementations of equivalent binary traversals. For instance, we do not need to keep track of which instructions and blocks have already been traversed nor do we specify the order in which different paths are explored.
C. Block Overlap Resolution
Once the second traversal is over, we have a set of candidate blocks, each one with a set of instructions (encoded in the predicate code_in_block_candidate). These blocks represent our best effort to obtain an over-approximation of the basic blocks in the original program. In principle, it is possible to miss code blocks. However, such code block would have to be reachable only through a computed jump/call and be preceded by data that derails the linear-sweep component of the traversal (Rule 10). We have not found any instance of this situation. We remark that if the address of a block appears anywhere in the code or in the data, it will be considered. This is similar to the idea of "Binary characterization" presented in [44] to compute a superset of possible indirect control flow targets.
The next step in our instruction boundary identification is to detect the blocks that overlap with each other. Overlapping blocks are extremely uncommon in compiled code. When they appear they tend to respond to very specific patterns such as having a block start with or without a lock prefix [30] . We can deal with those patterns with ad-hoc rules. Once those patterns have been taken into account, we consider that the remaining overlapping blocks should not overlap. Thus, if two blocks overlap, we assume one of them is spurious and needs to be discarded. This assumption could be relaxed if we wanted to disassemble malware but it is generally useful for regular compiled binaries.
We decide which block to discard using heuristics. Predicate block_points(Block:A,Source:A,Points:Z64,Why :S) assigns Points points to the block starting at address Block. Source is an optional reference to another block that is the cause of the points or zero for heuristics that are not based on other blocks. The field Why is a string that describes the heuristic for debugging purposes (and to distinguish the predicate from others generated from different heuristics).
Given two overlapping blocks (block_overlap), we discard the one with least points: 
This rule exemplifies how aggregates are used to compute the total amount of points for each candidate block. This idea of obtaining a superset of possible basic blocks and then resolve conflicts between blocks is also present in other disassemblers [8] , [26] .
Our heuristics are mainly based on how the conflicting blocks are connected to other blocks. For example, The rule below adds 6 points for to a block Block for each other block BlockPred that has no conflicts and has a direct jump to Block. 
Another example is the following rule which adds two points to the block Block if its address appears in one of the data sections and the address is aligned. 
V. AUXILIARY ANALYSES Once instruction boundaries have been computed, we have the location of all basic blocks in the binary. The next step in our disassembly procedure is to perform symbolization. However, in order to gather evidence for the symbolization, we first perform several static analyses. The goal of these analyses is to infer how data is accessed and used to deduce its layout.
A. Register Def-Use Analysis
First, we compute register definition-uses chains. The analysis produces predicates of the form: The register Reg is defined at address Adef and used at address Aused in the operand with index Index.
The analysis first infers definitions def(Adef:A,Reg:R) and uses use(Aused:A,Reg:R,Index:Z64). Then, it propagates definitions through the code and matches them to uses. The analysis is intra-procedural in the sense that it does not traverse calls but only direct jumps. This makes the analysis incomplete but improves scalability. During the propagation of definitions, the analysis assumes that certain registers keep their values through calls following Linux X64 calling convention [29] . One important detail is that the analysis considers the 32 bits and 64 bits registers as one given that in x64 architecture zeroes the upper part of 64 bits registers whenever the corresponding 32 bits register is written. That means that for instruction mov EDX, OFFSET 0x45CB23 at address 416C4E, the analysis generates a definition def(416C4E,RDX).
B. Register Definitions Used for Address
Once we have computed def-use chains, we want to know which register definitions are potentially used to compute addresses that are used to access memory. For that purpose, the disassembler computes a new predicate: def_used_for_address(Adef:A,Reg:R) that denotes that the register Reg defined at address Adef might be used to compute a memory access. This predicate is computed by traversing the def-use chains backwards starting from instructions that access memory. This traversal is transitive, if a register R is used in an instruction that defines another register R and that register is used to compute an address, then we consider that R is also used to compute an address. This is elegantly captured in the following Datalog rule: 
C. Register Value Analysis
In contrast to instructions that refer to code, where direct references (direct jumps or calls) predominate, memory accesses are usually computed. Rather that accessing a fixed address, instructions typically access addresses computed with a combination of register values and constants. This address computation is often done over several instructions. Such is the case in the example code in Fig. 2 .
In order to approximate this behavior, we developed an analysis that computes the value held in a register at an address. There are many ways of approximating the values of the registers ranging from simple constant propagation to complex abstract domains that take memory locations into account e.g. [6] . Generally, the more complex the analysis domain, the more expensive it is. Therefore, we have chosen a minimal representation that captures the kind of register values that are typically used for accessing memory. Our value analysis representation is based on the idea that typical memory accesses follow a particular pattern where the memory address that is accessed is computed using a base address, plus an index multiplied by a multiplier. Consequently, the value analysis produces predicates of the form: which represents that the value of a register Reg at address A is equal to the value of another register Reg2 at address A2 multiplied by a number Mult plus an offset Offset (or displacement).
The analysis proceeds in two phases. The first phase produces predicates of the form value_reg_edge which share the signature with value_reg. We generate one value_reg_edge per instruction and def-use predicate for the instructions whose behavior can be modeled in this domain and are used to compute an address (def_used_for_address). For example, Rule 15 below generates value_reg_edge predicates for add instructions that add a constant to a register: The first captures that RBX has a constant value after executing the instruction in address 416C35 (note that the multiplier is 0 and the register has a special value 'NONE'). The second, generated from Rule 15, specifies that the value of RBX defined at address 416C58 corresponds to the value of RBX defined at 416C35 plus 24. The third predicate denotes that the value of RBX 416C58 can be the result of incrementing the value of RBX defined at the same address by 24.
The set of predicates value_reg_edge can be seen as directed relational graph. The nodes in the graph are pairs of address and register (A, Reg) and the edges express relations between their values i.e. they are labeled with a multiplier and offset.
Once this graph is computed, we perform a second propagation phase akin to a transitive closure. This propagation phase chains together value_reg_edge predicates. The chaining starts form the leafs of the graph (nodes with no incoming edges). Leafs in the value_red_edge graph can be instructions that load a constant into a register such as mov RBX, -624 in Fig. 2 or instructions where a register is with an operation not supported my the domain. For example, loading a value from memory 416C40: mov RDI, [RIP + 0x673E80] in Fig. 2 . In that case, the generated predicate would be the tautological predicate value_reg(416C40,RBX,416C40,RBX,1,0).
In order to ensure termination and for efficiency reasons we limit the number of propagation steps by a constant step_limit with an additional field S:Z64 in the value_reg predicates. The main rule for combining value_reg_edge predicates is the following: 
This rule can chain edges linearly by combining the multipliers and the offsets. It can keep track of operations that involve one source register and one destination register. However, we also want to detect certain situations where multiple edges converge into one instruction. Specifically, we want to detect two kinds of situations: loops and diamonds.
Detecting Simple Loops. The following rule (Rule 17) detects situations where a register R is initialized to a constant O1 and then it is incremented/decremented inside a loop by a constant O2. 
This pattern can be interpreted as O1 being the base for a memory address and O2 being the multiplier used to access different elements of a data structure. Our new multiplier O2 does not actually multiply any real register, so we set the register field to a special value 'Unknown'. 416C58,'RBX',0,'NONE',1,−600) using Rule 16. Finally, Rule 17 is applied generating value_reg(416C58,'RBX',0,'Unknown',24,−600) which denotes that the register 'RBX' takes values that start at −600 and are incremented in steps of 24 bytes.
Detecting Diamond Patterns. We call diamond patterns situations where an instruction uses two different registers but these two registers are defined in terms of each other or in terms of a common third register. The last instruction adds the registers RAX and RBX. However, the value of RAX is two times the value of RBX. This is reflected in the predicates value_reg(2,RAX,0,RBX,2,0) and value_reg(0,RBX,0,RBX,1,0). Therefore, we can generate a predicate value_reg(3,RAX,0,RBX,3,0). 
Note that the register value analysis intends to capture some of the relations between register values but it makes no attempt capture all of them. The goal of this analysis is not to obtain a sound over-approximation of the register values but to provide as much information as possible about how memory is accessed. The analysis is also not strictly an underapproximation as it is based on def-use chains which are overapproximating.
D. Data Access Pattern Analysis
The data access pattern analysis takes the results of the register value analysis and the results of the def-use analysis to infer the values of registers at each of the data accesses and thus compute which addresses are accessed and which pattern is used to access them. The data access pattern analysis generates predicates of the form: data_access_pattern(A:A,Size:Z64,Mult:Z64,Origin:A) which specifies that address A is accessed from an instruction at address Origin and Size bytes are read or written. Moreover, the access uses a multiplier Mult.
Example 5.5: The code in Fig. 2 generates several data accesses. The instruction at address 416C40 produces: data_access_pattern(673E80,8,0,416C40) This represents an access to a fixed address that reads 8 bytes. Conversely, the instruction at address 416C47 yields the following data accesses:
This is because register RBX can have multiple values at address 416C47. In general, If we have multiple data accesses to the same address, we choose the one with the highest multiplier.
These data access patterns provide very sparse information, but if an address x is accessed with a multiplier m, it is likely that x + m, x + 2m, etc., are also accessed the same way. Thus we extend data access patterns based on their multiplier. The analysis produces a predicate propagated_data_access with the same format as data_access_pattern. Our auxiliary analyses provide no information on what is the upper limit of an index in a data access. Thus, we simply propagate a data access pattern until it reaches the next data access pattern that coincides on the same address or that has a different multiplier.
The idea behind this criterion is that the next data structure in the data section is probably accessed from somewhere in the code. So rather than trying to determine the size of the data structure being accessed, we assume that such data structure ends where the next one starts.
Example 5.6: In our running example (Fig. 2 ) the data access pattern data_access_pattern(45D0D0,8,24,416C40) is propagated from address 45D0D0 up to address 45D310 in 24 byte intervals. The generated predicates are:
propagated_data_access(45D0D0,8,24,416C40) propagated_data_access (45D0E8,8,24,416C40) . . . . . .
propagated_data_access(45D310,8,24,416C40)
The data access pattern is not propagated to the next address 45 D328 because that address contains another data access pattern generated from a different part of the code.
These propagated data access patterns will inform heuristics in the symbolization phase of the disassembly.
VI. SYMBOLIZATION
The next step to make the disassembled code assembleable is to perform symbolization, also known as Literal-Reference Disambiguation. It consists of deciding for each constant in the code or data element in the data sections whether it is a literal or a symbol. A first approximation can be achieved by considering as symbols all the numbers that fall within the range of the address space. However, as reported by Wang et al. [48] , this leads to both false positives and false negatives.
Next, we explain our approach to reduce the presence of false positives and negatives.
A. False Positives: Value Collisions
False positives are due to value collisions, literals that happen to coincide with range of possible addresses. In order to reduce the false positive rate, we require additional evidence in order to classify a number as a symbol.
Numbers in Data Sections. For symbols in the data section, similarly to the approach used for blocks, we start by defining a set of candidates:
data_object_candidate(A,PtSize,'symbol'):− pointer_size(PtSize), address_in_data(A,_). 
We define candidates for symbols whenever the number falls into the right range, string, whenever we have a sequence of printable characters ended in 0, and other, if we detect that an address is accessed with a different size than the pointer size (8 bytes in x64 architecture). We explained how propagated_data_access predicates are the result of the data access analysis computed in Sec. V-D.
We also assign points to each of the candidates based on heuristics and analyses and detect if there are overlapping. If they are, we discard the candidate with fewer points. However, even for data object candidates with no overlap, we require a minimum number of points to consider them to be data objects.
The main heuristics we use to symbolize data objects are:
Pointer to instruction beginning Whenever we have candidate symbol pointing to the code section, we assign more points if it is pointing to the beginning of an instruction. This heuristic relies on the results of the already computed instruction boundary identification. Symbol arrays We assign more points to contiguous or evenly spaced symbol candidates. This is because these usually correspond to jump tables or function tables. Also, it is significantly less likely to have several consecutive value collisions. Aligned symbols We assign extra points if the symbol candidate is located at an address with 8 bytes alignment. Long strings Longer string candidates receive more points. Access conflict If there is some access to data in the middle of a symbol candidate, we subtract points. Data access match If a data object candidate is accessed from the code with the right size, it receives points. This heuristic checks the existence of a propagated_data_access that matches the data object candidate.
Numbers in Code.
We follow the same approach to disambiguate numbers in instruction operands. However, only the first heuristic, "Pointer to instruction beginning," of the ones listed above is applicable to numbers in code. We distinguish two cases: numbers that represent immediate operands and numbers that represent a displacement in an indirect operand.
Once taking into account the Pointer to instruction beginning heuristic, we have not found false positives in displacements. For immediate operands we consider the following additional heuristics:
Uncommon pointer operation We subtract points if the immediate is used in an operation that is uncommon for pointers such as AND or XOR. Used for address We add points if the immediate is stored in a register that is used to compute an address (We use the predicate def_used_for_address from Sec. V-B). Compared to non-address We subtract points if the immediate is compared or moved to a register that in turn is compared to another immediate that is not an address candidate.
These heuristics are tailored to the inference of how the immediate is used. They use the def-use chains and the results of the register value analysis for that purpose.
In contrast to numbers in data where we require additional evidence to classify some number as a symbol, for number in the code we default to consider them as symbols as long as there is not enough evidence of the contrary.
B. False Negatives: Symbol+Constant
False negatives can occur in situations where the original code contains an expression of the form symbol+constant. In such cases, the binary under analysis contains the result of computing that expression.
There is no general way to recover the original expression in the code as that information is simply not present in the binary. Having a new symbol pointing to the result of the symbol+constant instead of the original expression is not a problem for rewrites which leave the data sections unmodified (even if the sections are moved) or rewrites that only add data to the end of the data sections. However, sometimes the address that results from an symbol+constant expression falls outside the data section ranges or falls into the wrong data section. In such cases, a naive symbolization approach can result in false negatives.
We detect and correct these cases by detecting common patterns where compilers generate symbol+constant using the results of our use-def analysis and the value analysis. We distinguish two cases: A displacement in an indirect operand and immediate operands. Typically, in a data access as the one above, one of the addends represents a valid base address that points to the beginning of a data structure and the rest of the addends represent an offset into the data structure. In our generic Knowing that a displacement should be symbolic is not enough, we need to infer the right data section to which the symbolic expression should refer. If the data access generates a data_access_pattern, we use the address of the data access pattern as a reference for creating the symbolic expression. Otherwise, we choose the closest boundary of a data section as a reference.
Immediate operands. Having a symbolic immediate that falls outside the data sections is uncommon. The main pattern that we have identified is when the immediate is used as an initial value for a loop counter or as a loop bound to which a loop counter is compared.
Example 6.1: Consider the example in Fig. 5 . The number loaded at address 4010A2 represents a loop bound and it is used in instruction 4010C9 to check if the end of the data structure has been reached. Address 402d40 belongs to section .rodata but address 402D48+160 is the beginning of section .eh_frame_hdr (in fact it coincides with the symbol __GNU_EH_FRAME_HDR).
We detect this and similar patterns by combining the information of the def-use analysis and the value analysis. We note that in these situations, the address that falls outside the section or on a different section and the valid range of the right section are within the distance of one multiplier. That is, let x be a candidate address that might represent the result of a symbol+constant expression, and let [s i , s f ) be the range of addresses of the original symbol section. Then x ∈ [s i − M, s f + M ) where M is the increment of the loop counter. Therefore, our detection mechanism generates a section range as above for every register that we identify as loop counter. where .L_402D48 is a new symbol pointing to the address 402D48.
VII. IMPLEMENTATION
We implemented our disassembly technique in a tool called Ddisasm (Datalog disassembler). Ddisasm takes a binary and produces an internal representation called GTIRB (GrammaTech Intermediate Representation for Binaries). This representation contains among other things a control flow graph and the symbolic information necessary for reassembly. GTIRB can be printed to assembly code that can be directly reassembled. Currently Ddisasm only supports x64 Linux ELF binaries but we plan to extend it to support other architectures and binary formats.
Ddisasm is predominantly implemented in Datalog which is compiled into highly efficient parallel C++ code using Souffle [23] . The Datalog code contains 3690 non-empty lines of code. Table III contains the lines of code of each of the components of the disassembly. The category "Other" represents auxiliary predicates used in multiple modules as well as specialized rules to deal with specific features such as PLT tables.
The remainder of Ddisasm i.e. the encoding of a binary into Datalog facts and the use of the results of the Datalog analyses to generate GTIRB , is written in C++ (2799 LOC).
VIII. EXPERIMENTAL EVALUATION
We performed several experiments against a variety of benchmarks, compilers, and optimization flags.
Benchmarks. We selected 3 benchmarks. The first one is Coreutils 8.25 which is composed of 106 binaries and has been used in the experimental evaluations of Ramblr [48] and Uroboros [49] .
The second benchmark is a subset of the programs from the DARPA Cyber Grand Challenge (CGC). We adopt a modified version of these binaries that can be compiled for Linux systems in x64 8 . We exclude programs that fail to compile or fail to pass all their tests. We keep the programs that pass some of the tests and consider the passing tests as a baseline. That leaves 69 different CGC programs.
Finally, the third benchmark is a collection of 25 real world open source applications. Table IV contains In summary, we test 2120 different binaries for Coreutils 1656 binaries for the CGC benchmark and 600 binaries from our real world binaries selection. All benchmarks together represent a total of 399 MB of binaries. Note that even though the number of real world examples is smaller, they represent a significant portion of the total disassembled binary data (106 MB).
A. Symbolization Experiments
In a first experiment we run the disassembler on all the benchmarks and collect the number of false positives and false negatives in the symbolization procedure. We also detect an additional kind of error i.e. when we create a symbolic expression, but the symbol belongs to the wrong section. This can happen if the techniques applied in Section VI-B fail.
For comparison we run the same experiments using Ramblr , the tool with the best published symbolization results. Table I contains the results of this experiment. Ddisasm presents a very low rate of false positives, false negative or references pointing to the wrong section. This shows the effectiveness of the approach. Ddisasm builds on many of the ideas implemented in Ramblr [48] , but makes significant improvements. Ramblr has a significantly higher number of both false positives and false negatives. Additionally, at the moment, we do not detect references pointing to the wrong section in Ramblr , as this information is not available Benchmark TABLE IV. REAL WORLD EXAMPLE BENCHMARK. Each program is annotated with its size in KB when compiled with GCC 7.1.0 and optimization flag -O0.
from their disassembler. This means that the numbers in the 'Broken' column are biased against Ddisasm as there might be binaries that are broken by Ramblr by having references pointing to the wrong section that are not counted.
Ramblr performs quite well on Coreutils (in line with their experiments), but its precision drops greatly against the real world examples. It is worth noting that even though the real world benchmark has less programs, these are considerably bigger. This can be appreciated in the number of references, which is higher in the real world examples benchmark.
B. Functionality Experiments
Using the same benchmarks we check how many of the disassembled binaries can be reassembled and how many of those pass their original test suites without errors. This experiment demonstrates that symbolization as well as other aspects of the disassembly such as instruction boundaries are correct.
The results of this experiment are in Table II . We disassemble the binaries with Ddisasm and Ramblr , we reassemble the resulting assembly code gcc and we run the original tests on the new binaries. We also perform the experiment with stripped versions of the binaries. In that case, we strip the binaries before running the disassemblers.
Ddisasm is able to produce reassembleable assembly code for all the binaries and only 3 in the CGC benchmark fail some of the tests. Note that the number of binaries that fail some tests is smaller that the number of broken binaries according to our previous experiment (Table I) . This is because the test suites of the binaries are not exhaustive. The results also show that Ddisasm does not depend on the information present in symbol tables and can perform equally well with stripped binaries.
On the other hand, there are many binaries that fail to reassemble with Ramblr and the results of the tests are worse than those of the symbolization information. We have found and reported several bugs to the Ramblr authors which they have promptly fixed but there might be others that cause additional failures. Ramblr fails to produce reassembleable assembly for the stripped versions of most programs in Coreutils and the real world benchmarks. Many of the failures are because Ramblr does not find the main function or generates assembly with undefined labels. We believe that these are not fundamental issues and should be easy to fix in most cases.
C. Performance Evaluation
Finally, we measure and compare the performance of both Ramblr and Ddisasm . We measure the time that it takes to disassemble each of the binaries in the three benchmarks. Runtime Performance Evaluation. The three graphs show the disassembly times (in seconds) for the binaries from each class of benchmark: Real world, Coreutils, and CGC. The disassembly time for Ddisasm is plotted (vertically) against the disassembly time of Ramblr (horizontally). In all graphs, points below the diagonal represent binaries for which Ddisasm is faster than Ramblr .
The results can be found in Fig. 6 . Ddisasm is faster than Ramblr in all but 49 binaries in the CGC benchmarks. In particular, Ddisasm is on average 5.9 times faster than Ramblr .
IX. CONCLUSION
We have developed a new disassembler called Ddisasm that produces reassembleable assembly. In order to produce reassembleable assembly, Ddisasm combines novel static analyses and heuristics to determine how data is accessed and used. Ddisasm is implemented in Datalog. We show that Datalog is well suited to this task as it enables the specification of static analyses and heuristics in a compositional and declarative manner and it compiles them into a unified, parallel, and efficient executable.
Ddisasm is, to the best of our knowledge, the first disassembler for machine code implemented in Datalog. Our experiments show that Ddisasm is both more precise and faster than the state-of-the-art tools for reassembleable disassembly, and better handles large complex real-world programs. Ddisasm makes binary rewriting practical by enabling binary rewriting of real world programs compiled with a range of compilers and optimization levels with unprecedented speed and accuracy.
X. ACKNOWLEDGMENTS
This material is based upon work supported by the Office of Naval Research under contract No. N68335-17-C-0700. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Office of Naval Research.
