Parallel Binary Code Analysis by Meng, Xiaozhu et al.
Parallel Binary Code Analysis
XIAOZHU MENG, Rice University, USA
JONATHON M. ANDERSON, Rice University, USA
JOHN MELLOR-CRUMMEY, Rice University, USA
MARK W. KRENTEL, Rice University, USA
BARTON P. MILLER, University of Wisconsin-Madison, USA
SRÐAN MILAKOVIĆ, Rice University, USA
Binary code analysis is widely used to assess a program’s correctness, performance, and provenance. Binary
analysis applications often construct control flow graphs, analyze data flow, and use debugging information
to understand how machine code relates to source lines, inlined functions, and data types. To date, binary
analysis has been single-threaded, which is too slow for applications such as performance analysis and software
forensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batches
that contain thousands of binaries.
This paper describes our design and implementation for accelerating the task of constructing control flow
graphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging code
constructs encountered during constructing CFGs, including functions sharing code, jump table analysis,
non-returning functions, and tail calls. However, existing analyses do not consider the complex interactions
between concurrent analysis of shared code, making it difficult to extend existing serial algorithms to be
parallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the task
of constructing CFGs as repeated applications of several core CFG operations regarding to creating functions,
basic blocks, and edges. We then derive properties among CFG operations, including operation dependency,
commutativity, monotonicity. These operation properties guide our design of a new parallel analysis for
constructing CFGs. We achieved as much as 25× speedup for constructing CFGs on 64 hardware threads.
Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8× for a
performance analysis tool and 7× for a software forensic tool with 16 hardware threads.
1 INTRODUCTION
Binary code analysis is a foundational technique for a variety of applications, including performance
analysis [Adhianto et al. 2010; Ţăpuş et al. 2002; Miller et al. 1995], software correctness [Arnold
et al. 2007; Gu and Mellor-Crummey 2018], software security [Jacobson et al. 2014; v. d. Veen
et al. 2016; van der Veen et al. 2015], and software forensics [Meng et al. 2017; Rosenblum et al.
2011b]. Important binary code analysis capabilities include constructing control flow graphs (CFGs),
analyzing control flow and data flow properties, and extracting source line mappings and data types
from debugging information, when it is available. Traditionally, binary analysis applications are
single-threaded. However, recent trends in these applications call for improving the performance
of binary analysis applications.
In the field of performance analysis, it is becoming common to optimize the performance of
large software systems that compile into multi-gigabyte binaries. We have witnessed this trend
within software developed by national laboratories and popular machine learning frameworks
such as TensorFlow [Abadi et al. 2016]. The developers of these large softwares use the following
Authors’ addresses: Xiaozhu Meng, Department of Computer Science, Rice University, Houston, TX, 77005, USA, Xiao
zhu.Meng@rice.edu; Jonathon M. Anderson, Department of Computer Science, Rice University, Houston, TX, 77005,
USA, jma14@rice.edu; John Mellor-Crummey, Department of Computer Science, Rice University, Houston, TX, 77005,
USA, johnmc@rice.edu; Mark W. Krentel, Department of Computer Science, Rice University, Houston, TX, 77005, USA,
krentel@rice.edu; Barton P. Miller, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI, 53706,
USA, bart@cs.wisc.edu; Srđan Milaković, Department of Computer Science, Rice University, Houston, TX, 77005, USA,
sm108@rice.edu.
ar
X
iv
:2
00
1.
10
62
1v
3 
 [c
s.P
F]
  1
6 M
ay
 20
20
2performance analysis workflow to optimize their code: (1) compile the source code to generate
the binary program, (2) measure the performance of the binary during execution, (3) attribute
measurements to the corresponding source code (via binary analysis), and (4) optimize the source
based on the performance results. These four steps are repeated until developers are satisfied with
their software’s performance.
In this performance analysis cycle, binary analysis must be repeated after any source code change
because even small code changes can lead to dramatically different binaries, especially with C++
templates and aggressive compiler optimizations. In such a workflow, if the binary analysis in
step (3) is slow, it will reduce the throughput and effectiveness of performance analysis. Current
single-threaded binary code analysis takes too long to analyze such large binaries. It takes more
than 20 minutes to analyze a 7.7GiB shared library from TensorFlow, which would interrupt the
workflow of developers tuning the code for production.
In the field of software forensics, researchers have achieved great success in applying machine
learning to tasks including compiler identification [Rosenblum et al. 2011a] and authorship at-
tribution [Caliskan-Islam et al. 2018; Meng et al. 2017; Rosenblum et al. 2011b]. These machine
learning-based software forensics applications require large training sets to be effective, containing
hundreds to thousands of binaries. During the development of these forensic applications, the
developers typically repeat the following workflow: (1) design a set of binary code features, (2)
extract the features with binary analysis to construct a training set, (3) validate the accuracy of
a model trained using the new features. These three steps are repeated until the developers are
satisfied with the effectiveness of the binary code features.
While the training and tuning of machine learning models have traditionally been regarded as
the bottleneck of these software forensics applications, modern machine learning packages often
provide support for parallel training and tuning, using multithreading and even GPUs. In such
scenario, a serial feature extraction step can be a limiting factor of the development cycle: for
example, the feature extraction step in compiler identification [Rosenblum et al. 2011a] may take
over 24 hours.
In this paper, we present our design and implementation of parallel binary code analysis to address
the speed requirements imposed by these applications. The core of this work is a new parallel
analysis for constructing control flow graphs (CFG construction), which constructs functions, basic
blocks, and edges between basic blocks. CFG construction is used in nearly every binary analysis
application, directly or indirectly.
Modern serial CFG construction algorithms focus on understanding complex machine code
generated by compilers [Di Federico et al. 2017; Meng and Miller 2016; Shoshitaishvili et al. 2016].
Complex code constructs such as non-returning functions, tail calls and jump tables play key roles in
understanding high-level programming constructs, making their analysis important for applications.
While function level parallelism is a natural starting point for parallel CFG construction, we must
address a range of complex issues:
• Functions may share code. Threads analyzing different functions may end up concurrently
analyzing shared code and require synchronization.
• Control flow graphs evolve during analysis. As a result, a parallel algorithm for CFG con-
struction needs to consider concurrent changes by others.
• Current binary code analysis is not designed nor implemented with parallelism in mind.
Parallelization exposes the flaws in existing serial analysis for jump tables and tail call
identification.
• While protecting intricate data structures with mutual exclusion is a tempting way to guar-
antee correctness, the serialization this induces must be carefully evaluated for its impact on
parallelism and performance.
Parallel Binary Code Analysis 3
To systematically address these issues, we abstract CFG construction as repeated applications of
several primitive CFG construction operations. These operations include creation of CFG elements
such as functions, basic blocks, edges, modification to basic block ranges, and removing blocks and
edges. We derive operation properties, including dependencies, commutativity, and monotonicity,
and use this theoretical framework to reason the correctness and performance of CFG construction
algorithms. This abstraction allows us to identify flaws in existing serial CFG construction algo-
rithms. Many of the flaws are caused by not considering the interactions between complex code
constructs. This methodology enables us to express parallelism as commutative operations and
focus our attention to address operation dependencies to improve performance. We then design
new algorithms and data structures for parallel CFG construction to address both correctness and
performance issues.
We implemented our new parallel CFG construction in the Dyninst binary analysis and instru-
mentation toolkit [Paradyn Project [n.d.]], a library widely used by researchers in performance
analysis, security, and software forensics, and evaluated the performance characteristics of our
parallel binary analysis with a number of large binaries, including a 7.7GiB shared library from
TensorFlow. We achieved as much as 25× speedup for constructing control flow graphs on 64
hardware threads, which significantly accelerates client tools that employ binary analysis. We then
showcase the benefits of parallel binary analysis with two applications. The hpcstruct utility in
HPCToolkit [Adhianto et al. 2010] is used to relate performance measurement back to source code;
we achieved 8× speedup for hpcstruct. BinFeat [Meng [n.d.]] is a tool for extracting binary code
features for software forensics, for which we achieved 7× speedup.
In summary, this work makes the following contributions:
(1) A set of CFG operations and operation properties that enable us to reason correctness and
performance of CFG construction algorithms.
(2) A new algorithm for parallel CFG construction that is derived from the requirements and
properties of CFG operations.
(3) An implementation of the new algorithm in Dyninst that can be used by other binary analysis
application developers.
(4) Demonstrating the effectiveness of our new parallel binary analysis with two binary analysis
applications: hpcstruct which significantly accelerates program structure recovery for per-
formance analysis and BinFeat which significantly speeds up binary code feature extraction
for software forensics.
2 RELATEDWORK
There is rich literature about constructing CFGs from binaries [Bardin et al. 2011; Di Federico
et al. 2017; Kinder and Kravchenko 2012; Kinder and Veith 2008; Schwarz et al. 2002]. A commonly
used approach is control flow traversal [Schwarz et al. 2002; Theiling 2000]. Starting from known
function entry points such as the ones found in the symbol table, it follows control flow transfers in
the program to discover code and identify additional function entry points for further analysis. We
discuss several challenging code constructs that must be addressed during control flow traversal
and representative binary analysis tools that implement control flow traversal.
2.1 Challenging Code Constructs
Functions sharing code: A common compiler optimization is to share binary code between
functions with common functionality, such as error handling code and stack tear-down code. This
construct is common in compiled code. We have observed this construct in glibc-2.29 (Released Jan.
2019) where common error handling code is shared by multiple syscall wrappers, and within code
compiled by the Intel Compiler Suite (ICC). This can also occur logically in functions with multiple
4entry points: binary analysis tools typically represent such functions as multiple single-entry
functions that share code. Thus, Fortran functions with multiple programmer-specified entry points
(using the entry keyword), and binaries on Power 8 or newer (the ABI specifies that each function
has at least two entry points) lead to functions sharing code. To support code shared multiple
functions, one can define a function as the basic blocks that are reachable from the function entry
by traversing only intra-procedural edges [Bernat and Miller 2012; Meng and Miller 2016].
Non-returning functions: Binary analysis tools often define a call fall-through edge, which
is a summary edge representing that the control flow at a function call will return to the call site.
However, a function call to a non-returning function will never return to its call site, so there should
be no call fall-through edge at such call sites. A wrongly created call fall-through edge can lead to
confusing control flow and cascading impacts on binary analysis applications.
The general idea of identifying non-returning functions is to match function names against
known non-returning functions such as exit and abort and uses an iterative analysis to identify
functions that always end in calls to non-returning functions. One example is a fixed point analysis
presented by Meng and Miller [Meng and Miller 2016]. Each function has a return status, with three
different values: UNSET, RETURN, and NORETURN. A function’s return status is initialized with
NORETURN if it is known to be a non-returning function, otherwise a function’s return status is
UNSET. Three main components in the non-returning function analysis are: (1) a function’s return
status is set to RETURN if we find a return instruction; (2) if we encounter a call site calling to a
function with UNSET return status, we do not parse the call fall-through edge until the callee’s
return status is set to RETURN; (3) if there is a cyclic dependency between functions’ return statuses,
all functions in the cycle are non-returning.
Jump table analysis: Compilers often emit indirect jumps for switch statements in the source
code. The targets of these indirect jumps are calculated based on jump table data in the binary.
Being able to statically resolve the control flow targets calculated through jump tables is critical
for complete control flow traversal. A common approach for resolving jump table targets is to
use backward slicing to identify the instructions that are involved in the target calculation and
construct a symbolic expression of the jump target to identify the actual jump targets [Di Federico
et al. 2017; Meng and Miller 2016; Shoshitaishvili et al. 2016; Williams-King et al. 2020].
Tail calls: A tail call [Clinger 1998] is a compiler optimization that uses a jump instruction at
the end of a function to target the entry point of another function, thus not every branch should be
labeled as intra-procedural. Tail calls are often recognized through heuristics [Di Federico et al.
2017; Meng and Miller 2016], including (1) a branch to a known function entry is a tail call; (2)
a branch to a basic block that is reachable through only intra-procedural edges of the current
function is not a tail call; (3) if there is stack frame tear down before the branch, it is a tail call.
2.2 Binary Analysis Tools
Recent binary analysis tools address these challenging code constructs, including angr [Shoshi-
taishvili et al. 2016], Dyninst [Meng and Miller 2016], and rev.ng [Di Federico et al. 2017]. While
these tools share similarity in addressing challenging code constructs, the software infrastructure
of these tools have distinct characteristics regarding to analysis speed.
Both angr and rev.ng first lift machine instructions to IR and then perform analysis on the
resulting IR. The first advantage of this approach is that the binary analysis is not architecture
specific and can be readily ported to a new architecture after the IR is supported on the new
architecture. For example, rev.ng uses QEMU to lift binary to LLVM IR. Therefore, rev.ng supports
every architecture where QEMU is supported (more than 16 different architectures). Similarly,
angr uses Valgrind’s VEX IR. The second advantage is that lifting to IR facilitates the development
of complex data flow analysis such as points-to analysis and value set analysis. However, this
Parallel Binary Code Analysis 5
approach leads to significant performance slowdown for two reasons. First, lifting process itself
is slow. Second, The number of assignments in the IR is significantly larger than the number of
machine instructions as one instruction may be lifted to multiple IR assignments, especially on
CISC architectures such as x86-64.
On the other hand, Dyninst directly operates with the binary. Dyninst’s instructionAPI provides
an architecture independent interface for querying instruction opcodes, instruction operands,
registers, and memory addressing modes. The CFG construction code inside Dyninst works with
this “bare-metal” instruction interface. The only exception is that when Dyninst resolves jump
tables, Dyninst lifts machine instructions to ROSE IR [Quinlan and Liao 2011]. However, since
lifting is applied to instructions that are involved in the jump table calculation found by backward
slicing, typically only a small portion of the binary is lifted.
As we will describe in Section 7, complex data flow analysis are not needed in our target
applications. Therefore, we implement our new parallel CFG construction algorithms in Dyninst to
achieve better performance.
3 NOTATION
While existing literature comprehensively describes how to address each challenging code construct,
a critical problem we encountered when designing a parallel CFG construction algorithm is the
complex interactions between these code constructs. Serial algorithms are designed with the
assumption that no concurrent modification is made to the CFG. To address this problem, we present
an abstraction of control flow graphs and a series of core operations on them. This abstraction
enables us to reason interactions between different CFG construction operations.
Our abstraction builds upon the CFG definitions and operations designed for binary modifica-
tion [Bernat and Miller 2012], which mainly works with fully constructed CFGs. Our abstraction
instead focuses on the process of constructing CFGs.
Definitions:We define a CFG G = ⟨B,C,E, F ⟩ to be a tuple of the following:
• B is a set of address ranges [s, e), representing basic blocks within the binary. Each of these
contains at most one control flow instruction, which if present is the final instruction within
the range, and has incoming control flow at only address s .
• C is a set of candidate blocks [t], representing addresses which are known to start basic
blocks but do not have known ending addresses yet.
• E ⊆ {(a → b) : a ∈ B,b ∈ B ∪C} is a set of directed edges between basic blocks, representing
possible control flow executions between blocks.
• F ⊆ B ∪C is the set of function entry blocks.
Partial order: We utilize a partial order between control flow graphs, designed such that a larger
graph includes more control flow elements. We define G1 ≼ G2 if all of the following are true:
• The address ranges contained in G1 are also contained by G2. Formally, let A1 and A2 be the
addresses contained by the blocks in B1 and B2 respectively. Then we require A1 ⊆ A2.
• The explicit control flow present inG1 is also present inG2, regardless of adjustments to block
ranges. Formally, for every edge (a = [sa , ea) → b = [sb , eb )) or (a = [sa , ea) → b = [sb ]) ∈ E1,
one of the similar edges ([s ′a , ea) → [sb , e ′b )) and ([s ′a , ea] → [sb ]) must be present in E2.
Intuitively,G2 may contain additional control flow edges that target addresses inside a or b,
causing them to be split. The requirement here is that the end address of the source block ea
and the start address of the target block sb are preserved under the partial order.
• The implicit control flow through a basic block in G1 is preserved in G2. Formally, for
every block b = [s0, e) ∈ B1 there is a sequence of blocks [s0, s1), . . . , [sn , e) ∈ B2 such that
([si , si+1) → [si+1, si+2)) ∈ E2 for i = 0, . . . ,n − 2. This means that a block b inG1 can be split
into multiple smaller blocks in G2 to incorporate other incoming control flow.
6• Function entry labels in G1 are preserved in G2, regardless of range adjustments. Formally,
for every block [s, e) or [s] ∈ F1, there is a block starting at the same address [s, e ′) or [s] ∈ F2.
CFG operations: To construct a CFG based on the underlying binary, we define a number of
core operations:
• Block End Resolution: Given a graph G containing a candidate block [t] ∈ C , we define
OBER (G, [t]) as G with the candidate block [t] replaced by an actual basic block starting at t
with a determined end address. There are three possible cases:
– Block splitting. If there is an existing block b = [s, e) ∈ B such that s < t < e , then we have
to split b into the basic blocks [s, t) and [t , e). Any incoming edges on b are redirected to
[s, t), while outgoing edges on b and incoming edges on [t] are moved to [t , e).
– Early block ending. If there is an existing block b = [s, e) ∈ B such that t < s and the range
[t , s) contains no control flow instructions, we replace [t] with [t , s) as in the first case and
append the edge ([t , s) → [s, e)).
– Linear parsing. If neither of the previous cases apply, let e be the address directly after the
first control flow instruction following t . We replace [t] with [t , e) as in the first case.
• Direct Edge Creation: Given a block a in a graph G, which ends with a direct control flow
instruction, we define ODEC (G,a) as G with outgoing edges appended to a, based on the
control flow instruction within a (if one exists). There are three cases:
– If a terminates with an unconditional jump to address t , we append the edge (a → [t]).
– If a = [s, e) terminates with a conditional jump to address t , we append edges for the cases
where the condition is true (a → [t]) and false (a → [e]).
– If a terminates with a function call instruction to address t , we append the edge (a → [t]).
• Call Fall-Through Edge Creation: Given an edge e = ([s, e) → f ) in a graph G where [s, e)
contains a function call instruction and f ∈ F , we define OCFEC (G, e) as G potentially with
the additional edge ([s, e) → [e]) summarizing the execution of the callee function. Correct
application of this operation depends on the non-returning function analysis used to identify
whether the target function can return or not.
• Indirect Edge Creation: Given a block a in a graph G which contains a jump to a dynamic
address, we define OI EC (G,a) as G with the additional edges (a → [t1]), . . . , (a → [tn]),
where t1, . . . , tn are target addresses determined statically. It is possible for this operation to
add no edges if the analysis used is insufficient to statically determine the possible targets.
• Function Entry Identification: Given an edge e = (a → b) in a graph G , we define OF EI (G, e)
asG with the block b potentially labeled as a function entry. This operation is trivial if e was
created by an explicit call instruction, but further heuristics are required to identify functions
that are reached only through optimized tail calls.
• Edge Removal: Given an edge e = (a → b) within a graphG, we define OER (G, e) as G with
the edge e removed along with any blocks and edges that are no longer reachable from any
function entry point. Formally, let B′ ⊆ B and C ′ ⊆ C be the sets of blocks and candidate
blocks in G reachable from any block in F without traversing e . We can then define
OER (G, e) = ⟨B′,C ′,E ∩ {(a′ → b ′) : a′ ∈ B′,b ′ ∈ B′ ∪C ′} \ {e}, F ⟩.
Starting with the initial graphG0 = ⟨, F0,, F0⟩, where F0 is the set of candidate function entry
blocks discovered via the binary’s symbol table and unwind information, the task of CFG construc-
tion can be abstracted as repeated application of these operations. We denote G1,G2 · · · ,Gn−1 as
the intermediate results and Gn as the final CFG.
Parallel Binary Code Analysis 7
4 CFG OPERATION PROPERTIES
We present several important properties of the defined operations, assess existing serial algorithms
with these properties, and use these properties to define critical correctness and performance issues
for parallel CFG constructions.
4.1 Properties
Operation dependencies: To correctly build the CFG, operations should be applied with an order
that satisfy the dependencies among them. We identify two types of dependencies:
• Applicability Dependency. We cannot apply operations to an graph element that has not
been discovered. For example, we must create an edge before we can resolve the target block
candidate of this edge.
• Non-returning Function Dependency. The correctness of OCFEC for creating call fall-
through edges depends on the operations applied to the callee functions to determine whether
the callee would return or not. If OCFEC is applied when the callee does not return, an
erroneous call fall-through edge would be added, leading to an incorrect CFG.
Operations that satisfy either type of the above dependency must be applied in order. We classify
operations that are not constrained by any dependency into three categories:
• Commutative operations: The operations OBER and ODEC commute with themselves and
with each other, allowing us to choose an order convenient for processing. To establish this,
we discuss the following three cases:
– Given two candidate blocks [a] and [b] where a < b, we have OBER (OBER (G, [a]), [b]) =
OBER (OBER (G, [b]), [a]). First, if there is a control flow instruction ending at address c
where a < c < b, candidate block [a] will end before c while candidate block [b] will
end after c . These two operations will act on non-overlapping address ranges and be
independent, which gives us commutativity. Second, if a control flow instruction ends at c
where a < b < c and c is first control flow instruction following b, we have
OBER (OBER (G, [a]), [b]) = OBER (G ∪ {[a, c)}, [b]) (Linear parsing)
= G ∪ {[a,b), [b, c)} (Block splitting)
= OBER (G ∪ {[b, c)}, [a]) (Early block ending)
= OBER (OBER (G, [b]), [a]). (Linear parsing)
Thus we also have commutativity in this case.
– Given two blocks a and b, we have ODEC (ODEC (G,a),b) = ODEC (ODEC (G,b),a). This is
because ODEC (G,a) only considers the terminating control flow instructions within the
block a.
– We have OBER (ODEC (G, [s, e)), [t]) = ODEC (OBER (G, [t]), [s, e)), when given a candidate
block [t] and a block [s, e). We observe thatODEC (G, [s, e)) depends on only the terminating
control flow instruction ending at e and will generate only new candidate blocks while
OBER (G, [t]) does not depend on candidate blocks. Therefore these two operations are
independent and thus commutative.
The operation OER also commutes with itself, allowing us to choose an order convenient
for processing. The graph OER (OER (G, e1), e2) = OER (OER (G, e2), e1) will contain no blocks
reachable only through e1 and e2, which gives us the commutativity property.
• Monotonic ordering property: While OI EC does not commute trivially with any other
operation, we can still establish a weaker property. LetOI EC (G,a) be an indirect edge creation
operation andOx be anOBER orODEC operation. We haveOx (OI EC (G,a)) ≼ OI EC (Ox (G),a),
since the edges added byOx can at most add control flow paths preceeding a and thus increase
81 A: B:
2 ... ...
3 leaveq mov %rsi, 1
4 jmp 0x400 jmp 0x400
Listing 1. An example showing inconsistent results in the tail call heuristics used by Dyninst.
the set of target addresses. When our goal is to achieve a maximal CFG, this allows us to
reorder OI EC after any OBER and ODEC operations without decreasing the final result.
• Non-reorderable operations: The operations OCFEC and OF EI do not always commute,
nor do they satisfy the ordering property above in all cases. Both of these operations use
implementation-specific analyses: non-returning function analysis for OCFEC and tail call
identification heuristics for OF EI , both of which at times require inspection of large portions
of the graph. Because of the sensitivity of these operations, we are cautious to apply these
operations only after the considered subgraph has stabilized.
4.2 Serial Algorithm Assessment
We compare the serial algorithms used by angr [Shoshitaishvili et al. 2016], Dyninst [Meng and
Miller 2016], and rev.ng [Di Federico et al. 2017] using the defined notation and operations.
• Dyninst and angr’s CFG construction can be characterized with an increasing expression:
G0 ≼ G1 ≼ G2 · · · ≼ Gn . This increasing pattern has the advantage of not performing
redundant work of adding and then removing graph elements. On the other hand, rev.ng has
an additional step to clean candidate function entries after this increasing phase.
We observe that this cleaning step can address issues caused by non-commutating operations
such as tail call identification. Listing 1 is an example where Dyninst will give inconsistent
results depending on the analysis order. In this example, function A and B branch to the
same address. If A is analyzed first, because leaveq tears down the stack frame, Dyninst will
treat the branch in A as a tail call and create a new function at the branch target; later when
Dyninst analyze B, we find that B branches to a known function entry, so the branch is in B
also a tail call. In this case, function B will not include block at 0x400. On the other hand, if B
is analyzed first, because there is no stack frame tear down before the branch in B, Dyninst
will not treat the branch as a tail call, and the block at 0x400 will be part of B. Therefore,
the function boundary of B is determined by the order of analysis. Note that without other
context, it is equally valid to conclude either “A and B both tail call to 0x400” or “A and B
share block at 0x400”. With a cleaning step at the end, we have the opportunity to generate a
consistent answer.
• The jump table analysis implemented in tools does not necessarily satisfy the monotonicity
property we defined for OI EC . The root cause of this issue is imperfect jump table anal-
ysis where jump table targets can be over-approximated. Suppose we have OI EC (G,b1)
and OI EC (G,b2). Due to imperfect jump table analysis, OI EC (G,b1) generates an over-
approximated set of jump targets, resulting in invalid outgoing edges. Such additional but
confusing control flowmay causeOI EC (G,b2) to fail, leading an empty set of targets. However,
if OI EC (G,b2) is performed first, then we may get the correct non-empty set of jump targets
for b2. We have observed this problem in Dyninst’s jump table analysis. While rev.gn and
angr both provide detailed descriptions on how they resolve jump tables, neither is able to
guarantee no over-approximation of the jump targets.
Parallel Binary Code Analysis 9
4.3 Challenges Towards Parallelism
Besides the need of a cleaning step after the increasing CFG construction and the need to address
jump table over-approximation, we further identify three issues must be addressed to achieve
effective parallel analysis.
• Commutative operations still need careful synchronization. Suppose we have two direct edges
e1 and e2 that have the same target address and we performODEC (G, e1) andODEC (G, e2) con-
currently. Due to commutativity, ODEC (ODEC (G, e1), e2) = ODEC (ODEC (G, e2), e1). However,
the operation performed first will create the candidate block, making the second operation
effectively the identity function. This is trivial to maintain for serial algorithms, but careful
synchronization is necessary to maintain this uniqueness property in a parallel setting.
• Non-returning function dependencies between operations can also lead to ineffective paral-
lelism. In a call chain where F1 calls F2, F2 calls F3, · · · , and Fn−1 calls Fn , anOCFEC operation
in F1 may need to wait for operations in F2 to complete, which may need to wait for operations
in F3, and so on. This effect causes undesirable serialization during the analysis.
• The monotonic ordering property for operationOI EC (G,a) indicates that we might be able to
find more control flow targets if it is applied after other edge creation operations. However,
deferring OI EC (G,a) can exacerbate the issue of non-returning function dependencies, as
this will delay the discovery of returns that are only reachable through the indirect jump.
5 PARALLEL CFG CONSTRUCTION
In Section 4, we have established that commutative operations can be performed in any order
without impacting the final results. This is the foundation for parallel CFG construction. Section 5.1
describes the stages of our parallel analysis. Section 5.2 presents five invariants for supporting
concurrent CFG operations. Section 5.3 discusses parallel control flow traversal. Finally, Section 5.4
discusses the parallel finalization step that is needed to obtain a correct CFG.
5.1 Stages in Parallel CFG Construction
Our parallel analysis can be characterized with the expression G0 ≼ G1 ≼ G2 · · · ≼ Gm ≽ Gm+1 ≽
· · · ≽ Gn . It contains a CFG expansion phase to discover and add new graph elements, followed by
a CFG correction phase to remove incorrect graph components.
Listing 2 describes three main stages in our parallel analysis. It starts with initializing data
structures for analyzing functions defined in the symbol table (Line 1). It is necessary to parallelize
this step as we have seen large binaries containingmillions of functions. The second stage represents
the increasing construction phase. We perform control flow traversal for initialized functions in
parallel, during which we may discover more functions. We repeatedly apply control flow traversal
until there are no more functions to analyze (Line 2 - 6). The details of control flow traversal are
presented in Section 5.3. The final stage is to finalize the CFG (Line 7). This stage includes cleaning
control flow edges and blocks created by over-approximated jump tables, cleaning inconsistent
tail call identification results, and determining which basic blocks belong to which function by
traversing intra-procedural edges from function entry blocks.
5.2 Control Flow Traversal Invariants
Weuse Figure 1 to illustrate how our five invariants ensure that threads correctly perform concurrent
CFG operations.
Invariant 1: Block Creation. There is at most one basic block starting at any given address.
This invariant means that if threads branch to the same target concurrently, one and only one thread
should create the block and make the block visible to other threads. Maintaining this invariant
10
1 funcs = InitFunctions() -- Done in parallel
2 while funcs != |$\emptyset$|
3 more_funcs = |$\emptyset$|
4 parallel for f in funcs
5 more_funcs = more_funcs |$\cup$| ControlFlowTraversal(f)
6 funcs = more_funcs
7 CFGFinalization() -- Done in parallel
Listing 2. Stages of our parallel binary analysis.
0x4 0xA 0xD
T1 T2 T3
T4
T5
jm
p
(a) T1 and T2 branch to two different addresses. T3, T4, and T5 branch to the same target.
T1
B1
0x4
T2
B2
T4
B3
0xA 0xD
(b) T1, T2, and T4 create new basic blocks. T2 first reaches the block end and creates new control
flow edges. T1 then reaches the block end and is going to split blocks.
B1
0x4
B2
0xA
T4
B3
0xD
(c) T1 finishes splitting blocks. T4 reaches the block end and is going to split blocks.
B1
0x4
B2
0xA
B3
0xD
(d) T4 finished splitting blocks, which involves moving edges.
Fig. 1. An example of five threads work with a common area of code. Solid edges represent the progress of
threads. Bold solid edges represent actions to take place. Dashed edges represent control flow edges in the
CFG.
requires efficient concurrent data structures that synchronize threads branching to the same target,
while allowing threads branching to different targets to proceed independently.
In Figure 1a, threads T3, T4 and T5 branch to the same address. According to this invariant, only
one thread should create a new basic block. As shown in Figure 1b, T4 creates a new basic block B3.
Parallel Binary Code Analysis 11
T3 and T5 do not create any new basic blocks and leave the common code area to work with other
code. Independently, T1 creates basic block B1 and T2 creates basic block B2.
Invariant 2: Block End. There is at most one basic block ending at any given address. A naïve
implementaion of this invariant is to let a thread check whether a block exists at its current working
address. If there exists one, the working thread can end its block. However, this implementation
means that there will be a block start lookup after decoding each instruction. This would create
a performance hotspot on the concurrent data structure used for Invariant 1. Our design defers
this check until the working thread reaches a control flow instruction. In this way, we reduce the
frequency of global concurrent data structure lookups from once per instruction to once per control
flow instruction.
As shown in Figure 1b, threadT1,T2 andT4 will independently parse their blocks until they reach
the indirect jump instruction. Based on this invariant, only one thread should register the block
end address, which is T2 in this example.
Invariant 3: Edge Creation. The thread that registers a block’s end is responsible for creating
out-going control flow edges from that block. This invariant ensures that no redundant control
flow edges are created and jump table analysis for a particular indirect jump is always performed
by one thread. This also reduces unnecessary block start lookups by avoiding redundant edges.
As shown in Figure 1b, because thread T2 registers the block end, T2 continues to perform control
flow analysis to resolve the indirect jump targets and create control flow edges. T2 then leaves the
common code and continues to work with other code.
Invariant 4: Block Split. The threads that reach a block end but do not register the block
end will need to split blocks. Suppose we have block B1[x1,y),B2[x2,y), . . . Bn[xn ,y) created
by n threads, where x1 < x2 < . . . < xn < y. The results of block split should be
B1[x1,x2),B2[x2,x3), . . . ,Bn[xn ,y), with a fall-through edge between each pair of adjacent basic
blocks. It is inefficient to wait for all relevant blocks before performing splitting, so we present the
following eager block split algorithm.
Based on Invariant 2 (block end), only one block Bi [xi ,y) will register its end at y. When another
block Bj [x j ,y) reaches y, the working thread can look up Bi as the registered block. Depending on
the relationship between xi and x j , we have two cases:
• If xi > x j , Bj is split into [x j , xi ) while Bi is untouched. We then register Bj at block end
address xi , which will trigger a new iteration of block split when another block has already
registered block end at xi . As shown in Figure 1c, T1 splits blocks B1, registers B1 ending at
0xA and then leaves the common code.
• If xi < x j , Bi is split into [xi , x j ) while Bj is untouched. We then replace Bi with Bj for block
end address y, register Bi for block end address x j , and move out-going edges from Bi to Bj .
Similar to the first case, registering Bi at x j may recursively require another block split. As
shown in Figure 1d, T4 splits B2 and moves control flow edges from B2 to B3.
For both cases, each iteration of the block split algorithm ends with a smaller block end address.
Therefore, our block split algorithm is guaranteed to converge.
Invariant 5: Function Creation. There is at most one function starting at any given address.
This invariant has similar properties and requirements to Invariant 1 for creating blocks.
These five invariants ensure that commutative operations can be safely reordered and performed
concurrently, and the relative speed of threads will not impact the final results.
5.3 Parallel Control Flow Traversal
Listing 3 presents the algorithm for control flow traversal. Coupled with the invariants defined in
Section 5.2, control flow traversal can be performed in parallel.
12
The traversal is repeated until there are no more unanalyzed basic blocks (Line 2). For each
unanalyzed block b, we use routine linearParsing to decode instructions until a control flow
transfer instruction is encountered (Line 4). Modifications to Dyninst’s instruction decoding code
add thread-safety to support this.
Routine registerBlockEnd follows invariant 2 (block end) to register the block end (Line 5).
Only the thread that successfully registers the block end will see a non-empty set of control flow
edges returned, following invariant 3 (edge creation). All other threads reaching the same block end
will see an empty set of edges and will follow invariant 4 (block split) to split the blocks (Line 7).
The thread that creates the control flow edges will proceed to traverse the edges (Line 8 - 12). If
we encounter a function call, we may need to create a new function, following invariant 5 (function
creation) at Line 10. If we encounter a call fall-through summary edge or return edge, we process
non-returning functions (Line 11). If we encounter other types of edges, such as indirect, direct or
conditional branches, we create new basic blocks based on invariant 1 (block creation) at line 12.
Handling non-returning functions: In procedure processCall, we use the non-returning
function analysis presented by Meng and Miller [Meng and Miller 2016]. To address the non-
returning function dependency between CFG operations, we improve the analysis to eagerly notify
its callers once a function’s return status is set to RETURN. This improvement works well in practice
because large functions may contain multiple return instructions. As soon as we encounter one of
a function’s return instructions, we know this function is RETURN and we can enable the OCFEC
operation in its caller without waiting for analysis of the callee to finish.
Jump table analysis.We address two issues raised in Section 4 about jump table analysis. First,
jump table analysis (OI EC ) in Dyninst does not satisfy monotonic ordering property. We identify
that when Dyninst encounters instructions or path conditions that it cannot analyze, Dyninst will
fail to analyze the jump table and generate an empty set of control flow targets. This issue can be
addressed by taking the union of the targets discovered along different paths, essentially ignoring
instructions or path conditions that fail analysis. In this way, jump table targets identified along
valid control flow paths can still be propagated to the indirect jump, and the analysis can generate
non-empty set of control flow targets. While this strategy makes the jump table analysis in Dyninst
satisfies the monotonic ordering property, it can over-approximate jump table sizes and lead to
bogus control flow edges. We will introduce a cleaning strategy in the CFG finalization stage to
remove bogus control flow edges.
Second, the monotonic ordering property specifies that we can get a larger graph if we delay
jump table analysis as much as possible, but this might delay the finding of return instructions
and hurt parallelism due to non-returning function dependencies. We balance these two factors by
ordering jump table analysis after the analysis of direct control flow edges in this function, but
before call fall-through edges when the callee does not have a known return status. In addition,
we repeat the analysis of a jump table after more control flow paths are created within the same
function. This fixed-point analysis of jump tables allows us to find most of the targets early in the
analysis and gradually converge to a complete set of targets.
5.4 CFG Finalization
The goal of the CFG finalization stage is remove wrong CFG elements and determine function
boundaries. No new CFG elements will be added.
The first step is jump table finalization, where we remove wrong indirect control flow edges.
We find that over-approximation of jump targets is caused primarily by over-approximation of
jump table sizes. We can mitigate this problem by leveraging an observation that compilers do
not emit overlapping jump tables [Williams-King et al. 2020]. Therefore, if the analysis of a jump
table overflows into another jump table, we can detect over-approximation and apply edge removal
Parallel Binary Code Analysis 13
1 more_func = |$\emptyset$|
2 while f.hasMoreBlocks()
3 b = f.nextBlock()
4 linearParsing(b)
5 edges = registerBlockEnd()
6 if edges == |$\emptyset$|
7 splitBlock(b)
8 for e in edges:
9 switch e.type()
10 when 'call' , 'tailcall' then more_func |$\cup$|= processCall()
11 when 'call-fallthrough', 'ret' then more_func |$\cup$|= processNonRetFunc()
12 when 'other': createNewBasicBlock(f)
13 return more_func
Listing 3. The algorithm of control flow traversal.
operations OER to remove wrong edges and cascading dangling blocks. We make two observations
about this strategy. First, we have established in Section 4 that edge removal operations commutate.
Therefore, it is safe to perform this mitigation strategy in parallel. Second, this strategy cannot be
used during the parallel control flow traversal step. This is because when we analyze a jump table,
we do not know the exact locations of all jump tables in the binary. For this reason, we delay this
mitigation of over-approximation until the CFG finalization phase.
We then address wrong tail calls edges and determine function boundaries. We handle this with
an iterative parallel control flow graph search. Starting from function entries, we add blocks to the
boundary of a function by traversing intra-procedural edges. After getting the temporary function
boundaries, we use three rules in order to correct tail call results:
(1) If a branch is marked as not a tail call, but the edge target has a CALL incoming edge, we
correct this edge to be a tail call.
(2) If a branch is marked as a tail call, but the branch target is within the current function
boundary, we correct this edge to be not a tail call.
(3) If a branch is currently a tail call, but the edge target has only the current edge as incoming
edges, we treat this as not a tail call. This is generally caused by outlined code blocks.
After correcting tail calls, we re-perform the function boundary graph search and the tail call
correction procedure. We flip the determination of tail call at most once for each edge, ensuring
convergence.
Finally, we remove functions that do not have incoming inter-procedural edges.
6 IMPLEMENTATION EXPERIENCES
We implemented our new parallel CFG construction algorithms in Dyninst. Careful implementation
that follows our design is crucial for correctness and high performance. We present several code
examples and lessons we learned in our work.
6.1 Sample Implementations of Invariants
In Section 5.2 we presented five invariants for parallel control flow traversal. An efficient imple-
mentation of these is the foundation for scalable parallel binary code analysis.
Listing 4 is a code example of our implementation for invariant 1 (block creation). This code
example can be easily adapted to implement invariant 5 (function creation). Recall that two re-
quirements for invariant 1 are (1) threads that branch to the same address should be synchronized
and only one thread should create a new block, and (2) threads that branch to different addresses
can make progress independently. Our implementation uses the concurrent hash map provided
by Intel’s Threaded Building Blocks library [Intel [n.d.]] to fulfill these two requirements, which
14
1 tbb::concurrent_hash_map<Address, Block*> blocks;
2 bool attemptToCreateBlock(Address a) {
3 Block* b = new Block(a);
4 if (blocks.insert({a, b})) {
5 // Successfully registered the new block.
6 return true;
7 } else { // Block already exists.
8 delete b;
9 return false;
10 }
11 }
Listing 4. Example implementation of invariant 1 (block creation).
1 tbb::concurrent_hash_map<Address, Block*> blocks_end;
2 bool blockEnd(Block* b) {
3 tbb::concurrent_hash_map<Address, Block*>::accessor a;
4 if (blocks_end.insert(a, b->end())) {
5 // Block end registered, continue to create edges.
6 AnalyzeCFEdges(b);
7 return true;
8 } else {
9 // a->second references the block in the entry.
10 a->second = splitBlock(b, a->second);
11 return false;
12 }
13 }
Listing 5. Example implementation for invariant 2 (block end), invariant 3 (edge creation), and invariant 4
(block split).
provides entry-level reader-writer locks. The insert method of concurrent_hash_map ensures
that only one of the concurrent insertions with the same key will succeed (Line 5). Therefore,
we can use the return value of insert to determine whether the current thread has successfully
created a block and should continue analysis of the block (Line 7). Threads that see a false return
value knows that another thread has created the block and can move on to other work (Line 9 - 10).
Listing 5 is a code example showing how invariant 2 (block end), invariant 3 (edge creation), and
invariant 4 (block split) fit together. concurrent_hash_map exposes the entry-level reader-writer
locks via an “accessor” semantic. We can obtain an “accessor” for the existing entry in the table
(inserting one if requested and not already present). The accessor acts as a read or write lock on the
entry, and other threads that are trying to obtain a conflicting accessor will wait until the holding
thread releases its own accessor. Line 4 ensures only one block is registered for a block end address,
enforcing invariant 2. The accessor ensures that edge creation (Line 6) and block splitting (Line
10) are mutually exclusive. This mutual exclusion guarantees that control flow edges will not be
created while being moved. Note that invariant 3 and 4 do not require mutual exclusion, and it is
possible to implement them with finer grained synchronization. Our implementation uses mutual
exclusion for simplicity and our performance profiling has not shown this mutual exclusion to be a
performance bottleneck.
6.2 Multi-Keyed Parallel Symbol Table
In Section 5.1, we described that it is necessary to have a parallel symbol table as a large binary may
contain millions of symbols. Dyninst’s symbol table supports lookups by any of its four properties:
byte offset, mangled name, “pretty” human-readable name and demangled “typed” name. The
original implementation used a template class from Boost [Boost Project [n.d.]] to implement this,
a very customizable structure called a multi_index_container. Since the Boost implementation
Parallel Binary Code Analysis 15
1 class Symtab::indexed_symbols {
2 concurrent_hash_map<Symbol*, Offset> master;
3 concurrent_hash_map<Offset, vector<Symbol*>> byOffset;
4 concurrent_hash_map<std::string, vector<Symbol*>> byMangledName;
5 concurrent_hash_map<std::string, vector<Symbol*>> byPrettyName;
6 concurrent_hash_map<std::string, vector<Symbol*>> byTypedName;
7 bool insert(Symbol* s) {
8 accessor a;
9 if(!master.insert(a, {s, s->getOffset()}))
10 return false; // Already inserted, no need to continue
11 { accessor a1;
12 byOffset.insert(a1, s->getOffset());
13 a1->second.push_back(s); }
14 { accessor a2;
15 byMangledName.insert(a2, s->getMangledName());
16 a2->second.push_back(s); }
17 // ... etc. ...
18 return true;
19 }
20 };
Listing 6. Implementation example for a thread-safe efficient map with multiple keys, discussed in Section 6.2.
is not thread-safe, after contention for its mutex lock became a notable bottleneck we redesigned
the structure for concurrency as shown in Listing 6.
The key insight is that no lookups occur in parallel with an insertion or modification, so syn-
chronization is only needed during writes. Two writes only conflict if the Symbol they are working
with is the same, so we use the entry-level lock on the master table to mediate between threads.
The thread which inserts on the master table proceeds to update the corresponding entries in the
by* tables, retaining its lock to ensure that any other modifications to the collective entries occur
in a total order. Once all modifications are complete, later lookups are able to use the by* tables
directly, giving the same semantics as the original structure.
6.3 Performance Improvements
We summarize two implementation lessons that are beneficial for improving performance, starting
with replacing parallel loops with task parallelism. As described in Section 5, we use a parallel for
loop to perform parallel control flow traversal and collect new functions to analyze. The problem
with this implementation is that analysis of newly collected functions will not start until all existing
functions have been analyzed. This can cause significant idleness when the analysis of functions is
imbalanced. To address this issue, our improved implementation uses OpenMP tasks as the parallel
programming model and we launch a new task as soon as we discover a new function to analyze.
The second lesson is to use a thread local cache to reduce redundant calculations while not
incurring thread synchronization overheads. For invariant 2 (block end) discussed in Section 5.2,
we let each thread parse their blocks without any synchronization until reaching a control flow in-
struction. This design causes redundant instruction decoding between overlapping blocks analyzed
by different threads. However, while functions sharing code is common, most of the code blocks in
a binary are still not shared. This means that most of the time, a thread is going to branch into a
block that was created by itself, not created by other threads. Therefore, we implemented a thread
local cache that maintains the addresses that have been analyzed by the thread and use this cache
to reduce redundant decoding.
16
6.4 Profiling and Debugging Tools
The implementation of our new parallel CFG construction algorithms in Dyninst involves a large
amount of code, over 120K lines of code related to reading ELF sections, decoding machine instruc-
tions, and data flow analysis. We find that the following workflow focuses our attention to what
needed it most, and helps us identify numerous thread-safety issues across the code base.
(1) Use a performance analysis tool such as HPCToolkit [Adhianto et al. 2010] to gather perfor-
mance traces of the code with the aim of identifying code regions whose computational cost
justifies an overhaul to add parallelism.
(2) Inspect the identified code and choose a candidate parallelization strategy, such as loop-level
parallelism, a fork-join parallel programming model, or task-level parallelism. The right
choice for a particular code region depends on its code structure as well as the algorithms
and data structures that it employs.
(3) Use data race detectors to identify data structures and sharing patterns that require syn-
chronization. Candidate tools include logical data race detectors such as cilkscreen and
happens-before race detectors such as helgrind, the latter available as part of the Valgrind
binary instrumentation system [Nethercote and Seward 2007].
(4) Use the performance analysis tool to identify excessive mutual exclusion, unbalanced work-
loads, and excessive phase-based synchronization that form major bottlenecks.
(5) Test the design and implementation with a large suite of test cases and benchmarks.
The above steps are repeated until the overall performance improves satisfactorily.
7 APPLICATION CASE STUDIES
We present two application case studies that utilize our new parallel CFG construction algorithms:
hpcstruct, a utility in HPCToolkit for performance analysis, and BinFeat, a feature extraction
tool for software forensics.
7.1 Application Background
Besides constructing CFGs (AC1), other commonly used analysis capabilities by binary analysis
applications include (AC2) identifying loops, (AC3) building a mapping between source lines and
machine instructions, (AC4) understanding function inlining for templates and inlined functions,
(AC5) iterating over functions, basic blocks, edges, and machine instructions, and (AC6) performing
data flow analysis such as register liveness analysis. We use two application examples to illustrate
how these analysis capabilities are used.
Performance Analysis with HPCToolkit: HPCToolkit is an integrated suite of tools for
measurement and analysis of application performance on computers ranging from desktops to su-
percomputers. To relate performance measurements to an application’s source code, the hpcstruct
utility in HPCToolkit relates each machine instruction address to the static calling context in which
it occurs. In particular, hpcstruct is able to relate instructions to their original function (AC1) or
loop construct (AC2) by inspection of the binary’s final CFG (AC5), and to an inlined function or
template (AC4) and source lines (AC3) if DWARF debugging information is available.
Feature Extraction with BinFeat: BinFeat is a tool for extracting binary code features for soft-
ware forensics tasks, including function entry identification, compiler identification and authorship
attribution. Commonly used features include machine instruction sequences, subgraphs of CFGs
(AC1), loop nesting levels (AC2), and live register counts (AC6). BinFeat iterates over all functions
and blocks to extract these features (AC5).
Parallel Binary Code Analysis 17
1 ParseAPI::CodeObject *co = getCodeObject();
2 co->parse() // Perform CFG construction in parallel
3 std::vector<ParseAPI::Function*> funcs = co->funcs();
4 SortFuncs(funcs); // Sort functions to address load balancing
5 // Parallel for loop to analyze each function in parallel
6 #pragma omp parallel for schedule(dynamic)
7 for (size_t i = 0; i < fvec.size(); ++i) {
8 ParseAPI::Function* f = fvec[i];
9 ParseAPI::LoopAnalyzer la(f); // Analyze loops
10 DataflowAPI::LivenessAnalyzer live(f); // Register liveness analysis
11 DataflowAPI::StackAnalysis sa(f); // Stack height analysis
12 // Other thread-safe intra-procedural analysis
13 }
Listing 7. Code example of utilizing Dyninst for parallel binary analysis applications.
7.2 Application Parallelization
Even if Dyninst’s CFG construction algorithms are parallelized, binary analysis applications need to
reduce serial execution to achieve good speedup. The basic idea here is that after the CFG has been
fully constructed, binary analysis will typically no longer make modifications to the CFG. Therefore,
the CFG becomes read-only and different threads can safely perform analysis independently as
long as the analysis itself is thread-safe. Based on this idea, we summarize a design pattern for
parallelizing binary analysis applications.
Listing 7 shows an example code snippet to write parallel binary analysis applications. Line
2 uses the parallel CFG construction algorithm described in Section 5 to construct a CFG. Line
3 and 4 get the list of functions in the binary and sort the functions to address load balancing
between threads. Sorting is important as functions will have different sizes, which can cause notable
unbalance if a large function is scheduled last in a work queue. Therefore, we sort the functions in
decreasing order so that large functions are processed first. Within the parallel loop, the user can
apply intra-procedural analysis in parallel to different functions.
To complete the parallelization of a binary analysis application, an application developer will
also need to parallelize application-specific logic.
For hpcstruct, we parallelize the parsing of DWARF debugging information in a binary. A
binary’s DWARF information is organized in a forest-like structure with a tree for each compilation
unit (CU). Since source files are typically of similar sizes across a project we simply used an
OpenMP parallel for loop to process each of the CUs in parallel, accumulating their information
in structures allocated in parallel by a previous phase. This resulted in thousands of race reports,
which we handled first by mutex locks and then later by using concurrent data structures such as
those discussed in Section 6.1. Some races were caused by code within Libdw, a utility library from
Red Hat for parsing DWARF, and in cases where the performance would suffer from full mutual
exclusion we applied more significant modifications by implementing a resizeable hash table [Click
2007; Michael 2002; Triplett et al. 2011] in Libdw.
BinFeat needs to build a global feature index after extracting features from every functions in a
binary. This operation can be parallelized with a reduction operation, which is a generic parallel
computing primitive.
8 EVALUATION
Since it is a challenging task to generate accurate ground truth for a binary’s CFG, we evaluate the
correctness of our parallel CFG construction algorithm and implementation by approximating the
ground truth with debug information and RTL intermeidates. We then evaluate the performance of
our work using hpcstruct and BinFeat.
18
8.1 Correctness
To illustrate the correctness of our approach, we verified our algorithm and implementation using
113 binaries obtained by compiling the coreutils and tar projects. These binaries are compiled
with GCC 9.3.0 for x86-64, with link-time optimization disabled and other optimizations enabled
as specified by the package. In addition, we compiled these binaries with debug information and
injected the flag -fdump-rtl-dfinish to generate RTL intermediates for individual source files.
The debug information and RTL are used only for generating the ground truth.
The ground truth of this data set consists of three parts:
• We represent the boundary of function with address ranges, essentially projecting the CFG
of a function to the virtual address space. The DWARF .debug_info section encodes function
ranges. In particular, it supports multiple non-contiguous ranges for one function and supports
one range corresponding to multiple functions. Therefore, we can evaluate the handling of
functions sharing code and non-contiguous functions.
• We include the size of a jump table as part of the ground truth, which can be extracted by
scanning the RTL files. Unfortunately, we cannot derive jump table locations or the actual
targets from the RTL files. As existing jump table analysis has focused on bounding the size
of jump tables, we believe jump table sizes provide significant evaluation value.
• RTL encodes the ground truth for calls to non-returning functions, where a non-returning
call has REG_NORETURN as one of its arguments.
We then write a checker program that uses our parallel CFG construction implementation to
get the CFG, print out function ranges, jump table sizes, and non-returning calls, and match these
items with the ground truth.
We identified four distinct differences between our implementation and the ground truth by
manual inspection of the automatically identified differences:
• Failing to identify non-returning calls to ‘error’, causing functions to include additional
ranges. ‘error’ is non-returning when the first argument is non-zero, but returning when
the first argument is zero. Existing non-returning function analysis performs name matches
for external functions. This approach does not work for ‘error’.
• For a function foo, the compiler may emit another function symbol (“foo.cold”) for outlined
cold blocks from foo. However, the debugging information does not encode “foo.cold” and
lists the address ranges of “foo.cold” blocks as part of foo.
• Failing to resolve a jump table whose calculation uses the stack to store intermediate values.
• An extra indirect jump target caused by failing to identify a non-returning call to ‘error’,
leading bogus wrong control flow edge to the indirect jump.
In all cases above, the differences are caused by either incorrectness in the individual CFG
operations (OCFEG andOI EC ) or mismatches between the symbol table and the DWARF information.
In other words, the errors are not caused incorrect parallelism and can be fixed by improving the
implementations of OCFEG and OI EC .
8.2 HPCToolkit’s hpcstruct
We use four large binaries to illustrate the effectiveness of our parallelization for speeding up
performance analysis, including two binaries from Lawrence Livermore National Laboratory (LLNL1
and LLNL21), one large binary from Argonne National Laboratory (Camellia), and one shared library
from TensorFlow [Abadi et al. 2016].
Sizes of relevant sections of the four binaries are given in Table 1. LLNL1 is a Power little-endian
64-bit binary, LLNL2 and TensorFlow are x86-64, and Camellia is a Power big-endian 32-bit binary.
1 Due to export control, we are unable to disclose the names of these binaries until approved by LLNL.
Parallel Binary Code Analysis 19
Section(s) Sizes (MiB)
Binary Total .text .debug_*
LLNL1 363.40 77.01 243.16
LLNL2 1913.50 149.13 1612.20
Camellia 299.08 40.81 232.43
TensorFlow 7844.81 112.21 7622.46
Table 1. Relevant statistics of the binaries used as input for the various benchmarks.
Fig. 2. Trace from a run of hpcstruct on TensorFlow, descriptions of labeled sections are given in Section 8.2.
Binary Time Taken (s)
Cores DWARF (2) CFG (4) hpcstruct
LLNL1
1 30.44 101.57 237.97 ± 3.79
16 2.65 11.21 30.44 ± 0.28
Speedup × 11.47 × 9.06 × 7.82
LLNL22
1 83.95 176.79 690.86
16 6.07 19.66 112.55
Speedup × 13.83 × 8.99 × 6.14
Camellia [Roberts 2014]
1 22.36 46.10 118.39 ± 2.24
16 3.34 5.38 20.21 ± 0.17
Spd. × 7.86 × 11.42 × 5.86
TensorFlow [Abadi et al. 2016]
1 702.81 112.55 1252.88 ± 19.67
16 64.29 9.56 160.82 ± 3.08
32 49.63 5.49 146.12 ± 1.70
64 48.67 4.46 154.61 ± 2.86
Spd. × 14.44 × 25.22 × 8.103
Table 2. Performance results, averages of 10 runs unless otherwise noted. Times for DWARF and CFG represent
parallel DWARF parsing and parallel CFG construction, corresponding to sections 2 and 4 in Figure 2.
LLNL1, LLNL2 and Camellia were compiled by their corresponding software development teams,
we compiled the TensorFlow binary with GCC 8.3.0. Experiments run on LLNL binaries were run
on a node with 16 threads (8 cores), Camellia on one with 36 threads (18 cores), and TensorFlow on
a two-socket machine with 36 cores each (72 threads total).
2Results for LLNL2 are based on one run for each thread count. We have limited access to the binary and cannot repeat the
experiment.
20
1 2 4 8 16 32 64
1
2
4
8
16
32
Threads
Sp
ee
du
p
hpcstruct
DWARF
CFG
Fig. 3. Average speedup (geometric mean) of hpcstruct on the four binaries, as described in Section 8.2.
The results are presented in Table 2 and in Figure 3. Overall, hpcstruct has an end-to-end
speedup of 6× to 8×, due to several serial phases in the application code. We achieved a speedup of
9× to 25× for constructing CFGs and a speedup of 8× to 14× for DWARF parsing.
To better understand the end-to-end performance impact of our work, we break down the
main phases of execution within hpcstruct in Figure 2, which presents a performance trace of
hpcstruct running on TensorFlow with 64 threads. The contents of each phase are as follows:
(1) Read data from disk into an internal buffer.
(2) Parse DWARF type information in parallel and store in appropriate data structures. Imbalance
in the sizes of compilation units can cause some idling.
(3) Parse address to function and line mappings from DWARF and store in a serial structure
optimized for accelerated lookup.3
(4) Parse text regions in parallel to identify functions and construct the final CFG.
(5) Convert line map and parsing results into “skeleton” objects inside hpcstruct, which are
suitable for export.
(6) Query Dyninst structures in parallel to fill the “skeleton” with the final data to be serialized.
(7) Serialize data and write to disk in parallel with queries to mitigate the effects of serial
processing.
Although our parallelization (2 and 4) scales well, the overall execution of hpcstruct has
difficulties scaling. As per Amdahl’s Law, the serialization in application code (1, 5-7) and remaining
difficulties (3) prevent our speedup from scaling past 13×. Applications with less serialization will
see larger speedups.
8.3 BinFeat
Software forensic researchers typically use real world software to construct their training sets.
We follow their practice and construct a set of binaries to analyze. We compiled Apache HTTP
Server [The Apache HTTP Server Project [n.d.]], Redis [The Redis Project [n.d.]], Mysqlslap [The
MariaDB Project [n.d.]], and Nginx [The Nginx Project [n.d.]] with GCC-6.4.0 and -O2 optimization.
3The design of the data structure used here makes this region difficult to parallelize.
Parallel Binary Code Analysis 21
Time Taken (s)
Cores CFG IF CF DF BinFeat
1 231.90 246.33 108.46 307.88 915.36
2 142.15 125.29 56.06 173.16 518.06
4 96.95 66.56 29.54 99.02 312.48
8 75.92 36.64 16.41 62.91 211.77
16 64.40 22.35 9.76 44.88 160.76
32 58.47 14.37 6.27 34.62 130.43
64 60.40 13.80 6.93 34.23 131.90
Speedup × 3.84 × 17.85 × 15.66 × 9.00 × 6.94
Table 3. Performance results for BinFeat. CFG, IF, CF, DF represent the stages of CFG construction, extracting
instruction features, control flow features, and data flow features, respectively.
Our data set contains 504 binaries. This experiment was run on a x86-64 machine with 18 cores, 72
threads, and 48MB of L3 cache.
Table 3 shows the performance results for BinFeat. We achieved 7× overall speedup using 32
hardware threads, but did not gain any further improvement with 64 threads. Extracting instruction
features (18×) and control flow features (16×) scale well to 64 threads.
The extraction of data flow features only achieves 9× maximium speedup, we find that its
performance is hurt by imbalanced workload between threads. Note that we extract features from
each function in parallel. Data flow analysis typically has a higher time complexity compared to
analyzing instructions and traversing control flow graphs. Therefore, the analysis of large functions
will dominate the whole execution.
CFG construction has only 4× speedup, we identify two factors that limit its performance. First,
the issue of imbalanced workload also applies to CFG construction as the jump table analysis in CFG
construction takes significantly longer to run compared to other CFG operations such as creating
direct edges. Second, as described in Section 4, the non-returning function dependencies between
CFG operations can hurt parallelism. While we mitigate this problem with an eager approach
discussed in Section 5.3, this problem still persists. Note that these two issues do not show up for
large binaries such as those used in the hpcstruct experiments. We find that large binaries contain
sufficient numbers of functions to keep threads busy and hide these two issues.
9 DISCUSSION
Benefiting other applications: Our work provides a general framework for researchers to paral-
lelize their binary analysis applications. For example, software vulnerability searching calculates
binary code similarity [Chandramohan et al. 2016; David et al. 2016] to match known vulnerable
code. The calculation of binary code similarity utilizes binary analysis capabilities of analyzing
machine instruction characteristics, control flow, and data flow. Our work has parallelized several
common analysis capabilities and it will be interesting to see how our work benefits other binary
analysis applications.
Compiler assisted analysis: Our work opportunistically uses information from the compiler
(such as providing correct and detailed labels in the code and DWARF). However, this is not a
complete solution and we cannot rely on sufficient or even accurate compiler support. Surprisingly
often for even the most widely-used compilers, the compiler-provided information is incomplete
or inaccurate. One key issue is that binary analysis applications do not typically control which
compiler is used to generate the input binaries. For performance analysis, software developers
often use the compiler and optimization flags that lead to greatest performance, which often leads
22
to less accurate debugging information. Software forensic analysts deal with binaries collected
from the wild, whose compiler generated information is often intentionally removed to defend
against analysis. Therefore, while we use compiler assistance when available, we cannot not rely
on its presence.
Other forms of parallelism:We focus on multi-threading as the mechanism for parallelization.
Other forms of parallelism can be used to further improve the performance of binary analysis
applications. For example, BinFeat can benefit from node level parallelism by distributing the
analysis of different binaries to different machines. We believe this type of parallelism is possible for
certain specific applications, and is orthogonal to our work. Binary analysis application developers
can benefit from our work and seek additional parallelization opportunities if necessary.
Stripped binaries: Stripped binaries do not have the static symbol table (.symtab) any more,
but still have the dynamic symbol table (.dynsym) and the exception unwinding frame information
(.eh_frame). In addition, our algorithms can be augmented with orthogonal research for identifying
stripped function entry points [Bao et al. 2014; Rosenblum et al. 2008; Shin et al. 2015].
Source code CFG construction: The challenges of binary code CFG construction are largely
distinct from those of source code CFG construction. First, binary code functions can share code,
which is the main reason that we must derive operation properties to guide our design invariants
to support analysis of multiple functions in a binary in parallel. In contrast, source code functions
cannot overlap unless functions are nested. In this case, CFG construction for source code does
not require rigorous synchronization. For example, a source code basic block parsed by one thread
is not going to be split by another thread. Second, jump tables in binary code are often used to
implement switch statements in the source code. Jump tables are encoded as indirect control flow
in the binary code, whose targets must be identified through data flow analysis. However, in source
code, the body of a switch statement is naturally grouped together, and it is straightforward to
identify every case clause for the switch statement. Third, the body of a source code function is
contiguous. However, basic blocks of binary code functions can be outlined to improve instruction
cache performance. As a result, binary analysis needs to address non-contiguous functions. Fourth,
tail calls in binary code are just normal function calls in source code.
10 CONCLUSION
With the increasing size of software and the need for analyzing large batches of binaries, adding
multithreaded parallelism speeds up binary analysis, but doing so requires principled algorithm and
data structure redesign and careful attention to implementation. Our work centers on a theoretical
abstraction that expresses CFG construction as applications of individual CFG operations. We
derived operation dependencies, commutativity, and monotonic ordering properties, which enable
us to assess the strengths and weaknesses of existing serial CFG constructions, and guided us
towards a new design for our parallel CFG construction algorithm. We evaluated our parallel
binary analysis with a performance analysis tool hpcstruct and a software forensics tool BinFeat,
achieving 25× speedup for parallel CFG construction, 14× for ingesting DWARF, 8× overall for
hpcstruct, and 7× overall for BinFeat using 64 hardware threads. Our results show that our
parallel binary analysis can significantly speed up binary analysis applications, cutting the wait
times for their users and developers.
REFERENCES
Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 265–283. https://www.usenix.org/c
onference/osdi16/technical-sessions/presentation/abadi
Parallel Binary Code Analysis 23
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCTOOLKIT: Tools
for Performance Analysis of Optimized Parallel Programs. Concurrency and Computation: Practice and Experience 22, 6
(April 2010), 685–701.
Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz. 2007. Stack
Trace Analysis for Large Scale Debugging. In 21st IEEE International Parallel and Distributed Processing Symposium
(IPDPS). Long Beach, California, USA, 1–10.
Tiffany Bao, Jonathan Burket, MaverickWoo, Rafael Turner, and David Brumley. 2014. BYTEWEIGHT: Learning to Recognize
Functions in Binary Code. In 23rd USENIX Conference on Security Symposium (SEC). San Diego, CA, 845–860.
Sébastien Bardin, Philippe Herrmann, and Franck Védrine. 2011. Refinement-based CFG Reconstruction from Unstructured
Programs. In 12th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI). Austin,
TX, USA, 54–69.
Andrew R. Bernat and Barton P. Miller. 2012. Structured Binary Editing with a CFG Transformation Algebra. In 2012 19th
Working Conference on Reverse Engineering (WCRE). Kingston, Ontario, Canada, 10.
Boost Project. [n.d.]. Boost, https://www.boost.org/.
Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind
Narayanan. 2018. When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries.
In 2018 Network and Distributed System Security Symposium (NDSS). San Diego, CA, USA.
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. BinGo:
Cross-Architecture Cross-OS Binary Search. In 2016 24th ACM SIGSOFT International Symposium on Foundations of
Software Engineering (FSE). Seattle, WA, USA.
Cliff Click. 2007. A lock-free hash table. In JavaOne Conference.
William D. Clinger. 1998. Proper Tail Recursion and Space Efficiency. In 1998 ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI). ACM Press, Montreal, Canada, 174–185.
Cristian Ţăpuş, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance
Tuning. In 2002 ACM/IEEE Conference on Supercomputing (SC). Baltimore, Maryland, 1–11.
Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical similarity of binaries. In 37th ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI). Santa Barbara, California, USA, 266–280.
Alessandro Di Federico, Mathias Payer, and Giovanni Agosta. 2017. Rev.Ng: A Unified Binary Analysis Framework to
Recover CFGs and Function Boundaries. In 26th International Conference on Compiler Construction (CC). Austin, TX,
USA.
Yizi Gu and John Mellor-Crummey. 2018. Dynamic Data Race Detection for OpenMP Programs. In International Conference
for High Performance Computing, Networking, Storage, and Analysis (SC). Dallas, Texas.
Intel. [n.d.]. Threaded Building Blocks, https://www.threadingbuildingblocks.org/.
Emily R. Jacobson, Andrew R. Bernat, William R. Williams, and Barton P. Miller. 2014. Detecting Code Reuse Attacks with
a Model of Conformant Program Execution. In International Symposium on Engineering Secure Software and Systems
(ESSoS). Munich, Germany, 18.
Johannes Kinder and Dmitry Kravchenko. 2012. Alternating Control Flow Reconstruction. In 13th International Conference
on Verification, Model Checking, and Abstract Interpretation (VMCAI). Philadelphia, PA.
Johannes Kinder and Helmut Veith. 2008. Jakstab: A Static Analysis Platform for Binaries. In 20th International Conference
on Computer Aided Verification (CAV). Princeton, NJ, USA, 423–427.
Xiaozhu Meng. [n.d.]. A tool base on Dyninst to extract binary code features for software forensics, https://github.com/m
xz297/BinFeat.
Xiaozhu Meng and Barton P. Miller. 2016. Binary Code Is Not Easy. In The International Symposium on Software Testing and
Analysis (ISSTA). Saarbrücken, Germany.
Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying Multiple Authors in a Binary Program. In 22nd
European Conference on Research in Computer Security (ESORICS). Oslo, Norway.
Maged M Michael. 2002. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth
annual ACM symposium on Parallel algorithms and architectures. ACM, 73–82.
Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic,
Krishna Kunchithapadam, and Tia Newhall. 1995. The Paradyn Parallel Performance Measurement Tool. IEEE Computer
28, 11 (Nov. 1995), 37–46.
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation.
In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Diego,
California, USA) (PLDI ’07). ACM, New York, NY, USA, 89–100. https://doi.org/10.1145/1250734.1250746
Paradyn Project. [n.d.]. Dyninst: Putting the Performance in High Performance Computing, http://www.dyninst.org.
Dan Quinlan and Chunhua Liao. 2011. The ROSE source-to-source compiler infrastructure. In Cetus users and compiler
infrastructure workshop, in conjunction with PACT, Vol. 2011. Citeseer, 1.
24
Nathan V. Roberts. 2014. Camellia: A software framework for discontinuous Petrov-Galerkin methods. Computers &
Mathematics with Applications 68, 11 (2014), 1581 – 1604. https://doi.org/10.1016/j.camwa.2014.08.010 Minimum
Residual and Least Squares Finite Element Methods.
Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2011a. Recovering the Toolchain Provenance of Binary Code. In 2011
International Symposium on Software Testing and Analysis (ISSTA). Toronto, Ontario, Canada.
Nathan Rosenblum, Xiaojin Zhu, and Barton P. Miller. 2011b. Who wrote this code? identifying the authors of program
binaries. In 16th European Conference on Research in Computer Security (ESORICS). Leuven, Belgium, 18.
Nathan Rosenblum, Xiaojin Zhu, Barton P. Miller, and Karen Hunt. 2008. Learning to Analyze Binary Computer Code. In
23rd National Conference on Artificial Intelligence (AAAI). AAAI Press, Chicago, Illinois, 798–804.
B. Schwarz, S. Debray, and G. Andrews. 2002. Disassembly of Executable Code Revisited. In Ninth Working Conference on
Reverse Engineering (WCRE). Richmond, VA, USA.
Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks. In
24th USENIX Security Symposium (USENIX Security 15). Austin, TX, USA.
Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and
G. Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In 2016 IEEE Symposium on
Security and Privacy (SP). San Jose, CA, USA.
The Apache HTTP Server Project. [n.d.]. The Apache HTTP Server Project, https://httpd.apache.org/.
The MariaDB Project. [n.d.]. mysqlslap is a tool for load-testing MariaDB, https://mariadb.com/kb/en/mysqlslap/.
The Nginx Project. [n.d.]. High Performance Load Balancer, Web Server and Reverse Proxy, https://www.nginx.com/.
The Redis Project. [n.d.]. An open source, in-memory data structure store, https://redis.io/.
H. Theiling. 2000. Extracting Safe and Precise Control Flow from Binaries. In the Seventh International Conference on
Real-Time Systems and Applications (RTCSA). Cheju Island, South Korea, 23–30.
Josh Triplett, Paul E McKenney, and Jonathan Walpole. 2011. Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming.. In USENIX Annual Technical Conference, Vol. 11.
V. v. d. Veen, E. Göktas, M. Contag, A. Pawoloski, X. Chen, S. Rawat, H. Bos, T. Holz, E. Athanasopoulos, and C. Giuffrida.
2016. A Tough Call: Mitigating Advanced Code-Reuse Attacks at the Binary Level. In 2016 IEEE Symposium on Security
and Privacy (SP). San Jose, CA, USA.
Victor van der Veen, Dennis Andriesse, Enes Göktaş, Ben Gras, Lionel Sambuc, Asia Slowinska, Herbert Bos, and Cristiano
Giuffrida. 2015. Practical Context-Sensitive CFI. In 22nd ACM SIGSAC Conference on Computer and Communications
Security (CCS). Denver, Colorado, USA.
David Williams-King, Hidenori Kobayashi, Kent Williams-King, Graham Patterson, Frank Spano, Yu Jian Wu, Junfeng Yang,
and Vasileios P. Kemerlis. 2020. Egalito: Layout-Agnostic Binary Recompilation. In Twenty-Fifth International Conference
on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Lausanne, Switzerland.
