Erlangen, den 27.03.2006 by Diplomarbeit Im Fach Informatik et al.
Procedural Abstraction for
ARM-Architectures
Diplomarbeit im Fach Informatik
vorgelegt von
Alexander Dreweke
geb. 08. Januar 1980 in Ingolstadt
angefertigt am
Institut f¨ ur Informatik
Lehrstuhl f¨ ur Informatik 2
Programmiersysteme
Friedrich-Alexander-Universit¨ at Erlangen–N¨ urnberg
(Prof. Dr. M. Philippsen)
Betreuer: Dipl.–Inf Dominic Schell
Dr.–Ing. Ingrid Fischer
Prof. Dr. Michael Philippsen
Beginn der Arbeit: 02.11.2005
Abgabe der Arbeit: 31.03.2006Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als
der angegebenen Quellen angefertigt habe und dass die Arbeit in gleicher oder ¨ ahnlicher
Form noch keiner anderen Pr¨ ufungsbeh¨ orde vorgelegen hat und von dieser als Teil einer
Pr¨ ufungsleistung angenommen wurde. Alle Ausf¨ uhrungen, die w¨ ortlich oder sinngem¨ aß
¨ ubernommen wurden, sind als solche gekennzeichnet.
Der Universit¨ at Erlangen-N¨ urnberg, vertreten durch die Informatik 2 (Programmier-
systeme), wird f¨ ur Zwecke der Forschung und Lehre ein einfaches, kostenloses, zeitlich
und ¨ ortlich unbeschr¨ anktes Nutzungsrecht an den Arbeitsergebnissen der Diplomarbeit
einschließlich etwaiger Schutzrechte und Urheberrechte einger¨ aumt.
Erlangen, den 27.03.2006
Alexander DrewekeDiplomarbeit
Thema: Prozedurale Abstraktion f¨ ur ARM-Architekturen
Hintergrund: Da auf Eingebetteten Systemen chronischer Speichermangel herrscht,
ist es wichtig, so kompakten Programmcode wie m¨ oglich zu erzeugen. Theoretisch ist
es mit Assemblercode m¨ oglich, hochoptimierten und kompakten Code zu programmie-
ren, jedoch erkauft man sich damit die bekannten Probleme in der Wartung, Pﬂege und
Portierung des Codes. Aus diesem Grund sollen Hochsprachen wie C/C++ eingesetzt
werden. Allerdings k¨ onnen ¨ Ubersetzer durch ihre Architektur bedingt nur suboptimalen
Code erzeugen. Die besonders bei OOP-Sprachen zu ﬁndende Aufteilung des Quellco-
des in viele kleine Moduln und die, da meist NP-vollst¨ andig, eingeschr¨ ankten Optimie-
rungsm¨ oglichkeiten, sorgen daf¨ ur, dass sich oft ¨ ahnliche Code-Fragmente in dem fertigen
Programm ﬁnden, die sich z.B. lediglich durch die verwendeten Register unterscheiden.
Prozedurale Abstraktion ist ein Ansatz zur L¨ osung dieses Problems: aus dem gesamten
Programm werden gemeinsame Code-Sequenzen in eine eigene Funktion ausgegliedert
und durch einen Aufruf dieser Funktion ersetzt. Die Ausgliederung wird nur vorgenom-
men, wenn sich die Kosten amortisieren, welche z.B. durch das Verschieben von Registern
vor und nach dem Funktionsaufruf entstehen.
Aufgabenstellung: In dieser Arbeit soll eine neue Methode zur Prozeduralen Abstrak-
tion basierend auf “Graph Data Mining“ evaluiert werden. Als zugrundelegende Archi-
tektur soll ARM verwendet werden. Folgende Meilensteine sollen erreicht werden:
• Die Binarys sollen zuerst in einen Kontrollﬂussgraph und anschließend in einen
Datenﬂussgraph umgewandelt werden. Der entstehende Graph ist Eingabe f¨ ur das
Data Mining auf Graphen.
• Das Ergebnis des Data Mining sind Codest¨ ucke (Graphfragmente), die h¨ auﬁg im
Graph vorkommen. Es ist zu testen, mit welchen Einstellungen der Miner das beste
Ergebnis liefert.
• Die gefunden Fragmente sollen analysiert werden, es m¨ ussen Heuristiken entwickelt
werden, welche Fragmente entg¨ ultig ausgelagert werden sollen.
• Die besten Fragmente sind aus dem Code auszulagern. An ihrer urspr¨ unglichen
Stelle ist ein Funktionsaufruf einzuf¨ ugen, Register sind anzupassen. Das erzeugte
neue Assemblerprogramm soll lauﬀ¨ ahig sein.
• Das entwickelte Verfahren soll z.B. anhand der MI-Bench evaluiert werden. Dabei
soll das entwickelte Tool mit der aipop von absint verglichen werden.Literatur:
• http://www.arm.com
• http://www.absint.de/aipopARM/
• http://www.eecs.umich.edu/mibench/
• W¨ orlein, Marc; Meinl, Thorsten; Fischer, Ingrid; Philippsen, Michael: A quanti-
tative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston . In:
Jorge, Alipio; Torgo, Luis; Brazdil, Pavel (Hrsg.): Knowledge Discovery in Databa-
se: PKDD 2005 (9th European Conference on Principles and Practices of Know-
ledge Discovery in Databases Porto, Portugal 2005-10-03 - 2005-10-07). Berlin :
Springer, 2005, S. 392-403
• Saumya Debray, William Evans, Robert Muth: Compiler Techniques for Code
Compression, 1999
Betreuung: Dominic Schell, Ingrid Fischer, Michael Philippsen
Bearbeiter: Alexander DrewekeZusammenfassung
Ziel der Diplomarbeit war es m¨ oglichst kompakten Programmcode f¨ ur eingebettete sys-
teme zu erzeugen, da auf eingebetteten systemen Speicher zu den limitierenden Faktoren
zhlt.
¨ Ubersetzer erzeugen in Hinblick auf die Codegr¨ oße nur suboptimale Ergebnisse, da
sich oft in verschiedenen Funktionen gleiche oder ¨ ahnliche Codefragmente beﬁnden, die
sich oft nur in der Reihenfolge der einzelnen Instruktionen oder durch die verwende-
ten Register unterscheiden. Diese Fragmente k¨ onnen in eigene Funktionen extrahiert
werden und anschließend durch Aufrufe an diese neuen Funktionen ersetzt werden. Da
moderne ¨ Ubersetzer aber nur Modulweise oder Funktionsweise optimieren, fehlt ihnen
f¨ ur bestimmte Optimierungen der n¨ otige ¨ Uberblick ¨ uber den vorhanden Code. Auch
wird manches Optimierungspotenzial erst nach der end¨ ultigen Registervergabe und dem
abschließenden binden m¨ oglich. Aus diesem Grund wenden wir in dieser Diplomarbeit
prozedurale Abstraktion auf statisch gebundene Programme an.
Im Gegensatz zu bereits vorhanden Werkzeuge, die gleiche Codefragmente durch rein
textuelles vergleichen mittels so genannter Suﬃx–B¨ aume suchen wird in dieser Arbeit
ein neuer Ansatz verwendet. F¨ ur jeden Grundblock werden zuerst ein die Daten-
ﬂussabh¨ angigkeiten der Instruktionen untereinander errechnet und der dazugeh¨ orige
Datenﬂussgraph erstellt. Durch den Einsatz des Graphmining–Werkzeugens ParMol
k¨ onnen dann h¨ auﬁge Fragmente identiﬁziert werden welche als Kandidaten fr die proze-
durale Abstraktion verwendet werden k¨ onnen.
Durch den neuen graph–basierten Ansatz k¨ onnen im Vergleich zu herk¨ omlichen Ans¨ atzen
im Durchschnitt mehr als doppelt so viele Instruktionen eingespart werden.
iiiAbstract
Goal of this diploma thesis was it to create program code for embedded systems as com-
pact as possible, because memory is one of the limiting factor on theses systems.
As there are often various procedure with the same of very similar code fragments,
compilers do not generate optimal code in terms of code size. Most of these fragments
only diﬀer in the ordering of the instructions or the used registers. These fragments
can be abstracted in to own procedures which afterwards get called. Because modern
compilers only optimize a procedure or a module at the time, they lack the information
to do these optimizations. Some optimizations evolve only after the ﬁnal register allo-
cation and the linking of all modules and libraries. Because of this we apply procedural
abstraction in this theis to statically linked binaries.
In contrast to already existing tools, which search for identical code fragments only by
textual comparison through so called suﬃx trees, a new approach is used in this thesis.
At ﬁrst the data dependencies among the assembler instructions is computed and a data
ﬂow graph is constructed for each basic block. On these data ﬂow graphs the graph
mining tool ParMol is used to identify code fragment that can be extracted by the use
of procedural abstraction.
In contrast to the traditional aproaches we where able to save twice as much instruc-
tions on average with our new graph based approach.
iiiivContents
1 Introduction 1
1.1 Code Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Code Compaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Traditional Approach for Procedural Abstraction . . . . . . . . . . . . . 5
1.4 Graph based Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Graph Miners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 gSpan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 Embedding Based Mining . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 ARM Machine Code Analyses 13
2.1 ARM Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 ARM Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Instruction Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Weaved Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Indirect Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Basic Block Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Data Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Graph based Procedural Abstraction for ARM 27
3.1 Preparing the Mining of the DFG . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Weighting Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Cross Jump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Subroutine call . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Evaluation 37
4.1 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Future Work 45
6 Summary 51
A Command-line Options 53
B Algorithm Overview 55
viList of Figures
1.1 Memory architecture for executing compressed binaries. . . . . . . . . . 2
1.2 Procedural Abstraction: identical instructions are replaced by a procedure
call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Suﬃx trie for the string ABABC$. . . . . . . . . . . . . . . . . . . . . . 5
1.4 Compact suﬃx tree representation for the string ABABC$. . . . . . . . 6
1.5 Complete subgraph lattice for an example molecul. . . . . . . . . . . . . 8
1.6 DFS-code for an example molecule. . . . . . . . . . . . . . . . . . . . . 9
1.7 Canonical ordered DFS-codes. . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Complete subgraphs vs. closed subgraphs. . . . . . . . . . . . . . . . . . 10
1.9 DFG Code fragment that can not be extracted. . . . . . . . . . . . . . . 11
1.10 Maximal independent set of embeddings. . . . . . . . . . . . . . . . . . . 12
2.1 Basic layout of a CPSR on ARM architectures. . . . . . . . . . . . . . . 14
2.2 Syntax of ARM instructions. . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Syntax of conditional ARM instructions. . . . . . . . . . . . . . . . . . . 16
2.4 Layout of ARM instructions capable to use immediate operands. . . . . 17
2.5 Data and address pools interweaved with instructions in the code section
of an ARM binary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Load macros on ARM architectures. . . . . . . . . . . . . . . . . . . . . 19
2.7 Code generated by compiler and linker for a generic procedure call. . . . 20
2.8 Def-/Use-vector of an ARM instruction add r1, r8, r4 (set bits are
depicted in gray). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Constructed DFG for a exemplary basic block. . . . . . . . . . . . . . . 25
3.1 ARM assembler code, that can be extracted through a cross jump because
each block ends with an unconditional branch. . . . . . . . . . . . . . . 30
3.2 ARM assembler code, that can be extracted through a cross jump because
each block ends with a return. . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 ARM assembler code, that can be extracted through a procedure call, LR
is saved in spare registers. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 ARM assembler code, that can be extracted through a procedure call, PC
is saved to the stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 ARM assembler code, that can be extracted through a procedure call, PC
is saved to the stack. All instructions that modify the stack have been
adjusted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii4.1 Saving increase between sequential and DFG representation with and
without maximal clique activated during search. . . . . . . . . . . . . . 41
4.2 Used extraction mechanisms with and without maxClique activated dur-
ing search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 ARM assembler code and their canonical representation. . . . . . . . . . 46
5.2 Enlargement of a search area over basic block boundaries. . . . . . . . . 47
5.3 Distinction of disjunctive memory locations. . . . . . . . . . . . . . . . . 48
5.4 Frequent instructions (marked gray) in sequential and DFG representa-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
viiiList of Tables
3.1 Instructions distribution in basic blocks. . . . . . . . . . . . . . . . . . . 28
4.1 Saved instructions in the benchmark suite with maximal clique activated
during search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Saved instructions in the benchmark suite with maximal clique deacti-
vated during search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Indegree and outdegree of all instructions. . . . . . . . . . . . . . . . . . 43
4.4 Number of instructions with (degreeIN∨degreeOUT) > 1 in all DFGs that
are used for mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ixx1 Introduction
Embedded systems are nowadays used in a many ﬁelds such as automotive, cellphones,
and various consumer electronics. As consumers demand more and more functionality,
the manufacturers have to further increase the size of the programs that run on the
embedded systems without losing the memory, space and energy eﬃciency.
As embedded systems are build in huge quantities even a small reduction in the per-
piece costs results in a huge overall saving. Smaller code reduces the need for greater and
therefore more expensive hardware components, thus embedded systems manufacturers
spend great eﬀort to reduce the amount of code. Techniques for code size reduction are
presented in the following section.
1.1 Code Compression
The major idea behind code compression [1] is to store the binary in a compressed rep-
resentation and to decompress it as needed during run-time. The simplest case would be
to store the compressed binary in the ROM and decompress it to the RAM just before
execution. As on embedded systems only one binary is executed on a speciﬁc chip, this
approach is useless. Memory would is only saved in the ROM but not in the RAM. The
RAM must be increased as beside the binary itself also the decompression code must be
loaded.
Other approaches like [2] describe a system architecture that uses a standard RISC
processor and an instruction cache that decompresses the code one the ﬂy. In this case
the program code is decompressed in pieces as small as cache lines that afterwards are
stored in the instruction cache (see Figure 1.1). During sequential program execution
the code blocks are decompressed one after another. But as programs are executed in a
non-linear way (as branch instructions and procedures calls are used quiet often in pro-
grams) the problem is to identify the target address in the compressed program code.
Code that has already been decompressed and is stored in the instruction cache can be
referenced with the addresses of the uncompressed program, because after decompres-
sion their addresses are equal to the addresses in the original not compressed binary. If
the target of a branch is not found in the instruction cache, a cache miss is generated
and the target block of the branch must be found in the compressed binary. This can
1Figure 1.1: Memory architecture for executing compressed binaries.
be done through a table look up in an address-line table. This table is generated by
the compression tool and stored in the processors cache. It maps instruction addresses
of the original binary to cache lines in the compressed one. After the corresponding
cache line has been identiﬁed it can be decompressed and executed. Compression can be
applied either to the entire binary [3] or on a more ﬁne granular base e. g. procedures [4].
Diﬀerent compression algorithms can be used. Traditional sequential data compres-
sion, like the widely-used Lempel-Ziv-Welch (LZW) algorithm [5] oﬀer a potentially high
compression rate. These algorithms do not include the decompressing dictionary as part
of the compressed data; instead they are capable to reconstruct the decompressing dic-
tionary on the ﬂy during decompression. But LZW has two major drawbacks. Firstly
compression gain can only be achieved over a large amount of data; secondly data must
be decompressed as a whole. Other compression algorithms like the Burrows-Wheeler
algorithm [6] analyze and reorder the data to achieve signiﬁcantly higher compression
rates than LZW. But these algorithms can only achieve the high compression for large
block sizes which makes these algorithms not suitable for a procedure-based compression.
2The advantage of the decompressing instruction-cache approach is that it is com-
pletely transparent to the processor. Only a decompression unit must be added to the
chip and the address table that is needed to ﬁnd the target addresses of branches in
the compressed data has to be stored in additional memory. Measurements show, that
compression rates as high as 75% can be achieved with this technique. The execution
time penalty is in average 10%.
The algorithms discussed so far all compress the complete binaries code. Another
approach is to split the single instructions into mnemonics (e.g. add, mov, etc.) and
operands (e.g. used registers, immediate values. These two parts are then compressed
separately. [1] shows that this technique signiﬁcantly increases the compression rate as
both parts will show a higher redundancy as when whole instructions are considered.
The compression approaches that where investigated achieve reasonable good results.
They have the disadvantages that additional time is used to decompress the code. As
embedded systems often run time-critical programs this can lead to problems. Another
problem is that the embedded system has to be equipped with a special decompression
unit. The need for additional hardware contradicts units is contra the goal of reducing
the costs for embedded systems.
1.2 Code Compaction
Code compaction is a technique that tries to reduce the amount of a binaries instruction
while keeping the binary executable. The need for additional hardware decompressing
the code is not given. Code compression can be achieved with diﬀerent approaches. A
brief overview about common techniques that are also used during program compilation
[7] follows:
• Unused variables: If a variable is never used, it can be removed from the code.
For example, loops can be transformed in a way that most loop counters such as
induction variables of higher orders can be removed. These variables need not
be calculated explicitly, but are deduced from other variables, e.g. by adding a
constant oﬀset.
• Dead code elimination: If code in a program can never be executed it can
safely be removed. The identiﬁcation of dead code is not always simple. To decide
whether code (e.g. in the body of a if clause) will never be executed, the if clause
has to be analyzed. If it can be decided that the condition is always met, the else
part can safely be removed. A conservative approach has to be chosen, meaning
that a lot of the code can not be eliminated at compile time.
3• Common subexpression elimination: Often values are explicitly calculated
more than once in a procedure. Common subexpression elimination is a technique
to remove the redundant code thus improving both cache usage and run-time. This
can be in an explicit way at the source level of a program (this is especially true
for array indexes that are recalculated for each array). Common subexpression
elimination is also performed by the compiler in its optimization phase as during
the code generation temporary results are not reused.
Figure 1.2: Procedural Abstraction: identical instructions are replaced by a procedure
call.
Most of these techniques are performed during compilation but modern compilers op-
timize only per procedure or per module. Even inter-procedural optimizations can not
remove unneeded or duplicated code completely, because some code is added at the very
last stages of the compilation process and after the optimizer phase has been completed.
For example spill code is introduced during register allocation. Some optimizations can
not be done by the compiler at all regardless of how late the optimization phase is done.
For example the compiler will always lack knowledge about the addresses of procedures
and data in the ﬁnal binary. These addresses are assigned in the linkage process, i.e.
after the compiler has generated code. Therefore a post-pass compaction phase will have
far more information and also can have an overview over the complete program code
and all it libraries. In this phase, procedural abstraction can be performed as another
optimization in order to reduce the amount of code [8].
4The traditional approaches to reduce the amount of code in binaries will be described
in the next section.
1.3 Traditional Approach for Procedural Abstraction
The base idea of procedural abstraction is to identify common code sequence in a pro-
gram and to extract this code into a single procedure. A call to this procedure will then
replace all occurrences of that code in the original program (see Figure 1.2). This way
code compaction is achieved. Procedural abstraction works like the inverse of the opti-
mization called inlining [8]. This optimization technique is used to reduce the overhead
of calling procedure (removing the need for parameter passing and jumps). This way
far more eﬃcient and faster program code can be generated but this optimization often
leads to signiﬁcantly larger binaries.
Figure 1.3: Suﬃx trie for the string ABABC$.
Program code can be interpreted as a linear instruction sequence. To ﬁnd recurring
code sequences suﬃx tries can be used, as described in [9]. Normally suﬃx tries are
used to search for substrings in- character strings; nevertheless they can be used for
5Figure 1.4: Compact suﬃx tree representation for the string ABABC$.
procedural abstraction to search for common code sequences. In a suﬃx trie each edge
is a substring of the string. Each path from the root node to a leave node corresponds to
a suﬃx, e.g. substring. The number in the leaf nodes of the suﬃx trie correspond to the
position of that suﬃx in the original string. The number of successors of an inner node
equals the times the substring (from the root node to that inner node) in the original
string occurs. To identify the end of the string it must be explicitly terminated by $.
In Figure 1.3 a suﬃx trie for the string ABABC$ is generated. After the suﬃx trie has
been created it can be stored in a more compact way, by merging inner nodes in the
trie that do not branch. The label of the merged inner node must then be stored in the
label of the next label (see Figure 1.4).
1.4 Graph based Data Mining
This section will give a short overview over the miner that is used for the graph based
approach to procedural abstraction. At ﬁrst the functionality of miners in general will
be given, followed by the extensions that had been added in order to suite the mining
algorithm to the problem of ﬁnding recurring code sequences.
1.4.1 Graph Miners
Data mining is the process of automatically ﬁnding new information in large datasets.
Recently there has been a great interest in developing data mining algorithms that oper-
ate on graphs. This is because a number of diﬀerent problem domains, like link analysis,
chemical compound classiﬁcation etc., are naturally modeled by graphs. These graphs
can diﬀer in various aspects like directed or undirected, labeled or unlabeld, etc. graphs.
Data mining algorithms like MoFa [10], Gaston [11], gSpan [12, 13], AGM [14], or FSG
6[15] ﬁnd frequently occuring patters in graph datasets. As pattern in the context of
graphs, we deﬁne a connected subgraph [16]. Interesting subgraphs for frequent sub-
graph miners are those who are part of at least a given number of graphs, this is the
so-called support or frequency of a fragment. The mapping of each occurence of a frag-
ment in the datasets is the so-called embedding of the subgraph.
For frequent subgraph mining in graph datasets there are two distinct problem for-
mulations, the graph transaction and the single graph problem. In the graph transaction
problem the mining input consists out of a set of graphs (called transactions), whereas
the input for the single graph problem consists only of a single graph. The diﬀerence in
the input data aﬀects the way the frequency of the various subgraphs is determined. For
the graph transaction problem the subgraph frequency is determined by the number of
transactions the subgraph occurs in, irrespective of how many times the pattern occures,
whereas in the single graph problem the frequency is based on the number of embeddings
in the graph.
Beside the distinction by their problem domain, the algorithms can also be distin-
guished by the way they go through the datasets. There are two main strategies for
traversing the lattice, the miner can use a depth ﬁrst search (DFS) or a breadth ﬁrst
search (BFS). The lattice is constructed by starting with a root node and connecting all
graphs that only consists of a single node. Afterwards each of theses graphs is extended
by another edge and if needed with a node . In BFS, it is possible to use information
collected for smaller graphs to prune the greater graphs later on. So if an edge is discov-
ered that leads to an infrequent subgraph, this edge need not be added, but the search
can be pruned for the resulting graphs. DFS has the advantage of only storing the depth
of the search space. This leads to signiﬁcant lesser memory usage compared with the
BFS approach. Because there a diﬀerent ways to create the same subgraph a detection
of dupplicates is needed. As each subgraph is only interesting once, graph isomorphism
tests or canonical graph representations are needed. Because subgraph isomorphism
test are NP-complete they have an huge impact on the runtime performace of the graph
miners.
In recent years a number of algorithms have been developed that ﬁnd patterns in
graph transaction problems, among these are MoFa, gSpan, FFSM and Gaston. These
algorithms are complete in the sense that they are guaranteed to discover all embeddings
in a graph dataset. But as for procedural abstraction we are intereseted in all embed-
dings and not only the number of transactions the fragment occures in, these algorithms
can not be used unmodiﬁed for our problem. It therefore is about a single graph problem.
71.4.2 gSpan
gSpan is a graph-based substructure pattern mining algorithm. Its search tree is derived
from the DFS-code. DFS-code is a special canonical representation for graphs. The
gSpan algorithm traverses its search tree in a depth ﬁrst search. The children of each
subgraph result from edge extensions. This means that in each step just one new edge
and if necessary a new node is added.
Figure 1.5: Complete subgraph lattice for an example molecul.
DFS-Code
The DFS-code is based on a spanning tree. All nodes are indexed incrementally during
the search, thus each edge in the graph can be written as a 5-tupel. As can be seen
in Figure 1.6. The ﬁrst two entries are the indices of the connected nodes. The node
indices are followed by the label of the ﬁrst node, the label of the edge and the label
of the second node. The ordering of the nodes determines the type of the edge. If the
node indices are ordered in a way such that the ﬁrst index is smaller than the second,
the edge is a forward edge (these edges are part of the spanning tree). On the other
hand if the ﬁrst index is greater than the second the a backward edge is given.
8Figure 1.6: DFS-code for an example molecule.
Canonical Form
Because for graphs more than one spanning tree could be constructed, there is also more
than one DFS code for the graphs. To be able to detected duplicates a unique ordering
of all DFS codes must be deﬁned. Therefore, the DFS lists are ordered lexicographically
from the ﬁrst edge to the last. By using the same sorting order for the edges as for the
nodes the lowest DFS-code is the canonical representation used by gSpan (see Figure 1.7).
Figure 1.7: Canonical ordered DFS-codes.
Canonical Pruning
By using the DFS-code and the canonical form a few beneﬁts during the search of the
lattice can be enabled. If an edge is added to a subgraph, the DFS-code for the resulting
subgraph can easily be created by extending the DFS-code of the old subgraph with a
tupel for the new edge. Each preﬁx of a canonical DFS-code is always a canonical DFS-
code. Therefore if a non-canonical DFS-code is reached during the search, the lattice
can be pruned just like for the frequency pruning. This way the search can be speeded
up, because unnecessary work can be avoided.
Closed Mining
Data mining algorithms aim to detected usefull patterns in data. Because even the
relatively small set of frequent subgraphs that are mined by gSpan exist out of a huge
amount of subgraphs. Because of that a technique called closed mining is used to reduce
the amount of subgraphs that are found by mining algorithms (see Figure 1.8).
9Figure 1.8: Complete subgraphs vs. closed subgraphs.
For each frequent graph a set of frequent subgraphs can be constructed by removing
one node after another. Therefore the complete set of frequent subgraphs contains much
redundant subgraphs. We deﬁne a set of graphs to be closed, if all graphs that meet the
following criteria are removed from the set:
• A graph is subgraph of another graph
• The frequency of the graph is equal to the frequency of its supergraph
As for procedural abstraction only the biggest frequent subgraphs are interesting, this
extension reduces the result set without removing any frequent graphs.
1.4.3 Embedding Based Mining
As described earlier for the single graph problem the fragment frequency must be calcu-
lated by counting the number of embeddings [15]. The counting method of the frequen-
cies is a fundamental issue that must be considered by any frequent subgraph mining
algorithm. In general, there are three possible methods of frequency counting. In the
ﬁrst method, two embeddings are considered diﬀerent as long as they diﬀer by at least
one edge (i.e. non-identical). This allows arbitrary overlapping of embeddins of the same
subgraph. The second method considers two embeddings as diﬀerent only if they do not
share edges (i.e. edge-disjoint), thus the third considers two embeddings as diﬀerent
only if the do not share nodes (i.e. node-disjoint). As the nodes in our application ﬁeld
10represent assember instructions, we must use the node-disjointapproach.
Figure 1.9: DFG Code fragment that can not be extracted.
For procedural abstraction we must count the number of non-identical embeddings,
thus we must deal with the problem of overlapping embeddings. As can be seen in
Figure 1.9 the fragment mov → mov → add can be embedded twice in the graph. But
during the actual extraction one must stop after the ﬁrst embedding is extracted, be-
cause instructions of the second embedding have already been extracted with the ﬁrst
embedding.
To solve this problem the maximal independent set [15] must be calculated for all
embeddings of a fragment. As this problem is NP-complete it has an huge impact on
the runtime performance. To compute the maximal independent set a undirected graph
must be built, where each node represents one embedding. Two nodes are connected
if there is an overlapping between the two embeddings the nodes represent. For such
a graph, the maximal independent set represents the maximal non-overlapping set of
embeddings in a fragment. These independent embeddings are then counted for the
frequency calculations. Figure 1.10 shows the maximal independent set of a given graph
for the pattern mov → add.
1.5 Overview
In chapter 2 the ARM instruction set will be explained. In addition to the ARM assem-
bler syntax special problems that arise from the architecture design like interweaved data
and indirect branch instructions will be discussed. After that the basic code analyses
11Figure 1.10: Maximal independent set of embeddings.
that must be perform are explained. Chapter 3 will then describe the actual procedural
abstraction process. This contains the supported extraction methods and their costs
together with the weighting function that is used to determine the code fragments that
must be extracted in order reduce the code size of the binary.
After the analyses have be examined the new graph based approach is evaluated on
a set of programs. The results of this evaluation is discussed in chapter 4. Approaches
to further improve the graph based procedural abstraction are described in chapter 5.
Chapter 6 will conclude the thesis.
122 ARM Machine Code Analyses
As the ARM architecture is widespread among embedded systems, this architecture was
chosen as the target platform for procedural abstraction in this thesis.
This chapter will give an introduction into the ARM instruction set and the assembly
syntax that is used (see section 2.1). Beside this some architecture specialties like con-
ditionally executed instructions, interweaved data (see section 2.2) and indirect branch
instructions will be explained (see section 2.3). Afterwards the various steps like build-
ing basic blocks, creating a control and data ﬂow graph, in the program analysis will be
explained in section 2.4.
2.1 ARM Instruction Set
The ARM (Advanced RISC Machine) code uses RISC architecture. A RISC design aims
to deliver simple instructions that execute within a single cycle at a high clock speed.
The design concentrates on reducing the complexity of instructions performed by the
hardware. Therefore more complexity and intelligence must be put in software rather
than in the hardware. As a result, RISC architecture places greater demands on the
compiler. Complex instruction set computers (CISC) in contrast rely more on the hard-
ware for instruction functionality.
2.1.1 ARM Registers
On an ARM architecture, there are ﬁfteen 32-Bit wide general purpose registers (r0 to
r14). All of these registers contain either data or an address. In addition to these regis-
ters, there is dedicated register r15, the program counter (PC) and the current program
status register (CPSR). The PC register contains the address of the next instruction to
be fetched and decoded. The CPSR is used to monitor and control internal operations,
see Figure 2.1. The CPSR is divided into four ﬁelds each eight bit wide:
• Flags: This ﬁeld encodes the conditional status for conditional instructions and
the carry bit.
• Status: reserved for future usage.
13• Extension: reserved for future usage.
• Control: This ﬁeld contains the mode, state and the interrupt bit mask of the
processor.
Figure 2.1: Basic layout of a CPSR on ARM architectures.
For ﬂoating point operations there are eight 32-bit wide general ﬂoating point regis-
ters (f0 to f7) and the ﬂoating point status register (FPSR). The design of the FPSR
is equivalent to the design of the CPSR.
2.1.2 Instruction Set
The ARM instruction set can be split into three major categories. General instruction,
ﬂoating-point operations, and coprocessor operation instructions. These categories can
further be split further into groups of instructions.
Floating-point instructions include instructions that perform arithmetic operations on
the ﬂoating-point registers and instructions that transfer data between general purpose
register, memory and ﬂoating-point register. As the ARM architecture is a very modular
architecture, each main computing core can be extended by coprocessors that for ﬁll
specialized task (like, I/O, vector-ﬂoating-point (VFP), etc.). Each of these coprocessors
has its own specialized instructions in addition to instructions that transfer data from
the main core to the coprocessor and vice verse. The general instructions, that represent
the majority of the ARM instruction, are partitioned as follows:
• Data processing instructions: Data processing instructions manipulate date
within registers. There are move instruction (mov, mnv, etc.) arithmetic instruc-
tions (add, sub, mul, etc.), logical instructions (and, orr, etc.) and comparison
instructions (cmp, tst, etc.).
14• Branch instructions: A branch instruction changes the ﬂow of execution or is
used to call a routine. This type of instruction allows programs to have subroutines,
if-then-else structures and loops. The change of execution ﬂow forces the PC to
point to a new address. As the instructions to branch inside of a subroutine (b)
and to call another procedure (bl) are very similar, we classify branch instructions
in this thesis in two more obvious parts. At ﬁrst all instructions that change
the program ﬂow within a subroutine are called jump instructions and second all
instructions that save the current program position and execute another subroutine
are called call instructions. As in a binary all call and jump instructions only use
addresses of other instructions, we have established a layer of abstraction. In the
internal representation all call and jump do not address other instructions directly
instead they address a pseudo instructions the so-called label. This approach is
similar to the proceeding that is used while writing or generation assembler code.
This way code can be relocated without the need to adjust all call and jump that
point to instructions in the relocated code sequence.
• Load and Store instructions: Load-store instructions transfer data between
memory and processor registers. There are three types of load-store instruc-
tions: single-register transfer, multiple-register transfer and swap instructions. The
single-register transfer instruction is used for moving a single data item in or out
of a register. Multiple-register transfer instructions can transfer multiple registers
between memory and the processor in a single instruction. These instructions are
more eﬃcient compared to single-register transfer for moving blocks of data around
memory and saving and restoring context and stacks.
2.1.3 Instruction Syntax
The syntax that is used for ARM instructions in this thesis is common between all assem-
blers that generate code for ARM architectures. The ﬁrst operand after the mnemonic
is the target register of the instruction; this register will store the result of the operation.
This is true for all but the store instructions. For store instructions this registers denotes
the source register. All general purpose registers are named r0 to r15 whereas there
are special aliases like pc or lr for the registers r15 and r14. All immediate operands
are marked by the # preﬁx. Memory addresses used by load (ldr) and store (str)
operations are marked by [ ...]. The register or immediate that is surrounded by the
brackets is interpreted as an address that will be read or written. Figure 2.2 gives an
overview about the syntax that is used to by ARM assemblers.
A special feature of most instructions on ARM architectures is that they are pred-
icated with conditional codes. The CPU checks for every instruction the ﬂags in the
CPSR (see Figure 2.1) respectively the FPSR and executes the instruction only if the
corresponding ﬂags are set. In the assembler syntax the conditional codes (GT, LT, etc.)
15add r4, r5, #6
sub r3, r6, r7
ldr r2, [r8, #8]
str r1, [r9]
Figure 2.2: Syntax of ARM instructions.
are appended to the instruction mnemonic as can be seen in Figure 2.3.
...
addGT r4, r5, #6
subLT r3, r6, r7
ldrEQ r2, [r8, #8]
strNE r1, [r9]
...
Figure 2.3: Syntax of conditional ARM instructions.
Figure 2.4 shows the schematic of all ARM instructions that can operate on immediate
values:
• cond ﬁeld encodes whether the instruction is executed conditionally (always, never,
etc.).
• opcode ﬁeld encodes the type of instruction (ldr, str, add, etc.).
• Rx ﬁelds encode a register (the purpose of the register depends on the type of
instruction) RD for example encodes the destination register, that is the register
the result of the instruction is written to. RN encodes an operand register that is
used by the instruction.
• immediate ﬁelds store an immediate value that is used by the instruction.
• beside these ﬁelds there may be other ﬁelds that are encoded in the instruction for
example there can be a rotate ﬁeld that rotates the immediate value before it is
used.
2.2 Weaved Data
On most architecture with a ﬁxed instruction width, immediate operands of instructions
can not be as wide as the machine width. This is because some bits in an instruction are
16Figure 2.4: Layout of ARM instructions capable to use immediate operands.
needed to encode the instruction functionality. Therefore absolute addresses can not be
encoded as an immediate operand of a single instruction. ARM instructions which are
mostly predicated with conditional codes that take up 4 of the 32 bits in the instruction
opcode there are at most 12 bits left to specify immediate operands.
In the following we distinguish two types of interweaved data (according to [17, 18]).
If the interweaved data is used to address memory locations or jump targets we will
call it address pool, else data pool. The most eﬀective and common solution for this
problem is to load the addresses from memory, but now a similar problem arises: where
can the addresses are loaded from. A common solution for this problem is the use of a
Global Oﬀset Tables (GOT) [19]. These tables store the addresses of all global data and
code. If such data or code needs to be accessed, its address is loaded from the GOT
by a load instruction that indexes the Global Pointer (GP). The GP is certain register
the always holds the address of the GOT. This approach has the drawback that one of
the general-purpose registers has to be sacriﬁced to become the GP. This reduces the
number of available registers and therefore increases register pressure and thus the need
for spill-code (code that stores values from registers to memory in order to make the
registers usable by other instructions).
On architectures with an explicit Program Counter (PC), like the ARM architecture,
the GOT does not need to be a continuous table. The table can be instead split into
small pieces that are interwoven with program code and are accessed through PC relative
indexed load operations. This way, no register has to be used as a GP. Figure 2.5 shows
how data and address pools can be interwoven in between the assembler instructions in
the code section. The instruction on address 0x401c in the code section of the program
needs to access the data stored on address 0x8058 in the data section. Because the
compiler only knows that the instruction will need access to that data but not where the
instruction will be placed in the ﬁnal program, the compiler will add an address pool
(depicted in gray) to the program code. This address pool will be inserted in between
the instructions. To access the data on address 0x8058 the compiler ﬁrst has to insert
another load instruction (on address 0x4018), that loads the address of the data (in the
17data section) from the address pool.
Figure 2.5: Data and address pools interweaved with instructions in the code section of
an ARM binary.
Because these data and address pools are mixed with the normal instructions in the
code section, the data and the addresses in them are interpreted as instructions during
program analysis. As it is important for the procedural abstraction to strictly distin-
guish between instructions and data, all potential instructions that could make use of
such interweaved data pools must be analyzed. If interweaved data has been identiﬁed,
it must be removed from the instruction stream. In order to create a fully functional
program after the optimization phase, all data and address pools must be inserted into
the program again. As instructions can be moved from their original position in the code
through procedural abstraction, it is import to associate the instructions with their data.
To achieve this, the original load instructions are transformed. The ARM assembler of-
fers a macro representation for load instructions (macro: LDR register, =value) that
takes a load instruction and a label or data. During assembling and linking this macro
(see Figure 2.6) is transformed. If the data is small enough to be encoded as an immedi-
ate operand, the macro is transformed to a load or move instruction with the immediate
operand. The immediate value 5 can be encoded into 12 bits and therefore the macro
can be expanded into a simple mov instruction. If the operand is too big to ﬁt into the 12
bits, the macro is transformed to a load instruction that uses an address pool to access
18the data. The immediate 0x0c48 can not be encoded into 12 bits therefore a PC relative
ldr instruction has to be generated.
...
LDR r4, =0x5
LDR r5, =printf
LDR r6, =0x0c48
...
(a) before
...
mov r4, #5
ldr r5, [pc, #76]
ldr r6, [pc, #80]
...
(b) after
Figure 2.6: Load macros on ARM architectures.
2.3 Indirect Branch Instructions
After all the interweaved data has been extracted from the code, all instructions that
modify the PC register must be analyzed. As the PC registers stores the address of
the instruction that must by executed next by the central processing unit (CPU), these
instructions alter the control ﬂow of a program and must thus represent branch instruc-
tions. Although many processors oﬀer explicit branch instructions, compilers often use
instructions that directly modify the PC register for performance reasons or to translate
control structures like switch case constructs. On ARM architectures instructions that
directly modify the PC register are the only way to implement a return from a subrou-
tine to its calling procedure. As on ARM all instruction have a ﬁxed width of 4 byte and
6 bit of these are needed to encode the call opcode and its condition ﬂag this leaves only
28 bits for the target address to jump to. As only the linker knows the exact address of a
procedure in the code, the compiler must generate code that would be capable of branch-
ing to every memory address in the code (see Figure 2.7). Therefore the compiler often
generates code pattern that combine interweaved data and an instruction that directly
moves the address into the PC register. So in order to analyze the code precisely at
least all the return instructions and the most common indirect procedure-call compiler
patterns must be identiﬁed. For this reason every instruction that directly modiﬁes the
PC register is analyzed and compared to the most common compiler patterns. This way
about 80 % of all indirect branch instructions can be identiﬁed.
2.4 Program Analysis
In order to perform any optimizations, the binaries must be analyzed ﬁrst. To simplify
the task of reading the ARM binaries and transferring them into an internal representa-
tion, not the binaries themselves were read but the output generated by the Linux tool
19...
LDR r5, =printf
...
(a) before
...
0x3410: ldr r4, [pc, #4]
0x3414: b 0x341c
0x3418: 0x4510
...
(b) after
Figure 2.7: Code generated by compiler and linker for a generic procedure call.
objdump is parsed. Objdump displays the assembly code of a binary.
Procedural abstraction is used in this thesis as a post link-time optimization. All
code used must be visible to the optimizing program, therefore only statically linked,
non stripped binaries are supported at the moment. The following sections describe the
performed program analysis in the order of their appearance in the optimizer.
2.4.1 Basic Block Analysis
As all following analysis requires the identiﬁcation of basic blocks to construct the control
ﬂow graph of a procedure or the data ﬂow, a quick explanation of basic blocks and their
identiﬁcation is given in this paragraph, some ARM-speciﬁc aspects have already been
described in section 2.1. Formally, a basic block is a sequence of instructions in which
the instruction in each position dominate all instructions in later position and no other
instruction is executed between two instructions in the sequence [20]. Thus, the ﬁrst
instruction, the so called leader instruction, in a basic block may be:
• The entry point of a procedure. This is the ﬁrst instruction in the sequence of
instructions that will be executed if the procedure is called in the program.
• A target of a branch instruction. These are often pseudo instructions, so called
labels that are only used to mark places in the instruction sequence as targets of
various branch and goto instructions.
• An instruction immediately following a branch or return instruction.
To determine the basic blocks that compose a procedure, at ﬁrst all the leader instruc-
tions have to be identiﬁed by analyzing all instructions in the code sequence. Now all
instructions between a leader instruction and the next one are included into the same
basic block [21]. In almost all cases, the above algorithm is suﬃcient enough to deter-
mine the basic block structure of a procedure.
20A special problem are call instructions. They also alter the control ﬂow, but they
have not been considered as a branch in determining the leaders in a procedure. In most
cases calls do not need to be treated as branches, resulting in longer and fewer basic
blocks, which is desirable for optimizing purpose. This can be done because after the
called procedure has ended; the program will continue the execution right after the call
instruction. Thus a call instruction can be seen as a special instruction that summa-
rizes all instructions that are executed in the called procedure. However, if a procedure
call has alternate return addresses, as it may be in various constructs used to perform
procedure calls on ARM architectures, or the return address is completely unknown,
then it must be considered as a basic block boundary.
2.4.2 Control Flow Graph
After having identiﬁed the basic blocks in a procedure, the procedure’s control ﬂow can
be constructed. The CFG is characterized by a rooted directed graph. In this graph,
the nodes are the basic blocks. In addition to the basic block nodes there are two special
nodes, the so called entry node and exit node. The edges in the control ﬂow graph point
from one basic block node to another in the same way as the control ﬂow is running in
the program during execution. The entry node is connected to the initial basic block
of the procedure and thus is the graph’s root. The initial basic block starts with the
ﬁrst instruction of a procedure. Each ﬁnal basic block (a basic block is called ﬁnal if
its last instruction terminates the execution of the current procedure and returns the
control ﬂow to the superior calling procedure) is connected with the exit block. Entry
and exit node of each procedure are also connected with each other. Entry and exit
node are not essential and are only added for technical reasons as they make many of
the analyzing algorithm much simpler by removing the need to treat some basic block
nodes diﬀerently. Every procedure has its unique entry and exit node.
In order to construct the control ﬂow graph of a procedure, the algorithm as described
in [20] is used. In this algorithm for each basic block a set of successors and predecessors
are created. To construct the set of successors for each basic block, at ﬁrst a mapping
between labels and the basic blocks, they occur in, has to be created. As labels are leader
instructions this can be accomplished by scanning over the ﬁrst instruction of each ba-
sic block in the procedure. Second a mapping between all branch instructions and the
basic blocks they occur in must be created. Similar to the label instructions, branch
instructions are the last instructions in basic blocks. Simultaneously with the creation
of the branch mapping the return instructions can be analyzed, as they also have to be
the last instruction. All basic block with a return statement at the end can be directly
connected with the exit node. If the branch or return instructions are conditional an
edge to the following basic block has to be added to the list of successors for these basic
blocks. After all these edges have been added to the CFG the mapping between the
21labels and the branches has to be resolved and edges between the basic blocks with the
branch instructions in it and the corresponding basic blocks with labels are also added
to the CFG.
Two problems remain unsolved: A problem while creating a CFG from assembler in-
structions is that compilers can generate indirect branch instructions as for example mov
pc, r4. For these instructions the target of the branch can not be easily determined.
As indirect branch instructions jump to addresses that are given to the instruction by
a register or a variable, the values stored in these registers or variables must be deter-
mined. This problem can only be resolved by using advanced analysis techniques to
determine the values that are stored in the registers. Another problem is that jumps
can leave the current procedure and can enter the control ﬂow of another procedure.
These jumps between procedures can not be modeled by a CFG any more. In order to
model these branch instructions correctly a inter-procedural CFG (ICFG) is necessary.
An ICFG models the whole program in one directed graph by merging the CFGs of all
procedures in the program together and adding edges between them. The edges between
the diﬀerent CFGs then model procedure calls and branches between the various control
ﬂows in the CFGS [22].
2.4.3 Data Flow Graph
In order to determine which instructions have data dependencies on other instructions
in the same basic block, all instructions must be analyzed. Data dependencies between
two instructions I1 and I2 can be classiﬁed as follows [20]:
• Flow dependency: I1 writes to a register or memory location that is used by I2.
I1: add r4, r2, #2 (r4 ← r2 + 2)
I2: mul r5, r2, r4 (r5 ← r2 · r4)
• Anti-dependency: I1 uses a register or memory location that is written by I2.
I1: sub r3, r7, r8 (r3 ← r7 − r8)
I2: div r8, r1, r5 (r8 ← r1/r5)
• Output dependency: I1 writes to a register or memory location that is also
written by I2.
I1: mov r9, r11 (r9 ← r11)
I2: mov r9, #7 (r9 ← 7)
22• Unknown dependency: It cannot be determined whether two instructions I1
and I2 are independent or not. This situation can occur if, for example, a load is
followed by a store instruction that uses diﬀerent registers to address a memory
location. As long as it can not be determined whether the two addressed locations
overlap, a dependency between two instructions must be assumed.
I1: ldr r12, [r4, #6] (r12 ← memory)
I2: str r15, [r13] (memory ← r15)
In the given example I1 calculates the memory address by adding 6 to the value
in register r4, this memory address is then loaded into register r12. I2 stores the
value in register r15 to the memory address that is stored in register r13.
Dependencies between instructions indicate the order in which the instructions must
be executed. If instruction I2 depends on I1, then I1 is a predecessor of I2 and therefore
must be executed before I2. In order to model the level of dependency between instruc-
tions correctly a directed acyclic graph (DAG) [23] can be constructed. The dependencies
between the instructions (nodes in the graph) are represented by edges that connect these
nodes depending on each other. Conditional codes and other implicit resources can be
treated as if they were registers or memory. Implicit resources on ARM architectures for
example is the carry ﬂag register that is used in case of an overﬂow during an addition.
The ordering that is deﬁned by the dependencies between the nodes of the DAG is the
only ordering, that must be preserved during optimizations. These optimizations change
the ordering of the instructions to be able to execute the code afterwards correctly again.
To precisely calculate the dependencies between the various instructions on ARM ar-
chitectures, two bit vectors are created for each instruction, that indicate which registers,
memory location and other resources are read (USE-vector) or written (DEF-vector)
by each instruction. Each resource is associated with a special bit in the vector [7]. For
example Figure 2.8 shows the Def-/Use-vectors for the add r1, r8, r4 instruction (all
set bits are depicted in gray). Because of the diﬃculty to distinguish between various
memory locations (as memory accesses are often register indirect, and the register value
can only be determined at runtime), all memory access to both heap-memory and stack-
memory are merged into one single bit. Distinction of memory is an important subject
in order to build precise but not too conservative data ﬂow graphs, which allow various
optimizations. But this task is beyond the scope of this thesis.
Many instructions can have side eﬀects. For example as on ARM most instructions
can be executed conditionally all of these instructions have a dependency to the cur-
rent processor status register (CPSR). These side eﬀects are not visible by analyzing the
operands of an instruction but must be read in the ARM developer books [24, 25, 26]
23Figure 2.8: Def-/Use-vector of an ARM instruction add r1, r8, r4 (set bits are de-
picted in gray).
or the ARM instruction set reference cards [27, 28, 29].
After the DEF- and USE-vectors have been constructed for each instruction in a basic
block, it has to be checked whether the vectors of two consecutive instructions overlap.
If the DEF-vectors of I1 overlaps with the USE-vector of I2 there is a ﬂow dependency
between the instructions. If both DEF-vectors overlap there is an output dependency, if
the USE-vector of I1 and the DEF-vector of I2 overlap there is an anti dependency. The
overlapping of the two USE-vectors is of no interest for the construction of the data ﬂow
graph and can therefore be ignored.
The data ﬂow DAG is now constructed by the following algorithm. Every instruction
in the basic block has to be examined whether their DEF-/USE-vectors overlap with the
vectors of any preceding instruction, in a way as described above. If the vectors over-
lap an edge is inserted between the nodes representing the two instructions, but only if
there is not already an edge and there is no path between the two nodes. If there is no
dependency of the current instruction to any of the preceding instructions the current
instruction is added to the set of root nodes of the DFG. Figure 2.9 shows the complete
DFG for an exemplary basic block (see Figure 2.9):
2.5 Summary
In this section an introduction to the ARM instruction set and the architectural spe-
cialties was given. Further on the preliminary analyses that are necessary for the actual
procedural abstraction process in chapter 3 have been explained.
24sub r0, r0, #5
mov r1, r0
sub r3, r1, #10
add r2, r1, #81
sub r4, r3, r2
sub r0, r0, #5
add r6, r4, r4
mul r7, r5, r6
mov pc, lr
(a) Basic block. (b) DFG.
Figure 2.9: Constructed DFG for a exemplary basic block.
25263 Graph based Procedural Abstraction
for ARM
After we have discussed the various preliminary analyses for the DFG based procedural
abstraction in chapter 2, this chapter will now explain the optimization itself. We will
introduce a method to determine the fragments with the highest beneﬁt among all code
fragments that where found by gSpan during the mining process. Afterwards diﬀerent
extraction methods will be discussed.
3.1 Preparing the Mining of the DFG
As already described in section 1.4, we use a modiﬁed gSpan to mine for common code
sequences on the DFGs. In order for gSpan to mine on the DFGs from all the basic
blocks in the program they must be transformed. gSpan works on graphs of integers. To
convert the object based DFG data structure into an integer representation; a mapping
between the instruction objects and their integer representations must be created. As
equal instructions must have the same integer value, the hashCode() procedure every
Java object oﬀers [30] can not be used. The default implementation of the hashCode()
procedure returns the memory location of each object. Because of this, equal objects
do not get the same integer representation. Although the hashCode() procedure could
be overridden in a way that it returns the same integer value for equal instructions,
this would be quite diﬃcult to achieve because of the constrains that have to be met
[30, 31]. The Java speciﬁcation demands that objects that return the same hash code
through the hashCode() procedure must also return true if they are compared by their
equals(...) methods. Because of these problems an external mapping between in-
structions and their integer representation was created. One can give an instruction
to this mapping facility and the associated integer corresponding to this instruction is
returned. If no such instruction has been used before, a new integer value is returned.
This approach is very simple and ﬂexible. At the moment only identical instructions
(mnemonic and all operands are equal) are mined, but the mapping facility can easily
be modiﬁed to account the mnemonic or the mnemonic and a canonical form of the
operands. This way the accurateness of the mining process can be modiﬁed without the
need to change the algorithms used in gSpan.
27As at the moment only individual basic blocks can be mined not all of the basic blocks
of a program must be analyzed. Table 3.1 shows the distribution of the size (number
of instructions) of the basic blocks and the number of their occurrence. As the code
fragments need a certain minimal length to be able to be abstracted in a way that the
program code gets less. There have to be at least ﬁve instructions in a basic block.
All basic blocks that have less then ﬁve instructions therefore must not be transform
into the gSpan data structures and not be analyzed. They are too short to contribute
usefully code fragments for the extraction process. These code fragments would also be
ﬁltered out by the weighting procedure, described in section 3.2. Now that the DFG is
setup, gSpan is started to mine for frequent fragments, i.e. common code sequences.
Table 3.1: Instructions distribution in basic blocks.
Instructions Number of blocks
per block bitcnts crc dijkstra patricia qsort rijndael search sha
1 ≤ x < 5 367 344 420 456 380 406 335 367
5 ≤ x < 10 172 162 214 236 289 189 176 168
10 ≤ x < 50 77 66 83 90 103 85 69 77
50 ≤ x 1 1 2 2 2 11 1 1
3.2 Weighting Function
After gSpan has mined for frequent fragments, and has eliminated duplicate fragments
the result has to be rated by a weighting function. Some fragments could technically
be extracted but this would not result in a smaller but in an even bigger binary. This
can be the case if the overhead to procedurally abstract a found fragment is higher
than the saving. In order to determine which of the identiﬁed code fragments shall be
abstracted a weighting function is used. This function takes a fragment and determines
the number of instructions that can be saved if this fragment will be extracted by one of
the extraction methods described in section 3.3. The formula that is used to calculate
the saving is:
savingByte = sizeByte · (numberEmbeddings − 1) − (PrologueByte + EpilogueByte) (3.1)
Although the ARM architectures only has ﬁxed width instructions (4 byte), in the
future support for the ARM Thumb instruction set [24, 25] (2 byte) will be added.
28Therefore the saving is not given in the number of instructions but in the number of
bytes. The size of the fragment is multiplied with the number of embeddings (number
of occurrences of the fragment in all basic blocks). From the number of embeddings one
occurrence has to be subtracted as at least one instance of the code has to remain in the
program in order for the program to work correctly after the procedural abstraction. As
prologue we name the instructions that are necessary in order to jump from the original
location of the code to the abstracted procedure. Depending on the method of extrac-
tion this includes a call or jump instruction, the exchange of registers, storage of the
program counter (in order to be able to return from the procedural abstracted procedure
to original location in the program), etc. Analogously to the prologue, epilogue names
the instructions that are necessary to return from the abstracted code to the original
location in the program. Depending on the type of fragment a return instruction is
added to the code that has been extracted into the procedure and instructions that
change the registers back to their original context, restore the program counter, etc. are
added to the original occurrences of the abstracted code. In the following section we
will have a look on the types of fragments and the diﬀerent extraction methods that are
used procedurally abstract them.
3.3 Extraction
In this section supported extraction methods are described. For each method all the
constraints that must be fulﬁlled in order to use it the prologue and epilogue cost will
be given. The prologue and epilogue costs will contain the calling of the procedurally
abstracted code fragments and general instructions like storing the PC to a free register
or onto the stack.
3.3.1 Cross Jump
In case a fragment ends with a return or unconditional branch instruction to the same
label (as in Figure 3.2 and Figure 3.1) a special case of procedural abstraction can be
used. This special case is called cross jump or tail merge [17, 18]. As all the embeddings
continue with the same basic block after the code that can be abstracted the overhead
of creating and calling a procedure can be avoided. On ARM architectures there is
only one way to return from a procedure to its calling procedure, namely by writing
the address of the instruction that shall follow into the program counter (PC) register.
The return address must be saved by the calling procedure before entering the called
procedure either into a special register, the so called link register (LR), or onto the stack.
Both way the return address has been set before the abstracted fragment code is called
and only the restoration of the PC from the LR or the stack does take place in the
29abstracted code. The costs for a cross jump extraction is very low, as we need only one
unconditional jump instruction for each embedding whose code is removed the prologue
cost is:
PrologueByte = (numberEmbeddings − 1) · jumpByte (3.2)
As the code that we want to extract must remain once in the code, one code embed-
ding must not be changed at all, for all other embeddings (numberEmbeddings − 1) there
must be inserted one jump to the one code embedding that was not changed. Because
of the special constraint of a cross jump extractable fragment, there are no further costs
for the prologue. The abstracted code fragment is capable to return to the original code
position by itself.
...
1404: mov r1, r0
1408: add r2, r0, #81
140c: mul r7, r5, r6
1410: sub r6, r5, #4
1414: b 0x54c4
...
260c: mov r1, r0
2610: add r2, r0, #81
2614: mul r7, r5, r6
2618: sub r6, r5, #4
261c: b 0x54c4
...
(a) before procedural abstrac-
tion
...
1404: b 0x260c
...
260c: mov r1, r0
2610: add r2, r0, #81
2614: mul r7, r5, r6
2618: sub r6, r5, #4
261c: b 0x54c4
...
(b) after procedural abstrac-
tion
Figure 3.1: ARM assembler code, that can be extracted through a cross jump because
each block ends with an unconditional branch.
3.3.2 Subroutine call
As the constraints for cross jump extractions (embeddings must end with an uncondi-
tional branch to the same label or with a return) are very strict a more general approach
is used to extract most code fragments. For this approach, no special constraints must
be met by the code to be extracted. A new procedure is created for this code sequence.
This procedure comprises the instructions to be extracted and all occurrences of is code
are replaced by a call to the newly created procedure. As the new procedure does not
return to the correct position in the program sequence by itself, the return address must
30...
2004: mov r1, r0
2008: add r2, r0, #81
200c: mul r7, r5, r6
2010: sub r6, r5, #4
2014: mov pc, lr
...
340c: mov r1, r0
3410: add r2, r0, #81
3414: mul r7, r5, r6
3418: sub r6, r5, #4
341c: mov pc, lr
...
(a) before procedural abstrac-
tion
...
2004: jmp 0x340c
...
340c: mov r1, r0
3410: add r2, r0, #81
3414: mul r7, r5, r6
3418: sub r6, r5, #4
341c: mov pc, lr
...
(b) after procedural abstrac-
tion
Figure 3.2: ARM assembler code, that can be extracted through a cross jump because
each block ends with a return.
be saved for each embedding before the procedure can be called. This can be achieved
by ﬁnding a register that is not used (a so called dead register) in the new procedure.
In this dead register the value of the LR is saved. Then the current PC value is saved
to the LR register and the new procedure is called. For this approach there must be
one dead register for each embedding that will call the abstracted procedure. These
registers do not necessarily have to be identical as saving and restoring the LR must be
done separately for each embedding directly before and after the call of the procedure.
To identify spare registers a sophisticated live register analysis is needed [25, 26].
The costs for this method would be:
PrologueByte = numberEmbeddings · (SaveLR + call) (3.3)
EpilogueByte = (numberEmbeddings · RestoreLR) + return (3.4)
The coast for this extraction approach would result in an overhead of 12 byte per
embedding (prologue and epilogue all together) and another 4 byte must be added for
the return statement that must be inserted as the last instruction into the abstracted
procedure. Figure 3.3 shows an example for this approach.
If not all embeddings have a spare register the LR can be stored into, another approach
must be used. If the code fragment to be extracted does not have any instructions that
modify the stack, the return address for the procedure call can be stored onto the stack.
31...
2004: mov r1, r0
2008: add r2, r0, #81
200c: mul r7, r5, r6
2010: sub r6, r5, #4
2014: mov r8, r9
2018: add r1, r1, r8
201c: sub r9, r2, #41
2020: sub r6, r5, r9
2024: mov r5, r6
...
340c: mov r1, r0
3410: add r2, r0, #81
3414: mul r7, r5, r6
3418: sub r6, r5, #4
341c: mov r8, r9
3420: add r1, r1, r8
3424: sub r9, r2, #41
3428: sub r6, r5, r9
342c: mov r5, r6
...
(a) before procedural abstrac-
tion
...
2004: mv r10, lr
2008: bl 0x5000
200c: mv lr, r10
...
340c: mov r11, lr
3410: bl 0x5000
3414: mov lr, r11
...
5000: mov r1, r0
5004: add r2, r0, #81
5008: mul r7, r5, r6
500c: sub r6, r5, #4
5010: mov r8, r9
5014: add r1, r1, r8
5018: sub r9, r2, #41
501c: sub r6, r5, r9
5020: mov r5, r6
5024: mov pc, lr
...
(b) after procedural abstrac-
tion
Figure 3.3: ARM assembler code, that can be extracted through a procedure call, LR is
saved in spare registers.
If the fragment has a call instruction in it, the whole procedures by themselves (or
any procedure that is used by them) is not allowed to modify the stack in order for this
extraction procedure to work. Because we don’t want to adjust all stack relative address
in the abstracted code fragment, we push the return address of the calling procedure onto
the stack without modifying the stack-pointer itself. This way no adjustment has to be
made to any instructions that read from it. As we write onto the stack without adjusting
the stack-pointer, the return value would be overwritten by any instruction that writes
to the stack. But because we only apply this abstraction method to fragments without
such instructions the return address will not be overwritten. Instructions that do read
memory positions that are beyond the stack pointer are not critical at all; As the values
are only well deﬁned up to the memory address the stack pointer points to, instructions
must assume to read random data from address greater then the stack pointer. Therefore
it does not matter if these instructions would read the return address that was placed
there by this abstraction method. The costs for this method would be:
32PrologueByte = numberEmbeddings · (PushPC + call) (3.5)
EpilogueByte = PopPC (3.6)
As the pop of the saved return address from the stack can be stored to the PC reg-
ister, this implicitly performs a return to the original position in the program stream
and therefore reduces the costs to 8 bytes per embedding and 4 bytes for the return
that has to be added to the abstracted procedure. Figure 3.4 gives an example for this
version. One problem with this approach could be that as stack and heap grow towards
each other. If both meet, the storage of the return address overrides data in the heap
space. As this problem is more a theoretical one, it was completely ignored for this thesis.
...
2004: mov r1, r0
2008: add r2, r0, #81
200c: mul r7, r5, r6
2010: sub r6, r5, #4
2014: mov r8, r9
2018: add r1, r1, r8
201c: sub r9, r2, #41
2020: sub r6, r5, r9
2024: mov r5, r6
...
340c: mov r1, r0
3410: add r2, r0, #81
3414: mul r7, r5, r6
3418: sub r6, r5, #4
341c: mov r8, r9
3420: add r1, r1, r8
3424: sub r9, r2, #41
3428: sub r6, r5, r9
342c: mov r5, r6
...
(a) before procedural abstrac-
tion
...
2004: str pc, [sp]
2008: b 0x5000
...
340c: str pc, [sp]
3410: b 0x5000
...
5000: mov r1, r0
5004: add r2, r0, #81
5008: mul r7, r5, r6
500c: sub r6, r5, #4
5010: mov r8, r9
5014: add r1, r1, r8
5018: sub r9, r2, #41
501c: sub r6, r5, r9
5020: mov r5, r6
5024: ldr pc, [sp] #4
...
(b) after procedural abstrac-
tion
Figure 3.4: ARM assembler code, that can be extracted through a procedure call, PC is
saved to the stack.
If the code fragment does include instructions that modify the stack, we can not
simply write the return address on the stack. In this case we must either identify all
33instructions that access memory locations on the stack. These must then be modiﬁed to
point to adjusted locations, if we still want to save the return address onto the stack (this
time with modifying the stack pointer). Another possibility is to create and maintain a
shadow stack that is only used to store the return addresses of the embeddings. As not
only the designated stack pointer register can be used to access memory locations on
the stack, every instruction that can potential access memory must be analyzed. Only
if all memory accesses can be clearly identiﬁed this optimization can be used. The costs
in terms of code size for this approach would be:
PrologueByte = numberEmbeddings · (PushPC + call) (3.7)
EpilogueByte = PopPC (3.8)
Adjusting all stack operations so that they still point to the correct stack addresses
(also the stack-pointer has to be modiﬁed by adding the return address onto the stack)
is diﬃcult, but it does not require more instructions in the optimized program code (see
Figure 3.5).
If not all memory addresses, that are used by instructions in the abstracted fragment,
can be identiﬁed without doubt, the stack may not be modiﬁed, otherwise the correctness
of the program can not be guaranteed anymore. In this case the return address may
be pushed to a shadow stack that was only created for this purpose. The cost for this
approach would be equal to the stack modifying extraction method. But the space for
the shadow stack must also be taken into account.
3.4 Algorithm
As now we have discussed all phases of the procedural abstraction, this section will
describe the whole algorithm (see Algorithm B.1) and explain how all phases work
together. The algorithm can be divided into three passes, the analysis phase, the opti-
mization phase and the post processing phase.
In the analysis phase the binary must be read into the internal program representa-
tion (see section 2.4). Then all interweaved address and data pools must be removed
from the instruction stream in order to assure that all following steps only work on real
instructions. The next step is to go over all instructions that modify the PC register and
search for known modiﬁcation patterns (see section 2.4). All instructions that can be
matched to a known pattern (call or return pattern) be replaced by special instructions
so that afterwards other analysis can relay on these results. Finally we start with the
optimization phase.
34...
2004: mov r1, r0
2008: add r2, r0, #81
200c: mul r7, r5, r6
2010: sub r6, r5, #4
2014: str r8, [sp, #8]!
2018: add r1, r1, r8
201c: sub r9, r2, #41
2020: sub r6, r5, r9
2024: mov r5, r6
...
340c: mov r1, r0
3410: add r2, r0, #81
3414: mul r7, r5, r6
3418: sub r6, r5, #4
341c: str r8, [sp, #8]!
3420: add r1, r1, r8
3424: sub r9, r2, #41
3428: sub r6, r5, r9
342c: mov r5, r6
...
(a) before procedural abstraction
...
2004: str pc, [sp]!
2008: b 0x5000
...
340c: str pc, [sp]!
3410: b 0x5000
...
5000: mov r1, r0
5004: add r2, r0, #81
5008: mul r7, r5, r6
500c: sub r6, r5, #4
5010: str r8, [sp, #4]!
5014: add r1, r1, r8
5018: sub r9, r2, #41
501c: sub r6, r5, r9
5020: mov r5, r6
5024: ldr pc, [sp]! #4
...
(b) after procedural abstraction
Figure 3.5: ARM assembler code, that can be extracted through a procedure call, PC is
saved to the stack. All instructions that modify the stack have been adjusted.
In the optimization phase, the processed instruction stream is divided into basic
blocks. For each basic block that has more than ﬁve instructions, a DFG is created
(see section 3.1). All generated DFGs are transformed into a representation that can be
used by gSpan. gSpan searches for code fragments that have at least ﬁve instructions
and occur at least twice. After all embeddings that fulﬁll these criteria are found, they
are sorted according to their weight (see section 3.2). The fragment with the highest
weight is extracted from the code. As the procedural abstraction modiﬁes some of the
basic blocks and therefore also the DFGs, one can only extract one fragment at a time
1. The DFGs for the modiﬁed basic blocks are created and gSpan is started again to
mine for new common code sequence. Therefore the optimization phase is as long as
fragments that can be extracted are found.
1The global optimum of all embeddings can not be calculated as the number of embeddings is far to
high even for small programs.
35If no more fragments can be extracted the optimization phase has ended and the post
processing phase follows. In this phase the program is written into an assembly ﬁle that
afterwards can be compiled into a binary again.
364 Evaluation
In this chapter we will evaluate the instruction saving that can be achieved by applying
the graph based procedural abstraction that was introduced in the former chapter. We
evaluated several diﬀerent programs and measured the instruction saving with our new
approach. These results are then compared to the results the traditional procedural
abstraction approach could achieve.
4.1 Benchmark Suite
To evaluate the procedural abstraction framework implemented in this thesis, a set
of benchmark programs are used. In contrast to other ﬁelds in computer science like
computer graphic, database systems, etc. there is no commonly used benchmark suite
for embedded systems. This is because the area of activity is far too comprehensive in
the ﬁeld of embedded systems. Also embedded cpus often are very speciﬁc optimized
for the task they where build to do. For this reason parts of the MiBench benchmark
suite [32] whose tasks are rather typical for embedded systems have been taken. The
following benchmark programs have been taken
• bitcnts: calculates the number of bits that are needed to encode a given input.
Further more it generates a few statistics about the usage of the bits.
• crc: calculates the cyclic redundancy check (CRC). CRC is a type of hash function
used to produce a checksum. These checksums are used to detect and correct errors
after transmission or storage. They are calculated before transmission or storage
and are veriﬁed afterwards by the recipient to verify that the data has not changed.
• dijkstra: determines the shortest path from a single source to all other nodes
in a graph. This algorithm if often used in mobile applications that are used by
navigation systems.
• patricia: calculates the patricia tree. This is a specialized set data structure
based on the tree that is widely used in the area of IP routing, where the ability
to contain large ranges of values with a few exceptions is particularly suited to the
hierarchical organization of IP addresses.
37• qsort: reads an unsorted list from standard input and uses the quicksort sorting
algorithm to sort the input data. The sorted list is written to standard output.
• rijndael: encrypts the given input data by using the advanced encryption standard
(AES), also know as rijndael. This cipher is a so called block cipher that works on
a ﬁxed-length group of bits (block) and encrypts that block with a symmetric-key.
• search: searches for a given string in the given input data. This is done by using
the Boyer-Moore-Horspool pattern matching algorithm.
• sha: calculates a secure hash algorithm (SHA) value for a given input. SHA is
cryptographic hash functions, that is most commonly used in security applications
and protocols such as PGP, SSH, SSL, etc.
As most embedded systems only run one speciﬁc application there is no need for dy-
namic libraries. Therefore all oﬀ the test programs where compiled with the gcc to a
static binary. All binaries where compiled with the -Os switch that tells the compiler
to optimize for size. This means all optimizations that would result in fast but large
code, like loop unrolling [7] are omitted and instead of these additional optimizations
that try to further reduce the size of a binary are added. To further decrease the size
of the binaries they where not linked against the glibc but a special embedded library;
the dietlibc. The dietlibc [33] is a standard cross platform C run-time library (at the
moment x86, arm, sparc, alpha, ppc, mips and s390 architectures are supported) that
aims to be as small as possible and to be SUVv2, respectively Posix compatible.
4.2 Measurements
In this section we discuss the results that where achieved with the procedural abstraction
implemented in this thesis. We compare our new DFG approach to the traditional suﬃx
trie based approach. Table 4.1 and Table 4.2 show the results that could be achieved
with and without maxClique enabled. For each program the total number of saved
instructions for both the sequential chain representation (SEQ) and the DFG represen-
tation is given. Additionally the number of saved instruction has been but in connection
to the number of instructions in the unoptimized binary.
If comparing the maxClique enabled results in Table 4.1 with the results int Table 4.2
(without maxClique) one sees that the number of saved instructions with maxClique is
always ≥ the saved instructions without maxClique. Without maxClique an average of
1.29% instructions in the sequential and 1.91% in the DFG representation can be saved
whereas with maxClique 2.33% respectively 3.38% of all instructions can be saved. The
overall beneﬁt with maxClique is consequently twice as high as without. This therefore
38Table 4.1: Saved instructions in the benchmark suite with maximal clique activated dur-
ing search.
Program # Instructions
Saving # Saving %
SEQ DFG SEQ DFG
bitcnts 3946 53 79 1.34 2.00
crc 3584 84 117 2.34 3.26
dijkstra 4632 113 168 2.44 3.63
patricia 5039 151 193 3.00 3.83
qsort 4770 147 199 3.08 4.17
rijndael 7113 128 257 1.80 3.61
search 3717 84 108 2.26 2.91
sha 3897 95 120 2.44 3.08
total 36698 855 1241 2.33 3.38
justiﬁes the additional time that must be spend for the maxClique approach.
The greatest diﬀerence between SEQ and DFG can be seen for the rijndael program
with maxClique enabled. Because of the operating mode of the encryption algorithm,
many very similar code sequences were generated by the compiler. In order to speedup
the execution of the program the compiler has reordered the instructions in these se-
quences in order to overlap load operations with computation [7]. Because of the in-
struction reordering or rescheduling, the old sequential approaches can not identify most
of the identical code, and therefore only 128 instructions or 1.80% can be saved. Because
the DFG approach is insensitive to the instruction order, out new approach is able to
identify much more identical code sequences and therefore 257 or 3.61% of all instruc-
tions can be saved.
The bitcnts program shows the smallest optimization results. As this program only
processes the given input and calculates the number of bits that are needed to repre-
sent it, it does not oﬀer as much optimization potential as other test programs in the
benchmark suite. Nevertheless the DFG approach can reduce the size of the binary by
79 instructions whereas the traditional approach is only able to reduce the size by 53
instructions. This is an enhancement of nearly 42%.
Figure 4.1 shows the increased saving between the traditional sequential chain repre-
sentation and our new approach in percent. As one can see the optimization gain that
can be achieved by our new approach goes from 31% for the qsort program up to 120%
39Table 4.2: Saved instructions in the benchmark suite with maximal clique deactivated
during search.
Program # Instructions
Saving # Saving %
SEQ DFG SEQ DFG
bitcnts 3946 53 72 1.34 1.82
crc 3584 46 64 1.28 1.79
dijkstra 4632 68 103 1.47 2.22
patricia 5039 70 98 1.39 1.94
qsort 4770 70 103 1.47 2.16
rijndael 7113 67 123 0.94 1.73
search 3717 46 65 1.24 1.75
sha 3897 55 74 1.41 1.90
total 36698 475 702 1.29 1.91
for the rijndael benchmark. In average an increase of 57% can be seen for all programs
in the benchmark suite.
The relative small saving gain of about 35% (with maxClique) that can be seen by
the qsort program is because of the algorithm itself. Quicksort sorts by using a divide
and conquer strategy. It ﬁrst picks an element, the socalled pivot element, from the
list. After that the list is reordert so that all elements that are smaller than the pivot
element come before it and all elements greater than come after. After the partitioning,
the pivot element is in its ﬁnal position and quicksort sorts recursively the sub-lists. As
most of the time the program must compare the values at diﬀerent memory locations and
swap these, there are many data dependencies between the instructions and therefore
the compiler can not perform optimizations like instruction reordering. Because of the
great amount of data dependencies the DFG and sequential representation are mostly
identical. That leaves only a few additional sequences that only can by extracted by our
approach.
In Figure 5.2 a comparison between the used extraction methods in the sequential and
DFG representation is given. As one can see cross jump extraction occures seldomly
in all test constilations. In order to be cross jump extractable, an instruction sequence
must end with a return or jump instruction (as described in section 3.3). Such sequen-
zes are not very common.
40Figure 4.1: Saving increase between sequential and DFG representation with and with-
out maximal clique activated during search.
To document the reasons for the higher instruction saving that is achieved by the DFG
representation compared to the sequential representation, we have analyzed the DFG
graph representation in detail. Table 4.3 gives an overview about the number of instruc-
tions in the DFG representation and their connection to other instructions. Instructions
that have no incomming or outgoing edges in a DFG are the either root nodes (no in-
comming edges) or leaf nodes (no outgoing edges). In the sequential representation all
instructions but the ﬁrst and last instruction of each basic block have exactly one pre-
decessor and successor and therefore an ind- and outdegree of one. The reasons for the
better instruction saving can only be explained if we look at the number of instructions
whose indegree or outdegree is greater then one (see Table 4.4). At these instructions
the data ﬂow either splits into several concurrent ﬂows or merges independent data ﬂows
into a single one. These are the instructions that enable our new approach to ﬁnd more
instruction sequences that can be extracted. As these instructions make 30% up of all
instructions in the programs one can see why the DFG approach is far more promissing
then the traditional one.
4.3 Summary
In this chapter an overview over the various benchmark programs was given. We showed
how the instruction saving could be increased by a factor of 2, compared to the suﬃx
trie approach. In chapter 5 we will discuss various approaches to further increase the
instruction saving.
41(a) Sequential.
(b) DFG.
Figure 4.2: Used extraction mechanisms with and without maxClique activated during
search.
42Table 4.3: Indegree and outdegree of all instructions.
Progam Type
Degree
0 1 2 3 4 5 6 7 8
bitcnts
In 447 2048 404 61 34 12 2 0 1
Out 355 2208 350 64 15 9 6 2 0
crc
In 418 1836 341 54 32 11 3 0 1
Out 327 1992 291 55 15 8 6 2 0
dijkstra
In 548 2429 444 77 35 12 2 0 1
Out 430 2634 382 66 18 10 6 2 0
patricia
In 590 2659 465 91 38 14 3 0 1
Out 464 2877 400 77 23 12 6 2 0
qsort
In 574 2511 458 81 42 13 2 0 1
Out 449 2734 380 75 26 10 6 2 0
rijndael
In 538 3839 1410 186 82 24 3 0 1
Out 408 4120 1256 205 53 24 12 2 3
search
In 435 1955 367 57 29 11 2 0 1
Out 343 2118 311 56 12 9 6 2 0
sha
In 451 2004 378 71 36 11 3 0 1
Out 355 2174 322 66 19 10 7 2 0
total
In 4001 19281 4267 678 328 108 20 0 8
Out 3131 20857 3692 664 181 92 55 16 3
Table 4.4: Number of instructions with (degreeIN ∨ degreeOUT) > 1 in all DFGs that
are used for mining.
Program degree > 1 degree ≤ 1
bitcnts 859 2150
crc 730 1966
dijkstra 932 2616
patricia 1010 2851
qsort 980 2702
rijndael 2542 3541
search 776 2081
sha 834 2121
total 8663 20028
43445 Future Work
This chapter gives an overview to possible extensions of the current procedural abstrac-
tion framework. We will discuss how these extension aﬀect the process of code extraction
and why they improve the instruction saving.
Fuzzy Instruction Match
At the moment the instructions in the embeddings of a found fragment must be com-
pletely identical (see section 3.1). This means that code sequences that do only diﬀer in
registers are not identiﬁed as identical an can therefore not be extracted at the moment.
Thus instead of mining from identical instructions one can mine for instructions that
are canonical equal. In a canonical representation two instructions are equal if they have
the same mnemonic and the same number and type of operands. Figure 5.1 shows ARM
instructions and their canonical representation where a R denotes a register and I an im-
mediate value. To further increase the number of fragments that are found by the miner,
one can also deﬁne two instructions as equal if they have the same mnemonic regardless
of number and type of operands. These approaches will increase the number of frage-
ments that can be found by the miner but need greater eﬀort in the actual extraction
phase. In the ﬁrst case with canonical representation registers must be swapped inorder
to make the embeddings actually extractable. In the second case mnemonic represen-
tation immediate values must be be transformed beside register swapping, wherefore a
semantic analysis must be implemented. Both approaches add additional cost that must
be considered in the weighting function (3.2).
Enlargement of the Search Area
Another approach to increase the number of of found fragments is to widen the search
area. Thus through increasing the search area the miner is enabled to ﬁnd more and
greater fragments that can be extracted. At the moment the search area is restricted to
the size of basic blocks. The search are can be increased incrementally, in a ﬁrst step
all the instructions of the basic blocks that are to small (see Table 3.1) will be merged
with their predecessesing basic blocks.
45mov r1, r0
add r2, r0, #81
mul r7, r5, r6
sub r6, r5, #4
b 0x54c4
(a) ARM assembler instruc-
tions.
mov R, R
add R, R, I
mul R, R, R
sub R, R, I
b I
(b) Canonical represen-
tation.
Figure 5.1: ARM assembler code and their canonical representation.
Figure 5.2 shows three distinct DFGs that are merged to one huge DFG over basic
block boundaries away. The three basic blocks follow sequentially in the control ﬂow.
This decreases the number of basic blocks that are not considered because they are too
small and increases the number and size of the basic blocks that are mined.
In a second step, the information of the CFGs is added to the generation of DFGs.
This leads to an even larger search area, as the CFG models the control ﬂow through a
procedure. The DFGs of all basic blocks in a function will be merged to one single DFG
that models the dataﬂow for the whole procedure.
The last step is to create a call graph for the program. A call graph models the re-
lation between procedures, each procedure in the program is represented by a node in
the graph. Each time a procedure calls another one an edge is added to the call graph
between the two nodes. With the help of the call graph and the CFGs of the individual
procedures an ICFG (inter procedural CFG) is built for the whole program (see chapter
2). Similar as in the second step the information of the ICFG is then used to create only
one DFG for the whole program. This DFG then represents the maximum search area,
because all instructions in all basic blocks are combined in one graph. This enables the
gSpan to ﬁnd the maximal number of fragments.
Distinction of Disjunctive Memory Locations
To further improve the number of found fragments, the data dependencies between sin-
gle instructions should be analyzed more precisely. At the moment two instructions
have a data dependency among them, if both instructions write to a memory location,
regardless whether the one instruction writes to the stack and the other to the heap.
The method must be improved to that eﬀect, that an analysis that destingishes between
stack and heap memory addresses is added. This analysis must then be further improved
to also distinguish between diﬀerent memory locations inside the stack or heap. This
will remove the data dependencies between instructions that lead to more branched out
46(a) Distinct DFGs in three ba-
sis blocks.
(b) DFGs merged over basic
block boundaries.
Figure 5.2: Enlargement of a search area over basic block boundaries.
DFGs. Branched out DFGs contain more independent instructions and therefor enable
the miner to ﬁnd more potential code fragments.Figure 5.3 shows load instructions that
access various memory locations. These load instructions have at the moment data de-
pendencies among them, also the accessed memery locations are mostly distinct. The
ﬁrst three marked load instructions access distinct memory locations in the stack and
heap memory and must therefore not have any data dependency among them. On the
other hand the last 2 load instructions access the same memory location on the stack
although the instructions themself seem to be diﬀerent.
More Exact Data Dependency Analysis
During DFG compositon for some instructions conservative estimations are used. A call
for example is treater as an alias for the whole procedure it calls. As we do not know
what resources are used by that procedure, a call has DEF- and USE- dependencies to
all registers and memory locations. By analyzing the registers that are used or modiﬁed
in a procedure (and all procedures that are called from within that procedure) without
being saved to the stack, the conservative data dependency estimation is replaced by
a more exact one. Similar to the load and store operations this removes unnecessary
47Figure 5.3: Distinction of disjunctive memory locations.
data decencies between instructions and therefore branch out the DFG what enables the
miner to produce better results.
Finding Distinct Instructions in the DFG Representation
In contrast to the sequential representation, the DFG representation of the instructions
has the advantage that it is independent from the instruction order. Therefore also
fragments are found with the DFG approach that can not be found in the traditional
approach. On the other hand because of the instruction order is broken up the miner
is not able to ﬁnd some frequent fragments in the DFG that can be found in the se-
quential representation. Figure 5.4 shows the sequential and the DFG representation of
a basic block. Frequent instructions are marked gray. In the sequential representation
the instructions are connected because of their instruction order and can therefor be
found (see Figure 5.4(a)). But because there are no data dependencies between these
instructions they are not connected in the DFG representation and can not be found
by the miner (see Figure 5.4(b)). The search on the DFG representation must therefore
either be optimized by adding a special glue node to the DFGs that connects all distinct
nodes or both representations must be combined in order to ﬁnd the maximum of the
instruction set for extraction.
DAG Miner
The DFGs that are used for our new procedural abstraction approach are directed and
acyclic and are therefore represented in DAGs. The gSpan algorithm as it is used in this
thesis is a general garaph miner, that does no take the special structure and attributes
of DAGs into account during the search. It must be further analysed if there are better
48(a) Sequential
representation.
(b) DFG representation.
Figure 5.4: Frequent instructions (marked gray) in sequential and DFG representation.
ways to search through a lattice of only directed and acyclic graphs. Even the detection
of duplicates by the means of a canonical representation might be optimized by design-
ing a spezialized canonical form for DAGs. Furthermore reseachring the overlappings
between DAG subgraphs can be improve the seach of the required optimal independent
subset of all embeddings. This will eventual eliminate the need of extracting the found
fragments in an interative way but to calculate the optimal set of embeddings for pro-
cedural abstraction.
49506 Summary
Our new procedural abstraction approach is based on graph mining. We construct a DFG
representation for all basic blocks and use the DFGs as input for the mining process.
The DFG representation has the advantage over the traditional sequential representa-
tion that it is independent from the instruction ordering. Therefore also fragments are
found with the DFG approach that can not be found in the traditional approach.
As one of our goals was to perform procedural abstraction as a post link-time opti-
mization for any compiler, we do not make assumptions about special code templates
that are used by speciﬁc compilers. As we also do not make special requirements to
compiler and extract all information that is used to perform the procedural abstraction
from the binary itself ﬁnding and removing weaved data in the instruction stream arises
a few problems. These problems must be addressed in order to be able to also optimize
hand written code.
Our evaluation shows that by using a DFG instead of the traditional sequential in-
struction representation we are able to achieve an up to 5 times higher instruction saving
on certain programs like rijndael. In average the instruction saving was increased by a
factor of 3.
Concluding we can outline the the procedural abstraction based on DFGs is superior
to the traditional sequential approach.
5152A Command-line Options
This appendix will give an overview over all supported command line options that can
be used to conﬁgure the program.
[(-a|--arch) <architecture>]
Archtiecture to analyse. Supported architectures: arm.
(default: arm)
(-b|--binary) <binaryFile>
Binary to analyse.
[--dumpAll]
Activate all dump switches.
[--dumpBB]
Dump all basic blocks.
[--dumpCFG]
Dump the CFGs of the program.
[--dumpDFG]
Dump the DFGs of the program.
[--dumpFunction]
Dump all functions (in a objdump format).
[--dumpParMol]
Dump data for the ParMol miners.
[--dumpStatistic]
Write statistics ...
[--ec]
Enable all checks.
53[(-f|--frequency) <parmol.minimumFrequency>]
Minimal frequency of embeddings to be added to the result set.
(default: 2)
[--gg <pa.graphGenerator>]
LabelGenerator to use. Supported generators: [DFG, Sequential,
SuffixTrie]. (default: DFG)
[--gmc]
Run maxClique test over all found embeddings to determine the
maximal extraction set.
[-h|--help]
Print this help text.
[--lg <pa.graphLabelGenerator>]
LabelGenerator to use. Supported generators: [Assembler, Canonic,
Mnemonic]. (default: Assembler)
[--log4j <log4j>]
Configfile for log4j. (default: log4j.xml)
[--minSize <parmol.minimumNodeCount>]
Minimal number of instruction for embeddings to be added to the
result set. (default: 5)
[--noMaxClique]
No maxClique text between embeddings of a fragment during search.
[-O]
Activate optimizations.
54B Algorithm Overview
Algorithm B.1 Algorithm for procedural abstraction.
1: convert binary into internal representation
2: ﬁlter interweaved data
3: analyze indirect branches
4: repeat
5: create basic blocks
6: for all basic block b in set of basic blocks do
7: if b has more then 5 instructions then
8: create dfg out of b
9: end if
10: for all dfg d in created set of dfgs do
11: create gSpan representation of d
12: end for
13: mine for frequent code fragments
14: for all fragment f in found fragments do
15: calculate weight of f
16: end for
17: for all fragment f in set of weighted fragments do
18: if weight of f > 0 then
19: extract f
20: break
21: end if
22: end for
23: end for
24: until no extractable fragments can be found
25: write assembler ﬁle of the optimized program
5556Bibliography
[1] Saumya Debray, William Evans, Robert Muth, and Bjorn De Sutter. Compiler
techniques for code compaction. ACM Trans. Program. Lang. Syst., 22(2):378–415,
2000.
[2] Andrew Wolfe and Alex Chanin. Executing compressed programs on an embedded
risc architecture. In Proceedings of the 25th annual international symposium on
Microarchitecture, pages 81–91, Los Alamitos, CA, USA, 1992. IEEE Computer
Society Press.
[3] Haris Lekatsas, J¨ org Henkel, and Wayne Wolf. Code compression for low power
embedded system design. In Proceedings of the ACM 37th Conference on Design
Automation, pages 294–299, New York, NY, USA, 2000. ACM Press.
[4] Darko Kirovski, Johnson Kin, and William Mangione-Smith. Procedure based pro-
gram compression. International Journal of Parallel Programming, 27(6):457–475,
December 1999.
[5] Terry Welch. A technique for high performance data compression. IEEE Computer,
17(6):8–18, June 1984.
[6] Michael Burrows and David Wheeler. A block-sorting lossless data compression
algorithm. Number 124, Palo Alto, CA, USA, 1994.
[7] Andrew Apple. Modern Compiler Implementation in Java. Cambridge University
Press, Cambridge, UK, second edition, 2002.
[8] Bjorn De Sutter, Bruno De Bus, Ludo Van Put, Dominique Chanet, and Koen
De Bosschere. Link-time optimization of arm binaries. In Proceedings of the 2004
ACM SIGPLAN/SIGBED Conference on Languages, compilers, and tools for em-
bedded systems, pages 211–220, Washington, DC, USA, 2004. ACM Press.
[9] Esko Ukkonen. On-line construction of suﬃx-trees. Algorithmica, 14(3):249–260,
September 1995.
[10] Christian Borgelt and Michael Berthold. Mining Molecular Fragments: Finding
Relevant Substructures of Molecules. In Proc. IEEE Int’l Conf. on Data Mining
ICDM, pages 51–58, Maebashi City, Japan, 2002. IEEE Computer Society Press.
57[11] Jun Huan, Wei Wang, and Jan Prins. Eﬃcient mining of frequent subgraphs in
the presence of isomorphism. In Proceedings of the 3rd IEEE Intl. Conf. on Data
Mining ICDM, pages 549–552, Piscataway, NJ, USA, 2003. IEEE Computer Society
Press.
[12] Xifeng Yan and Jiawei Han. gSpan: Graph–based substructure pattern mining. In
Proceedings IEEE Int’l Conference on Data Mining ICDM, pages 721–723, Mae-
bashi City, Japan, 2002. IEEE Computer Society Press.
[13] Xifeng Yan and Jiawei Han. Closegraph: Mining closed frequent graph patterns.
In Proceedings of the 9th ACM SIGKDD Int’l Conference on Knowledge Discovery
and Data Mining, pages 286–295, Washington, DC, USA, 2003. ACM Press.
[14] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm
for mining frequent substructures from graph data. In PKDD ’00: Proceedings of the
4th European Conference on Principles of Data Mining and Knowledge Discovery,
pages 13–23, London, UK, 2000. Springer.
[15] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Pro-
ceedings of the IEEE Intl. Conf. on Data Mining ICDM, pages 313–320, San Jose,
CA, USA, 2001. IEEE Computer Society Press.
[16] Christian Borgelt and Michael Berthold. Mining molecular fragments: Finding
relevant substructures of molecules. In Proceedings of the 2002 IEEE International
Conference on Data Mining (ICDM 2002), pages 51–58, Maebashi City, Japan,
2002. IEEE Computer Society Press.
[17] Bjorn De Sutter, Bruno De Bus, and Koen De Bosschere. Sifting out the mud: Low
level c++ code reuse. In Proceedings of the 17th ACM SIGPLAN Conference on
Object-oriented programming, systems, languages, and applications, pages 275–291,
New York, NY, USA, 2002. ACM Press.
[18] Bjorn De Sutter, Hans Vandierendonck, Bruno De Bus, and Koen De Bosschere.
On the side-eﬀects of code abstraction. In Proceedings of the 2003 ACM SIGPLAN
Conference on Languages, Compilers, and Tools for Embedded Systems, pages 244–
253, New York, NY, USA, 2003. ACM Press.
[19] John Levine. Linkers and Loaders. Morgan Kaufmann, San Francisco, CA, USA,
1997.
[20] Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kauf-
mann, San Francisco, CA, USA, 1997.
[21] Alfred Aho, Ravi Sethi, and Jeﬀrey Ullman. Compilers: principles, techniques, and
tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986.
58[22] Henrik Theiling. Extracting safe and precise control ﬂow from binaries. In Pro-
ceedings of the 7th Conference on Real-Time Computing Systems and Applications,
pages 23–30, Washington, DC, USA, 2000. IEEE Computer Society.
[23] Robert Sedgewick. Algorithms in Java, Part 5: Graph Algorithms. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 2003.
[24] Steve Furber. ARM System Architecture. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 1996.
[25] David Seal. ARM Architecture Reference Manual. Addison-Wesley Longman Pub-
lishing Co., Inc., Boston, MA, USA, 2000.
[26] Andrew Sloss, Dominic Symes, and Chris Wright. ARM System Developer’s Guide:
Designing and Optimizing System Software. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 2004.
[27] Advanced RISC Machines Ltd. (ARM), Los Gatos, CA, USA. Procedure Call Stan-
dard for the ARM Architecture, 2005.
[28] Advanced RISC Machines Ltd. (ARM), Los Gatos, CA, USA. ARM Instruction
Set Quick Reference Card, 2003.
[29] Advanced RISC Machines Ltd. (ARM), Los Gatos, CA, USA. ARM7500FE Data
Sheet, 1996.
[30] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Speci-
ﬁcation. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.
[31] Bruce Eckel. Thinking in Java. Prentice Hall, Saddle River, NJ, USA, second
edition, 2000.
[32] Matthew Guthaus, Jeﬀrey Ringenberg, Dan Ernst, Todd Austin, Trevor Mudge,
and Richard Brown. Mibench: A free, commercially representative embedded
benchmark suite. In Proceedings of the 4th IEEE Annual Workshop on Workload
Characterization, pages 3–14, Austin, TX, USA, 2001. IEEE Computer Society.
[33] dietlibc - a libc optimized for small size. http://www.fefe.de/dietlibc/.
59