Instrumenting and analyzing platform-independent communication in applications by Nilakantan, Siddharth
Instrumenting and analyzing platform-independent communication in applications
A Thesis
Submitted to the Faculty
of
Drexel University
by
Siddharth Nilakantan
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
March 2015
c© Copyright 2015
Siddharth Nilakantan.
This work is licensed under the terms of the Creative Commons Attribution-ShareAlike
license Version 3.0. The license is available at
http://creativecommons.org/licenses/by-sa/3.0/.
ii
Dedications
To Anandi, M.S. Sivaramakrishna, Visalam and Harsha.
Their support, belief, and unconditional love made this work possible.
iii
Acknowledgments
This PhD has been quite the adventure and I would like to take this opportunity to thank all of
the individuals who have helped me find my way. During my journey, I have often faced difficulties
that seemed to hold me back. Without the support and guidance of all these individuals, I would
not have overcome the barriers in my path.
I would like to first thank my advisor, Dr. Mark Hempstead, for giving me this opportunity,
for believing in my abilities, for his guidance and for tireless support. He provided an environment
where I was free to pursue research in the directions I chose. I appreciate all the time he spent
providing timely feedback and pushing me to think deeply. I also appreciate the motivation to use
my time as efficiently as possible.
I wish to like to thank my committee members Dr. Jeremy Johnson, Dr. Timothy Kurzweg,
Dr. Baris Taskin and Dr. Joseph Devietti for their encouragement, feedback and direction. All
the current and former members of the Power Aware Computing Lab and VLSI Lab at Drexel
have provided productive collaborations, advice and kept the atmosphere positive. They include
Dr. Ankit More, Rizwana Begum, Karthik Sangaiah, Jonathan Stokes, Srikanth Annangi, Giordano
Salvador, and Scott Lerner. I would like to extend special gratitude to Dr. Steven Battle, a friend,
lab mate, and collaborator. I have learnt a lot from him, and have thoroughly enjoyed working with
him over the years. Our technical, personal, general and humorous conversations made the time and
deadlines that much more palatable.
All my friends in Philadelphia who kept me sane and provided several opportunities to forget
my stress and laugh like I had no burdens. I would like to extend special thanks to Sai Hema
Venkataramanan and Dr. Nikhil Gulati for engaging me intellectually, listening and reciprocating
to all my thoughts, and just being great friends.
I would like to thank my family, to whom this thesis is also dedicated. Their faith and immeasur-
able love gave me a solid foundation on which I could stand and reach for my dreams. My mother,
iv
Anandi, singlehandedly raised my brother and me with a good standard of living, while still pursuing
a successful career. To this day, she serves as an awe-inspiring example and I continue to try and
mimic her spirit. My grandmother Visalam’s patience and unconditional affection have soothed my
tumultuous mind on more than one occasion. My grandfather M.S. Sivaramakrishna’s example of
nobility, principle and integrity has been instrumental in making me a better person. My brother
Harsha has seen me through this entire journey in Philadelphia. He has been a constant source of
smiles, laughter and emotional support. My aunt Radhika, uncle Shashidhara, and cousins Nandan
and Anandita have provided all the affection and love that usually comes from immediate family.
Theirs was a home away from home, where I have built many personal memories over the journey
to the PhD.
vTable of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction & Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Capturing software-level communication . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Communication classification used to generate task graph equivalents . . . . . . . . . . 6
1.3 Communication classification used for trace-based design space exploration . . . . . . 9
1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2. Sigil: Capturing, classifying and representing communication automatically 16
2.1 Communication classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Formal definition of categories for classification . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Identifying categories in software applications . . . . . . . . . . . . . . . . . . . . . 22
2.2 Automating the capture and classification of communication with Sigil . . . . . . . . . 24
2.2.1 Overview of Sigil’s capture methodology . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Shadow memory based implementation . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 Auxiliary data structures employed . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Sigil infrastructure & characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Sigil output representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Aggregates representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Event Trace representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Background and Related work in communication-aware profiling tools . . . . . . . . . 42
vi
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3. Using Sigil for function-level analysis . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1 Using Sigil’s aggregates representation for partitioning . . . . . . . . . . . . . . . . . . 48
3.1.1 System Architecture for partitioning IDFGs . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Important considerations for partitioning IDFGs . . . . . . . . . . . . . . . . . . . 50
3.1.3 Metric for partitioning IDFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks . . . . . . 56
3.2.1 Accelerator candidate functions in PARSEC benchmarks . . . . . . . . . . . . . . . 57
3.2.2 Data reuse analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Performing constrained allocation of resources for accelerator candidates . . . . . . . . 65
3.3.1 Sample application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.2 Performance estimation and resource allocation models . . . . . . . . . . . . . . . . 68
3.3.3 Area-constrained allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Using Sigil’s fine-grained task graphs to identify parallel execution paths . . . . . . . . 73
3.5 Background and related work in partitioning and parallelism . . . . . . . . . . . . . . 76
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4. Communication classification applied to multithreaded programs . . . . . . . 80
4.1 Issues with traditional flat traces of multi-threaded applications . . . . . . . . . . . . . 81
4.1.1 Exploring impact of Non-determinism: Experimental setup . . . . . . . . . . . . . 82
4.1.2 Exploring impact of Non-determinism: Pthread synchronization mismatches . . . . 85
4.1.3 Exploring impact of Non-determinism: User Space synchronization . . . . . . . . . 89
4.1.4 SynchroTrace: Tackling non-determinism through Synchronization-aware Trace
and Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Synchronization- and Dependency-aware Traces . . . . . . . . . . . . . . . . . . . . . . 93
4.2.1 Capturing synchronization events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.2 Capturing Operating System traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
TABLE OF CONTENTS
vii
4.3 Event-Trace Replay Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.1 Event Queue Manager and Memory Request Manager . . . . . . . . . . . . . . . . 101
4.3.2 Thread Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Design Space Exploration with Trace-based Simulation . . . . . . . . . . . . . . . . . . 103
4.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.2 Area and Power Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.3 Performance Results and Design Choices Under Constraints . . . . . . . . . . . . . 108
4.5 Achieving Fast Design Exploration with Multi-Threaded Traces . . . . . . . . . . . . . 110
4.5.1 Speedup using Multi-Threaded Trace Techniques . . . . . . . . . . . . . . . . . . . 110
4.5.2 Trace Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.3 Trace Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6 Background and Related Work in simulation-based design space exploration of CMPs 113
4.6.1 Comparison to Pinplay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6.2 Other Trace-Drive Simulation Solutions . . . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.8 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5. Conclusions & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
viii
List of Tables
2.1 Shadow Object Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Breakeven speedup for top 5 functions for PARSEC-2.1 benchmarks with simsmall input 57
3.2 Breakeven speedup for worst 5 functions for PARSEC-2.1 benchmarks with simsmall input 57
3.3 Function execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Accelerator characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 List of benchmarks used. All runs use 8 threads with simsmall inputs for Parsec and the
base input for Splash-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Cache Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 NoC Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
ix
List of Figures
1.1 Sort functions oﬄoaded to GPUs [36] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Convolution functions oﬄoaded to GPUs [36] . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Performance variation in 240 random thread mappings of FFT-32 [91] . . . . . . . . . . 2
1.4 An example of a task graph adapted from prior work in HW/SW partitioning for MP-
SoCs [114] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 A partitioning example of the task graph adapted from prior work in HW/SW partition-
ing for MPSoCs [114] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Non-Determinism in Thread Execution. Uneven thread progress and indetermi-
nate wait times at synchronization points cause non-determinism that potentially causes
different thread interleaving for different runs. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Applying categorizations in an accelerator design example . . . . . . . . . . . . . . . . . 18
2.2 Categorization of dataflow into I/O vs. local and unique vs. non-unique. Functions X,
Y and Z are called in order from left to right . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Capturing communication using Shadow Memory . . . . . . . . . . . . . . . . . . . . . . 26
2.4 How the shadow memory tracks producers and consumers; and distinguishes reuse . . . 27
2.5 Data structures for Function-level profiling . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Slowdown of Sigil and Callgrind relative to native for baseline function-level profiling . . 33
2.7 Slowdown of Sigil relative to Callgrind for baseline function-level profiling . . . . . . . . 33
2.8 Memory usage for baseline function-level profiling . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Data and control flow between functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Two functions with Loads and Stores. Computation chunks between each communication
edge is independent in the two functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 Communication between parallel paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Architecture model: an array of heterogeneous tiles . . . . . . . . . . . . . . . . . . . . . 49
3.4 The normalized coverage of the leaf nodes of the calltree for all benchmarks . . . . . . . 56
3.5 Breakdown of data bytes based on reuse counts for PARSEC benchmarks (simsmall input) 60
3.6 Average reuse lifetimes of the top vips functions by number of data bytes reused . . . . 60
x3.7 Data reuse distribution of “conv gen” in vips . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Data reuse distribution of “imb XYZ2lab”in vips . . . . . . . . . . . . . . . . . . . . . . 63
3.9 Breakdown of lines in memory based on reuse counts for benchmarks in the PARSEC
Benchmark Suite (simsmall input) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.10 Computation and communication overlap: When the initiation interval of a single
instance of an accelerator is greater than communication for a function call, replication
is possible (a) tcomm:accel:ip < tinit, (b) 2 ∗ tcomm:accel:ip > tinit . . . . . . . . . . . . . . 68
3.12 Maximum speedup based on function-level parallelism . . . . . . . . . . . . . . . . . . . 75
4.1 Framework for testing our Traces and DBI flow contrasted alongside the gem5flow . . . 83
4.2 Comparison of total Read and Written bytes. . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Breakdown by function of the difference in Read/Written bytes for the FMM benchmark 88
4.4 Comparison of total Read and Written bytes without Pthreads functions . . . . . . . . . 90
4.5 Amount of read mismatch by function for the FMM benchmark without Pthread functions 91
4.6 Intercepting Pthread API Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Multi-Threaded Event-Trace Replay Framework . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Design Choices Under Area and Power Constraints . . . . . . . . . . . . . . . . . . . . . 105
4.10 Total Uncore Power (NoC and Caches) vs. Performance (CPI) . . . . . . . . . . . . . . 107
4.11 SynchroTrace Speedup in Simulation using our Multi-Threaded Trace Techniques over
Gem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
LIST OF FIGURES
xi
Abstract
Instrumenting and analyzing platform-independent communication in applications
Siddharth Nilakantan
Mark Hempstead, Ph.D.
The performance of microprocessors is limited by communication. This limitation, sometimes
alluded to as the memory wall, refers to the hardware-level cost of communicating with memory.
Recent studies have found that the promise of speedup from transistor scaling, or employing het-
erogeneous processors, such as GPUs, is diminished when such hardware communication costs are
included [102, 16, 36, 75].
Based on the insight that hardware communication at run-time is a manifestation of communi-
cation in software, this dissertation proposes that automatically capturing and classifying software-
level communication is the first step in performing fast, early-stage design space exploration of future
multicore systems. Software-level communication refers to the exchange of data between software
entities such as functions, threads or basic blocks. Communication classification helps differentiate
the first-time use from the reuse of communicated data, and distinguishes between communication
external to a software entity and local communication within a software entity. We present Sigil,
a novel tool that automatically captures and classifies software-level communication in an efficient
way.
Due to its platform-independent nature, software-level communication can be useful during the
early-stage design of future multicore systems. Using the two different representations of output data
that Sigil produces, we show that the measurement of software-level communication can be used
to analyze i) function-level interaction in single-threaded programs to determine which specialized
logic can be included in future heterogeneous multicore systems, and ii) thread-level interaction in
multi-threaded programs to aid in chip multi-processor(CMP) design space exploration.

1Chapter 1: Introduction & Background
The greater computer architecture community is finding that the performance of modern computing
systems is limited by communication [16, 36, 75, 81, 11, 115, 80]. This limitation often refers to
the cost of communicating with memory (hardware-level communication). Contemporary processing
chips, for example Intel’s Sandybridge and Tilera’s TilePro36, employ large numbers of processing
cores with communication fabrics that interconnect the cores [62, 3]. As the number of these process-
ing cores and subsequently the number of transistors on chip scale up, communication will become
increasingly important and will need to be performed effectively in order to extract performance
from future multicore systems.
1024 2048 4096 8192 16384 32768 65536
Number of Sorted Elements
0.0
0.5
1.0
1.5
2.0
2.5
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
3
3
0
M
Sort Application Run Times (small data set)
Sort Function
Data Transfer to GPU
1M 2M 4M 8M 16M 32M 64M
Number of Sorted Elements
0
50
100
150
200
250
300
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
3
3
0
M
Sort Application Run Times (large data set)
Sort Function
Data Transfer to GPU
Fig. 1. Sort benchmark. Faster GPUs are affected more by the memory transfer overhead. For instance, when sorting 64M values, the application time on
the GTX 480 is 3.6x slower than the kernel itself.
that can be used for smoothing, edge detection, and blur,
among other functions. We benchmarked the separable con-
volution application example from the NVIDIA CUDA SDK
toolkit [16]. A typical use for an image convolution on a GPU
would transfer image data to the GPU, run the kernel, and
transfer the convoluted image back to the CPU, and Figure 2
shows the results of the benchmark. For the large data set, the
application takes more time to run on the fast GTX 480 than on
the slower 9800 GT because the memory-transfer bandwidth
is better for the 9800 GT.
3. SAXPY stands for “Scalar Alpha X Plus Y” and is a
function in the Basic Linear Algebra Subprograms package.
It is a straightforward multiply-and-add algorithm that is
pleasingly parallel. NVIDIA provides an optimized, CUDA
version of SAXPY in the CUBLAS package [17] and this
is what we benchmarked. Figure 3 shows the results of
the benchmark, and it is obvious that the memory-transfer
overhead is overwhelming to the runtime for the application.
On average, the kernel plus memory-transfer times took 43x
longer than the kernel processing time alone.
4. SGEMM is the Single Precision General Matrix Multiply
algorithm. The CUBLAS package also includes SGEMM, and
we ran our benchmark on the CUBLAS application. Like
SAXPY, SGEMM is pleasingly parallel, but because it is an
n3 algorithm, it performs more calculations on the matrices
than SAXPY, and Figure 4 shows that the memory-transfer
overhead is not as overwhelming to the application as for
SAXPY, especially as the data set increases. However, the run
time for the Tesla C2050 is still almost twice as slow for the
largest data set as it would be without the memory-transfer
overhead.
5. FFT is the Fast Fourier Transform algorithm, which
transforms signals in the time domain into the frequency
domain. NVIDIA provides the CUFFT library [18], which we
benchmarked. Figure 5 shows that for large data sets, there
is more than 100% overhead for data transfer on fast devices.
FFT returns less data than is passed into the function, so it does
take slightly less time for the transfer back from the GPU. It is
still apparent that the fast GPUs are constrained significantly
by the memory-transfer overhead.
6. Search is a simple textual search for a short random low-
ercase alphabetic string in a long random string of lowercase
alphabetic characters. The code we benchmarked was based on
an example from an online supercomputing performance and
optimization analysis tutorial [19]. For each benchmark, we
measured the total time to find 1000 different search strings in
a certain length sample text. As the sample text gets larger, the
likelihood for finding the search string increases, and as Figure
6 shows, when the size of the sample text reaches roughly
two million characters, the average search time levels out.
The Tesla C2050 incurs a memory-transfer penalty of 2.5x at
this point, and the GTX 480 incurs a penalty of 5x. The GT
9800, with CUDA compute capability 1.1∗, has comparably
poor performance compared to all three other GPUs because
the benchmark uses atomic operations that were significantly
optimized for later compute capabilities.
7. SpMV, or Sparse Matrix Vector multiplication, is an im-
portant sparse linear algebra application, and it is a bandwidth-
intensive operation when matrices do not fit into on-chip
memory. We benchmarked an implementation described in
Bell and Garland [20] for the coordinate (or “triplet”) format
that has a storage size proportional to the number of non-
zeros in the sparse matrix. Figure 7 shows the results for
three different matrices (suggested by Bell and Garland [20]
and used originally for the work in Williams et al. [21]).
The application transfers much more data to the GPU than
it returns, and thus the data transfer time is weighted to the
amount transferred to the GPU. In the case of the largest
∗The compute capability of a device denotes the CUDA features an
NVIDIA GPU supports; it is akin to a version number.
Figure 1.1: Sort functions oﬄoaded to GPUs [36]
1024 1280 1536 1792
Convolution Matrix Size (n x n)
0
5
10
15
20
25
30
35
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
3
3
0
M
Convolution Run Times (small data set)
Data Transfer fro  GPU
Convolution Function
Dat  Transfer to GPU
2304 2560 2816 3072 3328 3584 3840
Convolution Matrix Size (n x n)
0
20
40
60
80
100
120
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
Convolution Run Times (large data set)
Data Transfer from GPU
Convolution Function
Dat  Transfer to GPU
Fig. 2. Convolution benchmark. The benchmark time is dependent on data transferred to the GPU and also on data transferred back to the CPU. Note that
the slower 9800GT GPU has a fa ter overall run time than th much faster GTX 480 because of the slower transfer times on the GTX 480.
256K 512K 1M 2M 4M
Number of Elements
0
2
4
6
8
10
12
14
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
3
3
0
 M
SAXPY Run Times (small data set)
Data Transfer from GPU
SAXPY
Data Transfer to GPU
8M 16M 32M 64M 128M
Number of Elements
0
50
100
150
200
250
300
350
T
im
e
 (
m
s)
T
e
sl
a
 C
2
0
5
0
G
T
X
 4
8
0
9
8
0
0
 G
T
3
3
0
 M
SAXPY Run Times (large dat  set)
Data Transfer from GPU
SAXPY
Data Transfer to GPU
Fig. 3. SAXPY benchmark. Because the CUBLAS version of SAXPY is well optimized and pleasingly parallel, the memory-transfer overhead comprises
almost all of the application run time. Again, for large data sets the faster GTX 480 performs worse for the overall benchmark than the slower 9800 GT.
two data sets, mc2depi.mtx and webbase-1M.mtx, the
device memory-transfer bandwidth completely dictates the
time for the application, and the less powerful 330M GPU
completes the application faster than the GTX 480 simply
because it has a higher CPU-GPU bandwidth.
8. Histogram is an image processing algorithm that com-
bines a stream of pixel light values into a series of bins that
represent the distribution of light across an image. We bench-
marked the CUDA SDK histogram application that computes
both a 64-bin histogram and a 256-bin histogram on a set of
data [22]. The histogram application sends a large amount of
data to the GPU, but simply returns either a 64-byte or 256-
byte array representing the bins. Therefore, as can be seen in
Figure 8, the memory-transfer overhead for the data sent to
the GPU is significant, while the memory-transfer overhead
for the data sent back is not.
9. Mersenne Twister is an algorithm for generating pseu-
dorandom numbers. There is no transfer of data to the GPU
(except a single integer seed) and the algorithm produces the
values solely on the device. Figure 9 shows that the only
memory-transfer overhead comes from returning the results
to the CPU.
10. Monte Carlo is an algorithm that repeatedly and ran-
domly samples a function many times, averaging the results.
Figure 1.2: Convolution functions oﬄoaded to GPUs [36]
2Recent studies have found that the promise of speedup from technology scaling [102] or het-
erogeneous processors, such as GPUs, is diminished when hardware communication costs are in-
cluded [16, 36, 75]. Gregg et al. showed that oﬄoading functions to be executed on a GPU can
quite often be limited by data transfer costs [36]. Figures 1.1 and 1.2 show some illustrative re-
sults from the related publication from Gregg et al. The results show the breakdown of completion
time, when oﬄoading a sort and convolution function respectively, on to a GPU. Input size to the
functions increases from left to right and the overall time for completion is broken down into data
transfer costs, and time spent actually performing useful work. It is observed that the sort function
starts being dominated by transfer costs with increasing input size, while the convolution function
is mostly dominated by data transfer costs across all input sizes.
0
10
20
30
40
50
12 14 16 18 20 22
C
o
u
n
t
Aggregate IPC
Figure 1.3: Performance variation in 240 random thread mappings of FFT-32 [91]
Salvador et al. use profiles of measured thread-level communication to study the thread mapping
problem [90, 91]. The thread mapping problem refers to the arrangement of threads on cores in a
multicore system. Communication is made possible in a multicore system by the interconnect or
Network-on-Chip (NoC), that connects all the cores and the cache memory slices. Any core can
fetch data from any cache slice using network messages that traverse the links on the interconnect
over multiple hops. Salvador et al. show that close to theoretical maximum performance can
be achieved by arranging threads on cores such that the total number of hops required to perform
3inter-thread communication is minimized. Thus the efficacy of a thread mapping is dependent on the
communication profiles of threads. Salvador et al. also show the degree of impact on performance for
different thread mappings. As the number of possible arrangements of threads on cores increases with
increasing number of threads and cores, an exhaustively large number of combinations are possible.
Figure 1.3, adapted from their publications, shows the variation in performance for 240 random
thread mappings of a 32-threaded version of the FFT benchmark on a simulated 36-core multi-
processor chip. It is observed that with just 240 mappings, out of more than 500 million possible
mappings, a variation of 22% is observed. These data-driven examples are a strong indication to
the research community that communication costs will severely impact the performance of future
multicore systems.
Communication at a hardware-level is a run-time manifestation of communication at a software-
level. Software-level communication refers to messages between software entities such as functions,
threads, basic blocks, or even instructions; any self contained fragment of code can be a producer
or consumer of data. Due to its platform-independent nature, software-level communication can be
useful during the early-stage design of Chip Multi-processors (CMPs). A range of hardware and soft-
ware tasks—including software development, parallel programming, hardware-software partitioning,
and the design of network-on-chip—can be improved with a detailed understanding of software-level
communication within a workload [95, 31, 116]. This dissertation shows how capturing and classi-
fying communication is the first step in a methodology for early stage design of future CMPs. This
dissertation also shows how analyzing the sources and patterns of software-level communication in
a workload can be used in such a methodology.
Core counts in contemporary CMP designs have increased to 10s of cores on a chip with industrial
and academic projections predicting up to 100s to 1000s of cores on a chip for future CMPs [66, 16].
Some researchers have shown that power density problems will affect multicore processors due to the
existence of a utilization wall, where not all transistors can afford to be active at the same time. This
condition has been termed Dark Silicon [26, 43, 102, 45]. Researchers have proposed overcoming
the utilization wall with the use of hardware specialization in future many-core architectures [16,
4102, 19], leading to future CMPs potentially being heterogeneous in nature. We distinguish CMPs
that use specialized logic, with the “Heterogeneous” prefix, resulting in the term “Heterogeneous
CMP”. Heterogeneous CMP designs have begun appearing in industry as well, with some examples
being IBM’s PowerEN chip and NVIDIA’s Tegra chips [13, 77]. This dissertation illustrates that
the measurement of platform-independent software-level communication can be used to analyze i)
function-level interaction in single-threaded programs to determine which specialized logic to include
in Heterogeneous CMPs, and ii) thread-level interaction in multi-threaded programs to aid in CMP
design space exploration of both Homogeneous and Heterogeneous CMPs.
1.1 Capturing software-level communication
There is a need for capturing software-level communication as acknowledged by recent published
work [44, 69, 37, 55]. At the software-level, communication is caused by load/store instructions
when there is a dependency of the load instruction on the store instruction, through a memory
address. While dependencies exist between instructions, it is impractical to retain the chain of
dependencies through individual instructions and registers as the required storage becomes pro-
hibitively expensive [69, 98, 64]. Thus, most prior work that attempt dependence profiling, track
dependencies at coarser granularity (between collections of instructions) such as whole loops and
functions [87, 98, 60, 38]. However, prior attempts at capturing communication for many purposes
have several drawbacks; they do not classify communication, were less efficient, met with limited
success and often required programmer intervention [87, 98, 60, 38, 44, 69, 55]. Classification refers
to the process of distinguishing first time use of data from reuse of data and separating local and
input/output data. The primary contribution of this dissertation is a formal methodology to capture
and classify platform-independent communication.
In the context of this dissertation, software-level communication is defined to exist between
software entities. We define a software entity as a collection of instructions in a software application
that act as a source or sink of data. Thus, entities are considered a single unit for the purpose of
measuring communication. For example, basic blocks, functions, and threads can be entities. The
definition of a software entity allows for there to be multiple loads/stores in an entity, making it
1.1 Capturing software-level communication
5act as a potential source and sink of data. We generally do not refer to communication edges as
dependency edges in this dissertation, since entities can potentially have multiple dependency edges,
and it is not possible to establish precedence between entities with multiple dependency edges.
In the Load/Store example above, the entity that contains the store is the ”producer” of a data
value, while the entity that contains the load for that data is the ”consumer” of the data value.
Thus a system design methodology must measure the volume of communication between entities,
by tracking the producers and all consumers of every single data byte generated by a program. The
ability to dynamically trace communication in applications, through memory addresses will allow
tracking dependencies through pointer indirection, linked lists and through control flow as well.
Communication can be captured automatically, by identifying software entities and tracking
the load and store instructions in those entities. When a store to an address occurs within an
entity, we must record both the store and the producing entity under which the store occurred.
When a load of the same address occurs within a consuming entity, we must look up the producing
entity under which the store had been recorded and register a communication edge between the
two entities. This communication edge can be recorded individually, or aggregated, and can also be
classified into several categories as will be discussed in the next chapter. In Chapter 2, we show how
communication can be captured and classified automatically between software entities; specifically
functions and threads. We also show how communication can be represented in many ways for
purposes such as HW/SW partitioning problems and parallelism discovery.
Communication between threads is also protected by synchronization constructs. As threads
are entities that can execute in parallel, synchronization ensures that a producing thread finishes
producing its data before the consuming thread consumes the same data, thereby imposing or-
der between the producer and consumer. Synchronization is usually mandatory in multi-threaded
programs where producing and consuming entities use a shared memory model to exchange data.
Synchronization constructs such as barriers and mutexes help mediate access to shared memory to
ensure every entity has a consistent view of data in shared memory. In a shared memory model, a
producing entity puts data into a bounded buffer and waits when the buffer is full, and the consum-
1.1 Capturing software-level communication
6ing entity retrieves data from the buffer and waits when the buffer is empty. Both producer and
consumer waits are modeled using software synchronization constructs when the buffer is full and
empty, respectively.
The design of future multicore systems requires the capture of synchronization in order to accu-
rately model the behavior of multi-threaded programs. In this dissertation, we found that studying
the performance of an application in response to the configured hardware resources during design
space exploration, requires modeling the effect of synchronization on the behavior of the application.
Many approaches that study multicore CMPs model multi-threaded programs without synchroniza-
tion and if they do, suffer from drawbacks that we tackle in this dissertation. In chapter 4, we show
how we overcome drawbacks of previous approaches, by modeling synchronization accurately and
efficiently using our automated communication capture and classification methodology.
1.2 Communication classification used to generate task graph equivalents
Figure 1.4: An example of a task graph adapted from prior work in HW/SW partitioning for
MPSoCs [114]
Communication classification can be used to automatically generate data flow graphs of applica-
1.2 Communication classification used to generate task graph equivalents
7tions, which can be used as a proxy for task graphs in HW/SW partitioning problems. Prior research
from the HW/SW codesign and Reconfigurable computing communities have employed task graph
representations of applications to partition the applications between specialized hardware and gen-
eral purpose CPUs in embedded systems. HW/SW partitioning algorithms are used to discover
the optimal way to minimize communication between the specialized hardware partitions and the
software partition.
Task graphs have proved useful in a variety of ways, including schedule optimization and HW/SW
partitioning [51, 31, 67, 113]. Figure 1.4 shows the example of a task graph adapted from a research
study by Youness et al. on HW/SW partitioning of generic task graphs [114]. The task graph
is a generic example and shows a sequence of dependent tasks in the application, connected by
arrows that are annotated with communication costs. Each task at the receiving end of an arrow
must wait for the sending task to complete before it can begin. Once a task begins, it need not
be interrupted and needs no more data from any other task; it is self-contained and must execute
completely before any dependent task can begin. The cost for a task to finish completely is indicated
within the each task in the figure as well. Currently, task graphs used in partitioning studies are
usually constructed manually by the programmer/system designer or generated pseudo-randomly
using research tools such as Task Graphs for Free [24, 101]. Thus there is a need to automatically
capture task graph representations of applications. In this dissertation, we present a tool, Sigil that
implements a communication classification methodology and automatically produces function-based
graph representation equivalent to a task graph that can be used as a substitute for task graphs in
partitioning problems.
HW/SW partitioning algorithms have been previously studied by the HW/SW codesign and
reconfigurable computing community [100]. Algorithms that have been employed range from genetic
algorithms and simulated annealing based algorithms to various custom algorithms [51, 31, 67,
113]. Besides the task graph, the algorithms usually require a system architecture description with
platform-specific information such as the bandwidth and latency of on-chip communication and
the frequency of operation of the CPU and specialized hardware. Figure 1.5 shows the partitions
1.2 Communication classification used to generate task graph equivalents
8
	
	






	

	



 ! 

"!#

"#
#
 
 
" 
!
!

 






  

	
	

 

Figure 1.5: A partitioning example of the task graph adapted from prior work in HW/SW
partitioning for MPSoCs [114]
for the task graph example shown in figure 1.4, once again adapted from Youness et al. [114].
Their algorithm partitions the tasks between an FPGA and a CPU for the organization shown
on the right. The algorithm maximizes computational work on the FPGA, while absorbing all
the costly communication within partitions so as to keep communication between partitions at a
minimum. Task 2 which performs most of the computational work in the application, is run on
the FPGA hardware partition (can also potentially be specialized hardware ASICs), while Task1
is run in parallel on the CPU software partition. From there on, tasks are absorbed into each
partition to minimize communication, thereby assigning Task 3 and Task 6 to the FPGA partition,
with the remaining tasks assigned to the CPU partition. Sophisticated algorithms such as genetic
algorithms and simulated annealing have been employed for partitioning [67, 85]. However, they are
often limited to work on abstract task graph representations, generated using the manual method
described earlier or on pseudo-random task graphs. In this dissertation, we show a demonstrative
partitioning algorithm unique to the graph representation generated by our profiling tool, Sigil,
1.2 Communication classification used to generate task graph equivalents
9that produces a list of candidate functions for which to build specialized hardware and run on a
hardware partition. More sophisticated algorithms from prior work will also be applicable with
some modifications. Such an in-depth study of partitioning algorithms on our graph representation
is out of the scope of this dissertation. The graph representation generated by Sigil is discussed in
Chapter 2, while the partitioning algorithm and results are discussed in Chapter 3. Both the Sigil
tool and our partitioning algorithm are intended to be part of an end-to-end solution that uses the
unmodified application binary, to candidates for hardware specialization.
1.3 Communication classification used for trace-based design space ex-
ploration
Automatically captured and classified communication can also be used to generate traces of multi-
threaded applications to assist fast and scalable design space exploration. Traces of multi-threaded
applications need to record meta-information such as producer-consumer communication and syn-
chronization constructs in order to be useful for design space exploration. Our communication
classification methodology that uses our tool, Sigil, can be leveraged to tackle this problem, with ex-
tensions to capture synchronization constructs. We motivate the need for capturing synchronization
constructs in this section.
Traces are convenient and portable for simulation, but due to the non-deterministic execution
of multi-threaded applications, simulation using traces of multi-threaded applications has proven
difficult and been attempted only a few times [79, 86]. The non-determinism manifests as uneven
thread progress between synchronization points and indeterminate wait time at synchronization
points. Design time factors, such as CMP design configuration and static thread mapping, as well
as run-time factors, such as OS load on the cores or dynamic thread mapping, can play a role
in impacting thread progress differently. A particular order of relative progress between different
threads is sometimes termed thread interleaving [79, 86]. The examples in Figure 1.6 are the result
of two different thread interleavings. The non-determinism arising from the possibility of different
thread interleavings can subsequently affect performance metrics, such as cycle time, core utilization,
memory bandwidth, peak traffic, and energy footprint of a multi-threaded application.
1.3 Communication classification used for trace-based design space exploration
10
Non‐determinism in thread execution 3       
Mutex Sync Point Release 
Barrier
Release 
Barrier
Thread 0 Section A
Thread 1 Section A
Critical 
Section B
Critical 
S ti B
Wait
Section C
Section Cec on 
Mutex Sync Point Release 
Barrier
Release 
Barrier
Thread 0 Section A
h d
Critical 
Section B
Critical
Section C
Wait
WaitT rea  1 Section A  Section B Section C
Time
Figure 1.6: Non-Determinism in Thread Execution. Uneven thread progress and
indeterminate wait times at synchronization points cause non-determinism that potentially
causes different thread interleaving for different runs.
An example of thread non-determinism via different thread interleavings is illustrated with an
example in Figure 1.6. This figure depicts a portion of execution for an application containing
two synchronizing threads, between two barriers, i.e. a barrier region. Each thread must complete
Sections A, B, and C in sequence, and both threads must go through the Critical Section B in a
mutually exclusive manner (enforced by mutex synchronization). A mutex synchronization point
allows the first arriving thread to progress while the other has to wait, and a barrier only allows
progress when all registered threads have arrived. Two scenarios of thread progress are shown on
the top and bottom with slightly different execution times for Section A across the scenarios. This
minor difference in the timing of Section A has a big effect on the wall-clock time.
The wall-clock times vary between the two scenarios as explained below. In the scenario shown on
top of Figure 1.6, Thread 0 arrives at Critical Section B first due to the relative timing of Section A,
and vice versa for the bottom scenario. As Thread 0 has a longer Section C to complete, it would
benefit from completing Critical Section B first. In the top scenario, Thread 1 waits at the critical
1.3 Communication classification used for trace-based design space exploration
11
section for Thread 0 to finish first. In the bottom scenario, Thread 0 waits at the critical section
for Thread 1 to finish first. Since Thread 0 is allowed to progress through the critical section first
in the top scenario, both threads reach the barrier after Section C quicker. As change in wall-clock
time between the two scenarios is quite large, the effects of uneven thread progress for Section A,
coupled with the subsequent wait times at the synchronization points need to be modeled accurately
in order to compare performance of designs that cause these different types of thread interleavings.
Thus, when exploring the design space of a CMP, we need to model the small changes in the relative
progress of threads, and the corresponding wait times that can only be determined dynamically.
In the above example, the minor difference in the execution time of Section A represents un-
even progress of execution between synchronization points in multi-threaded programs. This is one
manifestation of non-determinism. The different wait time at the synchronization points in both
threads is another manifestation of non-determinism. It is thus important to model the impact of
non-determinism during simulation.
In sections 4.1.1 - 4.1.4, we further explain an in-depth study that confirms the sources of non-
determinism, and quantify the impact of non-determinism in multi-threaded programs.
In this dissertation, we discuss how our tool, Sigil, can be extended to intercept and model
synchronization constructs so as to model non-determinism accurately. We also discuss the modi-
fications we make to Sigil’s representations of profiled communication in order to adapt efficiently
for multi-threaded programs. The impact of non-determinism and our corresponding solutions are
discussed and evaluated in detail in Chapter 4.
1.4 Thesis Statement
Platform-independent, software-level communication has multiple applications in system design. A
methodology for capturing and classifying this communication is the first step in performing fast,
early-stage design space exploration of future multicore systems.
1.4 Thesis Statement
12
1.5 Dissertation Contributions
The work in this dissertation covers two important steps toward the design of heterogeneous CMPs.
It provides a profiling methodology that gathers platform-independent data from applications for
early stage design of CMPs. It shows multiple in-depth case studies that use the data for various
CMP design goals such as analysis for specialized hardware, analysis and simulation for CMP design
space exploration. The concrete list of contributions is as follows:
• Motivating the need to capture and classify communication We discuss the necessity
for classifying captured communication at a software-level. We show how formal categories of
classification applied on platform-independent software communication can be used to estimate
run-time traffic between different hardware structures. We show how this aids in different
aspects of the design of future multicore systems.
• Unique profiling methodology to automatically capture and classify communica-
tion efficiently We describe a profiling methodology that instruments computation and com-
munication costs for software entities such as functions and threads. We discuss the detailed
implementation of the tool, Sigil, that captures and classifies communication efficiently and
the corresponding tradeoffs.
• Method for interpreting profiling results for partitioning problems and parallelism
discovery The described method answers a subset of the following questions: i) which func-
tions in an application are worth building accelerators for. We show how the HW/SW parti-
tioning problem can be employed on the aggregates representation of captured and classified
communication data obtained from Sigil. Partitioning function-based task graphs have unique
considerations. We describe these considerations when partitioning the task graph. We also
present a profiling data representation we named as the “dependency tree” captures more
fine-grained detail, a trace of data dependencies between calls. We show how this allows for
detection of critical paths and pipeline parallelism.
• A methodology for early-stage modeling of heterogeneous many-accelerator CMPs
1.5 Dissertation Contributions
13
We evaluated a full methodology for early-stage modeling and design of CMPs that will contain
many accelerators. We describe execution and performance estimation models, and with the
help of profiling results, we partition a sample workload to evaluate the models. We also
studied how resources must be allocated to accelerators in an accelerator-rich CMP, in order
to optimize for performance.
• A study on the impact of non-determinism in traces of multi-threaded applications
We explain the need to model non-determinism in traces of multi-threaded applications in order
to use them for Design space exploration of multicore systems. We also quantify the impact
of non-determinism due to existence of thread synchronization constructs that protect thread
communication in multi-threaded applications.
• Using platform-independent communication analysis on multi-threaded programs
to enable analysis and trace-based simulation of CMPs This dissertation also extends
the profiling methodology with intercepts for synchronization events in shared memory multi-
threaded programs. The intercepts allow us to also capture synchronization events, in addition
to communication, which allow us to model non-determinism correctly for accurate simulation.
The data generated for the event trace representation of Sigil, is instead used as traces to
perform fast and accurate trace-based simulation for CMP design space exploration.
1.6 Published Work
This dissertation contains work from the following publications where the dissertation author was
the primary investigator and author.
• Platform-independent analysis of function-level communication in workloads This
work appeared in the 2013 IEEE International Symposium on Workload Characterization
(IISWC). It describes the framework used to automatically generate function-based task graphs.
It also describes the metrics captured, and the representations of profiled data when targeting
partitioning problems, reuse profiling and parallelism discovery.
1.6 Published Work
14
• Metrics for Early-Stage Modeling of Many-Accelerator Architectures This work ap-
peared in Computer Architecture Letters (CAL) in July of 2012. It describes a methodology for
early-stage modeling of many-accelerator chips, including task graph generation, partitioning,
execution models and a validation with a sample workload and RTL.
• Can you trust your memory trace?: A comparison of memory traces from binary
instrumentation and simulation This work was accepted in International Conference on
VLSI Design (VLSID) in January 2015. Some of the work in this paper motivates the reason
to capture synchronization in traces of multi-threaded applications.
• SynchroTrace: Synchronization-aware Architecture-agnostic Traces for Light-Weight
Multicore Simulation
This dissertation differs from the previous work in the literature in the following ways:
• While the need for capturing communication has been established, we show how classifying
communication is important for the design of future CMPs.
• Prior attempts at automation of capturing communication have addresses a very specific prob-
lem, or they are inefficient in terms of speed and memory, or they do not classify communica-
tion. We address these issues in this dissertation.
• Our case study demonstrating the use of communication classification to generate representa-
tions equivalent to task graphs to be used for HW/SW partitioning problems is unprecedented,
to the best of our knowledge. We discuss the unique considerations for partitioning problems
on our representations.
• Methodology for early-stage modeling of many-accelerator chips starting with the task graphs,
a novel execution model and with validation with RTL.
• We generate accurate traces for multi-threaded applications to be used for simulation. This
has not been previously demonstrated in a complete and validated manner.
1.6 Published Work
15
1.7 Dissertation Organization
This work presented in this dissertation is organized as follows. Chapter 2 describes communication
capture and classification, and our tool Sigil which automatically generate data to compose task
graph equivalents and another output representation named event trace. Chapter 3 shows how we
can apply the coarse-grained task graph for partitioning and also briefly discusses how the event trace
representation can be used for critical path analysis and discovery of parallelism. The chapter also
covers a early stage design study of many-accelerator CMPs with a constrained accelerator selection
problem. Chapter 4 further motivates and shows how we extend Sigil’s trace representation to apply
to multi-threaded workloads for design space exploration. Chapter 5 concludes.
1.7 Dissertation Organization
16
Chapter 2: Sigil: Capturing, classifying and representing communication
automatically
As explained in the introduction, due to its platform-independent nature, understanding software-
level communication can be useful in a variety of ways. A range of hardware and software tasks—
including software development, parallel programming, hardware-software partitioning, and the de-
sign of network-on-chip—can be improved with a detailed characterization of software-level commu-
nication within a workload [95, 31, 116]. This chapter addresses the challenge of characterizing the
sources and patterns of software-level communication in a workload; it proposes techniques and a
methodology to do so in an automated way with low overhead.
The communication-aware methodology presented in this chapter extracts architecture-agnostic
properties of the workload such as interprocedural control data flow graphs, data reuse lifetimes
and dynamic dependency chains. The interprocedural control data flow graphs can be used for
finding regions of code that are sensible to accelerate with specialized hardware, so as to improve
overall application performance. Data reuse lifetimes for regions of code provide hints into a code
region’s impact on a memory system, especially in the context of specialized hardware. Dynamic
dependency chains are used to discover the limits of parallelism in an application using critical
path analysis. These properties can help the hardware design process at an early stage, while also
providing useful insights for software optimization. While methodologies and models exist that
characterize communication with memory, many of the profiles recorded by these methodologies are
platform-dependent[57, 107, 23, 51]. For example, the measured communication might depend on
cache-size, cache configurations or other details of the platform’s memory hierarchy and intercon-
nection network.
Our methodology runs the application and dynamically captures communication, as static anal-
ysis does not provide adequate visibility into dynamic structures such as pointers and linked lists.
Like most software profilers (e.g. gprof ), we aggregate costs on a per-function or per-thread basis, for
17
two reasons: i) Functions and threads define clear logical boundaries that are understandable to the
software developer and ii) capturing and storing communication data at very fine granularities such
as instructions has been shown to be prohibitively expensive and unscalable [69, 98, 64]. Hereafter,
in this chapter we discuss our methodology in the context of functions, as capturing communication
between threads requires only a subset of the features in our methodology.
When a profiler analyzes function-level communication, not all of the total bytes read and written
by a function should be treated equally. Our methodology tracks the data produced and consumed
by each function call and differentiates the first-time use from the reuse of bytes. We also distinguish
between communication external to the function and local communication within the function by
tracking the producer and consumer of each unique data byte in the program; as there can be
multiple consumers of a data byte, every producer-consumer pair for a single byte of data represents
a communication edge. We categorize every communication edge encountered in the program, and
store the number of bytes associated with edges in every category. We refer to this process as
communication classification.
This chapter also presents a custom tool we named “Sigil” that implements our profiling method-
ology. Using the technique of shadow memory Sigil is able to index our table of functions efficiently
without exploding state [70]. Sigil leverages Dynamic Binary Instrumentation (DBI) technology and
is implemented on top of Valgrind’s Callgrind framework [72, 106]. Sigil can represent its profiling
results in one of two ways: it can dump aggregates on a per-function basis or list the execution as
a sequence of dependent events in anevent trace. The latter representation allows a system designer
to view a workload as a list of function calls connected by data transfer edges. Viewing the results
using the event trace representation is more conducive to solving problems such as scheduling using
critical path analysis [89, 88]. In the rest of this chapter, we discuss our communication capture and
classification methodology in the context of the tool, Sigil.
This chapter is organized as follows: Section 2.1 discusses the importance of communication
classification and the how communication classification can be used. Section 2.2 discusses the Shadow
Memory technique used by Sigil to perform capture and classify communication. Section 2.4 discusses
18
the output representations that Sigil can produce, and their potential uses. Section 2.3 reports the
slowdown and memory overhead of Sigil, for capturing its data.
2.1 Communication classification
Communication Terminology 
33 
 Unique communication reflects true inputs 
 Non-unique communication reflects re-use 
 Example: Map function to accelerator 
Accelerator 
Scratchpad 
CPU 
PE 
PE 
PE I/O Buffer 
Local Non-Unique 
Edits 
Local Unique 
Input Non-Unique 
Output Unique 
Input Unique 
Output Unique 
Figure 2.1: Applying categorizations in an accelerator design example
A modern system design methodology that tracks software-level communication must first mea-
sure the volume of communication between producing and consuming software entities. Designers
will find that for some tasks, such as hardware/software partitioning, the classification of commu-
nication into categories is more useful than just recording volumes of total communication. This
section discusses the various categories of classification in Sigil. We illustrate the reasoning behind
communication classification with a system design example, shown in figure 2.1. The example shows
a scenario where we want to evaluate a particular function for acceleration (i.e. build an accelerator
for it). A PE refers to a processing element that performs the work of the accelerator itself, while the
I/O buffer and scratchpad are random access memory structures that house data. In this example,
the CPU explicitly communicates with the accelerator. A well designed accelerator (ASIC, GPU, or
FPGA) for a function will include an internal buffer (the I/O buffer) and will not repeatedly fetch
the same data from outside the accelerator. The I/O buffer holds the input data set and output
data set that may be reused and the scratchpad holds intermediate data generated locally within
the accelerator that may be reused as well.
The categories of classification and their equivalent run-time manifestation in hardware are also
shown in figure 2.1 as arrows between different hardware structures. Each arrow indicates the
expected category of software-level communication edges that manifests as communication between
2.1 Communication classification
19
hardware structures. We label the communication edges that represent reads from the input data
set for the first time, as input unique communication edges. Aninput unique communication edge
represents the first-time read of a byte of data from the input data set. As there will be an input
unique communication edge for every data byte in the input set, the total bytes for all the edges in
this category is the size of the input data set. We label the communication edges that represent reads
from the local data set for the first time, the locally generated bytes that are not communicated
to any other function, as local unique communication edges. Anlocal unique communication edge
represents the first-time read of a byte of data from the local data set. As there will be a local
unique communication edge for every data byte in the local set, the total bytes for all the edges
in this category is the size of the local data set. The bytes for the input unique and local unique
communication edge categories will manifest as writes to the I/O buffer and scratchpad, respectively.
Repeated reads of data bytes from the input data set or local data set, represent reuse of the data,
and will manifest at run-time in hardware as reads from the I/O buffer or the scratchpad. We
label the bytes for these reuse communication edges as input non-unique and local non-unique,
respectively. The figure also shows the output unique category, which represents the output data
set, read by other functions in the program. The output data set for a function will be part of the
input data set for other functions which read from this function.
The distinction between unique and non-unique communication is particularly important for
HW/SW partitioning. Unique communication is the true amount of data an accelerator needs to
complete its task. In a HW/SW partitioning context, local data bytes will either be consumed within
the pipeline of the accelerator or stored in local memory depending on the data reuse characteristics
and the accelerator pipeline implementation.
Prior work has analyzed communication between functions [38], but does not distinguish total
communication from unique communication. In their work, first time accesses to a byte of data are
aggregated along with subsequent accesses to the same byte, not allowing us to isolate the true read
and write set of a function. In contrast, the unique byte counts from Sigil’s profile determine the
true inputs needed by a function.
2.1 Communication classification
20
2.1.1 Formal definition of categories for classification
As explained previously, each category of classification of software-level communication reflects di-
rected hardware-level communication at run-time between specific hardware structures. In this
subsection we cover those categories of classification with formal definitions. Sigil classifies every
communication edge, and hence every communicated byte, into two different categories: 1) in-
put/output/local and 2) unique/non-unique. In the first category, local indicates that the byte was
generated and read by the same function. The input/output identifier indicates that the byte was
generated by one function and read by another. The unique/non-unique category of classification
is used to distinguish between the first time use of a byte by a function and subsequent reuse of it.
Unique indicates that the consumer is reading this byte for the first time, while non-unique indicates
that the consumer has read this same byte before.
At the software-level, communication is caused by load/store instructions when there is a depen-
dency of the load instruction on the store instruction, through a memory address. As mentioned
earlier, tracking this type of communication allows us to handle dependencies across basic blocks
(across branch instructions) and through data structures that use pointers, such as linked lists and
hash tables that cannot be analyzed statically. Sigil monitors memory addresses to identify new
data bytes when they are written by a Store instruction in a producer function and establish com-
munication edges along with the appropriate categories of classification, when a Load instruction in
a consumer function reads the data bytes. When a memory address is overwritten by another Store
instruction, the new data bytes from the write are attributed to the new producer function that
contains the Store instruction. This gives Sigil the ability to track producer-consumer relationships
for every unique data byte produced by the program. Sigil also captures computation operations
such as Floating point and Integer operations, as HW/SW partitioning problems require a notion
of the amount of work done in each function. Here we formally define the different categories under
which communication is classified. For each function the following information is collected:
1. INSTRS: The number of instructions executed by the function.
2.1 Communication classification
21
2. IOPS: The number of integer operations executed by the function.
3. FLOPS: The number of floating point operations executed by the function.
4. IPCOMM TOTAL: The total data bytes for all the communication edges where the con-
sumer was the function, and the producer was some other function. There is a separate
IPCOMM entry for each producer that the function reads data from. Each IPCOMM entry,
thus represents the sum of all bytes from Load Instructions in the function that has the entry,
where the corresponding Store instructions were from some other particular producer function.
5. OPCOMM TOTAL: The total data bytes for all the communication edges where the pro-
ducer was the function, and the consumer was some other function. There is a separate
OPCOMM entry for each consumer that reads data from the function. Each OPCOMM entry,
thus represents the sum of all bytes from Load Instructions in some other particular consumer
function, where the corresponding Store instructions were from the function that has the entry.
6. LOCAL: The total data bytes for all the communication edges where the producer and con-
sumer was the function. The LOCAL entry, thus represents the sum of all bytes from Load
Instructions in the function that has the entry, where the corresponding Store instructions also
belong to the function itself.
7. IPCOMM UNIQUE: The data bytes for all the unique communication edges where the
consumer was the function, and the producer was some other function. Recall that a commu-
nication edge connects a producer and consumer function and is weighted in bytes. A unique
communication edge from the perspective of the consumer function represents the read of a
data byte for the first time, where the writer of the data byte was the corresponding producer
function. There is a separate IPCOMM UNIQUE entry for each producer that the function
reads data from. Each IPCOMM UNIQUE entry, thus represents the sum of all bytes from
Store Instructions in a particular producer function, where the corresponding Load instructions
were from the function that has the entry.
2.1 Communication classification
22
8. OPCOMM UNIQUE: The unique data bytes for all the communication edges where the
producer was the function, and the consumer was some other function. A unique communi-
cation edge from the perspective of the producer function represents the write of a data byte
that was read for the first time by the corresponding consumer function. There is a separate
OPCOMM UNIQUE entry for each consumer that reads data from the function. Each OP-
COMM UNIQUE entry, thus represents the sum of all bytes from Store instructions in the
function that has the entry, where the corresponding Load instructions were from some other
particular consumer function.
9. LOCAL UNIQUE: The unique data bytes for all the communication edges where the pro-
ducer and consumer was the function. The LOCAL UNIQUE entry, thus represents the sum
of all bytes from Store Instructions in the function that has the entry, where the corresponding
Load instructions belong to the function itself.
10. Number of Calls: Used to determine the average costs of a single call.
We can infer the non-unique categorization (represents reuse) mentioned earlier by subtracting
the unique communication bytes from the total communication bytes; performing (IPCOMM TOTAL
- IPCOMM UNIQUE) for input non-unique bytes, or (OPCOMM TOTAL - OPCOMM UNIQUE)
for output non-unique bytes. Hereafter, we refer to the total, unique and non-unique categorizations
where applicable. The IOPS and FLOPS represent the actual work done in each function in terms
of computation.
2.1.2 Identifying categories in software applications
We have shown the categories of classification and their implications in hardware-level communica-
tion. In this subsection, we highlight how we identify, in software applications, the categories that
the bytes of a communication edge belong to. For the purposes of this subsection, assume that the
Sigil infrastructure observes execution of a program at run-time and monitors memory addresses
to determine producers and consumers of data. We explain the implementation in further detail in
later sections.
2.1 Communication classification
23
Categorization of data flow
Edits
39
Function YFunction X Function Z
Compute 
Wr
Addr A
Compute 
Wr
Add A
Compute 
Rd 
Addr A
Rd
Addr A
Compute
r
Loca
a l
 U
n i
q u
e Compute Compute
Rd
Addr A
al N
on-
U
nique
L o
c a Rd
Addr A
Wr
Rd
Addr A
Rd 
Rd
Addr A
Addr A Addr A
Figure 2.2: Categorization of dataflow into I/O vs. local and unique vs. non-unique. Functions
X, Y and Z are called in order from left to right
Figure 2.2 shows an example of categorization of function-level communication. Functions X,
Y and Z in the application are executed in the order they appear going from left to right in the
figure. All communication in this example occurs through the Address A, where functions X and
Y both read and write to A, while function Z only reads from A. We see that function X writes
to address A twice, after which it reads address A twice. When function X overwrites address A,
the previous data at address A has not been communicated and is not counted as part of local or
input/output communication. In general, only the last write to an address within a function will be
used to register input/output communication. The first time function X reads Addr A, we call it as
a unique local communication edge and mark the bytes as being part of the local data set and the
second time function X reads Addr A, we label it as a non-unique local communication edge and the
bytes are marked as reuse of the local set. The first time function Y reads Addr A, we identify it as
a unique input communication edge for function Y (and a unique output communication edge for
function X), as this is the first time the data transfer has occurred for this producer-consumer pair.
The second time function Y reads address Addr A, we label the edge as non-unique and the bytes
2.1 Communication classification
24
of that edge are marked as reuse of the input/output set (output set for X, input set for Y). Finally,
when Function Y overwrites Addr A, we identify the new producer for Addr A as Function Y, so
that when function Z reads Addr A for the first and second times, we are able to correctly identify
the corresponding communication edges for this producer-consumer pair as unique and non-unique,
respectively.
In this methodology, uniqueness is defined to be over the duration of an entire function call
or for as long as the byte is not overwritten in memory. We mark off a byte for communication
edges as unique for the first time they are read by a particular call to a consuming function, and
subsequent reads to the byte within the same call of the consuming function are labelled as non-
unique. Thus, if the consuming function is called again, and reads the same byte once more, we
treat the first read as unique again. Theoretically, uniqueness can be defined over any period of
time in the application, such as over the duration of the program, or within a fixed number of
instructions. We model the costs for each call to a function separately, as each call would represent
the invocation of an accelerator for that function. We can also study non-unique communication
in a function to understand its data reuse pattern over some proxy for time. To facilitate this,
upon specifying a command line option, Sigil also records statistics for each data byte: the number
of non-unique accesses for each call(reuse count) and the instruction count between first and last
non-unique accesses (reuse lifetime). This can give us hints into a function’s impact on a memory
system; that is, how often a function is accessing the same data and the liveliness of that data.
2.2 Automating the capture and classification of communication with
Sigil
Traditionally, understanding software-level communication and using it to improve performance of
a system, has been left to the programmer, or system designer. As mentioned before, software-
level communication can be used for a multitude of purposes, anywhere from HW/SW partition-
ing, parallelism discovery, and design of the interconnect [95, 31, 116, 53]. Thus, there is a need
for automation of capturing software-level communication as acknowledged by recent published
work [44, 69, 37, 55]. Prior attempts at capturing communication have several drawbacks; they do
2.2 Automating the capture and classification of communication with Sigil
25
not classify communication, were less efficient, met with limited success and often required program-
mer intervention [87, 98, 60, 38, 44, 69, 55]. In this section, we show how our tool Sigil overcomes
many of these drawbacks, by being fully automatic and efficient. Sigil also incorporates additional
features such as communication classification and the ability to generate traces of communication,
which enable the utilities mentioned above.
Sigil is able to use Memory Shadowing to efficiently keep track of producers and consumers of data
in memory [71]. We explain how memory shadowing helps us achieve efficiency in section 2.2.2. To
the best of our knowledge, Shadow Memory is employed by only two tools to observe communication,
Redux and Kremlin [69, 33]. While the sole purpose of Kremlin is to discover parallelism, we employ
Sigil to perform communication characterization, similar to Redux. However, as explained earlier,
the Redux developers acknowledge that Redux is not scalable, as its goals were to capture very
detailed communication flow between individual instructions. Development on Redux has been
discontinued. In this section, we also explain the data structures and methodology Sigil uses to
capture and classify communication, and profile reuse behavior efficiently.
While the baseline usage of Sigil is to capture, classify and aggregate communication, it is also
capable of generating traces of communication and profiling the reuse behavior based on the non-
unique communication. The additional features are implemented as command line options. There
is one command line option to generate traces of communication that is mutually exclusive with the
command line option to profile reuse behavior. The implementation and structures for these options
are also discussed in this section. Case studies with these options activated are presented in the next
chapter.
2.2.1 Overview of Sigil’s capture methodology
Sigil executes the program natively and monitors the program as it runs to generate its profiling
data. Sigil monitors and captures communication through memory addresses, with the help of the
Shadow Memory [71]. Memory shadowing is an efficient way of holding an object of data (known as
a shadow object) for every byte-level address touched by the program. We use the shadow object
for each address to hold the last writer and reader of its corresponding address. The tool establishes
2.2 Automating the capture and classification of communication with Sigil
26
Tracking communication updated
Edits
37
Function X Function Y
Write 
Addr. A
1
Read
Addr. A
2 Fetch last 
writer, 
update last 
Monitor
Classified Last 
Update 
Last 
Writer
reader
Function Y
bytes WriterAddr. A
3
45
Shadow MemoryData 
object
Figure 2.3: Capturing communication using Shadow Memory
communication edges anytime a function reads data written by one or more functions.
Figure 2.3 presents an example of this process. The numbered circles indicate the sequence
of dynamic steps performed in capturing the trace from an application as it runs natively. When
a store to address A occurs in function X, the address is emitted to the shadow memory which
stores function X as the last writer for address A. Subsequently, in step 2, a read to address A
occurs in function Y; this implies a communication edge. The address is sent to a monitor which
checks against the Shadow Memory to determine the function call which last wrote to address A
and which last read address A in step 3. The last writer information is sent back to the monitor
in step 4, which decides if this was an inter-function communication edge or a local communication
edge. Furthermore, if the last reader matches with the current function call which read this value,
we mark the bytes for the communication edge as non-unique. In step 5, the appropriately classified
2.2 Automating the capture and classification of communication with Sigil
27
bytes for the communication edge are sent to the data objects that hold aggregated communication
information for Function Y. Not shown in the diagram, the bytes are also sent to the data object
for Function X that tracks its output bytes.
2.2.2 Shadow memory based implementation
Memory shadowing is an efficient way of holding data for every single byte written by the pro-
gram [71, 117]. Prior works that classify communication use linked lists and array based tables
which are not necessarily scalable [44, 37, 87, 98]. While memory shadowing is not new, Sigil is
amongst the first tools to employ it in capturing and classifying communication.
Tracking communication – Shadow Memory
21
Shadow Obj
“
“
Secondary 
Maps
Addr[15:0]
Shadow Obj
“
“
0 0 ….
Primary Map
Addr[34:16]
Last Writer = Func A
Last Reader = Func B
ST Addr, Register in Function A
.
.
LD Register, Addr in Function B
A B
Figure 2.4: How the shadow memory tracks producers and consumers; and distinguishes reuse
Sigil’s shadow memory structure is derived from Nethercote and Seward’s description [70]. It is
a two-level table, similar to an operating system page-table, where each level is indexed by a portion
of the data’s byte-address. Figure 2.4 shows how this structure is organized. The first level structure
is known as a primary map, while the second-level structures, called secondary maps. The primary
map is statically declared to cover a large portion of the address space defined by the upper address
bits. In order to save space, the secondary maps are created only when the corresponding portions
of the address space in the primary map are accessed. Each second-level structure is a chunk of
2.2 Automating the capture and classification of communication with Sigil
28
shadow objects which is initialized to “invalid” until the data byte corresponding to those addresses
are used by the binary.
Accessing shadow objects in Sigil
This shadow memory guarantees constant time lookups, inserts and delete for tracking producers
and consumers. Figure 2.4 how the entire lookup process works with Shadow Memory. Assume a
scenario where function A writes an address and function B reads the same address. Sigil needs to
establish a communication edge between A & B and classify the edge. When the store in A occurs,
the shadow memory object is looked up, declaring secondary maps in the process, if necessary. The
upper 20 bits of the address are used to index into the Primary map. If the map entry is empty, then
a new secondary map is allocated. The secondary map is looked up using the lower 18-bits of the
address and the shadow object is fetched. As Sigil uses Valgrind’s memory allocator, we found the
most efficient memory usage arose from using secondary maps that had 256k entries or more. Sizing
the secondary map too large could cause wastage of space if all the shadow objects in the secondary
map are not used. We chose the minimum size for secondary maps in order to keep the memory
footprint as low as possible. The shadow object contains the last writer and last reader variables.
The last writer is set to A. When the load occurs, the object is looked up again using the same
path and the last writer is determined to be A. We then store off the last reader as B and the call
number of B, so that during this call to B, if B reads Addr again, we can mark those corresponding
bytes as non-unique. Thus, we have only two array accesses to reach a shadow object, barring the
time taken to declare a secondary map occasionally. This guarantees O(1) time for lookup, insert
and delete in the worst case. Also, since the shadow memory’s primary map is sparsely populated,
the implementation is memory efficient. This implementation conserves space and speeds up the
lookups, providing an order of magnitude improvement over a naive approach that uses linked lists
or dynamic arrays to track addresses. The only limitation of our current implementation is that it
cannot handle addressing beyond 38 bits.
The full contents of a shadow object for Sigil is shown in Table 2.1. The baseline variables
(without profiling data reuse) collected for all workloads allow Sigil to determine producer/consumer
2.2 Automating the capture and classification of communication with Sigil
29
Table 2.1: Shadow Object Contents
Baseline
variable size description
last writer 8B pointer to function
last reader 8B pointer to function
last reader call 8B call number
Additional variables for Reuse mode
reuse count 8B # of times byte was accessed
reuse lifetime start 8B first access timestamp
reuse lifetime finish 8B final access timestamp
relationships. When also profiling reuse behavior using Sigil (command line option), the shadow
memory object is extended with additional variables used to derive data liveliness and reuse. The last
writer field corresponds to the variable that is updated and accessed in Steps 1 and 4, respectively,
of figure 2.3. Sigil also uses a pointer to the last reader and the corresponding call number to
distinguish between unique reads and non-unique reads, which are both updated and accessed in
Step 4.
2.2.3 Auxiliary data structures employed
Besides the Shadow Memory, which is used to track producers and consumers, Sigil needs auxiliary
data structures to record the aggregated communication information over the program’s execution.
Sigil dumps recorded data to a file upon termination of the program being profiled. These auxiliary
structures help us hold data and are constructed efficiently to reduce the lookup time to store
profiled data. Figure 2.5 shows all the structures we used to record communication information for
the baseline usage of Sigil (without reuse profiling and event trace generation options). For every
function in the program, we declare a data object that holds all its information. These data objects
link to one another and arranged in the form of a calltree. For example, figure 2.5 shows that
Function A, which calls Function B and C, is also linked to B and C. In essence, each Function can
have only one parent that links to it; for functions which are called in more than one context, we
create separate nodes to account for their costs separately. This is useful for HW/SW partitioning
problems as we show later.
The data object for each function records both communication and computation variables. We
2.2 Automating the capture and classification of communication with Sigil
30
Edits
A B C D E
Func
A
L l
Input List
Category Bytes
Unique
oca
Local- Unique Output List
Non -
Unique
Func
C
Func
B
Input List Input List
Local
Local- Unique
Local
Local- Unique Output List Output List
Figure 2.5: Data structures for Function-level profiling
capture information related to computation such as instruction counts, integer operations and float-
ing point operations for each function. This is not shown in figure 2.5 for lack of space. The data
object contains the counts of local total and local unique bytes and pointers to tables for the input
and output list. The input list holds an object for every function that this function consumes from,
while the output list holds an object for every function that this function produces data for. The
expanded view of the input list for function A in figure 2.5 shows that function A reads from B,
C, D and E and for each of them we hold the input unique and input total bytes read from those
particular functions over the run of the program. The output list is constructed in a similar manner.
The input and output lists are implemented as dynamic arrays that are re-sized in chunks of 100
entries, to stay memory efficient. While these lookups take O(n) time, we have not encountered the
number of functions in the input and output list to exceed 100 functions which can be looked up
fairly quickly. Also note that since the bytes for one producer-consumer pair represent input for the
consumer and output for the producer, the data objects are cross linked to allow for quicker updates
2.2 Automating the capture and classification of communication with Sigil
31
while profiling.
In addition, when reuse profiling is activated, we hold two 10000 entry histograms in the object
for each function, to profile reuse of the local and input data sets of a function. We can reduce
the size of the histogram based on the required fidelity and maximum allowable memory footprint,
though the maximum we currently support is 10000 entries per function. As mentioned earlier,
Sigil can also produce an event trace representation as output that additionally prints out every
communication edge encountered in the order that it was encountered. This is also activated as
a command line option, and can use a fixed length buffer of a configurable size (default: 2GB) to
hold the edges to be printed to file, before periodically dumping to file. This is done to reduce the
overhead time for printing.
Using the Shadow Memory based implementation and the auxiliary data structures to record
communication and computation information, Sigil is able to efficiently capture and classify com-
munication. We characterize Sigil to understand its slowdown and memory usage. We present this
in the next section.
2.3 Sigil infrastructure & characterization
We chose a run-time instrumentation framework to develop Sigil, as static analysis tools have limited
visibility into the run-time behavior of the program (especially through pointer-indirection) [44, 56].
Sigil can be implemented in any run-time instrumentation framework such as Pin, Valgrind or
DynamoRio, so long as the appropriate hooks are present such as function calltree information,
streams of memory addresses and compute operations. As of the writing of this dissertation, Sigil,
was written on top of Callgrind (part of the Valgrind instrumentation framework) [72], as Callgrind
has all the necessary hooks for functions and threads present and needed very minor, non-intrusive
modifications. Valgrind is a Dynamic Binary Instrumentation(DBI) framework that is capable of
intercepting a user program at run time and provides mechanisms to perform heavyweight analysis
of the program [72]. Valgrind translates assembly into an intermediate representation. This repre-
sentation reduces the program to a collection of RISC-like primitives such as Load/Store Instruction
and compute operations that allow profiling tools to be written easily.
2.3 Sigil infrastructure & characterization
32
Callgrind is a tool that is built over the Valgrind framework [106] and provides call tree structures
which form the basis for linking the auxiliary data structures in Sigil. Callgrind captures a calltree
of the running programs and also performs on-the-fly cache simulations to determine the behavior
of the program. It maintains costs for each function in the calltree of the running program. A
programmer can identify performance bottlenecks in a software application by using a breakdown
from Callgrind, of parameters such as cache misses and branch mispredictions. Sigil intercepts calls
in both the cache simulation, and where compute operations are identified and forwards information
such as addresses ranges, operations performed (read/write/compute) to its own profiling functions,
along with function/thread IDs. It also examines Callgrind’s variables to determine the current
active entity (functions or threads for this implementation) and how it was entered/exited. Thus,
for function-level analysis, Sigil monitors Call and Return instructions with the help of Callgrind.
For thread-level analysis, Sigil monitors thread API calls such as create and join, which requires
more powerful Valgrind-specific features, which are discussed in Chapter 4
In general Sigil can use any framework that identifies communicating entities, and exposes ad-
dresses and operations to the tool. Callgrind was minimally modified to insert calls to Sigil and
allow it to compile along with Callgrind. The biggest change made includes the functionality to log
floating point and integer operations within Callgrind. As with any Valgrind or debugging tool such
as gdb, the Sigil output’s human readability is drastically reduced when the binary does not have
debugging symbols. The application binary, otherwise, can remain unmodified.
Since Sigil is built on Valgrind which operates entirely in user space, Sigil shares Valgrind’s
inability to observe any compute or memory operations within the kernel. However, Valgrind can
intercept system calls and report an aggregate of the memory addresses read and written within a
system call. Since system calls are performed on the behalf of library functions, we attribute costs to
library functions instead. For a system call, we record the library function as the writer for the list
of memory addresses written, and treat the reads as coming from the library function. The current
limitation with this approach is that some reads performed in library calls are not matched with
visible writes (addresses modified inside the kernel and not reported by Valgrind.) We record these
2.3 Sigil infrastructure & characterization
33
under a label ”No producer” indicating the ambiguity of the producer of that information, and store
it along with the function’s data structures.
400
600
800
1000
1200
1400
S l
o w
d o
w
n
Sigil Callgrind
0
200
Figure 2.6: Slowdown of Sigil and Callgrind relative to native for baseline function-level
profiling
10
15
20
25
S l
o w
d o
w
n
simsmall simmedium
0
5
Figure 2.7: Slowdown of Sigil relative to Callgrind for baseline function-level profiling
We measured the cost of running Sigil on an Intel Xeon E5620 platform with 24GB of DRAM for
2.3 Sigil infrastructure & characterization
34
2000
3000
4000
5000
6000
M
e m
o r
y  
( M
B )
simsmall simmedium
0
1000
Figure 2.8: Memory usage for baseline function-level profiling
the baseline operation described in section 2.1.1, and with the data reuse profiling feature described
in the same section. With data-reuse profiling activated, Sigil’s memory usage is up to 2 times
larger when the instrumented program touches a large range of addresses. To alleviate large memory
footprints in general, we added a simple FIFO mechanism in Sigil, activated with a command line
option, to free up space from shadow bytes of addresses that have been least recently touched by
the program. Figure 2.6 shows the baseline function-level profiling slowdown of Sigil and Callgrind
relative to native runs without any instrumentation of the serial version of PARSEC workloads with
the ”simsmall” input. The slowdown is much larger compared to Callgrind; the average slowdown
being 580x for simsmall inputs and 720x for simmedium inputs. Figure 2.7 shows the slowdown
of Sigil relative to Callgrind; we observe an average slowdown of 8-9x and remain fairly consistent
given Sigil’s ambitious goals. dedup is an outlier which incurred more slowdown as we enabled the
memory limiting command line option to keep Sigil’s memory usage manageable. blackscholes and
swaptions with simsmall inputs take very little time in both frameworks (less than 5 minutes), so the
slowdowns of 10x over Callgrind are not noticeable. Figure 2.8 shows the memory usage of Sigil for
workloads as we increase the data size. The memory increase also remains consistent for increased
2.3 Sigil infrastructure & characterization
35
datasize. facesim and raytrace are intensive benchmarks that use larger amounts of memory but
incur constant overhead over a native run.
With data-reuse profiling activated, Sigil’s memory usage is up to 2 times larger when the instru-
mented program touches a large range of addresses. Using the memory limiting command line option
improves performance when instrumenting programs with large memory usage. dedup is the only
benchmark amongst the PARSEC benchmarks, for which we have needed to enable this memory
limit parameter. We found the corresponding loss of accuracy to be negligible, when we compared
the aggregated data.
Sigil incurs a larger slowdown than Callgrind over native runs of benchmarks. Sigil uses more
memory and incurs more memory lookups than Callgrind as it shadows the entire program state.
We believe this overhead is justified as Sigil captures platform-independent data and only needs to
be run once.
2.4 Sigil output representations
Up till now, we have described how Sigil can capture and classify every communicated byte in a
program. Sigil can represent profiled computation and communication data in two ways: (1) by re-
porting the aggregates of measured communication for each function in the program; (2) by recording
a list of all of the communication edges that occur. In the latter representation, a program’s essence
can be reconstructed as a trace of dependent events. The event trace representation essentially
records fragments of computation separated by communication edges. Note: the producing and
consuming entities are still functions, and this representation helps designers understand the order
imposed on function calls due to the algorithm implemented by the program. In each representa-
tion, we report the categories of unique and non-unique bytes of communication. The next two
subsections explain the two representations in detail.
2.4.1 Aggregates representation
HW/SW partitioning algorithms seek to discover the optimal way to split applications to determine
which pieces of the application run on specialized hardware. They seek to minimize communication
2.4 Sigil output representations
36
8/5/2013
1
Communication Example
main4/16
4
12
4
A B
DC E
48
8 16 16 8
Figure 2.9: Data and control flow between functions
between partitions and maximize work done in the hardware partition. Partitioning algorithms are
usually applied on task graphs constructed manually by the programmer/system designer or gen-
erated pseudo-randomly using research tools such as Task Graphs for Free [24]. Sigil’s aggregates
representation allows the construction of compact function-level control data flow graphs for real
applications. These type of control flow data graphs, where all nodes are functions and the only
control flow instructions monitored are call, return and jump instructions, are often called as in-
terprocedural data flow graphs [37, 87, 98]. We show that these interprocedural data flow graphs
(IDFGs) can be substituted for coarse-grained task graphs in HW/SW partitioning problems, with
some special considerations described in the next chapter. Thus the aggregates representation has
the potential to remove the manual work performed in constructing task graphs, and provides a
reasonable real application substitute for randomly generated task graphs.
In this subsection, we describe the representation and how Sigil’s output data is used to con-
struct it. Figure 2.9 shows the sample interprocedural data flow graph (IDFG) for a toy program,
constructed from profiling data that Sigil provides. Each function is represented by a node and
nodes are connected by two types of edges: call edges (or control flow edges) in bold arrows and
communication edges (or data flow edges) in dashed arrows. The nodes are arranged in the form
of a calltree obtained from Callgrind, where parent calling functions point to children with a bold
arrow. The calltree represents a hierarchical form of control flow where a child function is called
within its parent function and returns control to the parent function after completing its task. As
2.4 Sigil output representations
37
we explained in Section 2.2.3, to ensure that a child has only one parent, if any function is called
through two different paths traced from the root (different calling contexts), we replicate the node
and identify it differently for each path to maintain separate costs. The nodes in the calltree are
all unique, with no repetition. In the example in the figure, our final representation would generate
nodes D1 and D2 for the two different calling contexts of function D.
Since each node is not repeated, multiple communication edges can exist between all functions
as seen in the figure. This is due to the fact that although a node can have only one parent in the
calltree, data can be passed between any set of functions using pointers. The directed communication
edges are weighted by the number of bytes needed by the receiving function to do its work; i.e.
“unique” communication. If the total differs from the unique, a field for total communication can
also included as shown in the dashed arrow between functions main and A. Function A reads 4 bytes
from function “main” uniquely and 16 bytes in total, while all the remaining edges have only unique
communication. The dashed edges essentially represent volumes of unique or total communication
between functions across all calls and the entire run of the program.
The aggregates representation has the advantage of being compact and can be easily understood
and post-processed, similar to representations in previous attempts to capture communication [87,
38, 55]. The data for this representation can also be captured quickly, with a small memory footprint
during profiling. If we wish to analyze an application by breaking it up into recognizable entities,
this representation fits the requirement. Since this representation includes control flow, a software
application is divided into entities based on control flow boundaries. Thus, the entities are fixed
to individual instructions, basic blocks, functions, or threads. The current version of the Sigil tool
is restricted to using functions and threads as entities, but can be easily extended to include finer
granularity entities such as basic blocks. We show in the next chapter, how we can use the IDFGs
constructed using this representation, to perform HW/SW partitioning to find candidate functions
for hardware acceleration in future heterogeneous CMPs.
2.4 Sigil output representations
38
Classifying and representing communication
3
Compute 
2
Rd 
A
Function 1 Function 2
Wr
B
Compute
1
Wr
A
Rd 
B
Compute 
3
Compute 
4
Figure 2.10: Two functions with Loads and Stores. Computation chunks between each com-
munication edge is independent in the two functions
2.4.2 Event Trace representation
We produce the event trace representation to help identify parallelism by separating the costs of
computation chunks of each function call. We are able to use this data in conjunction with the ag-
gregates data to construct dependency chains of function calls and reveal critical paths and identify
parallelism. The aggregates representation is a compact representation that is useful for partitioning
problems and not expensive to profile. However, it does not record the individual communication
edges observed in the program and instead aggregates the bytes in those edges under different cat-
egories. Thus the aggregates representation alone would be unable to track fine-grained parallelism
between load/store instructions. Figure 2.10 shows an example of how fine-grained dependencies
allow us to discover parallelism between the load/store dependency edges. Function 1 and 2 both
read data from each other through addresses A and B, with two computation chunks in between.
In the example, chunks of computation from the two functions between the dependent loads/stores
are essentially independent of each other. Therefore, computation chunk 1 and 2 can be executed
in parallel, and similarly computation chunk 3 and 4 can be executed in parallel.
2.4 Sigil output representations
39
The output of the event trace representation can be post-processed to discover fine-grained par-
allelism in a program, and establish the limits of parallelism. The event trace representation is
invoked as a command line option in Sigil and produces a second file in addition to the file created
by the aggregate profile. This file contains a list of computation chunks and communication edges
in the form of abstract computation and communication “events”, recorded in the order they were
encountered. The producer and consumer entities are still functions and each computation chunk
represents a variable size set of instructions that form a part of a function. These chunks repre-
sent full self-contained fragments of work that do not have any intervening communication. They
are not demarcated by control flow boundaries (branches, call/returns etc.) like in the aggregates
representation.
Dependency Tree construction
8/5/2013
1
main
lf
main
lf
main
lfSe : 10
A
Self: 18
8 8
C t 28
C
Self: 24
C t 34
Se : 10
A
Self: 18
8 8
C t 28
C
Self: 24
C t 52
Se : 10
A
Self: 18
8 8
C t 28
C
Self: 24
C t 52
8 8
os  = os  =  os  = os  =  os  = os  = 
A
Self: 5
Cost = 33
D
Self: 13
Cost = 46
A
Self: 5
Cost = 33
D
Self: 13
Cost = 65
0
8
04
8
Figure 2.11: Communication between parallel paths
We can post-process an event trace file to separate the dependent chains of computation chunks
of functions in the program, to produce a dependency tree. The dependency tree reveals fine-grained
hierarchical parallelism in the application and essentially represents the maximum theoretical par-
allelism in the application. Prior work has extracted more coarse-grained hierarchical parallelism
using dependence structures, but do not indicate the theoretical limits of parallelism in an appli-
cation [33, 56, 116]. In our current use case of the event traces, we do not study the extraction
2.4 Sigil output representations
40
of hierarchical parallelism, but instead use the dependent chains to discover the critical path of an
application and the theoretical limits of scheduling computation chunks in parallel.
Here we show a simple example of how a dependency tree can be constructed from the results of
the emphevent trace representation, and how the critical path is discovered. Figure 2.11 illustrates
how we construct dependency chains of events for the same toy program discussed under the ag-
gregates representation in figure 2.9. We process the event trace file and process each computation
and communication event within the file. We generate nodes for each chunk of a function and the
communication edges are used to connect the nodes to form chains. As nodes get updated or added
to each chain, we must re-calculate the critical path. Each node in the figure represents part of a
single function call. The self-cost of each node, shown inside the box, is the number of operations
performed within the call. The inclusive cost, shown outside the box of a node, represents the sum
of the self-costs of the longest chain from “main” to that node. The longest chain in the entire tree
is the critical path. The critical path is highlighted with colored nodes in gray and edges in bold.
In the example, A and C are encountered first with A preceding C. Both are attached to main and
the path through C is the critical path. In the calltree for the toy program shown in Figure 2.9, A
calls C and when C returns, we encounter A again. We model functions as non-blocking, so that
they can potentially run in parallel and start consuming data. To include the effect of this, we add
the second occurrence of A as a separate node although it belongs to the same call, so as to not
affect the inclusive cost of C. We also add a dependency link to the previous occurrence of A to
conservatively enforce order between regions within A. Node D is then added when it consumes data
from that particular call of A. The path to C through A is the updated critical path. Finally, when
a link is established between C and D, the critical path is now updated to include D as the leaf node.
Computation and Communication event formats
Unlike the aggregates representation, the event trace representation needs to record chunks of func-
tion calls, and hence will need extra fields for identifiers of each event. Events (computation or
communication) are numbered in the sequence they are encountered. Instead of a global numbering
2.4 Sigil output representations
41
system, event numbering starts at 0 for each function instance called from a particular context and
the event number is incremented for each encountered event for that function context. A computa-
tion event is as follows:
1 Event Number , Function Number , Function Ins tance Number , Ca l l number , I n t e g e r Op
Count , F loat ing Point Op Count , Total Local Memory Read byte count ,
Unique Local Memory Read byte count
Listing 2.1: Computation Event
Computation events record the identifiers for each function call such as the function number, an
instance number (a unique instance number is generated for each context as described earlier), and
the value of n for the nth call to the function. A computation event also holds the number of
integer/floating point operations, and the locally produced and read bytes, so that each node in a
dependency chain is associated with a weight that represents the work it does. The instance number
represents the instantiation number due to the function being called through multiple contexts. This
has bearing on deciding the critical path, as demonstrated in the previous example. Locally read
bytes are bytes that have been written by the currently active function call, and then read within
the same function call. Locally read bytes have both the unique and total fields as shown. These
fields can also be interpreted as having a bearing on the critical path, as each locally written byte
represents the cost of intermediate data storage. A communication event has the following fields:
Event Number , Consuming Function Number , Consuming Function Ins tance Number ,
Consuming Function c a l l number ,
2 Producing Function Number , Producing Function Ins tance Number , Producing Function
c a l l number , Total bytes communicated , Unique bytes communicated
Listing 2.2: Communication Event
Similar to the computation event, a communication event holds the identifiers for both the producing
and consuming functions. Also similar to the computation event, it holds the bytes communicated
2.4 Sigil output representations
42
between the producer and consumer functions with the same unique and total categories. The
identifier categories help connect the nodes in the dependency chains, while the unique and total
fields can be used to establish communication costs through the dependency chains and hence the
critical path. An important optimization we implement is to merge consecutive computation
events and communication events that have identical field values for identifiers. We add up the
costs associated with the other fields in the merged event. This makes trace storage more tractable.
Due to the large traces that get periodically written to, this representation takes slightly longer to
capture than the aggregates representation. However, they need to be run only once per input and
are powerful in characterizing intrinsic parallelism of a program.
2.5 Background and Related work in communication-aware profiling tools
As mentioned before, software-level communication can be used for a multitude of purposes, any-
where from HW/SW partitioning, parallelism discovery, and the design of interconnect [95, 31,
116, 53]. There is a need for automation of capturing software-level communication as acknowl-
edged by recent published work [44, 69, 37, 55]. A recent workshop study by Wu and Kim propose
that tools such as Sigil and Conservation Cores are important to understand the limits of accel-
erator hardware in processing chips such as MPSoCs and heterogeneous CMPs [111, 102]. Prior
attempts at capturing communication for many purposes have several drawbacks; they do not clas-
sify communication, were less efficient, met with limited success and often required programmer
intervention [87, 98, 60, 38, 44, 69, 55].
Kim et al. propose a pin-based tool similar to ours that captures dataflow between functions for
the purposes of HW/SW partitioning targeting accelerators. The tool does not classify communi-
cation and no results are publicly available to compare our data against the tool’s output. Rul et
al. describe a profiling framework that captures dataflow between functions, solely for the purpose
of discovering function-level parallelism. They use tables for every address to attempt to classify
communication, though they do not characterize the costs in terms of memory or performance, es-
pecially since they gather traces using a simulator before post processing. They also do not classify
communication in a formal and complete manner as this dissertation.
2.5 Background and Related work in communication-aware profiling tools
43
Methodologies and models exist that characterize hardware and task-level communication pat-
terns [57, 23, 51], many of these profiles are very specific and the bytes of data transfer measured
are very dependent on the characteristics of the platform’s memory hierarchy and run-time behav-
ior. Curreri et al, in particular, propose an automated methodology for capturing communication
between application processes, but this do not distinguish between the first use and reuse of data [23].
Prior work in the hardware-software co-design field specifically use instructions, data flow analy-
sis, and communication in the design process [67, 113]. These methodologies do consider the impact
of communication on performance, but they do not extract data flow patterns from existing binaries
automatically, which makes it difficult to apply the methodology to all workloads. Gremzow et al.
employ dynamic instrumentation to determine both data flow between functions and reconstruct
source/high level information to assist high level synthesis [38]. Galanis et al. [31] derive data
flow graphs using static analysis and dynamic profiling of a given workload. However, neither work
classifies communication and account for unique data transfers.
Work in the reconfigurable computing field also explores the automated capture of communication
to assist hardware/software partitioning. Smith and Peterson [95] propose a model that includes
communication costs to estimate speedup of FPGA-accelerated cores for multi-threaded applications.
The RC Amenability Test (RAT) from Holland et al. [48] describe models to quickly estimate the
performance of applications targeted at FPGAs, while Huang et al. [51] propose splitting task graphs
such that overall communication in the system is kept to a minimum. Although these models and
techniques capture the impact of communication, they assume prior knowledge of the application or
existing data flow graphs. In this work, we discuss an automated way of extracting communication
and applying it on arbitrary workloads, using similar models.
Recently, tools have been released that use dependence analysis to highlight parallelism in
loops [116, 56]. There has also been work that uses compressed traces to do dependence analy-
sis and extract parallelism [83]. These works use traces or access histories for their dependence
analysis, with the sole purpose of extracting parallelism. Sigil uses memory shadowing which allows
it to accurately see dependencies across the workload and also classify it as unique and non-unique.
2.5 Background and Related work in communication-aware profiling tools
44
Kremlin [33] identifies potential parallel regions of a given serial workload using hierarchical critical
path analysis. These abstract regions do not have to be at function boundaries. Kremlin also does
not classify communication as unique and non-unique. Gupta et al. [40] propose models to paral-
lelize statically-sequential programs written in a suitable data flow fashion. However, their parallel
executable functions are identified by programmers.
Another important consideration in Sigil was the choice to study communication using dynamic
profiling rather than static analysis. Static analysis is closely related to compiler technology and
analysis can often help easily detect control flow boundaries [33]. It can also help classify data
bytes by their utility; for example it can detect induction and reduction variables, which gives
clues as to how the data byte will be used and shared amongst software entities. While static
analysis has some merits, its chief drawback is that it cannot profile dynamically determined code
behavior [60, 56, 44]. It is incapable of tracing through structures such as linked lists and loops with
dynamically determined trip counts. Hence, static analysis cannot track communication between
objects allocated at run-time. Kim et al. present some interesting data on the drawbacks of static
analysis vs. dynamic analysis in the context of parallelization discovery [56].
Our novel methodology, implemented in the tool Sigil, dynamically tracks the data produced and
consumed by each function call and differentiates the first-time use from the reuse of bytes. We also
distinguish between communication external to the function and local communication within the
function by tracking the producer and consumer of each unique data byte in the program. Using the
technique of shadow memory Sigil is able to index our table of functions efficiently without exploding
state unlike most tools that capture communication. Sigil can represent its profiling results in one
of two ways: it can dump aggregates on a per-function basis or list the execution as a sequence of
dependent events in an event trace for use cases described in the next few chapters.
2.6 Summary
In this chapter, we presented Sigil, a tool for capturing and classifying communication automatically.
We demonstrated, with an example, the utility of classifying communication into various categories
and formally defined all the categories we use. We explained the methodology the Sigil tool uses
2.6 Summary
45
for efficiently automating the capture and classification of communication, and characterized Sigil
in terms of memory usage and performance. Finally, we discussed the output representations that
Sigil can provide using the profiled computation and communication. These output representations
can be used for several purposes. We use Sigil’s data for analysis towards task graph partitioning,
modeling accelerators, and performing simulation for and design space exploration studies of the
uncore of CMPs. The next few chapters are dedicated to exploring those purposes.
2.7 Acknowledgments
Most work in this chapter is adapted and extended from a paper entitled “Platform-independent
analysis of function-level communication in workloads” by Siddharth Nilakantan, and Mark Hemp-
stead. The dissertation author was the primary investigator and author of this paper. Some material
is also drawn from currently unpublished work that will be submitted to a longer journal version of
the work mentioned previously. This material is based on work supported by the National Science
Foundation grants where the Primary Investigator is Mark Hempstead.
2.7 Acknowledgments
46
Chapter 3: Using Sigil for function-level analysis
General-purpose cores
Integrated 
IO
Special-purpose 
cores
(a) PowerEN die with accelerators (b) Tegra K1 32-bit architecture with accelera-
tors
Figure 3.1: Example of Heterogeneous CMP and MPSoC tiles. Adapted from papers on
existing dies [34, 77]
In this chapter, we show how Sigil’s data can be used to assist accelerator selection. As mentioned
in the introduction, modern microprocessors have hit a power wall that is constraining performance.
With the end of Dennard scaling[29], power density (i.e. W/mm2) is increasing with each transistor
process technology generation. Thus, to deal with the emerging power budgets fixed by cooling costs
and battery life requirements, designers must shrink the size of the microprocessor or leave sections
of the die un-powered, a condition recently named Dark Silicon[103, 27].
Hardware specialization increases energy efficiency and performance through the use of spe-
cialized circuits that complete more computations for every transistor switch than general-purpose
47
processors. Figure 3.1 shows a contemporary CMP and SoC that employ accelerators, IBM’s Pow-
erEN and NVIDIA’s Tegra K1 [13, 77]. Employing this solution requires new design decisions, such
as identifying candidate functions for which to build specialized hardware, and determining which of
them can be included alongside the general purpose processor [58]. We term these design decisions
as the accelerator selection problem. In this chapter, we show how Sigil’s data can be used to assist
accelerator selection, by producing a list of accelerator candidate functions.
Selecting the best ensemble of accelerators in a methodical and data-driven manner is essential for
hardware specialization to be an effective solution to Dark Silicon. More importantly, this decision
must be made by architects at the early-stage of the design process before hardware design teams
have been tasked or accelerator IP purchased. Performing an exhaustive evaluation of all the points
in this new and complex design space will be expensive and time-consuming. Hence, a methodology
is needed to allow early stage evaluation of systems that employ accelerators. In this chapter, we
use communication costs obtained from Sigil, to drive the accelerator selection methodology, by
studying how accelerator communication impacts the overall system performance.
Prior research in the HW/SW codesign and Reconfigurable computing community has targeted
the inclusion of specialized/reconfigurable hardware on chip, but have limited exploration to em-
bedded systems and FPGAs [51, 108, 51]. Given the adoption of accelerators in multiprocessor chip
design today in the form of CMPs and MPSoCs, these techniques are relevant and applicable in the
contemporary design space as well. Research in the HW/SW codesign and Reconfigurable comput-
ing communities usually employ Task Graph representations of applications to perform HW/SW
partitioning and scheduling [51, 31, 67, 113]. Some contemporary literature also allude to the use
of task graph partitioning for heterogeneous CMPs with generic computation units such as GPUs
or FPGAs [39, 94]. Thus, task graphs and HW/SW partitioning are relevant and central to the
adoption of accelerators in modern CMPs.
While the definition of task graphs vary from author to author, in the context of this work
we assume they are fundamental hardware/source-code independent representations of software
applications, and represent a sequence of dependent tasks in the application [108]. Each task is
48
self-contained and must complete before a dependent task can begin. The goal of partitioning a
task graph is to select a subset of tasks to be oﬄoaded to ASICs or FPGAs. For the partitioning
algorithm to find the optimal partitions, it will also require specifications of a particular platform
and its results will vary from platform to platform. This is due to the fact that the relative costs of
computation and communication will determine the efficacy of using specialized hardware to improve
overall application performance.
In this chapter, we explore the special considerations of applying HW/SW partitioning to Sigil’s
interprocedural data flow graphs (IDFGs) to produce accelerator candidates. These candidates are
considered part of the Hardware partition. We also discuss our demonstrative partitioning algorithm
and show accelerator candidates for the popular benchmarks suite, PARSEC [7]. Finally, we also
show how we can perform a constraint-based selection if we have RTL for the accelerator candidates.
3.1 Using Sigil’s aggregates representation for partitioning
In this section, we show how Sigil’s aggregates representation can be used as a replacement for task
graphs in HW/SW partitioning problems. Figure 1.4 from the introduction shows the example of
a task graph adapted from a research study on HW/SW partitioning of generic task graphs [114].
Such task graphs have also been generated pseudo-randomly using Task Graphs for Free, and do
not represent real applications, unlike Sigil’s aggregates representation [24, 101].
As tasks are abstract entities that represent fully self-contained units of work, there are no specific
rules that govern what constitutes a task. As Sigil can track communication between any software
entities, we can pick entities that reflect tasks best. We also discussed, in the previous chapter,
how granularities as fine as instructions can cause representations to be prohibitively unscalable and
expensive [69]. The notion of a task is best represented by a function in a software implementation as
functions are written to be frequently reused tasks [40] and represent clear and understandable logical
boundaries for developers and system designers. Functions also behave similar to accelerators. They
are called multiple times which is analogous to multiple calls to an accelerator. They are also called
from multiple contexts and in the case of library calls, from multiple applications. This implies that
they are naturally shared and accelerators representing functions can be shared across applications.
3.1 Using Sigil’s aggregates representation for partitioning
49
Tile	  	   Tile	   Tile	  
CPU	  
Accel	  
0	  
MC	  
$	  
Accel	  
n	  
Accel	  
1	  
Tile	  	   Tile	   Tile	  
Tile	  	   Tile	   Tile	  
Figure 3.2: Architecture model: an array of heterogeneous tiles
This served as motivation to use an application’s function calltree as the basis for the aggregates
representation produced by Sigil; the corresponding software entities used in the representation
are functions and the graph generated by the representation are IDFGs. The function calltree is
annotated with information such as communication edges, and computation operations for each
function, similar to the generic task graph described earlier. With per function costs obtained from
Sigil, and a proposed system architecture we show how to perform partitioning using a simple,
demonstrative algorithm for determination of accelerator candidates at an early-stage.
3.1.1 System Architecture for partitioning IDFGs
We assume a simple system architecture for the partitioning algorithm as we are targeting early-stage
accelerator selection in this study. Our system architecture assumes that the basic building block of
the system consists of a CPU and a mix of hardware accelerators closely coupled to the CPU. The
architecture model is informed by Hill and Marty’s analysis of parallel architectures, which shows
that as more parallelism is extracted for an application, serial execution delay will limit the effective
performance of a processor [47].
As the number of cores and accelerators grows, the communication patterns across the chip will
become complex and depend on the physical location of the accelerators and how the application has
been mapped to accelerators. We simplify the analysis by assuming a tiled architecture. Figure 3.2
shows the architecture we assume: a processor that consists of an array of tiles connected with a mesh
interconnect. Each tile will contain some of the following elements: a general-purpose processor core,
a memory controller, and an ensemble of accelerators connected via a shared bus. Hierarchical tiled
3.1 Using Sigil’s aggregates representation for partitioning
50
architectures have been proposed for applications such as software defined radio; they can reduce
global communication and provide significant bandwidth within a tile [61]. For the demonstrative
partitioning study in this chapter, we assume a 16GB/s bus bandwidth in a tile and a 2.5GHz
operating frequency for the system.
3.1.2 Important considerations for partitioning IDFGs
The goal of partitioning a task graph is to select a subset of tasks to be oﬄoaded to ASICs or FPGAs.
In the context of this dissertation, accelerator candidate functions are obtained from the hardware
Partition, when HW/SW partitioning is applied on Sigil’s IDFG. A partitioning algorithm will seek
to minimize communication between the HW and SW partitions, while maximizing coverage of the
application in custom HW. The tree is sliced into collections of nodes, such that communication
between the different collections is minimal. Partitioning abstract tasks graphs are easier, and ex-
isting HW/SW Partitioning algorithms can be applied directly as tasks are self-contained. However,
functions are not self-contained as they make calls to other functions before returning. Thus, when
exploring partitioning problems on Sigil’s IDFGs, we cannot simply view functions as abstract tasks.
Recall from Figure 2.9 which shows Sigil’s aggregates representation, that individual nodes are
functions with bold call edges and dashed communication edges. A partitioning algorithm will need
to account for the inherent hierarchy in the calltree portion (that can be traced out of just the bold
edges) of the IDFG. A naive approach will consider only leaf nodes of the tree, for evaluation as
accelerator candidates because they are fully self-contained. However, this is too limiting, as the
coverage of work done in the application may be too low with the leaf nodes of the calltree. If we
wish to consider a non-leaf nodes in the upper portion of the hierarchy for evaluation for candidacy,
we will need to include the functionality of the sub-calltree. That is, a node in the graph is assumed
to contain the functionality of its callees and its callee’s callees and so on; it absorbs the functionality
of its entire sub-calltree. This is justified for three reasons: If a non-leaf node is accelerated without
merging its sub-calltree into the node, i) the accelerator would incur the cost of going back and
forth from SW to HW for the upper portion and the node’s sub-calltree, ii) coverage of the workload
would not necessarily be improved and iii) we will have to split the function into several parts,
3.1 Using Sigil’s aggregates representation for partitioning
51
8/5/2013
1
Base Granularity
main4/16
4
12
4
A B
C E
48
8 8
D1
16
D2
16
(a)
8/5/2013
1
Merged Granularity
main4/16
4
12
4
main12/16
4
12
4
A B
C E
48
8 8
D1
16
D2
16
A B
D2 E
4
16 8
(b)
Figure 3.3: Partitioning interprocedural data flow graphs where nodes represent functions
before and after calls to the sub-calltree. Thus to consider each node in Sigil’s IDFG during the
partitioning process, we determine the costs for each node, assuming the entire sub-calltree for that
node is merged into the node. We term this the inclusive costs for each node.
The inclusive costs for each node simulate the effect of merging the entire sub-calltree of that
node, into the node. We illustrate the process of discovering inclusive costs for each node in figure 3.3.
Figure 3.3 shows the interprocedural data flow graph of the same toy program used in figure 2.9
with separate nodes for function D under each different calling context (D1 and D2). We determine
inclusive costs for a node by drawing a box around the node and its sub-calltree to represent the
functionality for the entire set of functions. Any dashed edges within the box are assumed to be
absorbed and edges flowing in/out of the box (crossing the bounds of the box) are considered as part
of the inclusive communication cost of the parent node. We also obtain inclusive computation costs
by adding measurements such as computing operations and CPU memory traffic for nodes within
the box to provide the inclusive computation costs for the node. For simplicity, figure 3.3 does not
3.1 Using Sigil’s aggregates representation for partitioning
52
show the computation and local communication costs of each function, but these metrics are also
captured by Sigil.
3.1.3 Metric for partitioning IDFGs
Partitioning the IDFG relies on merging nodes with their sub-calltrees when the nodes are considered
as part of the HW partition; i.e. accelerator candidates. Due to the existence of the hierarchy, if a
non-leaf node is deemed to be a better accelerator candidate than all the nodes in its sub-calltree,
then the algorithm must remove the sub-calltree from the IDFG as its functionality is considered
as part of the non-leaf node. This results in a merged IDFG, where the new leaf nodes are the
accelerator candidates, while the remaining non-leaf nodes of the IDFG are part of the software
partition.
We propose a two step partitioning process, that employs merging and ranking based on inclusive
costs. For the purposes of demonstration, we use a simple heuristic based metric that is derived
from the inclusive costs, to compare nodes against one another during the partitioning process. In
the first step of the partitioning process, we use the metric to compare nodes against every node in
their sub-calltree, and merge the sub-calltree only if the metric predicts that the node is a better
accelerator candidate than all nodes in the sub-calltree. Thus when evaluating a particular node, we
recurse down the sub-calltree and compare every sub-calltree node’s metric value against the parent
node, before merging. The result is a merged IDFG with leaf nodes as accelerator candidates. In
the second step of the partitioning process, we compare the leaf nodes of the merged IDFG for their
metric values and produce a rank for each node. The resulting list of leaf nodes sorted by rank
represents the accelerator candidate list in priority order.
Figure 3.3a shows the calltree before and after the merging step. If node A is selected for merging,
then we draw a box that encompasses its entire sub-tree as shown in figure 3.3b. We attribute only
communication outside the box to A. We represent the computation of the merged sub-tree by
accumulating the total number of operations for the entire sub-tree, resulting in a trimmed IDFG
with five nodes, where the leaf nodes are accelerator candidates.
Our heuristic based metric for a node, named the breakeven-speedup, is derived from 1) the node’s
3.1 Using Sigil’s aggregates representation for partitioning
53
inclusive coverage of software execution time of the application and 2) the hardware oﬄoad time
calculated from the node’s inclusive communication costs. As the software execution time and the
hardware oﬄoad time for a node is dependent on system assumptions such as the organization of
accelerators, cores and the interconnect, and their relative frequencies of operation, the metric value
for each node in the IDFG can differ based on the assumptions. We discuss these derived costs
before formulating the breakeven-speedup.
Estimating hardware oﬄoad time
To estimate the communication cost of oﬄoading computation to accelerators we assume—like the
PowerEN [13]—that direct memory access transfers (DMA) are used to inject data into the L2 cache
to transfer data and synchronization information between the CPU and accelerators. We do not
model situations where the accelerator requests data (demand-driven). Our model represents the
limiting case for acceleration and more elegant methods of data transfer and memory latency-hiding
mechanisms will add to any bottom-line benefit.
Part of the communication cost includes the data transfer of input data and output data over the
bus. We label the time for transfer of input data as the input communication time: tcomm:accel:ip.
To calculate this for a node, we first sum up all the bytes for all the input edges to the node from
the IDFG as Bytesip uniq. Recall that these edges represent IPCOMM UNIQUE from section 2.1,
as we assume that either data can be streamed in or sufficient buffer space is available so that
communication for each input byte of data for a particular function are incurred only once from
all sources. Thus, the input communication cost per function call is tcomm:accel:ip =
Bytesip uniq
BW
A similar set of assumptions and mechanisms must also be considered for output data transfer,
resulting in the output communication time: tcomm:accel:op.
The above calculation is independent of the CPU memory hierarchy or architecture. However,
additional cache miss penalties could occur depending on the location of the data in the memory
hierarchy when the transfer is initiated by the CPU. We capture the number of L2 misses for each
function using Callgrind and its configurable cache simulation model. From these assumptions the
3.1 Using Sigil’s aggregates representation for partitioning
54
total cost of communication for oﬄoading computation to and from accelerators is:
toffl:accel:ip = L2rd miss × tL2 miss + tcomm:accel:ip (3.1)
toffl:accel:op = L2wr miss × tL2 miss + tcomm:accel:op (3.2)
Estimating software execution time
To calculate software time for this study we used the Callgrind tool’s own Cycles Estimation formula.
The formula incorporates simulated cache misses, branch mispredictions and instruction counts. The
general form of the formula is as follows:
tsw = Num. instructions× CPI(Cycles per instruction) + L1 : Misses× L1 miss latency
+LL : Misses× LL miss latency +Branch : Misses×Branch misprediction penalty
(3.3)
The Cycles Estimation Formula we use, gives a rough estimate of software time assuming the
following latencies: 1 Cycle per instruction, 10 cycles for L1I and L1D misses, 100 cycles for unified
LL misses and 10 cycles for Branch mispredictions. These values reflect latencies in modern Intel
CPUs with a running frequency between 2.5 - 3GHz, 32kB L1I and L1D caches, and a unified 12MB
Last-level cache. The formula we use for calculation of software time is as follows:
tsw = Num. instructions+ 10× L1 : Misses+ 100× LL : Misses+ 10×Branch : Misses
(3.4)
Breakeven-speedup
Our breakeven-speedup metric is shown in equation 3.5. Breakeven-speedup is defined as the com-
putational speedup that an accelerator for a particular function would require in order to offset the
3.1 Using Sigil’s aggregates representation for partitioning
55
data-oﬄoad costs for input, tcomm:ip:accel, and output, tcomm:op:accel.
Sbreakeven =
tsw
tsw − (tcomm:ip:accel + tcomm:op:accel) (3.5)
Any computational speedup obtained in excess of the breakeven-speedup will result in an overall
improvement in execution time of application. Determining if the breakeven-speedup or greater
speedups for a function can be achieved, depends on the amenability of the function logic to a
hardware implementation. Amenability of functions to hardware accelerators is the next step in
early stage design of heterogeneous CMPs, after accelerator selection. Amenability is dictated by
the design space of the accelerator itself, and depends on several factors including the amount
of available parallelism in the candidate functions, the maximum allowable pipeline depth for the
accelerated function, and the accelerator’s initiation interval. Essentially, studying amenability will
require design space exploration for each candidate function with power/performance targets. We
leave the investigation of mapping candidate functions to specific hardware implementations to
future work.
The goal of the heuristic is to minimize the breakeven-speedup of all the leaf nodes of a trimmed
calltree to result in a merged IDFG. The leaf node of each branch of the merged IDFG should
have the least breakeven-speedup of that entire branch based on the first step of our partitioning
process. The heuristic is thus optimized for maximum application coverage with useful functions—
i.e. Amdahl’s law: the ratio of execution time in the candidate function over the total execution
time of the workload—and for minimal communication.
Limitations of breakeven-speedup
More sophisticated metrics are possible that incorporate computation costs and estimates for accel-
erator storage, based on Sigil’s captured data. The breakeven-speedup metric is a heuristic and does
not take into account the computation costs of an accelerator. While breakeven-speedup is used to
rank leaf nodes of a merged tree, it does not merge the resulting leaf nodes based on communication,
as that can have an effect on overall application performance as well. Breakeven-speedup also does
3.1 Using Sigil’s aggregates representation for partitioning
56
Figure 3.4: The normalized coverage of the leaf nodes of the calltree for all benchmarks
not consider the reuse characteristics of the data and assumes a serial execution schedule; i.e. it
does not incorporate the effect of parallel execution of software partitions and hardware partitions.
Furthermore, it assumes oﬄoad time can also not be overlapped with software time. Finally, it also
does not model streaming data to oﬄoaded functions, where a function can begin execution with
only a chunk of its input data set, thereby overlapping execution time with hardware oﬄoad time.
3.2 Results from IDFG partitioning and reuse analysis of PARSEC bench-
marks
We ran Sigil on a number of PARSEC benchmarks and used the heuristic-based metric to produce
merged IDFGs for each benchmark. Recall that the heuristic naturally tries to merge sub-trees to
maximize software time coverage while minimizing communication to a merged node. Figure 3.4
shows the breakdown of an application’s native execution time by fraction of candidate functions.
The coverage represented by the leaf nodes of the trimmed calltree is the lower bar and the rest of
the application is the upper bar.
From the graph, we see that many applications spend over 50% of their execution in the leaf
nodes of the trimmed calltree. The exceptions are Canneal, Ferret and Swaptions, whose candidate
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
57
functions show low “coverage” of the overall application in terms of execution time. Functions with
low coverage indicate fewer “hot code” regions.
3.2.1 Accelerator candidate functions in PARSEC benchmarks
Table 3.1: Breakeven speedup for top 5 functions for PARSEC-2.1 benchmarks with simsmall
input
Blackscholes S(breakeven) Bodytrack S(breakeven)
strtof 1.006 FlexImage::Set 1.000
ieee754 exp 1.011 ieee754 log 1.007
ieee754 expf 1.019 ieee754 log 1.007
ieee754 logf 1.021 IM::ImageErrorInside 1.007
mpn mul 1.039 IM::ImageErrorInside 1.007
Canneal S(breakeven) Dedup S(breakeven)
mul 1.008 sha1 block data order 1.008
memchr 1.028 sha1 block data order 1.013
netlist::swap locations 1.040 tr flush block 1.013
memmove 1.057 write file 1.033
std::string::compare 1.089 adler32 1.041
Freqmine S(breakeven) Ferret S(breakeven)
sort 1.00625524 memset 1.001249801
FPArray scan2 DB 1.035051339 printf fp 1.00704764
sort 1.037218337 printf fp 1.007179626
FPArray scan2 DB 1.039737893 ieee754 log 1.008068646
FPArray conditional pattern 1.073152924 ptb qsort help 1.012960706
Table 3.2: Breakeven speedup for worst 5 functions for PARSEC-2.1 benchmarks with simsmall
input
Blackscholes S(breakeven) Bodytrack S(breakeven)
dl addr 1.961 std::vector 1.278
mpn rshift 1.631 IO file xsgetn 1.266
IO sputbackc 1.421 DMatrix 1.143
free 1.238 DMatrix 1.143
mpn lshift 1.206 isnan 1.098
Canneal S(breakeven) Dedup S(breakeven)
gnu cxx 7.466 memcpy 6.119
std::locale::locale 3.136 memcpy 1.811
std::string::assign 2.645 hashtable search 1.441
std::basic string 1.893 hashtable search 1.433
operator new 1.609 free 1.156
For a designer to evaluate the best functions for acceleration first, we must sort the functions by
their breakeven-speedup. Table 3.1 shows the top functions picked by our proposed max-coverage,
min-communication heuristic from a few PARSEC-2.1 benchmarks. These functions are listed from
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
58
the top to bottom in order of increasing breakeven-speedup. A low breakeven-speedup indicates a
small communication cost to oﬄoad computation. We find that the breakeven-speedup in most cases
for the top few functions are close to 1. Table 3.2 shows the breakeven-speedups for the bottom few
functions. It can be seen that the functions are mostly utility functions such as constructors(e.g.
std::vector), destructors (e.g. free) and initializers (e.g. std::string::assign). These same functions
also exhibit less computational intensity. To illustrate the usefulness of functions picked by our
heuristic, we describe a subset of them here:
1. ieee754 (operation): These functions are part of the IEEE ‘math’ library. These are usually
very fast code implementations with existing hardware support.
2. mul/mpn mul : These are multiplication calls to the math library. While direct hardware
support exists in contemporary processors, these calls are made for compatibility purposes.
3. ImageMeasurements::ImageErrorInside: In the bodytrack benchmark, a human body is tracked
with multiple cameras through an image sequence. This function measures the “Silhouette”
error of a complete body on all camera images.
4. FlexImage::Set : This bodytrack function initializes an image and is mostly composed of mem-
copy calls.
5. memchr : This is a library call which searches for a character in a block of memory.
6. std::string::compare: This call compares two strings.
7. adler32 : A checksum algorithm optimized for speed over accuracy.
8. tr flush block : Part of the zlib algorithm implementing the flushing mechanism.
9. sha1 block data order : This call is the core of the SHA1 calculation.
10. netlist::swap locations: This call swaps two vectors.
11. ptb qsort help: This call is the main quicksort functionality.
12. sort : Basic sort functionality
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
59
13. FPArray scan2 DB : Builds prefix-tree for frequent pattern mining [6].
There are a few functions in the list that will benefit from accelerated communication rather
than computation. FlexImage::Set from the bodytrack benchmark is one such example and it is
composed of “memcpy” calls. Since breakeven-speedup focuses on minimizing communication, it
flags FlexImage::Set as having very low communication as all the communication with memcpy is
absorbed when calculating inclusive costs. For example, FlexImage::Set can potentially be sped up
by using memcpy accelerators [109].
This study shows that, with preliminary knowledge of a target platform and a little workload
analysis on a collection of workloads, we can determine a reasonable list of functions to target for
acceleration. Note: this methodology is more effective when the profiled code is more modular and
does not deviate significantly in behavior between calls to the same function; i.e. it is somewhat
input dependent. The next natural step for a system designer would be to traverse the list, apply
system constraints and perform an amenability test of these functions to determine if they can be
accelerated on hardware and for what cost.
3.2.2 Data reuse analysis
When the command line option for reuse-profiling is invoked, we characterize a data byte by its
reuse lifetime in the program and the number of times it is reused. Researchers have shown that
taking advantage of data reuse behavior can enhance the performance in a range of areas from FPGA
implementations, memory systems, and loops in scientific applications [63, 30]. It is also possible
to factor the data reuse patterns into the partitioning problem as the reuse patterns will determine
potential memory requirements of accelerated functions. Sophisticated partitioning algorithms that
take reuse patterns into account, is scope for future work and is not within the scope of the thesis
presented in this dissertation.
In this Section we study the data reuse patterns of PARSEC benchmarks in an architecture
agnostic manner. Sigil provides an automated way of capturing and analyzing data reuse at the
function-level with no prior knowledge of the application. We define reuse lifetime as the time
between the first and last read of a single data byte within a function call. In order to remain
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
60
30
40
50
60
70
80
90
100
b e
r   o
f   U
n i
q u
e  
D a
t a
  B
y t
e s
> 9 1 ‐ 9 0
0
10
20
N
u m
b
Figure 3.5: Breakdown of data bytes based on reuse counts for PARSEC benchmarks (simsmall
input)
5000
10000
15000
20000
25000
30000
eu
se
 L
ife
tim
e 
(In
st
ru
ct
io
ns
)
0Av
g.
 R
e
Figure 3.6: Average reuse lifetimes of the top vips functions by number of data bytes reused
architecture independent, we use the number of retired instructions as a proxy for execution time.
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
61
Data Reuse Within a Benchmark
We use Sigil to study the data reuse of PARSEC benchmarks, first in aggregate and then zooming
in to specific functions of interest. We discuss the implications of data reuse in the context of a CPU
cache. While we used a memory system with a cache as an example of gaining insight into memory
behavior, the platform-independent nature of our data allows us to investigate the behavior of any
arbitrary memory system. For instance, the data patterns are equally applicable in scenarios such
as HW/SW codesign and accelerator design.
Figure 3.5 shows the breakdown of repeat accesses to data for several PARSEC benchmarks with
simsmall inputs. The accesses are categorized based on the number of times each byte is reused.
The bottom-most section of each bar indicates zero reuse (the object is written once and read only
once within each function it is accessed in), while the remaining stacked bars represent two ranges
of reuse: between 1 and 9 accesses, and greater than 9. We see that for most benchmarks a very
small percentage of data elements are used more than 9 times. As a significant percentage of data
is created and consumed without ever being read again, most intermediate data generated by these
benchmarks are consumed quickly and need not be cached in a CPU or stored in a scratchpad for
an accelerator. Functions with limited reuse, such as those in the blackscholes and streamcluster
benchmarks, take very little advantage of a local memory (CPU cache or accelerator scratchpad) in
general. However, if the accessed data is not too sparse, such functions can still benefit from the
spatial locality extracting properties of a cache-based hierarchy, such as large cache line sizes and
prefetching. We hypothesize that applications with limited reuse could benefit from custom memory
systems, incorporating temporary buffers with explicit eviction of data when it is dead.
Re-use lifetime is an indicator of the time for which data needs to reside in memory during
program execution. This analysis is important to SoC hardware designers who need to size buffers
and scratch pad memories for accelerated functions. Using data from Sigil we can trace the source of
reuse in a benchmark of interest, e.g. vips. We sort the functions in vips based on their contribution
to the total amount of data reuse. Next, to understand the implication of large reuse, we look at the
top list of functions and examine the average lifetime of a reused data byte (reused at least once) in
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
62
10
100
1000
10000
100000
1000000
m
b e
r   o
f   d
a t
a  
e l
e m
e n
t s
1
0
4 4
0 0
0
7 0
0 0
0
1 1
4 0
0 0
1 5
4 0
0 0
1 8
5 0
0 0
2 2
7 0
0 0
2 6
4 0
0 0
3 0
0 0
0 0
3 4
1 0
0 0
3 7
4 0
0 0
4 1
5 0
0 0
8 8
5 0
0 0
1 1
2 9
0 0
0
1 1
8 0
0 0
0
3 5
9 0
0 0
0
3 9
7 4
0 0
0
4 4
3 5
0 0
0
4 9
1 1
0 0
0
4 9
2 1
0 0
0
5 0
1 7
0 0
0
5 1
7 6
0 0
0
N
u m
Reuse Lifetime (Bin size: 1000)
Figure 3.7: Data reuse distribution of “conv gen” in vips
those functions. This is shown in figure 3.6. Since Sigil keeps separate accounting of functions called
for different contexts, some functions occur more than once in the figure and are distinguished by
the number in parentheses. Functions with large average data reuse lifetimes may not need to be
cached as their data will be evicted before they are reused anyway.
In vips, the “conv gen(1)” function has the highest and “imb XYZ2Lab” has the smallest average
reuse lifetime. These two functions and the “affine gen” functions are the three biggest contributors
to the total unique data bytes processed by the benchmark (the total includes the input data, and
locally generated data), with each of their individual contributions being close to 10% each. The
remaining unique data bytes are distributed across numerous functions with most of their contribu-
tions being close to 2 - 3 %. Since “conv gen” and “imb XYZ2Lab” are such large contributors to
the overall data and incur such varying reuse lifetimes, we investigate them further.
Data Reuse Within A Function
Sigil can also capture a histogram of data-reuse during a function call. Each bin in the histogram
corresponds to a range of reuse lifetimes and the value of that bin is the count of data bytes whose
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
63
1000
10000
100000
1000000
10000000
100000000
m
b e
r   o
f   d
a t
a  
e l
e m
e n
t s
1
10
100
0 1000 2000 3000 4000 5000 6000
N
u m
Reuse Lifetime (Bin size: 1000)
Figure 3.8: Data reuse distribution of “imb XYZ2lab”in vips
re-use lifetimes fell in that range. This information can help designers understand cache behavior
and potentially design custom memory systems. Figures 3.7 and 3.8 shows the histogram for the
“conv gen” and “imb XYZ2Lab” functions in vips respectively, with the y-axis in logarithmic scale.
In “conv gen”, the distribution has a long tail and a central peak while “imb XYZ2Lab” has a peak
at 0 reuse and a short tail. The peak in “conv gen” signifies that there are plenty of data elements
that have large reuse lifetimes and hence bad temporal locality. For such functions, the cache size
will heavily determine the performance of the function, and indeed, of the program.
Designers can explore dynamic methods of partitioning the cache into a scratch area and cache
area to help such functions with large reuse lifetimes. In this case, a clever memory system would
keep the data for this function in a scratchpad so as to not evict it until the function returns.
Alternatively, designers can partition the cache into regions with different eviction rates i.e lazy
eviction vs. fast eviction. A compiler hint or a runtime monitor could easily embed this information
to ease memory partitioning decisions at run time. The “imb XYZ2Lab” function reuses data at a
higher frequency, which indicates increased temporal locality.
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
64
30%
40%
50%
60%
70%
80%
90%
100%
m
b e
r   o
f   C
a c
h e
  l i
n e
s
< 10 < 100 < 1000 < 10000 > 10000
0%
10%
20%N
u m
Figure 3.9: Breakdown of lines in memory based on reuse counts for benchmarks in the
PARSEC Benchmark Suite (simsmall input)
In the context of accelerator design, the reuse data captured by Sigil shows how many data bytes
need to stay in an accelerator’s local buffer after being consumed once. This will help determine
buffer sizes based on an execution schedule for the function. For example, Cong et al. use the concept
of BB-curves that indicate tradeoffs in increasing local buffer area for an accelerated function against
external bandwidth pressure[18]. Such curves are a function of numerous variables besides data
reuse, including the amount of parallelism available in the program and exploited in the accelerator
implementation, the pipeline depth, and the initiation interval of the accelerator.
Data Reuse at Cache-line granularity
Byte-level reuse analysis is useful in understanding memory behavior on arbitrary memory systems,
but needs to be used with a detailed model of execution and a hardware description. Sigil can also
capture line-level reuse when configured with the cache line size. In this mode, Sigil shadows every
line in memory rather than every byte. Our byte-level reuse characterization shadows every unique
byte and accumulates costs at function-level granularity. In this mode we print reuse counts and
lifetime for every block touched by the program, instead of aggregating costs by function.
3.2 Results from IDFG partitioning and reuse analysis of PARSEC benchmarks
65
The reuse behavior of cache lines is architecture-dependent but it can show a software developer or
system designer how to optimize cache use or improve cache design. Figure 3.9 shows the breakdown
of lines in memory by reuse count. While almost all benchmarks have lines reused more than 10,000
times, Dedup, Bodytrack and Streamcluster have a significant number of lines that are reused fewer
times. Lines with low reuse counts across different data sizes can be marked as dead after they are
fully reused. This information can be used for reuse distance analysis and to inform compile time
techniques that assist cache-replacement. There has been prior work exploring these techniques [105,
30] in detail, using information from the compiler or profiles collected from architectural simulation.
3.3 Performing constrained allocation of resources for accelerator candi-
dates
The communication classification data produced by Sigil can be used with actual hardware im-
plementations of accelerator candidates to arrive at the optimum resources for each accelerator
candidate. Recall that the partitioning process gives us a merged IDFG, where the leaf nodes
represent accelerator candidates. The process also ranked the leaf nodes based on the value of
the partitioning metric (in our demonstrative case, the breakeven-speedup metric). However, the
amount of resources to allocate to each accelerator will depend on how much they are limited by
communication relative to the time they take to finish their work. In this section, with the help of
a sample application composed of accelerator candidates, and performance models for the system,
we show how to allocate area for each accelerator candidate in an area-constrained selection where
the goal is maximizing overall application performance. We use real hardware descriptions to model
accelerator computation costs in the system.
Employing accelerators in a system increases energy efficiency and performance as accelerators
complete more computations for every transistor switch than general purpose processors [42]. Pro-
cessors with specialization dedicate an amount of chip area for executing a fraction of the workload
(f) with a high amount of speedup (S) [16].
Employing specialization requires new design decisions, such as selecting which hardware units
will be included alongside a general purpose processor (accelerator selection problem), how many
3.3 Performing constrained allocation of resources for accelerator candidates
66
accelerators to instantiate, and the communication and coordination of accelerators in the system.
We have already studied the accelerator selection problem in previous sections, and in this section
we focus on how many accelerators to instantiate for a particular class of workloads. Performing an
exhaustive evaluation of all the points in this design space of the entire system, at the RTL level
will be expensive and time-consuming. The early-stage model proposed in this section is the first
step toward helping prune the design space to a manageable size and provide insight into how to
allocate area for specialization.
We first establish the characteristics of the class of workload we study with the help of a sample
application composed of distinct tasks. We then propose a detailed, but simple performance model
to maximize performance of the application with the accelerator candidates. The model takes into
account the effect of Initiation Intervals in accelerator pipelines and overlaps in communication and
computation due to repeated calls to the same function. To populate the model, we use data from
synthesis of accelerator RTL and employ results from Sigil’s profiles to capture platform independent
information from the binary using a merged IDFG. With the sample application, we show how our
communication-aware model can be used to perform allocation and estimate speedup and compare
results with techniques based on communication-agnostic metrics [82, 16].
3.3.1 Sample application
Table 3.3: Function execution
Function % Cycles
des3 78.0%
des3key 1.1%
spiral-fft 2.8%
rand 17.9%
other 0.3%
We chose to construct a sample application instead of using PARSEC workloads, as synthesizing
and producing quality RTL for arbitrary functions such as those seen in PARSEC would take a
prohibitive number of research hours for the scope of this work. Instead, our sample application is
constructed with functions based on available RTL. The sample application contains the following
accelerator candidate functions: rand generates random data, des3key receives a 24 byte key value
3.3 Performing constrained allocation of resources for accelerator candidates
67
and generates a subkey schedule used for 3DES encryption, des3 encrypts or decrypts 8 bytes of
data using the key schedule from des3key, and spiral-fft performs a 16-point DFT on the decrypted
data. The application represents a stream of tasks where one task feeds into the next and multiple
iterations of the stream operate on independent data. Table 3.3 lists the percentage of execution
cycles for each function.
Table 3.4: Accelerator characteristics
Parameter Measure
Area mm2
Energy / task nJ / task
Cycle Time s
Task Latency Cycles
Task Throughput Tasks / cycle
Initiation Interval Cycles
Input BW Bytes / cycle
Output BW Bytes / cycle
For each of our functions, we also characterize actual IP that implements the function in hard-
ware, as characterized accelerators used in a performance model will more accurately reflect execution
under real-world constraints. Our decision to use real RTL is also based on the assumption that most
system designers will have a library of IP blocks from which they can select. Any accelerator block
can be characterized by the parameters in table 3.4. An accelerator can be called multiple times with
new input data for each call. The initiation interval represents the amount of time that must elapse
before an accelerator can be called again. We characterize several open-source IP blocks for our
functions including encryption (3DES,AES), DSP (FFT, DCT), and error coding (Reed-Solomon)
[96, 78] by synthesizing RTL through a Synopsys CAD flow. Scaling factors [28] can be applied on
IP to perform design space exploration at different technologies. Characterized accelerators give a
real cost for performance rather than an assumed approximate relationship between performance,
bandwidth, and area. Indeed the work from which this section of the thesis has been adapted,
shows this conclusion by showing how Pollack’s rule does not apply to estimation of accelerator
resources [73].
3.3 Performing constrained allocation of resources for accelerator candidates
68
time (b) 
Communication cost/fxn-call 
Pipelined Execution tcomm:accel:ip 
call1	  
becomes 
(a) time 
call2	  
ex1	   ex2	  
tinit 
call3	  
ex3	  
tcomm:stall 
Bus 
comm. 
call1	   call2	   call3	  
ex1	  
ex2	  
ex3	  
call4	  
ex4	  
tinit (effective) 
Accel	  1	  
Accel	  2	  
Accel	  1	  
tcomm:accel:ip 
call4	  
II  
Figure 3.10: Computation and communication overlap: When the initiation interval of
a single instance of an accelerator is greater than communication for a function call, replication
is possible (a) tcomm:accel:ip < tinit, (b) 2 ∗ tcomm:accel:ip > tinit
3.3.2 Performance estimation and resource allocation models
To evaluate our resource allocation for the accelerator candidates, we need to be able to estimate
application performance on the target system. As prior work has shown, communication costs will
be critical to evaluating the performance of the application on the target system [49, 51]. However,
simulation infrastructure that models heterogeneous CMP systems with application-specific accel-
erators and general purpose CPU interaction are currently unavailable to the best of our knowledge
as they require modeling sophisticated and flexible interfaces. Further exacerbating the problem,
there are no standard interfaces employed in current systems that employ hardware specialization.
In lieu of a simulation setup, we present an analytical performance estimation model in this section
that takes into account the cost of communication and computation on accelerators and the CPU.
Communication-na¨ıve performance estimation models do not differentiate between computation
and communication, and assume both components of CPU execution time, tex:CPU , are sped up
equally:
tex:CPU = tcomp:CPU + tcomm:CPU (3.6)
tex:accel =
tex:CPU
Saccel
(3.7)
where tex:accel is the execution time on an accelerator.
This would lead to optimistic results of overall application performance as the communication
3.3 Performing constrained allocation of resources for accelerator candidates
69
portion of the CPU software time is not necessarily accelerated. It is imperative to study the commu-
nication costs of an accelerator as well, given the same amount of bytes that consumed tcomm:CPU .
We propose a heterogeneous performance model acknowledges that data must be transferred to an
accelerator over an accelerator bus. Our models incorporate functions, multiple calls to functions
and details such as initiation intervals as will be shown shortly. Communication cost needs to be
measured as the amount of communication between functions and modeled respect to interconnect
bandwidth, BW . An accelerator implementation will include implicit local memory access and thus,
we can ignore duplicate memory reads by a function. We will thus use the unique communication
edges from the aggregates representation. The communication incurred per call on the bus is given
by:
tcomm:accel:ip =
Bytesinput/fxn− call
BW
(3.8)
tcomm:accel:op =
Bytesoutput/fxn− call
BW
(3.9)
In this early work, we assume pipelined accelerators characterized with an initiation interval,
tinit; a measure of the required time between successive function calls to an accelerator. Typically,
this is only one accelerator cycle in a pipelined accelerator. We also assume that data transfers occur
at the maximum transfer rate of the bus. Future extensions of our work should model latency due
to contention on the interconnect. We do not take into account the latency between function calls
and circular dependencies in the workload, we assume that all function calls occur serially at once
and are independent. Note that the assumptions for the performance estimation model are valid
for workloads that behave like our sample application as they possess no circular dependencies and
represent a serial stream of tasks.
Given our assumptions, the characteristics of a good candidate for acceleration is when the time
spent transferring data for a function call, tcomm:accel:ip + tcomm:accel:op, is less than the execution
time on the CPU, tex:CPU . The opportunity to exploit parallelism arises when both the input and
3.3 Performing constrained allocation of resources for accelerator candidates
70
output communication time for a call, tcomm:accel:{ip,op}, is less than tinit. Thus the end-to-end time
taken for a task to complete on the accelerator, termed the accelerator computation time tcomp:accel,
will not be the limiting factor in pipelined accelerators.
Four factors determine the performance estimation model and our communication-ware resource
allocation model: tcomm:accel:ip, input communication time per function call; tcomm:accel:op,output
communication time per function call; tcomp:accel, accelerator computation time; and tinit, initiation
interval. Figure 3.10 shows the execution schedule assumed for our performance estimation model,
and also shows how performance can be improved by allocating more resources by replacing the
accelerator when tinit of a particular accelerator is greater than tcomm:accel:{ip,op}. We model both
input and output communication, but only show input communication in the figure. A single
pipelined accelerator, (a), will be unable to utilize the full bus bandwidth as communication is
stalled while the pipeline is busy. This implies that tcomp:accel:ip < tinit. To mitigate this, the
designer must reduce the effective initiation interval. Reducing the cycle time can achieve this, or
as we show in (b), replicating the accelerator so subsequent function calls to the same accelerator
type exceed its intrinsic tinit, thereby reducing tinit(effective) (double buffering is assumed for this
purpose). This is the basis for equation (3.11) which describes the optimal number of accelerators
to reduce stalls.
The timing associated with figure 3.10(a), including stalls determines our accelerator execution
time shown in equation 3.10.
tex:accel = (#calls− 2)×
[
max (tcomm:accel:ip, tcomm:accel:op)
+ max
(
0, tinit −max (tcomm:accel:ip, tcomm:accel:op)
)]
+ 2× tcomp:accel + tcomm:accel:ip + tcomm:accel:op
(3.10)
For resource allocation itself, we need to determine which accelerators provide most benefit
to the application. Based on our assumptions, the following equation is a communication-aware
allocation model that lists the number of instantiations of a particular accelerator required to support
3.3 Performing constrained allocation of resources for accelerator candidates
71
overlapping communication and execution without stalls:
numaccels = min
(
tinit
tcomm:accel:ip
,
tinit
tcomm:accel:op
)
(3.11)
3.3.3 Area-constrained allocation
2	   3	   6	  
11	  
1	   1	   1	   1	  
7	  
14	  
27	  
1	   1	   2	   4	   0.00	  0.20	  
0.40	  0.60	  
0.80	  1.00	  
1.20	  1.40	  
1.60	  1.80	  
2.00	  
0	  
5	  
10	  
15	  
20	  
25	  
30	  
4	   8	   16	   32	  
Ac
ce
le
ra
to
r	  
Ar
ea
	  (m
m
2 )
	  
#	  
of
	  a
cc
el
er
at
or
	  in
st
an
ce
s	  
BW	  GB/s	  
des3	  rand	  spiral_fft	  des3key	  area	  
54	  
(a) Num. accelerators, accelerator area vs. bus BW
0.00	  0.20	  
0.40	  0.60	  
0.80	  1.00	  
1.20	  
pollack	   frac_ex	   comm-­‐aware	  
Ar
ea
	  (m
m
2 )
	  fo
r	  
16
	  G
B/
s	  
BW
	  
des3	   rand	   spiral_fft	   des3key	   S=211x	  S=18x	   S=174x	  
6	  
2	  
1	  
27	  
1	  
1	  
1	  
1	  
1	  
1	  
230	  
7	  
(b) Area allocation for “Pollack’s rule”, fraction of execution, and communication aware
models at 16 GB/s bus BW.
Figure 3.11: Model evaluation results
In this subsection, we compare our resource allocation model to communication-agnostic models
for allocation described in the previous subsection, using an area-constrained allocation problem for
our sample application. We assume the system architecture shown in section 3.1.1, where each tile
3.3 Performing constrained allocation of resources for accelerator candidates
72
is assumed to operate on a single application such as our sample application and the frequency and
latencies are as indicated in that section. Using measurements for communication costs collected
by Sigil, we also sweep bus bandwidth to evaluate how communication latency affects performance.
The 2.5GHz CPU is assumed to be an Intel XEON E5620. In order to compare against the XEON
E5260 baseline, we synthesize accelerators in 90 nm and scale to 32 nm. Given our architecture
assumptions and contention-free bus model, our model reflects an upper performance limit.
Figure 3.11a shows the optimum number of accelerators required to process the unique data
communicated to each accelerator (left axis), and total area (right axis) as bus BW increases. At
lower bandwidths, a single accelerator is sufficient for most functions. As bus speed increases, data
is passed to the accelerator more quickly than can be pushed into the pipeline, requiring additional
accelerator instances.
Figure 3.11b shows the accelerator area allocation at the 16 GB/s BW point, the speedup over
baseline CPU execution (calculated by our model (equation (3.10)), and the number of accelerators
instantiated by each (adjacent to column). We compare our communication-aware allocation (comm-
aware) governed by the model presented so far(right column) vs. allocations governed by Pollack
scaling, pollack (left column), and a na¨ıve allocation proportional to the fraction of execution cycles,
frac ex (middle column) keeping the total area constant with the comm-aware model.
The Pollack scaling model is based on a cost function known as “Pollack’s rule” which says that
performance scales with the square of area as described in equation (3.12)[82]. In prior research,
accelerators have been modeled via “Pollack’s rule” [16]. This equation models an allocation of
resources, Aaccel, with an effectiveness over CPU execution, α, yielding a speedup, S.
S =
1
α
√
Aaccel
(3.12)
For the pollack allocation, Pollack’s rule (equation (3.12)) is used as the cost function for an
optimization problem minimizing execution time, evaluated using the MATLAB fmincon constrained
non-linear optimization function. The Pollack model is communication “unaware,” using the fraction
of execution time as input into the cost function. Both rand and des3 receive more area than our
3.3 Performing constrained allocation of resources for accelerator candidates
73
model (right column) as they represent the largest amount of computation cycles, and the Pollack
model calculates a 165x speedup over the CPU. However, plugging into our communication-aware
model (equation (3.10)) yields only a 18x speedup (shown in the figure), as communication stalls on
des3 and spiral-fft limit actual speedup. We can conclude that not only does Pollack’s rule incorrectly
allocate area, but it does not correctly estimate overall speedup because it is communication na¨ıve.
The frac ex allocation (middle) is limited to a discrete number of characterized accelerators
similar to our communication-aware model. Area is allocated based on a function’s fraction of
execution, shown in table 3.3, constrained by the minimum area for each accelerator. This allocation
experiences communication stalls in spiral-fft while extra area is unnecessarily allocated to des3 and
rand. Our communication-aware model eliminates these stalls by creating the optimal number of
accelerators, yielding a 211x speedup over the CPU, vs. 18x for the pollack and 174x for the frac ex
allocations.
Thus, using execution time as a metric overestimates speedup as it does not account for bus
bandwidth and speedup limited by communication. This is evidenced by the fact that it does not
allocate area in a way that maximizes overall speedup. In the future, system memory models, and
detailed interconnect simulations need to be performed during the selection process. We showed
results for a sample application with a few functions. More sophisticated and derived constraints
that combine area, power and performance considerations can be used to perform resource allocation
as well. We leave such investigations to future work, as the scope of this section was to demonstrate
the capability of modeling systems with Sigil’s captured data. Future work must also consider a
multitude of functions operating under a system area and energy budget, in which case the presented
metrics and models will be the basis for an optimization problem.
3.4 Using Sigil’s fine-grained task graphs to identify parallel execution
paths
In this section, we show how Sigil’s event trace representation is used to discover parallelism in serial
versions of PARSEC and SPEC benchmarks. We generate event traces for several benchmarks in
PARSEC and SPEC by running Sigil on the benchmarks with the event trace command line option
3.4 Using Sigil’s fine-grained task graphs to identify parallel execution paths
74
activated. We then parse the event traces using a script to construct a dependency tree as described
in section 2.4.2.
There have been several approaches at automatically extracting parallelism from nested loops
and from functions [87, 33, 56]. We stay orthogonal to all this work, by highlighting the limits of
parallelism based on the critical path instead. Critical path analysis has been applied to a range of
domains from ASIC design, to scheduling, distributed systems, and networked systems [59, 89, 88].
By measuring this path, the critical path, programmers and system designers can focus their design
efforts on reducing the critical path and thus improving the functional parallelism of the workload.
As explained before, the dependency tracking features of Sigil allow it to examine the dependencies
between functions and discover the longest path of dependent functions within a program.
Using the information collected by Sigil, we construct dependency chains from the beginning of
the program, following the methodology described in section 2.4.2. The longest of these chains is the
critical path. These paths could also represent the ideal execution schedule of computation events.
As explained earlier, we distinguish between individual calls to a function by creating new nodes in
the chain for every individual call. We also assume calls to child functions can be non-blocking and
are only limited by their data dependencies. The maximum theoretical function-level parallelism is
the ratio of overall serial length of the program to the critical path length. This ratio represents
the limit to the extractable function-level parallelism in the program. We analyze the serial versions
of a few PARSEC benchmarks and the libquantum benchmark from SPEC to establish their limit.
The results are plotted in Figure 3.12.
To investigate further, we examine the functions in the critical path for streamcluster and flu-
idanimate benchmarks. We found the following functions in the critical path for streamcluster from
(leaf to main):
drand48 iterate → nrand48 r → lrand48 →
pkmedian → localSearch → streamCluster → main
Streamcluster is characterized by many short paths, where functions closer to the leaf-end of the
critical path are of small consequence, e.g. rand. While the theoretical parallel limit is high due to
3.4 Using Sigil’s fine-grained task graphs to identify parallel execution paths
75
the shortness of the individual paths, the overhead may not allow a programmer to extract all the
function-level parallelism. We find a similar situation for libquantum as well.
The functions for fluidanimate are as follows:
ComputeForces → main
Fluidanimate’s path is composed of a single function, ComputeForces. This function does the
bulk of the work in fluidanimate, contributing close to 90% of the operations in the entire workload.
As a result, a designer can speed up a program by accelerating/optimizing such a function with a
goal of matching the other path lengths.
For the sake of simplicity, we do not employ more sophisticated critical path analysis based
on literature, which also take communication edges into account [88]. Besides highlighting the
theoretical parallelism, we can use critical path information to build an optimal schedule for the
program. The functions in parallel paths in a program can be mapped onto multiple cores such that
dependencies are respected. A software developer may have a fixed number of scheduling slots based
on the number of available cores. The developer can map dependency chains onto these slots so as
to minimize communication between slots and balance the load among them.
6
8
10
12
14
16
18
20
m
  f u
n c
t i o
n ‐
l e
v e
l   p
a r
a l
l e
l i s
m 260 230
0
2
4
M
a x
i m
u m
Figure 3.12: Maximum speedup based on function-level parallelism
3.4 Using Sigil’s fine-grained task graphs to identify parallel execution paths
76
Current Limitations Recall from section subsec:eventTrace that our event trace representation
is compromise between coarse-grained representations such as the aggregates representation and
instruction-level data dependence graphs. However, our event trace representation does possess some
limitations that prevent it from extracting the best theoretical parallel representation. Listings 2.2
and 2.1 show that there are 3 identifiers in both computation events and communication events that
are used to refer to entities (computation chunks). The identifiers are Function numbers, Function
instance numbers and Call numbers, however, a chunk is not referred to by its event number.
Hence, specific values for the 3 identifiers can still map to more than one computation event. A
communication event that refers to a producing entity with the 3 identifiers cannot be traced down
to one particular computation event. A communication event does not necessarily refer to a specific
computation event, but rather the specific function call it came from. Thus when constructing a
dependency based on a communication event parsed from this file, we will have to conservatively
connect it to the latest computation event that possesses the correct values for the 3 identifiers. This
introduces some serialization in the construction of dependency chains of computation events.. The
main reason for the existence of this limitation is similar to the reason we cannot have instruction-
level data dependence graphs. Keeping track of that many producing and consuming entities is too
expensive as we will potentially need to identify millions of producing and consuming entities during
the profiling step. This becomes prohibitively expensive even though chunks are more coarse-grained
than instructions.
Another current limitation is that while post processing the event file, our tree gets too large to
hold in memory if we allow every computation event to be a node. Hence we allow accumulation
of multiple successive events into one node if the entity identifiers are same. This will also result in
some serialization as some nodes can have multiple incoming dependency edges.
3.5 Background and related work in partitioning and parallelism
There has been several research works that explore partitioning problems for embedded systems
and FPGAs. Most of that work is covered in the HW/SW codesign and Reconfigurable computing
fields. More recent efforts at design of accelerator-based heterogeneous CMPs have not explored
3.5 Background and related work in partitioning and parallelism
77
accelerator selection formally with HW/SW partitioning approaches, but have proceeded assuming
that the accelerators have already been selected [21, 20, 18]. Cong et al. [20] present a scalable and
flexible paradigm to reuse and restructure accelerator rich CMPs named CHARM. They also have
introduced AXR-CMP [21], a scheme that allows scaling and sharing of a set of multiple accelerators
among multiple cores. These efforts have sophisticated models for the system architecture and
interaction of accelerators with the CPU. In this section we attempt to cover the relevant related
work and distinguish our work in this chapter.
Gregg and Hazelwood argue in the context of GPUs that memory transfer time must be taken
into account when discussing speedup [36]. This chapter looks at this argument in the context of
fixed-function accelerators improving serial performance. Hou et al. analyze the memory access
patterns of several streaming workloads to investigate common characteristics of data streaming
accelerators included in the IBM POWER-EN processor [50]. They note that streaming accelerators
typically step through an address-space in a well formed manner. We take this one step further,
categorizing memory accesses in terms of the address, binning the access into “unique” and “non-
unique” accesses and identifying reuse patterns that can potentially define the utility of a function as
an accelerator candidate. As mentioned in section 3.2.2, we can potentially use this information to
improve the determination of candidacy of functions as accelerators. Cong et al.’s work assume that
the accelerators Finally, Blagovich et al. present a model of multi-grain parallelism that attempts
to model execution on a heterogeneous platform [10]. We model accelerators in a similar manner,
including the cost to oﬄoad computation and use this to make informed decisions.
Prior work in the HW/SW co-design field attempt to optimize the partitioning of tasks between
hardware and software at design time [113, 67, 32]. Youness et al. describe a generic partitioning
algorithm that can be used to partition applications between multiple CPUs or between CPUs and
accelerators. Mudry et al. employ a genetic algorithm to partition applications toward the same
goal. These techniques perform an optimization on a cost function to select the best design from
the design space. Galanis et al. operate on program binaries, but simply use instruction counts to
determine candidates (kernels in their nomenclature)[32]. They use a custom partitioning algorithm
3.5 Background and related work in partitioning and parallelism
78
to achieve their partitioning goals. Other work specifically use instructions/operations, dataflow
analysis and communication to determine hardware candidates [41, 25], but these works are quite
outdated and have been replaced with High level synthesis tools that perform similar operations.
Prior work usually uses profiling to determine performance of sub-tasks in software and estimation of
performance and area in hardware, similar to our work. We find that more sophisticated partitioning
algorithms as seen in these works can be applied on Sigil’s aggregate representations to determine
accelerator candidates, so long as the considerations presented in section 3.1.2 are incorporated.
Work in the reconfigurable computing field also explores the hardware/software partitioning
problem. Some work use the compiler to select regions of “hot code” for hardware/software parti-
tioning [17]. Cordone et al. use a graph-theoretic approach for optimal partitioning and also explore
scheduling problems [22]. Huang et al. [51] propose splitting task graphs such that overall commu-
nication in the system is kept to a minimum. Similar to algorithms encountered in the HW/SW
codesign field, we can incorporate these sophisticated algorithms into our partitioning approach.
Both work in the HW/SW codesign and Reconfigurable computing field quite often assume designer
expertise in their approaches, while we attempt to provide a more general approach that does not
require software or algorithms expertise.
Conservation Cores [103] presents a framework where applications are profiled statically, yielding
“hot regions” of code selected for high-level synthesis. The result is a chip with specialized hardware
units targeted at reducing energy consumption of applications. However, the metrics used to identify
such “hot regions” are not described. Our work presents a communication-centric approach to
identifying candidate regions inspired by work in HW/SW partitioning. The Roofline model from
Williams et al. [107] indicates the CPU performance bounds on a platform as a function of the
operational intensity. The model relies on external (off-chip) memory bandwidth to determine the
bounds. This model is similar in spirit, but its goal is code optimization, specifically for homogeneous
platforms.
3.5 Background and related work in partitioning and parallelism
79
3.6 Summary
In this chapter, we have shown how Sigil’s aggregate representation can be used for the acceler-
ator selection problem. The aggregates representation essentially represents interprocedural data
flow graphs (IDFGs) with unique communication edges (representing input sets). Our accelerator
selection methodology is inspired by prior HW/SW partitioning approaches, and we describe the
special considerations required to adapt partitioning algorithms to Sigil’s novel IDFGs. We discuss
and propose a novel way to calculate inclusive costs that can be used to fairly compare nodes in the
IDFG to determine candidacy for acceleration. We highlight our novel demonstrative partitioning
metric (breakeven-speedup) used in a case study to determine accelerator candidates from PARSEC
benchmarks. We show how Sigil’s novel data reuse profiling results can be interpreted to under-
stand memory system behavior for CPUs or accelerators. We have also explored a constraint-based
resource allocation problem on accelerator candidates, targeting improvement of overall application
performance, using a proposed performance estimation and novel communication-aware resource
allocation model. Finally, we also present the theoretical limits of parallelism in some SPEC and
PARSEC benchmarks using Sigil’s novel event trace representation, using critical path analysis.
3.7 Acknowledgments
Most work in this chapter is adapted and extended from a paper entitled “Platform-independent
analysis of function-level communication in workloads” by Siddharth Nilakantan, and Mark Hemp-
stead. The dissertation author was the primary investigator and author of this paper. The remaining
sections are adapted from a paper entitled “Metrics for Early-Stage Modeling of Many-Accelerator
Architectures” by Siddharth Nilakantan, Steven Battle, and Mark Hempstead. The dissertation
author was the primary investigator and author of this paper. Some material is also drawn from
currently unpublished work that was submitted to the International Symposium on Computer Ar-
chitecture, 2014. The adapted portions were well received and will make it to the refined version
of the submission. This material is based on work supported by the National Science Foundation
grants where the Primary Investigator is Mark Hempstead.
3.6 Summary
80
Chapter 4: Communication classification applied to multithreaded
programs
As chip multi-processors (CMPs) are the predominant type of architecture employed in modern
systems, system designers require dependable simulation methodologies for parallel CMP-based sys-
tems. Trace-driven simulation of CMP-based systems has significant benefits over execution-driven
simulation, such as reducing simulation complexity and simulation time, allowing portability, and
scalability. However, execution-driven simulation is still typically used to evaluate CMPs due to the
difficulty of reliably generating and accurately replaying multi-threaded traces [35]. In this chapter,
we investigate how our workload characterization methodology based on communication classifica-
tion can be applied to multi-threaded programs to overcome this difficulty and enable trace-driven
simulation methodologies for CMPs. For the purposes of this investigation, we adapt the approaches
presented in Chapter 2 for multi-threaded programs.
As part of our adaptations, we show how and why the event trace representation of Sigil is
extended to provide additional information unique to threads. With these extensions enabled, we
are able to use the event trace representation to perform accurate trace-based CMP design space
exploration. The extensions modify Sigil’s computation and communication events to also store all
the unique addresses encountered. In addition, the extensions model a new type of event called as a
synchronization event. Synchronization events are added to the trace to allow for modeling thread
synchronization constructs correctly when replaying traces.
We first study the motivation to extend the event trace representation by studying the major
limitations of traces for multi-threaded applications; non-determinism. To isolate the sources of
non-determinism and their impact on trace-based simulation, we compare the event traces to traces
obtained from an existing full system simulation framework, Gem5. For this comparison, we use
a simpler version of the event trace representation from the Sigil tool as a substitute for the event
traces during the comparison, as the simpler version is easier to parse. This simpler version takes
81
the form of a flat memory trace that is annotated with information such as Thread IDs and Function
names.
Based on the insight from the motivation study, we propose modifications to the event trace
representation and detail the required changes. Finally, based on the modified event trace represen-
tation, we propose and evaluate a trace-based simulation methodology named SynchroTrace. We
show how a simulation flow based on SynchroTrace is valid for design space exploration and provides
simulation speedup by comparing against Gem5’s full system simulation flow.
This chapter is organized as follows: Section 4.1 presents the study that highlights the current
issues with traces of multi-threaded applications and proposes solutions. Based on the insight from
this study, section 4.2 proposes a modified event trace representation to generate traces for accurate
CMP simulation. Section 4.3 presents a dynamic replay mechanism that plays back the traces
described in section 4.2.The trace generation and replay mechanism together form a trace-based
CMP design space exploration methodology named SynchroTrace. We validate SynchroTrace by
comparing our trace-based simulation results for a CMP design space exploration against the Gem5
Full-System simulator results in section 4.4. In section 4.5, we describe the performance improvement
of SynchroTrace over full-system simulation and present trace-based optimizations for speedup of
CMP architecture simulations. Finally, we compare SynchroTrace with related work in section 4.6
and conclude in section 4.7.
4.1 Issues with traditional flat traces of multi-threaded applications
In the context of architecture simulation, traces refer to a record of the chronological sequence of
events that occur in a program. A trace-driven simulation flow takes two passes: trace generation and
trace replay. Traces can be recorded at different levels of the system depending on which subsystem
is being designed. For example, an instruction trace records all the instructions in the dynamic
stream in chronological order. It can be used when detailed CPU models are required. Similarly,
memory traces record only the LD/ST instructions from the dynamic stream [9]. Memory traces
can be used in conjunction with very simple CPU models in order to do more detailed simulation of
just the “uncore” [65]. However, traditional instruction and memory traces cannot accurately model
4.1 Issues with traditional flat traces of multi-threaded applications
82
multi-threaded applications in simulation due to the non-determinism in thread execution.
We first present the potential sources of non-determinism, and then confirm and quantify the
impact of non-determinism. To quantify the degree of mismatch due to non-determinism in thread
execution, we employ flat traces of multi-threaded applications captured in different simulation
frameworks. Large mismatches arising from non-determinism would result in inconsistency of mod-
eling multi-threaded applications for trace-based CMP design space exploration. By comparing
traces directly, we are able to remove the contribution from different microarchitecture models or
timing models employed by each simulation environment.
In subsection 1.3, we had explained non-determinism in thread execution of multi-threaded ap-
plications. In subsection 4.1.1 we present the framework used for comparison of traces from different
simulation frameworks and the metrics used for comparison.
4.1.1 Exploring impact of Non-determinism: Experimental setup
Here we describe the setup for a study that compares traces generated from different frameworks.
The goal of the study is to explore the impact of non-determinism on traces due to the interleaving
encountered at capture by different frameworks. If the traces deviate significantly between the
frameworks, the result is a large mismatch in operations that could also cause simulation results
to deviate significantly. Thus is becomes necessary to identify the sources of non-determinism that
need to be modeled correctly when using traces for simulation of multi-threaded applications.
We compare the event traces from Sigil that we plan on using for simulation, against traces
generated from an existing full system simulation methodology, Gem5. The study uses flat traces
obtained from Sigil, as a simpler equivalent to event traces in order to reduce the representation to
as close a form as possible to the format of traces obtained from Gem5.
We choose to study memory traces instead of full instruction traces as instruction traces can
become very large. A memory trace is also sufficiently representative of the flow of a program, in
that, it captures data from all the basic blocks and the modification of all variables that is necessary
for our comparison objective. Figure 4.1 shows our comparison framework, which captures traces
at the front-end of the gem5 simulator and the Sigil tool. Both frameworks operate on the same
4.1 Issues with traditional flat traces of multi-threaded applications
83
Framework Diagram
1
Binary
Valgrind tool
1 2 43
Core/M5
1 2 43Memory 
model
BINARY INSTRUMENTATION
Memory Trace
GEM5
1 2 43
Memory Trace
Figure 4.1: Framework for testing our Traces and DBI flow contrasted alongside the gem5flow
binary and produce memory traces of virtual addresses for each thread. The memory trace is a list
of load and store instructions in the order they are observed.
We annotate the traces obtained from gem5 with Thread IDs and functions for every memory
operation. We modified the stock gem5 to print load and store instructions sent by the Simple Timing
CPU core model into the Ruby Memory model. The Simple Timing CPU model is the simplest core
model supported by gem5 and guarantees that the loads and stores come out in program order for
each thread. We extract information such as Thread ID and Function name from the symbol table
and print that information for each load and store instruction.
The simpler event trace from Sigil does not model Integer or Floating point operations and allows
each event to have only one memory operation. It keeps the function annotations in the original
event trace and adds Thread ID annotations as well. This makes the trace from Sigil equivalent to
the flat memory trace we obtain from Gem5.
We also wrote Python scripts to compare the two traces and find the flat difference in read and
written bytes for each function in the application. Our methodology also aggregates the costs of
functions across threads, as the source of non-determinism in thread execution arise from specific
4.1 Issues with traditional flat traces of multi-threaded applications
84
Table 4.1: List of benchmarks used. All runs use 8 threads with simsmall inputs for Parsec
and the base input for Splash-2.
Benchmarks evaluated Benchmark Suite Domain
Blackscholes Parsec Financial Analysis
Swaptions Parsec Financial Analysis
Canneal Parsec Chip Design
FMM Splash2 N-body simulation
Water-nSquared Splash2 Molecular Dynamics
LU Splash2 Linear Algebra
FFT Splash2 Signal processing
functions such as threading API function calls such as barriers and mutex calls. The identification of
the same function occurring in both traces is an important feature of our methodology, because the
function names in the frameworks are not guaranteed to be the same. As a solution, function names
were parsed from the symbol table in the assembly and aliases were used so that all equivalent
functions could be compared correctly. We were able to identify functions that run exactly in
both frameworks within a programmable tolerance based on the bimodal distribution of differences
between mismatched functions. We found the lower peak in the distribution at around 5% difference.
We use a % tolerance in read and written bytes for determining if a function is matched. Based on the
distribution, we assume a 5% tolerance is acceptable for our comparison. This methodology allows
the designers to discover the parts of the application and system context that behave differently
between the two frameworks.
Our traces were gathered over multiple runs to identify the impact of non-determinism within
frameworks. We found the gem5 data consistent across multiple runs and the Binary instrumentation
data showed a variability of up to 1%, indicating results are relatively stable within a particular
framework.
We analyze multi-threaded benchmarks from the PARSEC and Splash-2 benchmark suites in
detail, as the non-determinism of multi-threading adds more sources of mismatch. The benchmarks
we use are listed in table 4.1. These benchmarks were picked as they have been shown to exhibit
significant communication and synchronization [4].
The Linux kernel used by gem5’s Full System mode is different from the host machine we em-
4.1 Issues with traditional flat traces of multi-threaded applications
85
ployed and would not be a fair comparison. Hence, we run with gem5’s syscall emulation (SE) mode
as it emulates system calls on the same host on which we capture our DBI trace [9]. gem5’s syscall
emulation (SE) mode does not support dynamically linking or libpthread’s functionality. Instead,
M5threads has custom implementations for libpthread calls. For example, “pthread mutex lock” is
implemented with spinlocks in m5threads, unlike in libpthread. To this end, we statically compiled
the benchmarks and linked with M5threads instead of libpthread. Static compilation guarantees
that the same assembly code runs through both frameworks even for library calls. We verified that
this does not affect the functional results of our program through gem5, Binary instrumentation or
native execution.
4.1.2 Exploring impact of Non-determinism: Pthread synchronization
mismatches
We are able to isolate the sources of mismatch using function-level and thread-level information
available in the comparison framework detailed in Section 4.4. We begin this analysis with a simple
comparison of the total read and written bytes in the respective memory traces of gem5 and the
DBI framework.
We find that the total number of bytes read and written are not identical between the two
frameworks and we investigate the differences here. We classified the total read and written bytes
for the entire application across all threads and functions into 3 distinct categories. The first category,
titled Pthread, includes any reads or writes that occur from within Pthread API calls. These functions
are used for creating and destroying threads, for synchronization among threads, critical sections and
other threading related constructs. As thread synchronization constructs are a subset of Pthread API
calls, the API calls can potentially be associated with non-determinism arising from indeterminate
wait time at synchronization points, as explained in subsection 1.3. We classify the remaining read
and written bytes as coming from the user space (titled UserSpace) and from shared libraries and
their system calls (titled Library calls). We also discuss the UserSpace mismatch as it is relevant to
non-determinism.
Figure 4.2 shows the total bytes for the DBI framework in each category, normalized to the
4.1 Issues with traditional flat traces of multi-threaded applications
86
0
0.5
1
1.5
2
2.5
3
3.5
4
5.03 4.71 13.92
FM
M8
Wa
ter
N8 LU
8
Bla
cks
ch
ole
s8
FF
T8
Sw
ap
tion
s8
Ca
nn
ea
l8
N
or
m
al
iz
ed
 R
ea
d 
by
te
s 
to
 g
em
5
 
 
Pthread
Library Call
User Space
Total
(a) Read bytes
0
0.5
1
1.5
2
2.5
3
3.5
4.98 4.6 13.94
FM
M8
Wa
ter
N8 LU
8
Bla
cks
ch
ole
s8
FF
T8
Sw
ap
tion
s8
Ca
nn
ea
l8
N
or
m
al
iz
ed
 W
rit
te
n 
by
te
s 
to
 g
em
5
 
 
Pthread
Library Call
User Space
Total
(b) Written bytes
Figure 4.2: Comparison of total Read and Written bytes.
4.1 Issues with traditional flat traces of multi-threaded applications
87
corresponding bytes from the gem5 framework for that category. We see that the Pthread category
shows anywhere from 80% less to 104% more read and written bytes than gem5, confirming that
our source of mismatch and hence non-determinism between the frameworks is mostly related to
threading API calls. We further investigate the specific functions within the threading API to
confirm whether synchronization functions cause non-determinism.
We discovered that for some Pthread functions, the difference in read and written bytes was
several orders of magnitude. A function-level breakdown for the FMM benchmark is shown in
Figure 4.3, with the y-axis showing logarithmic mismatch in bytes of Sigil’s trace normalized to
Gem5. We discovered that most of the difference comes from a small number of functions. These
small number of functions that cause the largest differences are plotted separately. We label the
smaller contributors collectively under Others. Neither gem5 nor DBI consistently overestimated
read and written bytes, when all functions in the trace are aggregated as shown in Figure 2. With
the exceptions of benchmark FMM and FFT8, all benchmarks demonstrate similar read/written
bytes. However, we found that the sign of the difference varied from function to function, This
indicates that mismatch of total bytes over the application could be a conservative estimate, as
negative and positive mismatch in functions could offset one another in the total. Take the FMM
benchmark for example. In Figure 4.3 the black bars, in the graphs for both read and written bytes,
represent functions whose bytes in gem5 are more than the bytes from the DBI framework and the
white bars represent the converse.
For the FMM benchmark (Figure 4.3) the biggest contributor for differences is pthread barrier wait.
In general the Pthread class of functions is at the top of the list of functions with the most mismatch.
When pthread barrier wait is called by a thread, that thread blocks in the call and waits until the
required number of threads have reached that barrier. We examined the assembly trace for the
function in both frameworks and found that the mismatch comes from a while loop that executes a
different number of times in the pthread functions that cause mismatch, namely pthread barrier wait
and pthread join. We confirm and conclude that the thread interleaving encountered during the cap-
ture of the trace resulted in a random, indeterminate wait period at these synchronization points,
4.1 Issues with traditional flat traces of multi-threaded applications
88
100
102
104
106
108
1010
a
bs
(ge
m5
 by
tes
 − 
DB
I b
yte
s)
 
 
pth
rea
d_
ba
rrie
r_w
ait
Do
wn
wa
rdP
as
s
Co
mp
ute
Su
bT
ree
Co
sts
Ini
tEx
p
VL
istI
nte
rac
tio
n
WL
istI
nte
rac
tio
n
Co
nst
ruc
tGr
idL
ists rea
d
pth
rea
d_
join
_
_
sto
p_
__
libc
_fr
ee
res
_p
trs
_
_
ra
nd
om
Oth
ers
Negative
Positive
(a) Read Bytes
100
102
104
106
108
1010
a
bs
(ge
m5
 by
tes
 − 
DB
I b
yte
s)
 
 
pth
rea
d_
ba
rrie
r_w
ait
re
ad
pth
rea
d_
join
_
_
sto
p_
__
libc
_fr
ee
res
_p
trs
_
_
m
pn
_d
ivr
em
_
int
_n
ew
_a
ren
a
Ins
ert
Pa
rtic
les
InT
ree
_
IO_
ne
w_
file
_o
ver
flow
ha
ck_
dig
it.1
20
31
Fin
dH
om
e
Me
rge
Lo
cal
Pa
rtic
les
Oth
ers
Negative
Positive
(b) Written Bytes
Figure 4.3: Breakdown by function of the difference in Read/Written bytes for the FMM
benchmark
4.1 Issues with traditional flat traces of multi-threaded applications
89
and hence an indeterminate amount of memory traffic was generated. In fact, recall that the DBI
trace has up to 1% variability in these functions across multiple runs with the same configuration.
From figure 4.3 we confirm that the large mismatch in Pthread API calls is indeed due to the
non-determinism when a thread waits at a threading synchronization call such as a barrier or join.
The amount of reads or writes while using the implementation of these synchronization calls is non-
deterministic. For example, if a lock is implemented as a spinlock, then the number of reads will
vary between runs as the spinlock checks the memory address continuously until it is freed. Since we
linked our benchmarks to the M5threads library, all the pthread API calls use spinlocking wherever
possible.
This data makes a strong case for traces that record threading related calls at a higher-level
of abstraction than raw memory operations. Capturing raw memory operations could result in
an inaccurate trace-based simulation as it would force the system to reflect the behavior that was
encountered during capture. The simulation of these portions of the application will not correctly
respond to changes in the design as well. Additionally, in the case of a trace-based simulation, we
advocate for the interception and embedding of threading calls in the trace.
Assuming mismatches due to Pthread functions could be addressed with future simulation mech-
anisms, it is important to analyze the remaining sources of mismatch that were dwarfed by mis-
matches in pthread related functions. Figure 4.4 shows the updated totals of the read and written
bytes excluding the Pthread category. The revised scale of the plot in Figure 4.4 shows that there
is still significant mismatch (up to +/-40%) for some benchmarks in the remaining categories. In
the next subsection, we explore which functions cause UserSpace differences as they are related to
synchronization.
4.1.3 Exploring impact of Non-determinism: User Space synchronization
Another important aspect of potential mismatch due to non-determinism is with condition syn-
chronizations in the UserSpace [93]. Condition synchronization, more informally known as the
“bounded buffer problem” refers to the producer-consumer relationship between threads, when one
thread waits to consume data produced by another thread, while the container for the data has a
4.1 Issues with traditional flat traces of multi-threaded applications
90
0
0.5
1
1.5
FM
M8
Wa
ter
N8 LU
8
Bla
cks
ch
ole
s8
FF
T8
Sw
ap
tion
s8
Ca
nn
ea
l8
N
or
m
al
iz
ed
 R
ea
d 
by
te
s 
to
 g
em
5
 
 
Library Call
User Space
Total
(a) Read bytes
0
0.5
1
1.5
FM
M8
Wa
ter
N8 LU
8
Bla
cks
ch
ole
s8
FF
T8
Sw
ap
tion
s8
Ca
nn
ea
l8
N
or
m
al
iz
ed
 W
rit
te
n 
by
te
s 
to
 g
em
5
 
 
Library Call
User Space
Total
(b) Written bytes
Figure 4.4: Comparison of total Read and Written bytes without Pthreads functions
4.1 Issues with traditional flat traces of multi-threaded applications
91
100
102
104
106
108
a
bs
(ge
m5
 by
tes
 − 
DB
I b
yte
s)
 
 
Do
wn
wa
rdP
as
s
Co
mp
ute
Su
bT
ree
Co
sts
Ini
tEx
p
VL
istI
nte
rac
tio
n
WL
istI
nte
rac
tio
n
Co
nst
ruc
tGr
idL
ists rea
d
_
_
sto
p_
__
libc f
re
er
es p
trs
_
_
ra
nd
om
sp
in_
loc
k
_
IO_
file
_xs
pu
tn
Oth
ers
Negative
Positive
Figure 4.5: Amount of read mismatch by function for the FMM benchmark without Pthread
functions
limited size. This producer-consumer relationship can cause an indeterminate amount of waiting in
either thread [93].
Some of the benchmarks we studied showed non-determinism in the UserSpace from condition
synchronization use without calls to a threading library. A flag variable declared on the stack, for
instance, can be used inside a loop to enforce ordering between certain sections of a thread. These
sources of mismatch are the most difficult sources to identify and separate with our comparison
framework.
We found a few mismatched UserSpace functions across all benchmarks, but the FMM benchmark
showed a significant amount of mismatch stemming from this area. From figure 4.4, mismatch in
many UserSpace functions of the FMM benchmark contribute to the 44% extra read bytes in the DBI
trace. For example, the breakdown of read bytes for the FMM benchmark in Figure 4.5 shows the top
mismatched functions. We found that most of these functions contained condition synchronization.
4.1 Issues with traditional flat traces of multi-threaded applications
92
InitExp is an exception, which has a loop whose trip count that is tied to the supported precision
of floating point operations on the machine.
As condition synchronization can cause indeterminate amount of reads, we recommend that
trace-based simulation model these events by embedding information about these synchronization
conditions in the trace, without actually recording the memory traffic. A trace collection mechanism
should intercept and embed this synchronization event, if it is possible to detect them automatically.
We advocate for programmers to use standard calls to threading APIs as far as possible, to facilitate
automation. This will allow for the framework that replays the trace to employ a mechanism to
handle the correct behavior during simulation.
4.1.4 SynchroTrace: Tackling non-determinism through Synchronization-
aware Trace and Replay
In summary, this section shows that non-determinism in the execution of multi-threaded applications
will affect trace-based simulation of these applications on CMPs. The non-determinism manifests as
uneven thread progress between synchronization points and indeterminate waits at synchronization
points. We confirm the sources of and quantify how important this problem is, by capturing flat
traces of multi-threaded applications from different frameworks and note their large differences
in memory traffic (from 40% to 100% deviation) arising specifically from synchronization related
mechanisms. For the quantification study, we use a simplified version of Sigil’s event traces in
the form a flat memory trace with function and Thread ID information. In order to model the
non-determinism correctly for simulation, we advocate for traces of multi-threaded applications to
capture high-level synchronization events and embed them in the trace.
We found that existing methodologies that capture traces for multi-threaded applications do not
succeed in capturing high-level synchronization constructs and are currently inadequate for CMP
design space exploration. PinPlay is one such methodology that captures multi-threaded traces
in the form of pinballs. Pinplay is used for deterministic and reproducible replay, and it supports
multi-threaded applications [79]. However, as the study in this section shows, the presence of
synchronization mechanisms significantly affects the timing associated with the execution of multi-
4.1 Issues with traditional flat traces of multi-threaded applications
93
threaded applications in an unpredictable way during trace capture. PinPlay’s traces and replay do
not model this non-determinism accurately during simulation as they do not capture synchronization
events in their traces. As a result, design space exploration of a CMP with PinPlay may lead to
sub-optimal design choices. Additionally, there are currently no publicly available simulators that
support pinballs of multi-threaded applications. Another trace-based solution, proposed by Rico et
al., is a hybrid execution-driven and trace-driven methodology for simulation [86]. However, their
methodology requires source to source transformation to interface their synchronization calls with
their simulation framework. Furthermore, their simulation framework is not fully validated with a
known CMP or CMP system simulator.
The work presented in the next few sections overcomes the shortcomings of previous approaches
by using synchronization- and dependency-aware, architecture-agnostic multi-threaded traces. Based
on the insights from this section, we describe the modifications and resulting format of the event
traces obtained from Sigil, to make them thread-dependency- and thread-synchronization-aware.
As our final goal is to use the traces for simulation, we use the modified event traces in a simula-
tion methodology; SynchroTrace. SynchroTrace is a two-step methodology for trace-based simula-
tion of multi-threaded applications: i) The generation of synchronization- and dependency-aware
architecture-agnostic traces using Sigil’s event trace representation, discussed in the next section
and ii) A lightweight replay mechanism that respects those dependencies, simulates synchronization
actions, and handles simple scheduling for threads for playback on any target hardware platform,
discussed in the next to next section. To the best of our knowledge, SynchroTrace is the most ro-
bust multi-threaded trace based simulation methodology proposed so far and serves to highlight the
power of platform-independent analysis and communication classification. Therefore, SynchroTrace
forms an important contribution in this dissertation.
4.2 Synchronization- and Dependency-aware Traces
To tackle the problem of non-determinism, we had presented the idea that a trace-driven simulation
flow for multi-threaded applications should not record traffic due to a specific thread interleaving
encountered at capture. Our trace capture and replay based simulation flow, SynchroTrace, im-
4.2 Synchronization- and Dependency-aware Traces
94
plements this idea by allowing for thread interleaving to be determined by hardware architecture
and run-time factors during replay such as OS load on the cores, and dynamic thread mapping.
Thus, SynchroTrace records some architecture-independent information for each thread, including
synchronization constructs, which allow for correct modeling of wait times at synchronization points
and uneven progress during simulation.
The tracing methodology utilizes the Sigil tool’s event trace representation to trace through the
multi-threaded program. Recall that within the traces, events of different types are identified to
separate computation, and communication between entities in programs. To allow correct modeling
of non-determinism in shared memory multi-threaded programs, we introduce the synchronization
event as well. The replay mechanism presented in section 4.3 parses these traces and the correspond-
ing events and inserts them appropriately into the computation/memory stream during playback. In
later sections, we show how SynchroTrace can be used to achieve accurate design space exploration
and its flexibility in terms of speed and accuracy trade-offs.
Traditional traces that are used for simulation usually contain detailed and full instruction de-
scriptions, making them large and slow to use. For traces to remain i) ISA- and microarchitecture-
agnostic, ii) fast, and iii) easily compressible, we use abstract events instead of detailed instructions.
This allows us to use abstract workload characterization based on Sigil, including the communica-
tion classification technique, to capture the traces. Due to the microarchitecture-agnostic nature of
the traces, they can be used for heterogeneous CMP design space exploration as well. The event
trace representation presented in Chapter 2 generated a single trace file with computation and
communication events for all functions because generating a separate trace for each function would
be inconvenient as even simple programs contain many functions of varying sizes, and it is also in-
convenient to trace communication between functions across so many files. However, multi-threaded
programs are usually balanced in the amount of work for each thread has just one identifier, making
it simple and more convenient to hold a trace for each thread. Thus, the first modification we
make from the original representation is to generate a separate event trace for each thread of a
multi-threaded application, and we will show how this helps each trace become simpler. Second,
4.2 Synchronization- and Dependency-aware Traces
95
we change the format of each event to refer to threads instead of functions as producing consuming
entities. Each of the 3 types of events present in the trace, including modifications from the original
format, is described below with the fields of each event outlined in Listings 4.1–4.3.
Computation Events represent local processing performed by a thread, completely indepen-
dent of other threads. We retain abstract computation events from the original format presented in
Listing 2.1. The format presented in Listing 2.1 in Chapter 2 contains function identifiers, but in
this representation we remove identifiers from computation events as they reside in the appropriate
thread’s trace file. Computation events retain counts of Integer Operations (IOPS) and Floating
Point Operations (FLOPS).
Recall in the original representation format, we held the unique and total communication bytes for
computation and communication events. In this modified representation, we hold the corresponding
Load and Store addresses for all communicated bytes as well. In a computation, event we hold the
unique (virtual memory) addresses read and written in lieu of unique local bytes, and we also store
the number of byte-addressable local memory addresses written which is equivalent to total local
communication bytes in the original format. For an input/output communication edge (represented
by a communication event), we hold the unique address in both the producer computation event
and the consumer communication event, to allow the producer thread to proceed independently
as dictated by multi-threaded program execution. We store the set of unique read and written
(virtual memory) addresses in an event, with special symbols such as $ and * delimiting the lists.
In summary, a computation event holds all unique memory addresses written (input/output and
local), and local memory addresses read, while a communication event holds all the unique memory
addresses read. Computation events can be used to simulate timing models for compute operations,
memory writes, and reads from non-shared memory. Essentially, computation events in a thread are
abstractions for work done by the thread that is independent of action by any other thread.
4.2 Synchronization- and Dependency-aware Traces
96
Event Number , I n t e g e r Op Count , F loat ing Point Op Count , Memory Read Count , Memory
Write Count $ Unique Addresses Written ∗ Unique Addresses Read
Listing 4.1: Computation Event
Thread Synchronization Events contain the type of pthread API call and the address of
the data structure used, so that a particular synchronization object can be recognized. We recog-
nize and intercept pthread the following pthread API calls within the trace: pthread create/join,
pthread mutex lock/unlock, pthread barrier signal/wait, and pthread condition wait. Thread syn-
chronization events are interpreted during simulation, and the action appropriate for the synchro-
nization type is applied for participating threads. When the traces are replayed, the appropriate
waiting time for each thread at this synchronization point is determined on-the-fly by the Replay
framework described in section 4.3.
1 Event Number , pth ty : Pthread Cal l Type ˆ Address o f Synchron izat ion St ructure
Listing 4.2: Synchronization Event
Communication Events represent communication edges between threads. Communication
events in trace-based simulation are used solely for the purpose of catching synchronization con-
structs that cannot be intercepted as the previously described synchronization events by our Sigil
tool. In essence, a communication event is necessary for modeling synchronization events that man-
ifest as communication occurring between threads. These sort of synchronization events may not be
fully transparent to the capture framework, such as user-level synchronization explained in the pre-
vious section, or memory traffic within the kernel. A communication event in the consuming thread
is associated with a particular computation event in the producing thread. Compare Listing 4.3 with
Listing 2.2 to note the significant modifications. We drop all consumer identifiers since the commu-
nication event is present in the trace of the consuming thread, We identify the producer by thread
number, event number. Also, instead of unique and total communicated bytes, our modified format
4.2 Synchronization- and Dependency-aware Traces
97
holds the actual addresses used by the communication, but that are unique within the event.
The # symbol in Listing 4.3 separates 3 fields which hold the producing thread, the event number
within that thread and the specific addresses touched. Communication events can potentially hold
references to data from multiple producer threads by using the # delimiter to separate fields for
different producers. When replaying the trace, we enforce the dependency between the consuming
and producing thread; we make the consuming thread wait for the producing thread to proceed past
the corresponding computation event.
1 Event Number # Producer Thread , Producer Event , Address Range
Listing 4.3: Communication Event
An excerpt of a single thread’s trace using fields from Listings 4.1–4.3 follows:
1 1774522,1,0,0,1 $ 132941440 132941447
1774523,1,0,0,1 $ 132941448 132941455
3 1774524 # 1 4534 7048536 7048543
1774525,1,0,1,0 * 132941388 132941391
5 1774526,1,0,0,0
1774527 , pth_ty: 5 ^ 67113320
7 1774528 ,114 ,0 ,0 ,1 $ 132941456 132941463
1774529,3,0,1,0 * 132941560 132941567
9 1774530 # 1 5870 7048472 7048479
Listing 4.4: Single Thread’s Trace Example
The example in Listing 4.4, of events 1774522 to 1774528, shows the uncompressed version of
the trace where we allow at most one memory read or write per event; the events representation also
allows for multiple consecutive operations which fall under the computation or communication cate-
gories to be merged together (detailed further in section 4.5). The first two lines show computation
events 1774522 and 1774523. It can be observed that each event records one memory write and one
integer operation with the addresses for the memory writes are shown after the $ symbol. Event
4.2 Synchronization- and Dependency-aware Traces
98
1774524 is a communication event with this thread reading from Thread 1’s event 4534 through
the addresses 7048536-43. Event 1774524 is a computation event that recorded one memory read
and one integer operation, with the addresses read shown after the * symbol. The next event does
not contain any memory operations as a synchronization operation intervened before it could record
any memory operations, necessitating a synchronization event 1774527. The synchronization event
is of type 5, which represents a barrier, with the barrier address being 67113320. We next describe
how we capture synchronization events as that is the most novel addition to the format prior to
modification.
4.2.1 Capturing synchronization events
Sigil was designed to capture communication between functions and threads, and we added the
ability to register threading API calls. This capture tool monitors the execution of a program and
builds sequences of computation, synchronization, and communication events for each application
Tracking synchronization primitives 
4 4  Captured  
application  
execution 
Capture  
Tool  
Process 
R/W/C 
Barrier 1 
R/W/C 
R/W/C 
Barrier 1 
R/W/C 
Mutex Lock  Mutex Lock 
Log 
Sync event 
Thread 1 
Critical Section 
A 
Critical Section 
A 
Mutex 
Unlock 
Mutex 
Unlock 
Thread 1  Thread 2 
R/W/C   
Read/ 
Write/ 
Compute 
Log 
Sync event 
Thread 2 
Figure 4.6: Intercepting Pthread API Calls
4.2 Synchronization- and Dependency-aware Traces
99
thread. We periodically dump the trace to a file so as to efficiently manage the amount of state held
in memory during the trace gathering process. This keeps the Sigil capture framework lightweight
and fast.
Figure 4.6 illustrates an example of how we intercept pthread API calls to generate synchroniza-
tion events. The trace capture mechanism uses Valgrind’s function wrapping feature to intercept
pthread API calls [2]. Depending on the type of synchronization encountered, a synchronization
event is logged in the trace for one or more threads.
SynchroTrace cannot capture threading activity when standard threading API calls are not
used in the traced program, as it is not possible to infer synchronization at the assembly level in
Valgrind. This can occur in cases where condition variables are explicitly written in user code, or
critical sections using low-level locks are encountered in the kernel [74]. As explained earlier that
is why we capture communication events to handle these cases and interpret them as dependencies
between the threads.
4.2.2 Capturing Operating System traffic
SynchroTrace’s capture framework intercepts information related to Operating System (OS) actions,
albeit currently in a limited fashion. Since our capture framework, Sigil, is built on Valgrind, Syn-
chroTrace shares Valgrinds inability to capture any computation, communication, or synchronization
events within the kernel. However, Valgrind can intercept system calls and report an aggregate of the
memory addresses read and written within a system call. Thus, SynchroTrace embeds the aggregate
information into computation and communication events in the trace for each thread, though the
sequence of memory traffic within the kernel will not be preserved. We currently measure memory
traffic related to producer-consumer communication with the kernel but not synchronization within
the kernel. We thus conservatively treat reads that consume from memory writes within the kernel
as dependencies that a thread will be required to wait on through communication events.
Our platform-independent methodology based on Sigil captures traces quicker than a full-system
simulation-based trace capture as our traces are derived from native runs of the program. The events
representation allows for more size efficient traces by only holding detailed information for the most
4.2 Synchronization- and Dependency-aware Traces
100
Computation 
Communication 
Synchronization 
Trace 
Translator 
Event Queue 
Manager 
Cache 
Simulator  
& 
 NoC  
Simulator 
Mem. 
Req. 
Mem. 
Resp. 
Memory 
Request 
Manager 
Event 
Traces 
Thread Scheduler 
Replay 
TN 
T0 
…
 
T1 
TN T0 T1 … 
Event Types 
Figure 4.7: Multi-Threaded Event-Trace Replay Framework
important events. As the traces have synchronization and dependencies embedded in them, they
can be used for architecture simulations and also can be post-processed to infer useful information
about the workload and to apply techniques to speedup simulation. We will demonstrate the latter
in section 4.5.
4.3 Event-Trace Replay Framework
For architecture simulations, a replay mechanism is required to process the trace and generate
architectural events. The replay mechanism dynamically generates the appropriate actions for all
events during simulation while providing light-weight thread scheduling and management. As shown
in Figure 4.7, the captured event-trace sends computation, communication, and synchronization
events for each thread into the Replay framework. Within Replay, the individual events are processed
via the Trace Translator into individual Replay event data structures and passed into the Event
Queue Manager (EQM). The EQM also interfaces with the Memory Request Manager (MRM)
to send memory requests. The MRM interfaces with an external cache simulator and generates
response back to the EQM. The Thread Scheduler handles the thread creation, deletion, scheduling,
and synchronization.
4.3 Event-Trace Replay Framework
101
Despite the theoretical capability, some simplifying decisions have been made in the SynchroTrace
framework implementation as follows: the current playback mechanism with the multi-threaded
traces uses simple timing models for in-order cores. SynchroTrace currently accounts for the pro-
gression of the modeled core’s cycle time using a 1-CPI timing model with detailed timing models for
the uncore. The core Replay infrastructure can be connected to more detailed timing models such as
out-of-order cores. Our current Replay framework processes memory requests for the Ruby/Garnet
simulators. However, the multi-threaded traces and replay mechanism are portable to any system
simulator.
4.3.1 Event Queue Manager and Memory Request Manager
As detailed in Algorithm 1, the EQM handles the progression of the events for each of the threads
within the EventQueue. During each cycle, the EQM checks if there are events ready to be processed
from threads for the current cycle. If there are no available events for the current cycle across all of
the threads, the CurrentTime progresses to the next available event’s scheduled wakeup time. Events
scheduled to wake up in the current time are handled by the process represented in Algorithm 2.
ProcessEvent is described in detail in Algorithm 2. For computation events, the EQM schedules
the thread to wake up after the cycle time required to complete the computation event based on
the number of IOPS and FLOPS. When this thread wakes up at its scheduled clock time, the EQM
will send a read or write memory request to the MRM and block the thread until the MRM triggers
a memory response to the EQM. As shown in Figure 4.7, the MRM interfaces with the Cache and
NoC simulators to obtain the correct timing for the memory request. As described by Lines 11–13
in Algorithm 1, after receiving a memory response from the MRM, the EQM will then queue the
next event for the thread.
For synchronization events, the EQM sends create and join events to the Thread Scheduler.
Upon processing mutex lock and barrier events, the EQM handles these events similar to the thread
dependencies of the communication events; if a thread is unable to acquire a mutex lock or is waiting
at a barrier, the thread will be rescheduled by the EQM to attempt again during the next cycle. If
the synchronization event is successful, the thread will proceed to the next event. Synchronization
4.3 Event-Trace Replay Framework
102
events in the Replay framework generate memory requests, but these are omitted in Algorithm 1 to
simplify the pseudocode.
For communication events, the EQM maintains the dependencies between consumer threads and
the corresponding computation events of producer threads. While processing the communication
event of a consumer thread, the EQM will check on the progress of the corresponding computation
event of the producer thread. If the corresponding computation event has not been completed, the
EQM will block the consumer thread from progressing. Once the corresponding computation event
has been completed, the EQM will immediately issue the memory read of the communication event
and block the consumer thread until the MRM triggers a memory response to the EQM.
Algorithm 1 Event Queue Manager
1: for all ThreadIDs in EventQueue[ThreadID] do
2: for all Events in EventQueue[ThreadID] do
3: if Event.T imeReady = CurrentT ime then
4: ProcessEvent . Algorithm 2
5: end if
6: end for
7: end for
8: if AllEventsinEventQueue ≥ CurrentT ime then
9: ProgressCurrentT imetoNextEventT ime
10: end if
11: if MemoryResponseTriggeredForThread then
12: QueueNextEvent
13: end if
4.3.2 Thread Scheduler
SynchroTrace can be integrated with any simulator that contains CMP architecture models. In
a simulation flow that employs the SynchroTrace methodology, the Replay framework accepts the
simulation parameters/configuration and configures the simulation back-end accordingly. This con-
figuration process is independent of trace generation, so the number of threads being simulated
does not necessarily correspond to the number of cores. This necessitates a thread scheduler in the
absence of the OS in trace-driven simulation. The Thread Scheduler handles the creation, deletion,
scheduling, and synchronization of threads across any number of cores, including multiple threads
per core. Currently, SynchroTrace’s Thread Scheduler opportunistically swaps out stalled threads
4.3 Event-Trace Replay Framework
103
Algorithm 2 ProcessEvent
1: if COMPEV ENT then MemReq@(Comp.T ime+ CurrentT ime) and WaitforResp.
2: else if COMMEV ENT then
3: if Dep.EventCompleted then
4: MemReq@(CurrentT ime) and WaitforResp.
5: elseScheduleThreadtoAttemptAgainNextCycle
6: end if
7: else if SY NCHEV ENT then
8: if Event = Create or Join then
9: SendEventtoThreadScheduler
10: else if MutexLockRequest then
11: if MutexLockObtained then QueueNextEvent
12: elseScheduleThreadtoAttemptAgainNextCycle
13: end if
14: else if MutexUnlockEvent then QueueNextEvent
15: else if BarrierEvent then
16: if LastThreadforBarrier then
17: QueueNextEvent
18: elseScheduleThreadtoAttemptAgainNextCycle
19: end if
20: end if
21: end if
for threads ready to progress. Threads can be stalled on synchronization events, dependencies, or
memory requests. Our thread scheduler performs a simple round-robin approach when multiple
threads are ready to progress. While we do not currently model a cost for the scheduling actions,
we intend on adding that cost to the simulation as well.
4.4 Design Space Exploration with Trace-based Simulation
SynchroTrace provides the means for accurate and efficient design space explorations ranging from
low-power to highly-scaled CMPs. In this section, we demonstrate how the light-weight SynchroTrace
simulation flow can be used to select optimal CMP uncore design choices for a fixed in-order core,
given uncore area and power constraints targeting CMPs. The uncore we are evaluating includes
the L1 cache, L2 cache, NoC routers, and NoC links. We also show that our light-weight simulation
flow yields the same result when using the cycle-accurate Gem5 Full-System simulator.
4.4 Design Space Exploration with Trace-based Simulation
104
4.4.1 Experimental Methodology
Our experimental methodology consists of two sets of experiments. The focus of the first experiment
is to use SynchroTrace to analyze the design space across cache sizes and network parameters for
a given set of uncore area and power constraints with a fixed in-order core model. Specifically, we
vary the L1 and L2 cache sizes, NoC virtual channels, NoC buffer depth, and NoC link bandwidth.
The goal of this first experiment is to accurately select the best performing design configuration,
using the metric Cycles Per Instruction (CPI), under uncore power and area constraints. Although
we capture cycles, we chose CPI as our performance metric in lieu of execution cycles so that all
benchmarks simulated are weighted equally when assessed for design space exploration. To calculate
CPI for both frameworks, we used the number of instructions obtained from SynchroTrace’s capture
tool. This kept the relative trends of CPI and cycles consistent for each individual benchmark.
The focus of the second experiment is to perform an equivalent design space exploration using the
cycle-accurate Gem5 Full-System simulator [9]. The goal of the second experiment is to compare
the cycle-accurate full-system simulator results against SynchroTrace’s light-weight simulation flow
for accuracy and speedup.
The base of the design configurations consists of a single 8-core chip, 2-level cache, and directory-
based MESI protocol. The cache and network design parameters are detailed in Tables 4.2 and 4.3,
respectively. The CMP contains private L1 caches with an associativity of 4, a shared distributed
L2 cache with an associativity of 8, and 64-byte blocks. The cores and NoC both operate at 1 GHz.
The caches and NoC are designed for the 65nm technology with area and power given by Cacti
6.5 [68] for the caches and Orion 2.0 [54] for the NoC. The traces were captured on the Linux Kernel
2.6 in Red Hat Enterprise Linux 5 (RHEL5) with the standard POSIX Thread API. We benchmark
the design configurations using applications from the PARSEC-2.1 [8] and Splash-2 [110] benchmark
suites.
We use the SynchroTrace simulation flow illustrated in Figure 4.7. The traces are only generated
once per benchmark and used for simulation of all 16 design points. For this section, we used uncom-
pressed traces as described in Section 4.2. To additionally show that the SynchroTrace simulation
4.4 Design Space Exploration with Trace-based Simulation
105
0
50
100
150
200
250
A
re
a 
(m
m
2 )
 
Area 75 mm^2 125 mm^2
0
100
200
300
400
500
600
700
800
900
1000
0
10
20
30
40
50
60
70
A
re
a 
(m
m
2)
 
Po
w
er
 (W
) 
Total Power Total Area 33% of Max 75% of Max
Figure 4.8: Design Choices Under Area and Power Constraints
Table 4.2: Cache Design Parameters
Cache Configs. Cache Sizes
SS (Small) L1I/D = 4kB; L2 Slice = 256kB
MM (Medium) L1I/D = 16kB; L2 Slice = 1024kB
LL (Large) L1I/D = 32kB; L2 Slice = 2048kB
vLvL (veryLarge) L1I/D = 64kB; L2 Slice = 4096kB
flow yields the same results as the Gem5 Full-System simulator, the SynchroTrace simulation flow
uses the same cache and NoC simulators (Ruby and Garnet) that are used by the Gem5 framework.
For the remainder of this paper, we hereby use the terminology of the “SynchroTrace simulation
flow” to represent the integration of our traces and replay mechanism specifically with the Ruby
cache simulator and the Garnet NoC simulator. In the next section, we show how the power of our
tracing format in allowing large compression and simulation speedup with minimal change in the
accuracy measurements presented here. For our comparisons, we use Gem5’s TimingSimpleCPU
core model which is a 1–CPI in-order model.
4.4 Design Space Exploration with Trace-based Simulation
106
Table 4.3: NoC Design Parameters
Network Configs. Network Parameters
VC 2 Virtual Channels = 2, Buffer Depth = 4
VC 4 Virtual Channels = 4, Buffer Depth = 4
BW 4 Link Bandwidth = 4 Bytes
BW 16 Link Bandwidth = 16 Bytes
4.4.2 Area and Power Constraints
The constraints for the pruning of the uncore design space are based on 1) 75% and 2) 33% of
area and total power of the most resource-intensive design point (vLvL VC 4 BW 16). Figure 4.8
illustrates the total uncore area and power for each design point and the corresponding constraints.
Design points satisfying each of the design constraints (under respective dashed lines) are considered
for further evaluation in this design space exploration.
0.8
0.9
1
1.1
1.2
1.3
1.4
SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5
BlackScholes Canneal Water-Nsquared Water-Spatial Ocean
C
P I
 
( N
o r
m
a l
i z
e d
 t o
 L
L
_ V
C
_ 4
_ B
W
_ 1
6 )
 
LL_VC_4_BW_16
LL_VC_2_BW_16
MM_VC_4_BW_16
LL_VC_4_BW_4
MM_VC_2_BW_16
(a) CPI for first half of the benchmarks
0.8
0.9
1
1.1
1.2
1.3
1.4
SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5 SynchroTrace Gem5
LU Barnes Cholesky FMM FFT Average
C
P I
 
( N
o r
m
a l
i z
e d
 t o
 L
L
_ V
C
_ 4
_ B
W
_ 1
6 )
 
LL_VC_4_BW_16
LL_VC_2_BW_16
MM_VC_4_BW_16
LL_VC_4_BW_4
MM_VC_2_BW_16
(b) CPI for second half of the benchmarks and the overall average
Figure 4.9: CPI of Top 5 Design Points of SynchroTrace and Gem5 Under Uncore Constraints:
650 mm2, 45 W
It should be noted that the total area values calculated using Cacti 6.5 and Orion 2.0 are equiva-
lent in both the SynchroTrace simulation flow and Gem5, as this computation is performed externally
4.4 Design Space Exploration with Trace-based Simulation
107
0
10
20
30
40
50
60
70
3 3.5 4 4.5 5 5.5 6
Po
w
er
 (W
) 
Performance (CPI) 
SynchroTrace 
Optimal CPI 
75% Constraint: 
LL_VC_4_BW_16  
Optimal CPI 
33% Constraint: 
MM_VC_4_BW_4 
(a) SynchroTrace Power vs. Performance
0
10
20
30
40
50
60
70
3 3.5 4 4.5 5 5.5 6
Po
w
er
 (W
) 
Performance (CPI) 
Gem5 
Optimal CPI 
75% Constraint: 
LL_VC_4_BW_16  
Optimal CPI 
33% Constraint: 
MM_VC_4_BW_4 
(b) Gem5 Power vs. Performance
0
10
20
30
40
50
60
70
3 3.5 4 4.5 5 5.5 6
Po
w
er
 (W
) 
Performance (CPI) 
SynchroTrace with Trace Filtering 
Optimal CPI 
75% Constraint: 
LL_VC_4_BW_16  
Optimal CPI 
33% Constraint: 
MM_VC_4_BW_4 
(c) SynchroTrace with Trace Filtering Power vs. Performance
Figure 4.10: Total Uncore Power (NoC and Caches) vs. Performance (CPI)
4.4 Design Space Exploration with Trace-based Simulation
108
to the simulation solely using the design parameters.
In these experiments, detailed in section 4.4.3, the SynchroTrace simulation flow selected the
same design points under the constraints as Gem5. The consistency of the total power of the design
points with both simulators is expected as the total power is largely dominated by the leakage power,
which is application independent. The average difference in total power between the two simulators
is roughly 1%.
4.4.3 Performance Results and Design Choices Under Constraints
Given the constraints in section 4.4.2, our goal is to find the uncore hardware configuration that
will yield the highest performance, which is inferred by the lowest CPI. Additionally, we investigate
the accuracy of the design point selection by comparing the result against the selection of the cycle-
accurate Gem5 Full-System simulator.
Constraint 1: 75% of Max Area and Power
The design points allowed under Constraint 1 are compared for relative performance. Figure 4.9
summarizes the top 5 best performing design points of SynchroTrace and Gem5 across all tested
benchmarks, normalized to the CPI of LL VC 4 BW 16. Note that we have split the graph into two
parts to make the presentation clearer in this dissertation format. The first graph contains results
for the first 5 benchmarks while the second graph contains the remaining 5 benchmarks and the
average as well. Observing the average CPI of the design points in SynchroTrace and Gem5, it is
evident that the LL VC 4 BW 16 design point is the highest performing design point. The average
normalized CPI per design point of SynchroTrace is slightly skewed by up to 1.6% in comparison to
Gem5. However, SynchroTrace preserves the same number of design points under the constraints
with the equivalent ranking of design points by average CPI.
Furthermore, as detailed in Figure 4.9, SynchroTrace captures the CPI trends in all benchmarks
except for stray cases where additional NoC provisioning causes slight increases in execution time (up
to 4.6% difference in normalized CPI between SynchroTrace and Gem5) for Gem5. In particular,
doubling the virtual channels (LL VC 2 BW 16 to LL VC 4 BW 16) reduces the performance of
4.4 Design Space Exploration with Trace-based Simulation
109
BlackScholes and Cholesky simulations with Gem5. In the case of Ocean, SynchroTrace and Gem5
both match in terms of the overall trend in CPI, but the ranges are greatly skewed between the two
simulators: the overall normalized CPI range of SynchroTrace is roughly 10.6%, while the normalized
CPI range of Gem5 stretches to 31.6%. This deviation is caused by the large amount of user-level
synchronization within the execution of Ocean; SynchroTrace introduces dependency-based waits
for communication events representing this inter-thread communication, while Gem5 executes the
user-space synchronization construct specified in the program as expected. We leave to future work,
the investigations of how to increase the cycle-level fidelity of benchmarks that implement a large
amount of user-level synchronization, though it is noteworthy that SynchroTrace is already able to
maintain the normalized trends in power and performance of these benchmarks.
Constraint 2: 33% of Max Area and Power
With strict area and power constraints of 33%, the design space converges to only 4 design points.
The MM VC 4 BW 4 design point is the highest performing design point for the strict constraints for
both SynchroTrace and Gem5. However, when comparing the smaller design points, the difference in
average normalized CPI per design point between the two frameworks is up to 9.7%. The overall CPI
trends are maintained between the two simulators, but as we show in section 4.4.3, SynchroTrace is
slightly skewed towards underestimating cycles for less resource-intensive design points.
Design Exploration with SynchroTrace Comparison to Gem5
As shown in the design exploration above, the SynchroTrace simulation flow obtains the equivalent
optimal design point under sets of constraints. Additionally, from Figures 4.10a and 4.10b, we
deduce that 1) the power estimation (as well as the area, not shown) between SynchroTrace and
Gem5 are the same, and 2) the SynchroTrace simulator skews towards underestimating the execution
time in comparison to Gem5, and the skew is increased for less resource-intensive designs. This skew
in absolute CPI ranges from 6.9% in vLvL VC 4 BW 16 to 17.8% in SS VC 2 BW 4. However, and
more importantly, the ratio of CPI between any two design points in SynchroTrace (effectively the
ratio of Cycles), is within 97% of the ratio of CPI for the same two design points in Gem5. Thus,
the overall trend for SynchroTrace is maintained within 97% of Gem5.
4.4 Design Space Exploration with Trace-based Simulation
110
We have shown that the accuracy of SynchroTrace in uncore design space exploration and design
selection experiments is 100% in comparison to the selections of full-system simulation. Furthermore,
as we show in section 4.5, each design point is simulated up to up to 13.4× faster with SynchroTrace
over Gem5.
4.5 Achieving Fast Design Exploration with Multi-Threaded Traces
Although our SynchroTrace simulation flow is up to 13.4x faster than Gem5 on average, our multi-
threaded traces can be used to speed up simulation by trading off accuracy for speed. To this end
we propose techniques including event compression (“lumped events”), “lumped-events” with hit
prediction, and trace filtering. Figure 4.11 illustrates the speedup of the SynchroTrace simulation
flow for all the trace techniques over Gem5, when simulating a modern CMP configuration most
closely represented by the largest design point in Tables 4.2 and 4.3 (i.e. from [1]) for applications
with 8 threads. We show up to 18.4x gains compared to Gem5 in simulation performance.
We also evaluate the accuracy in terms of design space exploration for the technique that showed
the most promise: trace filtering.
4.5.1 Speedup using Multi-Threaded Trace Techniques
Exploring design spaces using architecture simulation can take a significant amount of time, from
days to months. Our event-traces offer a significant advantage by reducing simulation time. The first
bar in Figure 4.11 shows the speedup in simulation from using our baseline uncompressed trace ex-
ecution through the SynchroTrace simulation flow versus the Gem5 Full-System TimingSimpleCPU
based model. We executed multiple benchmarks with “simsmall” data sizes from the PARSEC 2.1
and Splash-2 benchmark suites for both the multi-threaded trace-based simulation flow and the
Gem5 Full-System simulation flow as a comparison for simulation speed. Across the benchmark
simulation executions, the results show that the multi-threaded trace-based simulation flow has up
to a 13.4x speedup with an average of 4.6x speedup over Gem5.
4.5 Achieving Fast Design Exploration with Multi-Threaded Traces
111
4.5.2 Trace Compression
Our traces are generated by abstracting and aggregating the different classes of behavior in a program
as explained in section 4.2; we produce computation, Synchronization, and communication events
for multi-threaded programs that use the pthread API. This provides an opportunity to perform
compression within the trace by lumping together multiple consecutive operations which fall under
the computation or communication categories. When consecutive computation events are merged
together, the fields that represent counts, i.e. Integer Op Count, Floating Point Op Count, Memory
Read Count and Memory Write Count are all added together. Recall the fields in each event type
as shown in Listing 2.1 and 2.2. The fields that represent address ranges are merged together to
keep only the unique address ranges. Consecutive communication events can be merged by simply
merging the address ranges as described above and Synchronization events cannot be merged.
When parsing a lumped event, the Replay mechanism also optimizes playback by attributing
cycles for hits in a lumped-event. Lumping events together will lose some ordering information
amongst operations for the benefit of compression. We can set a limit on the number of events that
can be lumped together in the trace, so as to maintain accuracy. The trace example in Listing 4.4
shown in section 4.2 is an example of an uncompressed trace where we allowed only up to one
memory operation per line; i.e. no lumping. This was the setting we used for the traces in the
previous section as well. For the PARSEC 2.1 and Splash-2 benchmarks tested, we found the optimal
trace compression limit was 100 events per line, which produces around 10% difference in execution
cycles, but shows large improvement in compression and simulation time. This compression reduces
zipped file sizes by up to 74% for some benchmarks and 63% on average, while the simulation flow
has up to an 18.4x speedup with an average of 5.64x speedup over Gem5 Full-System as shown in
Figure 4.11.
4.5.3 Trace Filtering
We also studied the reduction in simulation time using a trace filtering approach inspired by prior
work in the context of traces for single-threaded applications [84, 112]. Puzak’s work used a direct
mapped cache to filter out hits from a trace. The resulting trace only contains misses. In a multi-
4.5 Achieving Fast Design Exploration with Multi-Threaded Traces
112
0
2
4
6
8
10
12
14
16
18
20
Sp
ee
du
p 
of
 S
im
ul
at
io
n 
Ti
m
e 
(N
or
m
al
iz
ed
 to
 G
em
5)
 
Normal Trace Execution
Compression + Simulated Cache Hit Fast-Forwarding
TraceFiltering
Figure 4.11: SynchroTrace Speedup in Simulation using our Multi-Threaded Trace Techniques
over Gem5
processor system, this will not work without modification as memory reads and writes could also
potentially cause coherence actions compromising accuracy. While Wu et al. attempt to apply the
technique to multi-processor scenarios, they use a multi-pass approach which was not evaluated for
accuracy or the effect on coherence. Here we demonstrate the promise of this technique by filtering
hits only to non-shared data (local accesses) from computation events, as filtering hits to shared
data can become complex due to non-determinism.
The filtering technique we implement post-processes the trace and uses a filter cache structure
to remove address ranges from computation events if they hit in the filter cache. The technique
also adds a field to the trace to record the hit count, which can be used to estimate cycles by the
Replay mechanism. The configuration parameters of this filter cache determine the speedup and
accuracy associated with simulating filtered traces for design space exploration. We use an 8kB,
fully associative structure with a line size of 8 bytes. Prior work has shown that stack distance in a
fully associative structure is sufficiently representative of set-associative caches employed in modern
architectures [5, 12]. Hits in the 8kB structure are very likely to hit in caches larger than 8kB during
4.5 Achieving Fast Design Exploration with Multi-Threaded Traces
113
simulation, making it an effective predictor of hits. We use an 8-byte line size to conservatively allow
for line size changes in the simulated configuration and to account for accesses that straddle cache
line boundaries.
The speedup obtained over Gem5, shown in Figure 4.11 ranges from 2x to 18.4x with an average
of 7.5x. Ocean has limited speedup due to the user-level synchronization that is enforced with
dependency waits in SynchroTrace. Both Canneal and LU traces are relatively large and would
benefit from more aggressive compression and filtering techniques.
We also ran the same design space exploration experiment of the previous subsection and arrived
at the same subset of designs ranked in the same order. The accuracy is shown in Figure 4.10c, where
we plot the CPI vs. Power of all 16 of the design points as in the previous subsection. We find that
the design points with filtered traces overlap with the points from unfiltered traces in most cases,
including the optimal designs. At the smaller design points, the effect of the high associativity of
the filter cache causes aggressive filtering to underestimate cycles by around 2%, though the relative
trends are still preserved as before.
4.5.4 Scalability
SynchroTrace is designed to be scalable and can generate and run traces for applications with more
than 128 threads. While we leave in-depth investigation of the framework’s capability in terms of
accuracy and speedup for such a large number of threads, we performed a single demonstrative
experiment using Splash2’s FFT benchmark. In our measured 32 thread of FFT, we show a 17x
speedup of SynchroTrace over Gem5 using the trace filtering technique discussed in Subsection 4.5.3.
4.6 Background and Related Work in simulation-based design space ex-
ploration of CMPs
The most accurate solution for a simulation-based design space exploration can be obtained through
execution-driven full-system simulators such as Gem5 [9] that execute entire applications. Recently,
a number of scalable simulators that use parallel simulation have been released [14, 65, 92]. They
allow different levels of slack in the ordering of memory accesses for multi-threaded applications and
enforce synchronization between simulation threads at quanta ranging from a few 1000 cycles to
4.6 Background and Related Work in simulation-based design space exploration of
CMPs
114
arbitrary barrier synchronizations [14, 65, 92].
The biggest challenge in achieving fast parallel simulation is determining how often to synchro-
nize simulation threads; determining the granularity of synchronization. Synchronization at fine
granularities allows more accuracy in modeling contention of shared resources, while coarse granu-
larities result in better parallel speedup at the expense of accuracy. Graphite and ZSim both achieve
speedup by allowing varying levels of slack in the ordering of memory accesses during parallel execu-
tion [65, 92]. However, as emphasized by Srinivasan et al [97], Graphite’s modeling of contention via
queuing theory is inaccurate for microarchitecture analysis. Additionally, Graphite requires source
code to be compiled against special libraries for simulation. Sniper (based on Graphite) provides
a more accurate estimate for the core CPI based on the interval core model, but has up to 25%
absolute error against real hardware [14]. These parallel simulators have not been fully validated for
relative errors and design space exploration capabilities. Recall that we showed that when the ratio
of CPI between any two design points is considered, we report this relative error for SynchroTrace
integrated with Gem5 to be within 97%. Note that prior work in the area of parallel simulators are
orthogonal to our work in this paper, as the SynchroTrace methodology can be integrated into any of
these simulators. SynchroTrace can also aid those frameworks in identifying synchronization points
and for potential performance improvement using trace filtering. Integration with the SynchroTrace
in its current form implies using SynchroTrace’s 1-IPC core models and bypassing the similarly weak
core models in all these simulation frameworks, but has the advantage of potential analysis of the
trace to dynamically determine the synchronization granularity. This is an investigation for future
work.
Traces used in trace-based simulations are simply a chronological log of the various events (mes-
sages sent over the NoC or cache access or instructions etc.) taking place in a system. Prior
trace-based simulation approaches have encountered difficulty capturing and accurately replaying
multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded pro-
grams [35]. As shown in this Chapter, SynchroTrace is able to model non-determinism by capturing
and embedding synchronization events in the trace and tracking dependencies between traces during
4.6 Background and Related Work in simulation-based design space exploration of
CMPs
115
capture.
4.6.1 Comparison to Pinplay
PinPlay provides a framework, based upon dynamic instrumentation, to capture execution into
traces (Pinballs) and replay the captured execution, deterministically [79]. There are clear benefits to
deterministic replay, such as debugging or reduced complexity in CMP simulators for single-threaded
applications. However, deterministic replay can fundamentally cause inaccuracies for design space
exploration with multi-threaded benchmarks.
In the context of multi-threaded applications, Pinballs are generated for each individual thread’s
execution. Included in multi-threaded Pinballs is a thread dependency file that captures shared
memory read and write in order and instruction dependencies among threads to deterministically
replay the traces in the captured order. Deterministic replay of multi-threaded traces is useful for
debugging multi-threaded applications in frameworks such as DrDebug [104]. However, deterministic
replay does not allow for timing behavior to affect the critical path of multi-threaded applications
and produces the same thread interleaving for every run. An example of this timing behavior is the
influence of the memory system on the ordering of thread synchronization events. This potential
inaccuracy in the context of design space exploration with multi-threaded benchmarks is noted by
T.E. Carlson et al. [15], which includes the developers of Pinplay and a Pinplay-integrated multi-core
simulator, Sniper. Pinplay’s enforcement of thread event ordering can cause cycle-time inaccuracy
when replaying multi-threaded Pinballs into a CMP simulator as the imposed thread ordering may
differ from the native execution of multi-threaded programs on different types of CMPs. In contrast
to Pinplay, SynchroTrace allows thread timing behavior to affect the critical path of multi-threaded
applications with a more accurate, non-deterministic playback.
To the best of our knowledge, no Pinball-based solution has been developed for the more accurate,
non-deterministic playback of multi-threaded Pinballs in the context of design space exploration.
Currently, the Sniper simulator [14], which can interface with single-threaded Pinballs, is unable to
playback multi-threaded Pinballs for design-space exploration.
4.6 Background and Related Work in simulation-based design space exploration of
CMPs
116
4.6.2 Other Trace-Drive Simulation Solutions
Rico et al. [86] present a hybrid simulation methodology that uses an execution-driven component
to handle threading API calls (parops, in their nomenclature) in multi-threaded applications, while
a trace-driven engine handles the non-parallel portions of the application. These traces capture
sequential flow of execution for each thread, somewhat similar to our methodology [86]. However, this
methodology requires source to source transformations to interface the parops with their simulation
framework, while SynchroTrace does not require source code changes. Also, the authors propose a
simulation framework with complex interfaces that are not fully validated against hardware or full-
system simulation. They have also not characterized simulator performance and only demonstrate
the methodology on a single custom application. This motivated us to write a methodology with a
simple interface that works with unmodified benchmarks using standard threading libraries.
Trace-based approaches have also been employed to specifically explore the NoC design space [46,
52, 76, 99]. Most work in this space has recognized the need to establish causation between network
messages in order to model the associated delays correctly. Thus, most of them attempt to annotate
dependencies in their traces. Raw traces are collected, and dependencies are extracted, mostly
through post-processing approaches [46, 52, 76]. YSC Huang et al. use a bloom filter inspired
approach for message passing interface (MPI) based applications but cannot handle shared-memory
applications [52]. Nitta et al.’s methodology and Netrace suffer from the need for multiple full-
system runs to infer true dependencies [46, 76]. In general, collecting traces through full-system
simulation is not scalable to large number of threads. To the best of our knowledge, we are the
first to generate reliable synchronization and dependency-aware multi-threaded traces that require
no changes to application code for architecture simulation.
4.7 Summary
In this chapter, we show that non-determinism in the execution of multi-threaded applications will
affect trace-based simulation of these applications on CMPs. The non-determinism manifests as
uneven thread progress between synchronization points and indeterminate waits at synchronization
4.7 Summary
117
points. We quantify how important this problem is, by capturing flat traces of multi-threaded
applications from different frameworks and note their large differences in memory traffic (from 40% to
100% deviation) arising specifically from synchronization related mechanisms. In order to model the
non-determinism correctly for simulation, we modify the event trace representation of Sigil to provide
traces of multi-threaded applications that capture high-level synchronization events and embed them
in the trace. Based on the modified infrastructure, we presented SynchroTrace: Synchronization- and
Dependency-Aware architecture-agnostic traces, played through an intelligent Replay mechanism for
accurate, flexible, scalable, and fast design space exploration for multi-threaded applications. We
have shown how the traces can be integrated into a simulator easily with the help of our Replay
mechanism. We validate the SynchroTrace simulation flow by successfully achieving the equivalent
results of a constraint-based design space exploration with the Gem5 Full-System simulator. We
show how our methodology is flexible, and we can trade-off minimal loss of accuracy for large gains
in speed by compressing and filtering traces. The results from simulating benchmarks from PARSEC
2.1 and Splash-2 show that our trace-based approach with trace filtering has a peak speedup of up
to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup.
4.8 Acknowledgments
Most work in the section 4.1 is adapted from a paper entitled “Can you trust your memory trace?:
A comparison of memory traces from binary instrumentation and simulation” by Siddharth Nilakan-
tan, Scott Lerner and Mark Hempstead. The dissertation author was the primary investigator and
author of this paper. The remaining sections are adapted from a paper entitled “SynchroTrace:
Synchronization-aware Architecture-agnostic Traces for Light-Weight Multicore Simulation” by Sid-
dharth Nilakantan, Karthik Sangaiah, Ankit More, Giordano Salvador, Baris Taskin, and Mark
Hempstead. The dissertation author was the primary investigator and author of this paper. This
material is based on work supported by the National Science Foundation including a CAREER
award CCF-1350624 and grant ECCS-1232164. Karthik Sangaiah is supported by the NSF Grad-
uate Research Fellowship under Grant No. 1002809. Any opinion, findings, and conclusions or
recommendations expressed in this material are those of the authors and do not necessarily reflect
4.8 Acknowledgments
118
the views of the National Science Foundation.
4.8 Acknowledgments
119
Chapter 5: Conclusions & Future Directions
The performance of modern microprocessor-based systems is limited by communication. Recent
studies have found that the promise of speedup from technology scaling, or using heterogeneous pro-
cessors, is diminished when hardware communication costs are included. We discussed some results
from those studies that motivate the need to consider the impact of hardware-level communication.
We showed that there are strong indications to the research community that communication costs
will severely impact the performance of future multicore systems.
Based on the insight that hardware-level communication is a run-time manifestation of software-
level communication, we elected to study the impact of software-level communication on the design
of future multicore processors. This dissertation showed how a methodology for capturing and
classifying communication is the first step for early stage design of future Chip-multiprocessors
(CMPs). We discussed the methodology in detail and outlined the categories of classification and
the use cases enabled by profiling software-level communication.
We introduced the concept of software entities and showed how all communication in a pro-
gram is essentially visible through Load and Store instructions through memory. The ability to
dynamically trace communication in applications, through memory addresses will allow tracking de-
pendencies through pointer indirection, linked lists and through control flow as well. We introduced
the novel concept of communication classification, which to the best of our knowledge has never
been implemented prior to the work presented in this dissertation. We showed how the categories
communication classification represent various run-time manifestations of software-level communi-
cation. For example, data transfer of the input set over a bus to an accelerator is represented by
unique communication, while reuse of the same data is represented by non-unique communication.
We introduced a novel tool named Sigil that is capable of automatically capturing and classifying
communication efficiently through the use of Shadow Memory and auxiliary data structures. The
Sigil tool is the biggest contribution of this dissertation as we show its representations are powerful
120
in enabling and assisting many hardware design tasks ranging from HW/SW partitioning, paral-
lelism discovery and CMP simulation. We exhaustively showed with the help of case studies, how
platform-independent software-level communication can be used to analyze i) function-level interac-
tion in single-threaded programs to determine which specialized logic to include in Heterogeneous
CMPs, and ii) thread-level interaction in multi-threaded programs to aid in CMP design space
exploration of both Homogeneous and Heterogeneous CMPs.
In this dissertation, we have the following contributions: i) We motivated the need to capture and
classify communication. ii) We described a unique profiling methodology to efficiently capture and
classify communication iii) We proposed a method for interpreting function-based profiling results for
partitioning problems. iv) We proposed a method to discover the limits of fine-grained parallelism.
v) We described early-stage modeling and resource allocation of heterogeneous many-accelerator
CMPs. vi) We studied the impact of non-determinism in traces of multi-threaded applications. v)
We analyzed platform-independent communication in multi-threaded programs to enable analysis
and trace-based simulation of CMPs We summarize in detail, our contributions in the dissertation,
in the following paragraphs:
Motivating the need to capture and classify communication We discussed the necessity
for classifying captured communication at a software-level and showed how formal categories of
classification applied on platform-independent software communication can be used to estimate run-
time traffic between different hardware structures. We listed and defined all the formal categories
of classification and described how to identify them in software application.
Unique profiling methodology to automatically capture and classify communication
efficiently We described a profiling methodology that instruments computation and communication
costs for software entities such as functions and threads. We discussed the detailed implementation of
the tool, Sigil, that captures and classifies communication efficiently and the corresponding tradeoffs.
A shadow-memory based implementation was employed to achieve efficiency and discussed the slow-
down (below 10x on average over Callgrind) and memory usage (within 2GB overhead for baseline
usage) of such an implementation. We also showed how multiple representations of output data can
121
be constructed from Sigil’s classified data for purposes ranging from assisting partitioning problems
and using traces to explore multicore design spaces. The aggregates representation produced by Sigil
can be used for a demonstrative partitioning to identify candidate functions for acceleration. The
event trace representation captures more fine-grained detail: a trace of data dependencies between
calls. With a simple example, we described how dependency trees can be constructed from the event
trace representation that allow for detection of critical paths and pipeline parallelism.
Method for interpreting function-based profiling results for partitioning problems
We motivated the accelerator selection problem and discussed similarities to earlier HW/SW parti-
tioning problems in the HW/SW codesign and reconfigurable computing communities. We described
how Sigil’s aggregate representation can be used for the accelerator selection problem by producing
interprocedural data flow graphs (IDFGs) on which HW/SW partitioning problems can be applied.
The IDFGs are constructed using the unique communication edges obtained from function-level pro-
filing in Sigil, and can be used as substitutes for data flow graphs in HW/SW partitioning problems.
We described the unique considerations for partitioning the aggregates representation of captured
and classified communication data obtained from Sigil. To perform the partitioning we also intro-
duced a novel, demonstrative metric termed breakeven-speedup that is able to produce a reasonable
list of accelerator candidates without requiring accelerator implementations. Specifically, we showed
how our demonstrative partitioning approach is able to detect interesting functions for acceleration
such as sort, compression and encryption. These functions can be analyzed to estimate storage based
on their patterns of data reuse. These novel studies on Sigil’s IDFG show the power of platform-
independent analysis toward the design of accelerator-based heterogeneous CMP architectures. We
have discussed in the background, how industry’s adoption of such architectures hinges upon tools
and analysis such as ours.
Discovering the limits of fine-grained parallelism We demonstrated a critical path analysis
case study of Sigil’s event trace representation results for PARSEC benchmarks, establishing a
theoretical limit of parallelism for those applications. We discussed how our approach, based on
critical path analysis, is different from prior approaches, and uses a different representation of the
122
workload for analysis.
Early-stage modeling and resource allocation of heterogeneous many-accelerator
CMPs We evaluated a methodology for early-stage modeling and design of CMPs that will contain
many accelerators. We described an execution model and performance estimation mode. With the
help of profiling results, we partition a sample workload and allocate resources in a communication-
aware manner to maximize performance of accelerator-rich CMPs. We showed, in the context of
the sample workload, how a resource allocation without considering communication, will impact the
performance of the accelerator-rich CMP negatively.
A study on the impact of non-determinism in traces of multi-threaded applications
As shared memory multi-threaded programs protect communication with synchronization constructs,
they are subject to non-determinism. With an example, we showed that non-determinism will need
to be modeled to perform accurate design space exploration of multicore systems that run multi-
threaded applications. To model non-determinism, capturing communication alone is insufficient
and capturing synchronization will be required as well. The non-determinism manifests as uneven
thread progress between synchronization points and indeterminate waits at synchronization points.
In a study, we quantified the impact of non-determinism due to existence of thread synchronization
constructs in multi-threaded applications.
Using platform-independent communication analysis on multi-threaded programs
to enable analysis and trace-based simulation of CMPs This dissertation also extends the
communication classification methodology with intercepts for synchronization events. In order to
model the non-determinism correctly for simulation, we modified the event trace representation of
Sigil to provide traces of multi-threaded applications that capture high-level synchronization events
and embed them in the trace. We discussed the necessary extensions of the event trace representation
of Sigil. With the help of a replay mechanism that models non-determinism at run-time, we showed
how the traces can be used to perform trace-based simulation for CMP design space exploration,
in a methodology named SynchroTrace. The relative accuracy of exploring the design space is
maintained to within 97%, while speedup of simulation using SynchroTrace the is shown to be 4.6x
123
on average over Gem5. We showed how SynchroTrace is flexible, and we can trade-off minimal loss
of accuracy for large gains in simulation speed by applying optimizations such as compression and
filtering on the traces. The results from simulating benchmarks from PARSEC 2.1 and Splash-2
with optimizations showed that our trace-based approach with optimizations has a peak speedup of
up to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup.
Hence with platform-independent analysis of communication in workloads, we are able to explore
diverse use cases in system design. We also showed that this analysis can be performed in an efficient
manner by using our methodology for communication classification.
5.1 Future Directions
We believe the contributions in this dissertation can be extended in the following ways:
We can extend the categories of classification to allow analysis of induction variables, reduction
variables and constants. This will necessitate a hybrid dynamic/static approach to communication
classification. With the help of such categories we will be able to refine the estimation of run-time
behavior for multicore systems. Sigil can also be extended to work on basic blocks to allow for
loop-level analysis, as loops have been identified as software entities that communicate both locally
and externally. In lieu of basic block level analysis, loop detection will also work. Nested loops form
a hierarchy that is similar to the calltree hierarchy as well.
Accelerator selections can be extensively validated with the help of accelerator performance/power
estimators and simulators that model accelerators. This necessitates the study of accelerated func-
tions, high-level synthesis. Sophisticated partitioning algorithms can also be applied on Sigil’s data
to obtain a refined list of accelerator candidates. Indeed, a full study of applying HW/SW parti-
tioning algorithms on unique graph representations such as interprocedural data flow graphs could
seed an entirely new set of contributions.
The event trace representation, a fine-grained representation, has been employed for several
purposes in this dissertation and promises to be useful for the discovery of and modeling of parallelism
in applications. It can be used to discover both intra- and inter-function parallelism, and with some
extensions, programs can be modeled as self-contained collections of instructions that communicate;
5.1 Future Directions
124
this representation has the potential to extract parallelism with dynamic profiling.
The SynchroTrace methodology presented in this dissertation currently uses simple simulation
models for the CPU cores. The core models can potentially be made more sophisticated by allowing
each event to also contain an estimate of ILP present in the event. Similar to the unique addresses
held in an event, the unique registers and number of dependencies through registers could help model
out-of-order cores and thereby improve the capability of SynchroTrace.
5.1 Future Directions
125
Bibliography
[1] Intel Xeon E5-2667. http://ark.intel.com/products/83361.
[2] Valgrind function-wrapping. http://valgrind.org/docs/manual/manual-core-
adv.html#manual-core-adv.wrapping.
[3] AnandTech. The Sandy Bridge Review: Intel Core i7-2600K, i5-
2500K and Core i3-2100 Tested. http://www.anandtech.com/show/4083/
the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested.
[4] N. Barrow-Williams, C. Fensch, and S. Moore. A communication characterisation of splash-2
and parsec. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium
on, pages 86–97, 2009.
[5] K. Beyls and E.H. D’Hollander. Reuse distance as a metric for cache behavior. In Proceedings of
the IASTED Conference on Parallel and Distributed Computing and Systems, pages 617–662,
2001.
[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark
suite: Characterization and architectural implications. In Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 72–81,
New York, NY, USA, 2008. ACM.
[7] Christian Bienia and Kai Li. Parsec 2.0: A new benchmark suite for chip-multiprocessors. In
Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June
2009.
[8] Christian Bienia and Kai Li. Parsec 2.0: A new benchmark suite for chip-multiprocessors. In
Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June
2009.
[9] N. Binkert et al. The gem5 simulator. In The ACM SIGARCH Computer Architecture Newslet-
ter, pages 1–7, August 2011.
[10] Filip Blagojevic, Xizhou Feng, Kirk W. Cameron, and Dimitrios S. Nikolopoulos. Modeling
multigrain parallelism on heterogeneous multi-core processors: A case study of the cell be. In
Per Stenstrm, Michel Dubois, Manolis Katevenis, Rajiv Gupta, and Theo Ungerer, editors,
HiPEAC, volume 4917 of Lecture Notes in Computer Science, pages 38–52. Springer, 2008.
[11] C. Bobda, P. Mahr, B. Andres, and H. Ishebabi. Application-driven architecture synthesis
of on-chip multiprocessor systems. In High Performance Computing and Simulation (HPCS),
2010 International Conference on, pages 591–598, June 2010.
[12] M. Brehob and R. Enbody. An analytical model of locality and caching. Michigan State
University, Department of Computer Science and Engineering MSU-CSE-99-31, 1999.
[13] J.D. Brown, S. Woodward, B.M. Bass, and C.L. Johnson. Ibm power edge of network processor:
A wire-speed system on a chip. Micro, IEEE, 31(2):76 –85, march-april 2011.
[14] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for
scalable and accurate parallel multi-core simulation. In International Conference for High
Performance Computing, Networking, Storage and Analysis (SC), November 2011.
126
[15] T.E. Carlson, W. Heirman, H Patil, and L. Eeckhout. Efficient, accurate and reproducible sim-
ulation of multi-threaded workloads. In REPRODUCE: Workshop on Reproducible Research
Methodologies. IEEE, 2014.
[16] E.S. Chung, P.A. Milder, J.C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does
the future include custom logic, fpgas, and gpgpus? In Microarchitecture (MICRO), 2010 43rd
Annual IEEE/ACM International Symposium on, pages 225–236, Dec 2010.
[17] Katherine Compton and Scott Hauck. Reconfigurable computing: a survey of systems and
software. ACM Comput. Surv., 34(2), June 2002.
[18] Jason Cong, Mohammad Ali Ghodrat, et al. Bin: a buffer-in-nuca scheme for accelerator-rich
cmps. In ISLPED, ISLPED ’12, 2012.
[19] Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and
Glenn Reinman. Accelerator-rich architectures: Opportunities and progresses. In Proceedings
of the 51st Annual Design Automation Conference, DAC ’14, pages 180:1–180:6, New York,
NY, USA, 2014. ACM.
[20] Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman.
Charm: A composable heterogeneous accelerator-rich microprocessor. In Proceedings of the
2012 ACM/IEEE international symposium on Low power electronics and design, pages 379–
384. ACM, 2012.
[21] Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Chunyue Liu, Glenn Reinman, and Yi Zou.
Axr-cmp: Architecture support in accelerator-rich cmps. In 2nd Workshop on SoC Architec-
ture, Accelerators and Workloads, 2011.
[22] R. Cordone, F. Redaelli, M.A. Redaelli, M.D. Santambrogio, and D. Sciuto. Partitioning
and scheduling of task graphs on partially dynamically reconfigurable fpgas. Computer-Aided
Design of Integrated Circuits and Systems, IEEE Transactions on, 28(5):662–675, May 2009.
[23] John Curreri, Greg Stitt, and Alan George. Communication visualization for bottleneck de-
tection of high-level synthesis applications. In FPGA, 2012.
[24] R.P. Dick, D.L. Rhodes, and W. Wolf. Tgff: task graphs for free. In Hardware/Software
Codesign, 1998. (CODES/CASHE ’98) Proceedings of the Sixth International Workshop on,
pages 97–101, Mar 1998.
[25] Rolf Ernst, Jorg Henkel, and Thomas Benner. Hardware-software cosynthesis for microcon-
trollers. IEEE Des. Test, 10(4), October 1993.
[26] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug
Burger. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual
International Symposium on Computer Architecture, ISCA ’11, pages 365–376, New York,
NY, USA, 2011. ACM.
[27] Hadi Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In ISCA 38, 2011,
June 2011.
[28] Hadi Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In Proceedings of the
38th International Symposium on Computer Architecture (ISCA), June 2011.
[29] R H Dennard et al. Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions.
IEEE Journal of Solid-State Circuits, SC(9):256–268, October 1974.
[30] Min Feng, Chen Tian, Changhui Lin, and Rajiv Gupta. Dynamic access distance driven cache
replacement. TACO, 2011.
127
[31] M.D. Galanis, G. Dimitroulakos, and C.E. Goutis. Speedups from partitioning critical software
parts to coarse-grain reconfigurable hardware. In ASAP, 2005.
[32] M.D. Galanis et al. Speedups from partitioning critical software parts to coarse-grain reconfig-
urable hardware. In Application-Specific Systems, Architecture Processors, 2005. ASAP 2005.
16th IEEE International Conference on, July 2005.
[33] Saturnino Garcia, Donghwan Jeon, Christopher M. Louie, and Michael Bedford Taylor. Krem-
lin: Rethinking and rebooting gprof for the multicore age. In Proceedings of the 32Nd ACM
SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11,
pages 458–469, New York, NY, USA, 2011. ACM.
[34] A. Golander, N. Greco, J. Xenidis, M. Hyland, B. Purcell, and D. Bernstein. Ibm’s poweren
developer cloud: Fertile ground for academic research. In Electrical and Electronics Engineers
in Israel (IEEEI), 2010 IEEE 26th Convention of, pages 000803–000807, Nov 2010.
[35] S. R. Goldschmidt and J. L. Hennessy. The accuracy of trace-driven simulations of multipro-
cessors. ACM SIGMETRICS, pages 146–157, June 1993.
[36] C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu perfor-
mance without the answer. In ISPASS, 2011.
[37] C. Gremzow. Quantitative global dataflow analysis on virtual instruction set simulators for
hardware/software co-design. In Computer Design, 2008. ICCD 2008. IEEE International
Conference on, pages 377–383, Oct 2008.
[38] Carsten Gremzow. Compiled low-level virtual instruction set simulation and profiling for code
partitioning and asip-synthesis in hardware/software co-design. In SCSC, 2007.
[39] Dominik Grewe and Michael F. P. O’Boyle. A static task partitioning approach for heteroge-
neous systems using opencl. In Proceedings of the 20th International Conference on Compiler
Construction: Part of the Joint European Conferences on Theory and Practice of Software,
CC’11/ETAPS’11, pages 286–305, Berlin, Heidelberg, 2011. Springer-Verlag.
[40] Gagan Gupta and Gurindar S Sohi. Dataflow execution of sequential imperative programs on
multicore architectures. In MICRO, 2011.
[41] Rajesh K. Gupta and Giovanni De Micheli. Hardware-software cosynthesis for digital systems.
IEEE Des. Test, 10(3), July 1993.
[42] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C.
Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of
inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Sympo-
sium on Computer Architecture, ISCA ’10, pages 37–47, New York, NY, USA, 2010. ACM.
[43] Nikos Hardavellas, Michael Ferdman, Anastasia Ailamaki, and Babek Falsafi. Power scaling:
The ultimate obstacle to 1k-core chips. tech. report NWU-EECS-10-05, 2010.
[44] W. Heirman, D. Stroobandt, N.R. Miniskar, R. Wuyts, and F. Catthoor. Pincomm: Charac-
terizing intra-application communication for the many-core era. In Parallel and Distributed
Systems (ICPADS), 2010 IEEE 16th International Conference on, pages 500–507, Dec 2010.
[45] Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study
power-constrained architectures and specialization. In ISCA Workshop on Modeling, Bench-
marking, and Simulations (MoBS), Austin, Texas, June 2009.
[46] J. Hestness, B. Grot, and S. W. Keckler. Netrace: dependency-driven trace-based network-
on-chip simulation. In Proceedings of the International Wokshop on Network on Chip Archi-
tectures (NoCArc), pages 31–36, 2010.
128
[47] Mark Hill and Michael Marty. Amdahl’s Law in the Multicore Era. IEEE Computer, July
2008.
[48] Brian Holland et al. Rat: Rc amenability test for rapid performance prediction. TRETS, 2009.
[49] Brian Holland, Karthik Nagarajan, Chris Conger, Adam Jacobs, and Alan D. George. RAT: a
methodology for predicting performance in application design migration to fpgas. In Proceed-
ings of the 1st international workshop on High-performance reconfigurable computing technol-
ogy and applications: held in conjunction with SC07, HPRCTA ’07, pages 1–10, New York,
NY, USA, 2007. ACM.
[50] Rui Hou, Lixin Zhang, M.C. Huang, Kun Wang, H. Franke, Yi Ge, and Xiaotao Chang.
Efficient data streaming with on-chip accelerators: Opportunities and challenges. In High
Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on,
pages 312 –320, feb. 2011.
[51] Miaoqing Huang, Vikram K. Narayana, Harald Simmler, Olivier Serres, and Tarek El-Ghazawi.
Reconfiguration and communication-aware task scheduling for high-performance reconfig-
urable computing. ACM Trans. Reconfigurable Technol. Syst., 3(4), November 2010.
[52] Y. S.-C. Huang, Y.-C. Chang, T.-C Tsai, Y.-Y Chang, and C.-T King. Attackboard: A novel
dependency-aware traffic generator for exploring NoC design space. In Proceedings of the
ACM/IEEE Design Automation Conference (DAC), pages 376–381, 2012.
[53] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. times;pipescompiler: a tool for in-
stantiating application specific networks on chip. In Design, Automation and Test in Europe
Conference and Exhibition, 2004. Proceedings, volume 2, pages 884–889 Vol.2, Feb 2004.
[54] A. B. Kahng, B. Li, L. Peh, and K. Samadi. ORION 2.0: A fast and accurate NoC power and
area model for early-stage design space exploration. In Proceedings of the Design, Automation
Test in Europe (DATE), pages 423–428, April 2009.
[55] Martha Kim and Stephen Edwards. Computation vs. Memory Systems: Pinning Down Accel-
erator Bottlenecks. In Workshop on Architectural and Microarchitectural Support for Binary
Translation at ISCA (AMT-BT), June 2010.
[56] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. Sd3: A scalable approach to dynamic
data-dependence profiling. In MICRO, 2010.
[57] Yooseong Kim and Aviral Shrivastava. Cumapz: a tool to analyze memory access patterns in
cuda. In DAC, 2011.
[58] Anil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, and Steven VanderWiel. Hard-
ware acceleration in the ibm poweren processor: Architecture and performance. In Proceedings
of the 21st International Conference on Parallel Architectures and Compilation Techniques,
PACT ’12, pages 389–400, New York, NY, USA, 2012. ACM.
[59] Yu-Kwong Kwok and I. Ahmad. Dynamic critical-path scheduling: an effective technique for
allocating task graphs to multiprocessors. TPDS, 1996.
[60] Zhen Li, A. Jannesari, and F. Wolf. Discovery of potential parallelism in sequential programs.
In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 1004–1013, Oct
2013.
[61] Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, and Trevor Mudge. Soda: A
low-power architecture for software radio. In International Symposium on Computer Architec-
ture (ISCA), June 2006.
[62] LinuxGizmos.com. MIPS-like 32-core SoC runs Linux. http://archive.linuxgizmos.com/
mips-like-32-core-soc-runs-linux/.
129
[63] Qiang Liu, G.A. Constantinides, K. Masselos, and P. Cheung. Combining data reuse with
data-level parallelization for fpga-targeted hardware compilation: A geometric programming
framework. IEEE TCAD, 2009.
[64] Jonathan Mak and Cambridge Cb Fd. Facilitating program parallelisation: a profiling-based
approach, 2011.
[65] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and
A. Agarwal. Graphite: A distributed parallel simulator for multicores. In Proceedings of the
IEEE International Symposium on High Performance Computer Architecture (HPCA), pages
1–12, 2010.
[66] Matteo Monchiero, Jung Ho Ahn, Ayose Falco´n, Daniel Ortega, and Paolo Faraboschi. How
to simulate 1000 cores. SIGARCH Comput. Archit. News, 37(2):10–19, July 2009.
[67] Pierre-Andre´ Mudry, Guillaume Zufferey, and Gianluca Tempesti. A dynamically constrained
genetic algorithm for hardware-software partitioning. In Proceedings of the 8th annual confer-
ence on Genetic and evolutionary computation, GECCO ’06, pages 769–776, New York, NY,
USA, 2006. ACM.
[68] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing nuca organizations
and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), 2007.
[69] Nicholas Nethercote and Alan Mycroft. Redux: A dynamic dataflow tracer. In Electronic
Notes in Theoretical Computer Science, page 2003. Elsevier, 2003.
[70] Nicholas Nethercote and Julian Seward. How to shadow every byte of memory used by a
program. In VEE, 2007.
[71] Nicholas Nethercote and Julian Seward. How to shadow every byte of memory used by a
program. In Proceedings of the 3rd international conference on Virtual execution environments,
VEE ’07, pages 65–74, 2007.
[72] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic
binary instrumentation. In PLDI, 2007.
[73] S. Nilakantan, S. Battle, and M. Hempstead. Metrics for early-stage modeling of many-
accelerator architectures. Computer Architecture Letters, PP(99):1, 2012.
[74] S. Nilakantan, S. Lerner, M. Hempstead, and B. Taskin. Can you trust your memory trace?:
A comparison of memory traces from binary instrumentation and simulation. In Interna-
tional Conference on VLSI Design and 14th International Conference on Embedded System
Design (VLSID ES), Jan 2015.
[75] Siddharth Nilakantan, Srikanth Annangi, Nikhil Gulati, Karthik Sangaiah, and Mark Hemp-
stead. Evaluation of an accelerator architecture for speckle reducing anisotropic diffusion. In
CASES, 2011.
[76] C. Nitta, K. Macdonald, M. Farrens, and V. Akella. Inferring packet dependencies to improve
trace based simulation of on-chip networks. In Proceedings of the IEEE/ACM International
Symposium on Networks on Chip (NoCS), pages 153–160, 2011.
[77] NVIDIA Corp. NVIDIA Tegra K1 Whitepaper. pages 1–26, 2014.
[78] OpenCores. OpenCores. http://www.opencores.org.
[79] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. Pinplay:
A framework for deterministic replay and reproducible analysis of parallel programs. In Pro-
ceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and
Optimization, CGO ’10, pages 2–11, 2010.
130
[80] Cuong Pham-Quoc, Z. Al-Ars, and K. Bertels. Automated hybrid interconnect design for fpga
accelerators using data communication profiling. In Parallel Distributed Processing Symposium
Workshops (IPDPSW), 2014 IEEE International, pages 151–160, May 2014.
[81] B. Phanibhushana, K. Ganeshpure, and S. Kundu. Task model for on-chip communication
infrastructure design for multicore systems. In Computer Design (ICCD), 2011 IEEE 29th
International Conference on, pages 360–365, Oct 2011.
[82] Fred J. Pollack. New microarchitecture challenges in the coming generations of CMOS process
technologies (keynote). In MICRO 32, 1999.
[83] Graham D. Price, John Giacomoni, and Manish Vachharajani. Visualizing potential paral-
lelism in sequential programs. In PACT, 2008.
[84] T.R. Puzak. Analysis of cache replacement-algorithms. PhD thesis, University of Mas-
sachusetts Amherst, 1985.
[85] J.Javier Resano, M.Elena Pe´rez, Daniel Mozos, Hortensia Mecha, and Julio Septie´n. Analyzing
communication overheads during hardware/software partitioning. Microelectronics Journal,
34(11):1001–1007, November 2003.
[86] A. Rico, A. Duran, F. Cabarcas, Y. Etison, A. Ramirez, and M. Valero. Trace-driven simulation
of multithreaded applications. In IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS), pages 87–96, 2011.
[87] Sean Rul, Hans Vandierendonck, and Koen De Bosschere. Function level parallelism driven
by data dependencies. SIGARCH Comput. Archit. News, 35(1):55–62, March 2007.
[88] A.G. Saidi, N.L. Binkert, S.K. Reinhardt, and T. Mudge. Full-system critical path analysis.
In ISPASS, 2008.
[89] Ali G. Saidi, Nathan L. Binkert, Steven K. Reinhardt, and Trevor Mudge. End-to-end perfor-
mance forecasting: finding bottlenecks before they happen. In ISCA, 2009.
[90] G. Salvador, S. Nilakantan, B. Taskin, M. Hempstead, and A. More. Static thread mapping
for nocs via binary instrumentation traces. In Computer Design (ICCD), 2014 32nd IEEE
International Conference on, pages 517–520, Oct 2014.
[91] G. Salvador, S. Nilakantan, B. Taskin, M. Hempstead, and A. More. Effects of nondeterminism
in hardware and software simulation with thread mapping. In VLSI Design (VLSID), 2015
28th International Conference on, pages 129–134, Jan 2015.
[92] Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitectural simulation
of thousand-core systems. In Proceedings of the 40th Annual International Symposium on
Computer Architecture, ISCA ’13, pages 475–486, 2013.
[93] Michael L. Scott. Shared-Memory Synchronization. Morgan & Claypool Publishers, San Rafael,
California, 2013.
[94] A.K. Singh, M. Shafique, A. Kumar, and J. Henkel. Mapping on multi/many-core systems:
Survey of current and emerging trends. In Design Automation Conference (DAC), 2013 50th
ACM / EDAC / IEEE, pages 1–10, May 2013.
[95] Melissa C. Smith and Gregory D. Peterson. Parallel application performance on shared high
performance reconfigurable computing resources. Perform. Eval., May 2005.
[96] Spiral Project. Software/Hardware generation for DSP algorithms. http://www.spiral.net.
[97] S. Srinivasan, L. Zhao, B. Ganesh, B. Jacob, M. Espig, and R. Iyer. Cmp memory modeling:
how much does accuracy matter?, June 2009.
131
[98] William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. A practical approach to
exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the 40th An-
nual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 356–369,
Washington, DC, USA, 2007. IEEE Computer Society.
[99] F. Trivino, F. J. Andujar, F. J. Alfaro, and J. L. Sanchez. Self-related traces: An alternative to
full-system simulation for NoCs. In International Conference on High Performance Computing
and Simulation (HPCS), pages 819–824, 2011.
[100] Frank Vahid. What is hardware/software partitioning? SIGDA Newsl., 39(6):1–1, June 2009.
[101] Keith Vallerio. Task Graphs for Free 3.5. http://ziyang.eecs.umich.edu/~dickrp/tgff/.
[102] Ganesh Venkatesh et al. Conservation cores: reducing the energy of mature computations.
In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming
languages and operating systems, ASPLOS ’10, pages 205–218, New York, NY, USA, 2010.
ACM.
[103] Ganesh Venkatesh et al. Conservation cores: reducing the energy of mature computations. In
Proceedings of ASPLOS 15, 2010.
[104] Y. Wang, H. Patil, C. Pereira, G. Lueck, R. Gupta, and I. Neamtiu. Drdebug: deterministic
replay based cyclic debugging with dynamic slicing. In Proceedings of IEEE/ACM Interna-
tional Symposium on Code Generation and Optimization, CGO’14, New York, NY, USA, 2014.
ACM.
[105] Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. Using the
compiler to improve cache replacement decisions. In PACT, 2002.
[106] J. Weidendorfer et al. A tool suite for simulation based analysis of memory access behavior.
In ICCS, 2004.
[107] Samuel Williams et al. Roofline: an insightful visual performance model for multicore archi-
tectures. Commun. ACM, 2009.
[108] Wayne Wolf. Computers As Components: Principles of Embedded Computing System Design.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001.
[109] Stephan Wong, Filipa Duarte, and Stamatis Vassiliadis. A hardware cache memcpy accelerator.
In FPT, 2006.
[110] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: charac-
terization and methodological considerations. In Proceedings of the International Symposium
on Computer Architecture (ISCA), pages 24–36, 1995.
[111] Lisa Wu and Martha A. Kim. Acceleration targets: A study of popular benchmark suites.
[112] Z. Wu and W. Wolf. Iterative cache simulation of embedded cpus with trace stripping. In
Proceedings of the International Workshop on Hardware/Software Codesign, CODES, 1999.
[113] H. Youness, M. Hassan, K. Sakanushi, et al. A high performance algorithm for scheduling and
hardware-software partitioning on mpsocs. In DTIS, 2009.
[114] H. Youness, M. Hassan, K. Sakanushi, Y. Takeuchi, M. Imai, A. Salem, A.-M. Wahdan, and
M. Moness. A high performance algorithm for scheduling and hardware-software partitioning
on mpsocs. In Design Technology of Integrated Systems in Nanoscal Era, 2009. DTIS ’09. 4th
International Conference on, pages 71 –76, 2009.
[115] Heng Yu, Yajun Ha, and B. Veeravalli. Communication-aware application mapping and
scheduling for noc-based mpsocs. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE
International Symposium on, pages 3232–3235, May 2010.
132
[116] Xiangyu Zhang, Armand Navabi, and Suresh Jagannathan. Alchemist: A transparent depen-
dence distance profiling infrastructure. In CGO, 2009.
[117] Qin Zhao, Derek Bruening, and Saman Amarasinghe. Umbra: Efficient and scalable memory
shadowing. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code
Generation and Optimization, CGO ’10, pages 22–31, New York, NY, USA, 2010. ACM.
133
Vita
Siddharth Nilakantan was born in Chennai, Tamil Nadu, India in November 1984 and moved to
Bangalore, India during the 8th grade. He attended and finished schooling in Bangalore in 2002,
studying at the Bishop Cotton Boys’ school followed by Seshadripuram Composite Pre-university
college. After completing schoolwork, he obtained a Bachelor of Engineering degree in Electronics
& Communication at MS Ramaiah Institute of Technology, Bangalore, India in 2006, where the
syllabus was prescribed by Visweswariah Technological University. In the years of 2007 and 2008, he
attended the University of Southern California, Los Angeles, CA, USA, where he received a Master
of Science degree in Electrical Engineering. During the following two years he was employed, first as
a design engineer at Novelics, a startup company that designed and licensed embedded memories,
and then at embedUR systems, where he performed Firmware QA.
In August 2010, he entered the Graduate school of Drexel University, to pursue a PhD in Elec-
trical Engineering. During the course of the PhD, he also served as Teaching Assistant for 3 courses.
Siddharth’s research interests include workload analysis, oﬄoading tasks to hardware accelerators,
and the design of large scale multi-core architectures, latency tolerant out-of-order architectures,
cache and prefetcher design. He is a member of IEEE and volunteered under Drexel’s IEEE Grad-
uate Forum while at Drexel.

