Washington University in St. Louis

Washington University Open Scholarship
McKelvey School of Engineering Theses &
Dissertations

McKelvey School of Engineering

Winter 12-15-2017

Locality-Aware Concurrency Platforms
Jordyn Chrystopher Raymond Maglalang
Washington University in St. Louis

Follow this and additional works at: https://openscholarship.wustl.edu/eng_etds
Part of the Computer Sciences Commons

Recommended Citation
Maglalang, Jordyn Chrystopher Raymond, "Locality-Aware Concurrency Platforms" (2017). McKelvey
School of Engineering Theses & Dissertations. 290.
https://openscholarship.wustl.edu/eng_etds/290

This Dissertation is brought to you for free and open access by the McKelvey School of Engineering at Washington
University Open Scholarship. It has been accepted for inclusion in McKelvey School of Engineering Theses &
Dissertations by an authorized administrator of Washington University Open Scholarship. For more information,
please contact digital@wumail.wustl.edu.

WASHINGTON UNIVERSITY IN ST. LOUIS
School of Engineering and Applied Science
Department of Computer Science and Engineering

Dissertation Examination Committee:
Kunal Agrawal, Chair
Roger Chamberlain
Chris Gill
Sriram Krishnamoorthy
I-Ting Angelina Lee

Locality-Aware Concurrency Platforms
by
Jordyn Chrystopher Raymond Maglalang

A dissertation presented to
The Graduate School
of Washington University in
partial fulfillment of the
requirements for the degree
of Doctor of Philosophy

December 2017
Saint Louis, Missouri

c 2017, Jordyn Chrystopher Raymond Maglalang

Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1: Introduction . . . . . . . . . .
1.1 Streaming Applications . . . . . . .
1.1.1 The Streaming Model . . . .
1.1.2 Assumptions and Definitions
1.2 Dynamic Multithreading . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

1
3
4
5
6

Chapter 2: Cache-Concious Scheduling of Streaming Pipelines
2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Lower Bound on Total Misses . . . . . . . . . . .
2.2.2 Lower Bound on Time . . . . . . . . . . . . . . .
2.2.3 The seg_cache Algorithm . . . . . . . . . . . . . .
2.2.4 Bounding seg_cache’s Performance . . . . . . . .
2.3 Conclusions and Future Work . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

8
9
10
10
12
15
17
19

Chapter 3: Executing Streaming Pipelines . . . . . . . . . . . . . .
3.1 Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . .
3.2 Pipeline Execution . . . . . . . . . . . . . . . . . . . . . . .
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Pipeline Generation . . . . . . . . . . . . . . . . . . .
3.3.2 Experimental Settings . . . . . . . . . . . . . . . . . .
3.3.3 Disabled Computation . . . . . . . . . . . . . . . . .
3.3.4 Randomly Generated Computation Times . . . . . . .
3.3.5 Correlated Computation Time . . . . . . . . . . . . .
3.3.6 Segmentation Based on Computation Time and Cache
3.3.7 Fixed State and Computation . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

20
20
22
23
23
24
25
26
28
29
32

ii

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

3.4
3.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33
34

Chapter 4: AMPipe: Automatically Mapping Streaming Pipelines
4.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Contributions . . . . . . . . . . . . . . . . . . . . .
4.2 Design and Implementation . . . . . . . . . . . . . . . . . .
4.2.1 AMP IPE front-end . . . . . . . . . . . . . . . . . .
4.2.2 Mapping and Code Generation . . . . . . . . . . . .
4.2.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

35
35
36
38
39
42
46
49
53
55

Chapter 5: Locality-Aware Dynamic Task-Graph Scheduling .
5.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Overall performance . . . . . . . . . . . . . . . .
5.5.2 Locality impact of NABBIT C’s scheduling strategy
5.5.3 Overheads due to colored steals . . . . . . . . . .
5.5.4 Importance of good coloring . . . . . . . . . . . .
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Conclusions and Future Work . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

64
64
66
70
71
78
81
82
83
84
85
87

Chapter 6: Locality-Friendly Parallel For Loops . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . .
6.3 Background . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Assigning Iterations to Workers . . . . . . . . .
6.4.2 Elevating CILK _ PARFOR beyond syntactic sugar .
6.4.3 Stealing iterations towards load balance . . . . .
6.5 Implementation . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Programming Interface . . . . . . . . . . . . . .
6.5.2 CILK _ PARFOR loop execution . . . . . . . . . .
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Experimental Setup . . . . . . . . . . . . . . . .
6.6.2 Time-Bound Applications . . . . . . . . . . . .
6.6.3 Data-Bound Applications . . . . . . . . . . . . .
6.6.4 Investigating Our Overhead . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

95
95
97
98
99
100
101
101
102
102
103
111
111
112
114
115

iii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6.7

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

iv

List of Figures
Figure 1.1:

An example pipeline with annotated state size, input and output rates and
gains for both modules and edges. . . . . . . . . . . . . . . . . . . . . .

4

Normalized results for seg_rand and rand_assign against seg_cache with
computation simulation disabled. We see that seg_cache works better with
zipf gain than with uniform gain. . . . . . . . . . . . . . . . . . . . . .

25

Normalized results for seg_runtime, seg_random, bin_fullest and bin_emptiest against seg_cache with computation time enabled. We see that
seg_cache performs the best when both the computation time and the gains
have a zipf’s distribution. This is due to the fact that (1) zipf’s distribution
on computation time reduces the total computation time of the pipeline,
making it more likely that the runtime is dominated by cache misses and
(2) zipf’s distribution on gains allows seg_cache more leeway to pick better cross edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

seg_runtime, seg_random, bin_fullest and bin_emptiest against seg_cache
for correlated state size and computation time. . . . . . . . . . . . . . .

29

Segmentation based on both computation time and cache. We see that this
form of segmentation does a better job of load-balancing computations
when computation-based strategies dominate, but also gets most of the
advantage of cache-based segmentation when seg_cache dominates. . .

30

Normalized results seg_runtime, seg_random, bin_fullest and bin_emptiest against seg_cache when state size and computation times have been
fixed and gains are selected with a zipf distribution. . . . . . . . . . . .

31

Figure 4.1:

A pipeline example with various mappings. . . . . . . . . . . . . . . . .

39

Figure 4.2:

Architecture of the AMP IPE platform. . . . . . . . . . . . . . . . . . .

40

Figure 3.1:

Figure 3.2:

Figure 3.3:
Figure 3.4:

Figure 3.5:

v

Figure 4.3:

A full “Hello, World” pipeline programming demonstrating kernel,pipeline
and application definitions including the newly required calls to pass profiling and system information to the front-end. . . . . . . . . . . . . . .

56

AMP IPE’s abstract Mapper class which provides the interface for defining new Mappers which at a minimum requires users to implement the
create_segments and assign_segments_to_cores methods. . . . . . . . . .

57

The S EG B OTH segmentation algorithm. schedule_maxload is a helper function which determines if a segmentation can be found within a given load
and if so, stores it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Kernel scheduling function within a segment. Each firing results in the
first kernel of the segment executing once before propataing it’s output as
far as possible through the segment. . . . . . . . . . . . . . . . . . . . .

59

AMP IPE’s single producer single consumer thread safe ring buffer used
for cross edges between non-replicated modules. . . . . . . . . . . . . .

60

AMP IPE’s Split and Join edges, which are responsible for trafficking data
between replicated and non-replicated kernels. Only one producer will be
connected to a split edge, while only one consumer will be connected to
a join edge, making the read and write methods unnecessary in those two
classes respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

The Interchange and ITSPQ classes responsible for trafficking data between two replicated kernels . . . . . . . . . . . . . . . . . . . . . . . .

62

NABBIT scheduling examples presented in two stages. A thread surrounding a
node denotes that thread is processing that node. A solid line from a to b denotes
that b is a predecessor of a while a dashed line from b to a denotes that a is in b’s
successor list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 5.2:

NABBIT C abstract class interface . . . . . . . . . . . . . . . . . . . . .

72

Figure 5.4:

Key routines to spawn predecessors and successors in NABBIT C. . . . .

90

Figure 5.5:

Extensions to the Cilk Plus RTS API to inform the runtime of the worker’s
preferred color and the colors available in a continuation. . . . . . . . .

90

Speedup for all benchmarks. x-axis: number of threads (processor cores);
y-axis: speedup over serial O PEN MP STATIC. . . . . . . . . . . . . . . .

92

Figure 4.4:

Figure 4.5:

Figure 4.6:

Figure 4.7:
Figure 4.8:

Figure 4.9:

Figure 5.1:

Figure 5.6:

vi

Figure 5.7:

Figure 5.8:
Figure 5.9:

Percentage of accesses that are to data in remote NUMA domains. We
show percentages for 20–80 cores (1–10 cores fit in one NUMA domain
and do not incur remote accesses). x-axis: core count; y-axis: Percentage
of accesses that are remote. . . . . . . . . . . . . . . . . . . . . . . . .

92

Average number of successful steals for (a) NABBIT C and (b) NABBIT.
x-axis: number of cores; y-axis: Average number of successful steals . .

93

Average idle time per processor core (across all processor cores and runs)
due to forcing the first colored steal for the heat benchmark. Error bars
show standard deviation across five runs among all processor cores. We
observed this time was the same for all benchmarks. . . . . . . . . . . .

93

Figure 6.1:

CILK _ PARFOR

user interface . . . . . . . . . . . . . . . . . . . . . . .

Figure 6.2:

The expansion of the CILK_PARFOR_BEGIN and CILK_PARFOR_END macros 104

Figure 6.3:

The main CILK _ PARFOR execution body . . . . . . . . . . . . . . . . .

105

Figure 6.4:

Iteration steal example . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

Figure 6.5:

Smith-Waterman results . . . . . . . . . . . . . . . . . . . . . . . . . .

113

Figure 6.6:

Conway’s Game of Life results . . . . . . . . . . . . . . . . . . . . . .

114

vii

102

List of Tables
Table 4.1:

Runtimes (with speedup over serial in parenthesis) for DES, LZ77, Ferret,
and DNN. sr is S EG RUNTIME without replication, srr is S EG RUNTIME
with replication, sb is S EG B OTH without replication and sbr is S EG B OTH
with replication. For DES and DNN, mappers with and without replication
find the same mapping. . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Table 4.2:

L2 Cache miss counts for DNN with for S EG RUNTIME and S EG B OTH .

53

Table 5.1:

Benchmark configurations and serial O PEN MP STATIC execution time. The
PageRank benchmarks use the same code with three different web crawl
datasets [18, 19, 46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

Speedup of NABBIT C over NABBIT when all tasks are assigned bad colors resulting in preferential execution of non-local tasks. S.D. denotes
standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Speedup of NABBIT C over NABBIT when all tasks are assigned invalid
colors resulting in failure of all colored steal attempts. S.D. denotes standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Table 5.2:

Table 5.3:

viii

Acknowledgments
This work would not have been possible without the help and support of all of my colleagues,
friends and family. I could fill the entirety of this document with thanks to everyone who’s played
a role in getting me here, indeed there could never be enough space. Although I only mention a
few of you, I am forever indebted to you all.
Thank you to my advisor, Kunal Agrawal, for taking me under your wing and providing guidance
over the years. To Sriram Krishnamoorthy, for your supervision during my time at PNNL and your
help towards securing my first first-author paper. Thank you to the rest of my committee members:
Roger Chamberlain, Chris Gill and Angelina Lee. Your encouragement and help has gotten me
through each of the milestones leading to this.
I want to thank my fellow lab members in the Parallel Computing Technology Group for all that
I’ve learned from you. Thank you Rob for all the helpful conversations that got me unstuck and
reminded me I wasn’t always alone in my work. To Adam, thanks for aways popping in with
questions or curiosity. You always kept me on my feet or pulled me away to the Kings when what
I really needed was a break.
To Michelle, who was a constant voice of support throughout every inch of this roller-coaster effort,
I could never thank you enough. We might not have become the the best dart players in the world
like we planned, but we helped each other keep our heads above water over these years. Thank
you for never letting me give up.

ix

Thank to all my friends all over the place, who’ve watched me struggle and succeed during this
whole ordeal. I relied on you all a lot more than I might have mentioned. This accomplishment is
ours together. Thank you Carl, for helping keep my head straight with everything thats happened
over the years. Josh, thanks for always having my back and giving me refuge when I needed to
escape, you guys remind me that I’m not in this on my own. To Amy, who could see me when so
many others couldn’t, thank you.
To Janine, who’s never let me drift too far out of arms length, thank you. We chose to always
be there for each other, despite the challenge of the physical distance. This accomplishment is as
much for you, and for us, as it is for me.
To my family, who’ve all been with me for every step, thank you. You believed in me well before
I did, I couldn’t have gotten here without you.
This research was supported by National Science Foundation grants CCF-1527692, CCF-143906,
and CCF-1150036, and by the U.S. Department of Energy, Office of Science, Office of Advanced
Scientific Computing Research under award number 63823.

Jordyn Chrystopher Raymond Maglalang
Washington University in Saint Louis
December 2017

x

ABSTRACT OF THE DISSERTATION
Locality-Aware Concurrency Platforms
by
Jordyn Chrystopher Raymond Maglalang
Doctor of Philosophy in Computer Science
Washington University in St. Louis, 2017
Professor Kunal Agrawal, Chair
Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and
more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main
memory. Therefore, some memory is more expensive to access than other and high-performance
software must consider memory locality as one of the first level considerations.
Memory locality is often difficult for application developers to consider directly, however, since
many of these NUMA affects are invisible to the application programmer and only show up in low
performance. Moreover, on parallel platforms, the performance depends on both locality and load
balance and these two metrics are often at odds with each other. Therefore, directly considering
locality and load balance at the application level may make the application much more complex to
program.
In this work, we develop locality-conscious concurrency platforms for multiple different structured
parallel programming models, including streaming applications, task-graphs and PARALLEL _ FOR
loops. In all of this work, the idea is to minimally disrupt the application programming model
so that the application developer is either unimpacted or must only provide high-level hints to the
runtime system. The runtime system then schedules the application to provide good locality of
access while, at the same time also providing good load balance. In particular, we address cache
xi

locality for streaming applications through static partitioning and developed an extensible platform
to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling
library to guide scheduling decisions towards better NUMA locality with the help of user-provided
locality hints. C ILK P LUS PARALLEL _ FOR loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the
memory hierarchy. We address this issue with a novel PARALLEL _ FOR loop implementation that
can get good cache and NUMA locality while providing support to maintain good load balance
dynamically.

xii

Chapter 1
Introduction
In line with Moore’s law, transistor counts have been steadily increasing over the years which, for
a long time, allowed chip manufacturers to make faster processors. In recent years this increase in
raw performance has not managed to keep up – despite having access to more transistors individual
processors are no longer getting any faster. To address this, manufacturers began including multiple
processing units (cores) on a single chip, ushering in an era of parallelism. Although a single core
could not complete an application faster, if that application’s workload could be divided up or
shared, developers could continue to take advantage of new hardware to run their software faster.
The recent emergence of commercial multi-processor systems, which combine multiple multi-core
processors within the same machine, demonstrates the industry’s intention to continue increasing
the parallelism a system can support.
While some applications can easily take full advantage of the increasing number of cores when
their workloads can naturally map to multiple cores, many can be challenging to implement correctly. Two of these programming challenges are particularly important determining factors when
it comes to producing high-performance applications.

1

1. Load-balance: To get good load-balance, programmers need to ensure that all cores are
performing roughly the same amount of work. Balancing the workload across the entire
system ensures that no individual core is wasting time idling.
2. Locality: To get good locality, cores should do work which uses data that is close by. This
means taking advantage of caches, re-using data when possible, and in the context of a multisocket machine which employs non-uniform memory accesses, using data which is on the
same socket. In either case, the goal is to mitigate the cost of accessing memory by ensuring
the memory accesses can be as fast as possible.

It is often the case, however, that these two challenges are at odds with one another – a good
solution for one may in fact be a bad solution for the other. When dealing with an application
who’s subtask workloads are not known ahead of time, often the best strategy is to dynamically
assign subtasks to cores at runtime. Indeed many concurrency platforms employ randomized workstealing, a provably efficient version of this dynamic scheduling. Dynamically scheduling subtasks
might be great for getting good load-balance, but can be particularly poor at getting good locality.
With no guarantee of what executes where, programmers are unable to ensure their application’s
subtasks use data that is close by, leading to a larger number of expensive memory accesses. To
avoid this, programmers could instead distribute their application’s data across the entire machine’s
memory and perform static scheduling to ensure the data needed by an individual subtask is close.
This strategy can lead to optimal locality but at the cost of being highly sensitive to workload
imbalances. Making scheduling decisions before-hand can result in a single core getting assigned
more work than others.
The overall aim of this work is to bridge the gap between these two programming challenges
and produce multi-core ready applications which get good-load balance while making decisions
to improve locality. Solving these problems generally can be difficult as there are many different
2

application classes which require different strategies to be effective. To that end, we narrow our
focus to two programming models and develop solutions for each individually.
1. Streaming Applications, where programmers connect subtasks with explicit communication channels and stream data between them. We investigate cache-conscious scheduling
of streaming pipelines ?? in Chapters ?? and provide AMP IPE, a platform for developing
parallel streaming applications in Chapter 4.
2. Dynamic Multithreading, where programmers denote opportunities for parallelism and allow the runtime system to make all assignment and scheduling decisions. In Chapter 5 we
address locality-aware dynamic task-graph scheduling ?? and in Chapter 6 we look at a novel
locality-conscious PARALLEL _ FOR loop.

1.1

Streaming Applications

Streaming is an effective paradigm for parallelizing complex computations on large datasets. At
a high level, streaming application can be described by a directed graph where nodes correspond
to computation modules, and edges represent directed FIFO channels between modules. The
modules send data in the form of messages (also called tokens or items) via these channels. In
this work, we only consider pipeline topologies where modules are connected in a linear chain via
channels. Scheduling individual computation modules on different cores allows these applications
to execute in parallel.
For streaming applications, a schedule describes two things: where (on which core) does each
iteration of each module execute; and the order of the execution of these iterations. In this work,
we will primarily concern ourselves with static schedules — schedules where a module is placed
on a particular processor and all iterations of that module execute on the same processor.
3

s

g(a):4

g(b):4

g(c):4

g(d):2

g(e):1

g(f):1

g(g):1

g(h):1

g(h):2

a

b

c

d

e

f

g

h

i

o:4

i:1 o:1
i:1 o:1
i:1 o:2
i:4 o:1
i:2 o:1
i:1 o:2
i:2 o:1
i:1 o:2
i:1 o:2
i:1
s:
s:
s:
s:
s:
s:
s:
s:
s:
g(s,a):4
g(a,b):4
g(b,c):4
g(c,d):8
g(d,e):2
g(e,f):1
g(f,g):2
g(g,h):1
g(h,i):2
g(i,t):4

t

Figure 1.1: An example pipeline with annotated state size, input and output rates and gains for
both modules and edges.
Arriving at a load-balanced static schedule for streaming applications involves finding an assignment of computation modules to cores such that the total workload across all cores is roughly
equal. Such a scheduling decision can be modeled as a bin-packing problem.
Producing a streaming application that gets good locality is not as straight-forward. In this work,
we focus on finding static schedules with good cache locality which reuse cached data as much as
possible while reducing the number of accesses to data which does not reside in cache.
The streaming paradigm has been applied to diverse application domains such as media [42], signal
processing [65], computational science [53] and data mining [34]. Several languages explicitly
support streaming semantics, including Brook [22], StreamC/KernelC [41], StreamIt [71] and more
recently RaftLib [11].

1.1.1 The Streaming Model
A streaming pipeline consists of a sequence of n computational modules, and each module u
has exactly one incoming channel (from the previous module in the sequence) and one outgoing
channel (to the subsequent module in the sequence). The modules send data in the form of messages to each other via these channels. Channels may have buffers (implementing FIFO queues)
to store messages that have not yet been consumed by the receiving module. We say that module
u precedes module v, denoted by u ≺ v, if u is before v in the sequence. We assume that the

4

incoming channel into source module s (the first module) streams an infinite amount of data into
the pipeline and the outgoing channel from sink module t (the last module) streams it out.
Each module u has an associated state; we denote the size of this state by s(u). In order to execute,
or fire a module u on a processor, the entire state of that module must be loaded into that processor’s
cache. When the module fires, it consumes in(u) data items from its incoming channel, performs
some computation, and then produces out(u) data items on its outgoing channel, where in(u) and
out(u) are static parameters of the module. A module is ready to fire its input buffer contains at
least in(u) messages.

1.1.2 Assumptions and Definitions
Throughout this work, we make the several assumptions about the streaming pipeline — these
assumptions are either necessary to admit any reasonable solution or are without loss of generality
(made only to simplify the exposition). We assume that all messages are unit size and that the
state of size each module is at most the size of the cache. The former assumption is without loss
of generality given the arbitrary input and output rates. The latter is necessary to allow a module
to be fully resident in cache when fired. In addition, we assume that the state size of each module
also counts the minimum buffer required on its incoming and outgoing channels.
Finally, we define some terms and notation used throughout the work. We use the term gain to describe the rate of amplification of messages along the pipeline. In particular, The gain of a module
is the number of times that module fires, on average, each time the source module consumes an
Q
1
input item. Therefore, the gain of the module is gain(w) = u≺w (out(u)/in(u)) × in(w)
. The
gain of edge (v, w) is the number of items produced on that edge, on average, each time an item
is consumed by the source module, and is given by gain(v, w) = gain(v) × out(v). Figure 1.1
represents an example pipeline with module and edge gains computed.
5

We call a set of consecutive modules a segment, and we denote the segment comprising modules
between u and v (inclusive) by hu,v,. i In each segment s, we call the edge with minimum gain a
gain-minimizing edge (if there is more than one, choose arbitrarily), denoted by gainM in(s).

1.2

Dynamic Multithreading

In contrast with typical thread based applications, where the user explicitly defines parallel operations which are mapped to different threads, dynamic multithreading allows users to describe
opportunities for parallelism and the decisions of what executes where is left up to runtime system. In this work we focus on Intel’s C ILK P LUS, an extension to C and C++ which offers such an
environment. C ILK P LUS adds to these languages the cilk_spawn and cilk_sync keywords which are
handled by the provided runtime system. When the user issues cilk_spawn, it informs the runtime
that the spawned work can execute in parallel with all of the code until the next cilk_sync. Applications of this type can be represented as a direct acyclic graph, where nodes represent instructions
and edges represent the dependencies between them. Figure ?? demonstrates the type of graph a
user could create with these new keywords.
Apart from offering an easy-to-use programming interface, C ILK P LUS utilizes a provably good
randomized runtime system which makes dynamic scheduling decisions to parallelize the user’s
application for them. By making scheduling decisions at runtime C ILK P LUS is able to effectively
load-balance these parallel applications resulting in good runtime performance.
Randomized scheduling decisions can, unfortunately, lead to applications with poor data-locality,
especially in the context of systems with non-uniform memory accesses (NUMA). The user is incapable of structuring their application to ensure that the data necessary for an operation is within
cache, or even within the same NUMA-domain. To resolve this, we augment these randomized
6

scheduling decisions to be locality-conscious, allowing the applications to still get good load balance and better data-locality.
We first accomplish this in a task-graph scheduling library built on top of C ILK P LUS as well as
within the runtime itself by a novel implementation We accomplish this balance, between loadbalance and data-locality, in the domain of dynamic multithreading through two avenues:

1. a task-graph scheduling library built on top of the C ILK P LUS runtime.
2. a novel implementation of PARALLEL _ FOR loops to be provided by the runtime.

7

Chapter 2
Cache-Concious Scheduling of Streaming
Pipelines
Cache efficiency is an important determinant in the performance of algorithms, and there has been
extensive theoretical and practical work in designing cache-efficient algorithms for both sequential [1, 12] and parallel [16] machines. On a shared-memory machine, the communication cost
roughly corresponds to a subset of the cache misses, but additional cache misses occur locally on
each processor and are not reflected by communication cost. We are not aware of much prior work,
particularly with a theoretical foundation, that considers the impact of cache effects on throughput
in streaming applications. One example is from Agrawal et al. [6], who focus on designing cacheefficient scheduling of streaming applications on single-processor machines. Specifically, they
show that for both pipelines and general acyclic graphs, a simple partitioning strategy generates a
schedule that is asymptotically optimal for a single-level cache.
In this chapter, we consider the question of cache-conscious scheduling of streaming applications
on parallel machines with private caches and find a static scheduling algorithm which minimizes
cache misses.

8

2.1

Model

Our theoretical analysis is with respect to the parallel external memory (PEM) model [8], which is
an extension of the external memory model [1] and similar to many other private-cache models in
the literature (e.g., in [12]). The PEM model is a computational model consisting of P processors,
each with a private cache of size M , and a global shared memory. The caches and shared memory
are organized into blocks of B consecutive addresses. A processor can only read or write data in
its own cache; when accessing data not in cache, a cache miss or I/O occurs, whereby the block
must be loaded from shared memory to the processor’s cache. The model allows a cache to store
any M/B blocks simultaneously (i.e., caches are fully associative). All communication between
processors occurs through the shared memory in the form of I/Os.
Processors may perform I/Os concurrently, which is called a parallel I/O. The complexity measure
is the number of parallel I/Os. Interpreted as time, accessing data in cache is free, but each I/O
takes unit time. Thus in a single timestep, each processor may load up to 1 block or B elements,
for a total of P blocks or P B elements. A “linear” I/O bound for, e.g., touching n elements in an
array is O(n/(P B)).
Variants of the model specify when the same data may be resident in multiple caches, but these
are not important for the present paper. We require exclusivity with respect to modules only —
the same module may not be resident in two caches simultaneously. Otherwise, our lower bound
applies to all model variants, and our upper bound does not have any data simultaneously loaded
on any caches.

9

2.2

Theory

This section gives lower bounds on cache misses when scheduling a streaming pipeline on multiple processors, which we leverage later to prove that a partitioned schedule is optimal. The
two bounds are analogous to lower bounds for other load-balancing problems. Specifically, we
first lower-bound the total number of misses, which implies a lower bound on parallel I/Os (or
time) since ≤ P occur concurrently. Our second bound addresses the case where parallelism P
is not achievable due to local imbalance, i.e., if some small part of the pipeline dominates the
running time. Combining the two gives a lower bound on running time that matches the upper
bound achievable by a partitioned scheduler. Unfortunately, and surprisingly to us, the second
lower bound is restricted to the case of a static scheduler and does not hold in general. We discuss
this limitation at the end of the section, highlighting what makes proving bounds on (non-static)
parallel streaming schedulers difficult.

2.2.1 Lower Bound on Total Misses
We first bound the total number of cache misses — the proof is inspired by the approach of Agrawal
et al. [6], but with some changes to cope with the multiprocessor setting. The basic intuition is that
any schedule (static or not) must “pay” for the messages crossing certain edges in the pipeline. The
first lemma bounds the cache-miss cost for a single processor with respect to a single edge.
Lemma 2.2.1. In the pipeline under consideration, let s =hu,v,b ie any segment with total state
size at least 2M , and let e = gainM in(s) be its gain minimizing edge. Any subschedule that
fires module v at least 2M gain(v)/gain(e) times on a particular processor p incurs at least M/B
cache misses on p. Moreover, these M/B cache misses are all due to either loading state from
modules or messages on edges within the segment hu,v,. i
10

Proof. The proof consists of two cases.
Case 1: Suppose processor p loads the entire segment hu,v,d iuring the subschedule. At most M
of that state can already be resident in p’s cache at the start of the subschedule, so p must incur at
least M/B cache misses to complete the load.
Case 2: Suppose that some module in hu,v,i is not fired by p during the subschedule. We define a
message m to be a crossing ancestor if m is consumed by a module running on processor p during
the subschedule, but m is not generated by a module on processor p during the subschedule. (If
m was generated on processor p but before the subschedule began, it is still considered a crossing
ancestor.) Since p does not fire the entire segment during the subschedule, all inputs to v during
the subschedule are the progeny of some crossing ancestor. The number of crossing ancestors is
minimized if they all occur at the gain minimizing edge e, and hence there must be at least 2M
crossing ancestors in order to fire v a total of 2M gain(v)/gain(e) times. By definition, crossing
ancestors are read by p, so each crossing ancestor must either already be resident in p’s cache
before the subschedule, or it must be loaded into p’s cache during the subschedule. Since at most
M crossing ancestors can be resident at the start of the subschedule, the remaining M crossing
ancestors incur at least M/B cache misses.

The following corollary combines 2.2.1 across all processors. The nuance here that necessitates
the new lower bound, as opposed to applying Agrawal et al.’s bound [6] as a black box, is that P
processors have P M cache in total instead of the M cache for the uniprocessor case.
Corollary 2.2.1. Consider a streaming pipeline. Let s =hu,v,b ie any segment with total size at
least 2M , and let e = gainM in(s) be its gain-minimizing edge. Any subschedule that fires v
at least 6P M gain(v)/gain(e) times in total across P processors must incur Ω(P M/B) cache
misses. In other words, Ω((1/B)gain(e)/gain(v)) is a lower bound on the amortized cost of
firing v.
11

Proof. From 2.2.1, if a particular processor p fires v a total of fp ⌈2M gain(v)/gain(e)⌉ times, then
P
it incurs at least ⌊fp ⌋M/B cache misses. It remains to prove that p ⌊fp ⌋ ≥ P , and hence the total
P
number of cache miss is p ⌊fp ⌋M/B ≥ P M/B.
Since e is the gain-minimizing edge and in(v) ≤ M , we have gain(e) ≤ gain(v) · in(v) and
thus gain(v)/gain(e) ≥ 1/in(v) ≥ 1/M . It follows that 2M gain(v)/gain(e) ≥ 2, and hence
⌈2M gain(v)/gain(e)⌉ ≤ 3M gain(v)/gain(e). Since there are at least 6P M gain(v)/gain(e)
P
P
firings in total, we have p fp ≥ 2P and hence p ⌊fp ⌋ ≥ P .
The following theorem combines the preceding corollary across the entire pipeline. Note that the
theorem identifies which edges must be paid for with respect to an arbitrary segmentation of the
pipeline (defined as “bandwidth” in [6]). Nevertheless, the bound holds for any schedule, even a
non-partitioned and non-static schedule.
Theorem 2.2.2. Consider a pipeline graph in which S = {hui , vi i} is any collection of disjoint
segments such that each segment has total size at least 2M . Then for sufficiently large T , any
parallel schedule of the pipeline that fires the sink node t at least T · gain(t) times must incur at
P
least Ω((T /B) s∈S gain(gainM in(s))) cache misses in total.

Proof. Observe that if t fires T gain(t) times, then vi fires T gain(vi ) times. Thus if T ≥ 6P M/gain(gainM in(s))
for s ∈ S, then we can apply 2.2.1 to s to get a cache-miss bound of Ω((T /B)gain(gainM in(s))).
In addition, each application of 2.2.1 and 2.2.1 only counts misses for messages/state within each
segment. Therefore, there is no double counting of cache misses.

2.2.2 Lower Bound on Time

12

2.2.2 implies that the number of parallel I/Os is at least Ω((T /(P B))

P

s∈S

gain(gainM in(s)),

since at most P misses occur in any parallel I/O. This provides a lower bound on the running time
of any schedule. We now argue that for static schedules, the gain-minimizing edge with the largest
gain also provides a lower bound.
Theorem 2.2.3. Consider a pipeline graph in which S = {hui , vi i} is any collection of disjoint
segments such that each segment has total size at least 2M , and let t be the sink node. After t is
fired at least T · gain(t) times for sufficiently large T , the running time is,
• Ω((T /(P B))

P

s∈S

gain(gainM in(s))) parallel I/Os for any schedule, and

• Ω((T /B) maxs∈S gain(gainM in(s))) parallel I/Os for any static schedule.
Proof. The first statement follows from 2.2.2. For the second statement, consider segment s =hu,v,w
ihich has the largest gain-minimizing edge. 2.2.1 shows that for this segment, if the module v stays
on one processor p, then p must incur an amortized Ω((1/B)gain(gainM in(s′ ))/gain(v)) cache
misses each time v is fired, each of which must be part of a different parallel I/O.

Somewhat surprisingly, the second lower bound deriving from the maximum does not hold for
general schedulers, only for static ones. We demonstrate by example: Suppose that the pipeline
consists of 2kM/B modules, each of size B, with all gains equal to 1. Let P = kM 2 /B be the
number of processors. For conciseness only, the following description assumes each processor has
one extra block, i.e., M + B cache size. Consider the following schedule: For each module u,
assign a M/B processors, denoted Pu , to module u. Place cyclic buffers of size M 2 on each edge.
Warm-up the schedule to get M 2 /2 messages in each buffer, which is invariant between rounds in
the remaining schedule. The schedule proceeds in rounds, in parallel on each module, consisting
of a load step followed by an execute step.
Load step for u: foreach processor from Pu in parallel, load a distinct M/(2B) full blocks from
13

u’s input buffer and also preload the corresponding M/(2B) empty blocks from u’s output buffer.
These loads occur in parallel, so the total number of parallel I/Os is M/B.
Execute step for u: foreach processor from Pu but now sequentially (since u can not be fired on
multiple processors at once), load u’s state and fire it M/2 times. In the execute step, loading
u incurs one cache miss on each processor for M/B parallel I/Os total, but the input and output
buffers are already in cache, so firing u incurs no other I/Os.
Putting it together, a single round takes 2M/B parallel I/Os but fires each module M 2 /(2B) times,
for a total of O(1/M ) per firing. Thus if the sink fires T times, then the total time is O(T /M ).
This upper bound gives an example where the maximum-based lower bound (second condition in
2.2.3) does not hold. That bound would say that the total time is Ω(T /B) which is much larger
than O(T /M ). This discrepancy arises from the fact that the example schedule moves modules
across processors. The first condition of 2.2.3 still holds, since it only states that the lower bound
on time is Ω(T k/(P B)) = Ω(T /M 2 ). Note that, setting k = Θ(B), this example only achieves a
√
runtime of O(T k/( P B) parallel I/Os, not O(T k/(P B)). There is a tradeoff here; the example
is able to beat the Ω(T /B) lower bound by increasing the total number of misses by a factor of M ,
which in turn increases running time arising from the total-work bound. Therefore, we have shown
that a non-static schedule can do better than a static schedule, however, at the cost of increasing
the total number of cache misses.
This section describes our cache-based partitioning algorithm, called seg_cache, for scheduling
pipelines on multiple processors. For correctness, this algorithm assumes that the maximum state
size of any module is at most M/6.

14

s

g(a):4

g(b):4

g(c):4

g(d):2

g(e):1

g(f):1

g(g):1

g(h):1

g(h):2

a

b

c

d

e

f

g

h

i

o:4

i:1 o:1
i:1 o:1
i:1 o:2
i:4 o:1
i:2 o:1
i:1 o:2
i:2 o:1
i:1 o:2
i:1 o:2
i:1
s:
s:
s:
s:
s:
s:
s:
s:
s:
g(s,a):4
g(a,b):4
g(b,c):4
g(c,d):8
g(d,e):2
g(e,f):1
g(f,g):2
g(g,h):1
g(h,i):2
g(i,t):4

(a) A temporary segmentation found by greedily building segments up to size
lighted.

s
o:4

M
2

t

with the minimum gain edges high-

a

b

c

d

e

f

g

i:1 o:1
s:

i:1 o:1
s:

i:1 o:2
s:

i:4 o:1
s:

i:2 o:1
s:

i:1 o:2
s:

i:2 o:1
s:

h

i:1 o:2
s:

i

i:1 o:2
s:

i:1

t

(b) The final segmentation using the cross edges found from the temporary segmentation. Note that each segment has
size at most M .

2.2.3 The seg_cache Algorithm
The algorithm produces a partitioning by first dividing the pipeline into temporary segments S =
{s1 , s2 , . . . , sk }, where each segment si = hui , vi i for i < k has total state between M/3 and M/2
— the last segment sk may be smaller. This temporary segmentation can be selected greedily:
Initially create one segment s1 as the current segment. Iterate over the modules in order. Add
the current module to the current segment. If the total state of the current segment si exceeds
M/3, create a new segment si+1 and set that to be the current segment. Under the assumption
that modules have state at most M/6, this process produces segments with total state at most
M/3+M/6 = M/2. Figure 2.1a shows this temporary segmentation for the pipeline in Figure 1.1.
Then select the minimum gain edge gainM in(si ) within each segment s1 , . . . , sk−1 , with the exception of the last segment sk . We call these edges the cross edges, as they are the edges crossing
between our final segments. That is, our final segmentation R = {hxi , yi i} is the one induced by
cutting the selected cross edges. Since each final segment spans at most two temporary segments,
each of size at most M/2, each segment in R has total state at most M . Figure 2.1b shows this final
segmentation — note that we cut the gain minimizing edge in each of the temporary segments. If
there are two edges with the same gain, we can choose arbitrarily.

15

We next load balance the set of segments R across processors as follows. We define the load of
a segment r = hxi , yi i as load(r) = gain(xi )in(xi ) + gain(yi )out(yi ), i.e., the sum of the gains
of the incoming edge and the outgoing edge. Our load-balancing also employs a simple greedy
P
strategy. Specifically, calculate the total load L =
r∈R load(r). The average load across P
processors would thus be L/P . In pipeline order, greedily place segments from R on the current
processor until the total load on the processors exceeds L/P , and then move to the next processor.
The in-order aspect of this greedy load balancing guarantees that each processor is assigned a contiguous set of segments. Therefore, for the example shown in Figure 2.1b, if we had 2 processors,
the first segment on the first processor and the last three segments on the second processor.
It remains to define buffers on cross edges and the actual schedule. To simplify the description
and analysis, we describe a version that schedules in synchronized rounds with a large period and
correspondingly large buffers. To calculate the buffers on cross edges and the period, we choose
Q
the value X = M modules i in(i) · out(i). In this way, for every edge e (and in particular for cross
edges), we have that Xgain(e) is an integer larger than M . We add buffers of size Xgain(c)
to every cross edge c. Our actual implementation (which is also synchronous) does not employ
buffers this large and the buffer sizes can be further reduced by scheduling asynchronously.
The schedule itself is periodic and divided into rounds, with synchronization between rounds.
In each round, each processor loads each of its segments exactly once in pipeline order; once a
segment is loaded, the contained modules are fired many times. Specifically, for hxi , yi i, if the
input buffer is full — that is, there are Xgain(xi )in(xi ) messages available on the incoming edge
— the segment is ready and is run.1 To run the segment hxi , yi i, execute the latest module within
the segment with enough inputs available to fire. Repeat until yi has fired Xgain(yi ) times. By
1

Otherwise, wait until the next round. This waiting occurs only in the earliest rounds until a steady state is reached
— the ith processor first becomes ready in the ith round.

16

careful construction of X, all modules j ∈ hxi , yi i have fired Xgain(j) times, and hence there are
no messages on any internal edges, and the subsequent segment becomes ready.

2.2.4 Bounding seg_cache’s Performance
We now prove an upper bound on the number of cache misses incurred on each processor with
respect to its load, which is defined as follows. Let Rp be the set of segments assigned to p by
P
seg_cache. Then the load of processor p is defined as procLoad(p) = r∈Rp load(r).
Lemma 2.2.4. In each round of seg_cache, the total number of cache misses incurred by p is
O((X/B)procLoad(p)).

Proof. When executing a segment r = hx, yi, its entire state is loaded just once, since it fits in
cache. (By assumption, the state includes a small buffers on internal edges to accommodate varying
input/output rates.) Thus when running r, the only cache misses are from loading the state initially,
and reading/writing messages from/to the incoming/outgoing cross edges. The state load costs
O(M/B). The cost on cross edges is O(Xgain(x)in(x)/B) and O(Xgain(y)out(y)/B), respectively, which sum to O((X/B)load(r)). Since Xgain(e) ≥ M for all edges e, O((X/B)load(r))
dominates. Summing over all segments r assigned to p completes the proof.

We obtain the following corollary on time by taking the max of Lemma 2.2.4 across all processors.
Corollary 2.2.2. The duration of each round is O((X/B) maxproc. p procLoad(p)) parallel I/Os.

We now upper-bound the running time of seg_cache.
Lemma 2.2.5. For sufficiently large T , the time required for seg_cache to fire the sink node t a
total of T gain(t) times is O((T /B) maxproc. p procLoad(p)) parallel I/Os.
17

Proof. If a segment fires X times its gain times each time it becomes ready, then a simple inductive argument shows that all the segments on processor i first become ready in round i. Moreover, once ready, they continue to become ready again in each subsequent round. Thus, after
Z rounds, the last processor’s segments run during at least Z − P of them, and hence t fires
at least (Z − P )Xgain(t) times. From Corollary 2.2.2, the total time required for Z rounds is
O(Z · (X/B) maxproc. p procLoad(p)). As long as T ≥ P X, choosing Z = ⌈2T /X⌉ completes
the proof.

The following theorem states that seg_cache is asymptotically optimal, when given a constant
factor memory augmentation. Note that in the theorem statement, procLoad(p) is a metric of the
pipeline — we are not just saying that seg_cache is optimal for this particular static schedule, but
rather that no other static schedule is much better.
Theorem 2.2.6. Let procLoad(p) denote the processor loads arising from an execution of seg_cache on the pipeline. Then for sufficiently large T , every static schedule on a machine with cache
size of M/6 requires a runtime of Ω((T /B) maxproc. p procLoad(p)) parallel I/Os to fire the sink
node t a total of T gain(t) times.

Proof. Consider the temporary segmentation S chosen by seg_cache, and let C be the set of gainminimizing cross edges selected. Each of the segments (except the last which contributes no
cross edge) has size at least 2(M/6), and thus applying 2.2.3 for a machine with M/6 allows
P
us to conclude that Ω((T /(P B)) c∈C gain(c) + (T /B) maxc∈C gain(c)) is a lower bound on the
runtime of any static scheduler.
We next show (1/P )

P

c∈C

gain(c) + maxc∈C gain(c)) ≥ (1/2) maxproc. p procLoad(p), which

completes the proof. Cross edges contributes to the load of two segments, so the total load is
P
L = 2 c∈C gain(c). The seg_cache algorithm adds at most one segment to a processor after
it reaches L/P load. Each segment borders two cross edges, so the load of any segment is at
18

most 2 maxc∈C gain(c). Thus the maximum load on any processor is maxproc. p procLoad(p) ≤
P
(2/P ) c∈C gain(c) + 2 maxc∈C gain(c).

2.3

Conclusions and Future Work

In this chapter, we studied cache-conscious scheduling of streaming pipelines on machines with
private caches. We’ve presented seg_cache which finds static schedules for streaming pipelines by
performing a gain-minimizing partitioning.
Although we focused on reducing misses in the last-level private cache, we could extend our algorithm to minimize misses at all levels of the memory hierarchy. A natural progression for future
work would be to deploy streaming pipelines on NUMA machines while reducing the number of
remote memory accesses.
Another avenue for future work would be expanding our algorithms to support different classes
of streaming applications. Allowing for more expressive structures like series-parallel or general
directed acyclic graphs would enable us to handle a larger range of application types.

19

Chapter 3
Executing Streaming Pipelines
In this Chapter, we evaluate the performance of our cache-based scheduling algorithm seg_cache
against a number of other viable static scheduling algorithms for streaming pipelines. Although
the direct aim of seg_cache is to reduce cache misses, our final goal is to produce applications with
higher throughput which processes a fixed set of input as fast as possible.
To perform this evaluation, we constructed an executing environment to emulate the execution
of randomly generated streaming pipelines. The environment takes in: 1) a pipeline definition
2) a scheduling strategy, and 3) the amount of input to process. Our environment emulates this
execution on real hardware. By comparing multiple scheduling strategies on the same pipeline,
we’re able to directly compare the impact of each strategy on the resulting performance.

3.1

Scheduling Policies

We evaluated several static schedulers. Each schedule is characterized by 3 features: (1) The
segmentation. We describe all of our schedules in terms of segments — a set of consecutive
modules restricted to the same processor. Note that a segment may contain anything between
one module and all the modules. How segments are chosen varies by scheduling policy. (2)
20

The processor assignment. Each schedule defines how the segments are (statically) assigned to
processors. Multiple segments may be assigned to the same processor. (3) Cross-edge buffer
allocation. Each scheduling policy also defines the size of buffers to place on cross edges — the
edges that go between segments.
All the other details for all policies are similar. For any internal edge e from module u to module v
of the same segment the buffer of size 2lcm(out(u), in(v)) — that is the least common multiple of
the number of items produced and consumed on the edge — is allocated.2 For all policies except
seg_cache, the cross edge buffers are also assigned in the same manner. In all the computationbased policies (seg_runtime, bin_full and bin_empty), before load-balancing, we normalize the
modules computation cost based on the gain of the module. The scheduling policies compared in
this section are defined as follows:

1. seg_cache: Segmentation based on cache (our policy) as described in 2.2.3. Each cross edge
e has a buffer of (M/2)gain(e).
2. random_assignment: Segments consists of single modules and are assigned to processors at
random.
3. seg_random: Random segmentation. A single segment is created per processor by randomly
picking p − 1 edges to be cross edges.
4. seg_runtime: Segmentation based on computation cost. In this policy a single segment is created per processor minimizing the maximum number of compute steps on a single segment
by using a greedy load-balancing policy analogous to that of seg_cache.
5. bin_full: Fullest first binpacking on computation. Assign modules to a segment until their
cumulative computation is just smaller than 1/P fraction of the total computation, creating
2

This internal buffer could be sized out(i) + in(i + 1), but the larger quantity allows for easier scheduling and
does not affect the performance.

21

P large segments, one for each processor. The remaining modules are in their own segments
and assigned greedily to the processor with the smallest overall compute cost. This policy
often results in a segmentation nearly identical to seg_runtime.
6. bin_emptiest: Emptiest first binpacking on computation. Each segment consists of a single
module. We greedily assign segments to the processors with the fewest total compute steps.

3.2

Pipeline Execution

We built an execution engine that takes in a pipeline and the static-schedule description and executes the pipeline according to that schedule. The execution engine runs exactly one thread on
each processor. The master thread begins execution and is responsible for parsing the schedule
and spawning all the remaining worker threads and pinning them on their designated processors.
These worker threads are responsible for executing the segments assigned to respective processors. The master thread executes the source node, generating all the subsequent messages to be
processed, and the sink node, consuming all the messages after processing.
In a loop, each worker thread finds a segment that is ready — the buffer on the input cross edge
is at least half-full and the buffer and the output cross edge is at least half-empty. This segment
is then loaded into memory. At this point, this segment is executed until either the buffer on the
input cross edge is empty or the buffer on the output cross edge is full. If there are multiple ready
segments, the worker thread always executes the last (in pipeline order) ready segment. Once a
ready segment is picked for execution, it is executed until either the buffer on its incoming cross
edge is empty or the buffer on its outgoing cross edge is full. When a segment is executed, its
internal modules are fired one at a time. Any schedule of firing its internal modules is valid — in
our execution engine, the last ready module (in order of the pipeline) is always fired.

22

The internal-edge buffers between modules within a segment are implemented as simple FIFO
queues. No synchronization is required on internal edges because all modules within a segment
are consecutive and are scheduled on the same processor. Neighboring segments can, however, be
scheduled on different processors, and we thus use thread-safe single-producer single-consumer
lock free ring buffers for communication through cross edges between segments

3.3

Experiments

We now experimentally compare the running time of our strategy against several other static policies when executing randomly generated pipelines drawn from distributions with varying characteristics. The experiments indicate that under many conditions, cache-based segmentation is able
to improve the overall running time, while under other conditions, computation time based partitioning performs better. We also investigate a heuristic that combines computation and cache
misses as a basis of partitioning, and find that it often provides the best of both worlds.

3.3.1 Pipeline Generation
Our experimental evaluation consists of running each schedule on randomly generated pipelines.
We fixed the total number of modules to 140 in order to allow for sufficient differences in the
schedules generated. For each pipeline, we randomly generate three sets of parameters: (1) gain
of each edge, (2) state size of each module, and (3) computation requirement of each module. The
gain of each edge is picked randomly between [1, M/16], where M is the size of the cache. The
upper limit is picked so that the maximum gain is pretty large, but still allows all non-cross edge
buffers to fit in memory. The state size of each module is picked randomly between [1KB, 32KB],

23

again in order to allow the each module to individually fit in cache. Finally, the module computation times are selected from [0.5µs, 50µs]. We used two kinds of distributions for generating our
parameters — namely uniform distribution and zipf’s distribution, [76]. Zipf’s is a heavy tailed
distribution with a probability density is given by P (x) =

x−p
.
ζ(p)

Here ζ is the Riemann zeta func-

tion and p is a positive parameter which controls how heavy the tail is. For our experiments we use
p = 1.5

3.3.2 Experimental Settings
All of our experiments were run on Intel Eight Core Xeon E5-4620 2.2Ghz processors with 8-way
set associative L2 caches of M = 256KB and block size B = 64 bytes. Each message is the size
of a word (4 bytes) and the source node generates 16384M (16384 × 1024 = 16777216) messages
in total during each execution. This large number ensures that the overhead of loading and clearing
the pipeline is low compared to the total execution time — therefore the total execution time is a
reasonable approximation of the steady-state throughput.
Each chart consists of a set of trials using the same parameters to randomly generate the pipelines.
In all charts, the y-axis is running time — lower is better. A vertically aligned set of points plot
the running times of each scheduler on the same pipeline. The x-axis is the running time of seg_cache. By construction, the running time of seg_cache fits a line, which is shown. If most of the
points for a particular strategy fall above the line, then that strategy is generally worse than cachebased segmentation; if the points generally fall below the line, then it is better than cache-based
segmentation.

24

seg_cache

900

seg_rand

rand_assign

seg_cache

seg_rand

rand_assign

2500

800
2000

600

Runtime

Runtime (s)

700

500
400

1500

1000

300
200

500

100
0
120

140

160

180

200

220

240

260

280

0
650

300

700

750

800

850

Seg_Cache Runtime (s)

Seg_Cache Runtime (s)

(a) Zipf gain, uniform state, uniform computation

(b) Uniform gain, uniform state, uniform computation

Figure 3.1: Normalized results for seg_rand and rand_assign against seg_cache with computation
simulation disabled. We see that seg_cache works better with zipf gain than with uniform gain.

3.3.3 Disabled Computation
We first conduct a simple experiment to see if cache performance has any effect on runtime. We set
the computation time for all modules to 0 and compare seg_random and rand_assign scheduling
policies against seg_cache (the other policies depend on computation time).
Figure 3.1a shows the results for pipelines that have been generated using a zipf distribution for
the gains and a uniform distribution for state size. We see that seg_random generally requires at
least twice as much time to complete as seg_cache. While seg_random is able to take advantage of
the temporal locality arising from segmentation in general, a random selection does not properly
reduce the cost on cross edges, resulting in some bottlenecked processor. Since rand_assign lacks
this temporal locality, it performs even worse. In Figure 3.1b we see that when a uniform distribution is used for gain selection seg_random’s performance is much closer to seg_cache’s. This is
due to the fact that seg_cache concentrates on cutting low gain edges; with uniform distribution,
there is little difference between a “good edge” (low gain edge) and a random edge. Therefore, the
primary advantage of seg_cache is reduced compared to seg_random. Due to the lack of temporal

25

locality, random_assign still performs much worse. In subsequent experiments, we will omit the
random_assign policy since it almost always much worse than the others.
This experiment provides evidence that cache performance of scheduling policy can play a role
in determining the running time. In addition, a uniform distribution on the gains decreases the
effectiveness of a cache-based segmentation due to the increased number of expensive edges which
makes it difficult to pick “cheap” cross edges. Zipf’s like heavy-tailed distributions commonly
occur in practice in many circumstances and some studies [39, 40, 48] claim that the sizes of Unix
jobs follow a heavy-tailed distribution. Therefore, it seemed reasonable to try these distributions
in our experiments.

3.3.4 Randomly Generated Computation Times
The remainder of the experiments will consider computation times. Figure 3.2 shows the results
when state sizes are selected uniformly at random, but the distribution of the gain and computation
varies. As in the case where there is no computation, we again see, in Figures 3.2a and 3.2b,
that uniform gains are not particularly good for seg_cache and computation-based segmentation
seg_runt performs as well or better. The reason is similar as in the case of disabled computation
— when the gains are uniformly distributed, seg_cache has very little ability to pick better cross
edges (smaller gain cross edges) than other policies.
We see another interesting effect here — seg_runt is better with respect to seg_cache when computation is distributed uniformly (Figure 3.2a), while it is comparable to seg_cache when computation
is generated using the zipf’s distribution (Figure 3.2b). This is due to the difference in distributions: Our uniform distribution has a higher mean than the zipf distribution; therefore, the overall
runtime is larger with the uniform distribution than it is with the zipf’s distribution. Therefore,

26

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

3500

2500

3000
2000

Runtime (s)

Runtime (s)

2500
1500

1000

2000
1500
1000

500
500
0

0
900

1000

1200

1300

1400

750

800

850

900

950

1000

1050

1100

1150

1200

Seg_Cache Runtime (s)

Seg_Cache Runtime (s)

(a) Uniform Gain, Uniform State, Uniform Computation

(b) Uniform Gain, Uniform State, Zipf’s Computation

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

seg_cache

1200

1200

1000

1000

800

800

Runtime (s)

Runtime (s)

1100

600

400

200

200

200

250

300

0
120

350

seg_rand

bin_full

bin_empty

600

400

0
150

seg_runt

140

160

180

200

220

240

Seg_Cache Runtime (s)

Seg_Cache Runtime (s)

(c) Zipf’s Gain, Uniform State, Uniform Computation

(d) Zipf’s Gain, Uniform State, Zipf’s Computation

Figure 3.2: Normalized results for seg_runtime, seg_random, bin_fullest and bin_emptiest against
seg_cache with computation time enabled. We see that seg_cache performs the best when both
the computation time and the gains have a zipf’s distribution. This is due to the fact that (1)
zipf’s distribution on computation time reduces the total computation time of the pipeline, making
it more likely that the runtime is dominated by cache misses and (2) zipf’s distribution on gains
allows seg_cache more leeway to pick better cross edges.

27

computation cost is more likely to dominate the performance with uniform distribution on computation and hence, a policy that load-balances computation performs better. On the other hand,
under the zipf’s distribution, the computation of most modules is small; therefore, the overall computation is small and the cache effects can become more prominent and seg_cache can start to have
some advantage.
In comparison, (just as with disable computation) when the gains are zipf’s distributed, in Figures 3.2c and 3.2d, we see that seg_cache generally outperforms the other policies. In particular,
even when the computation is uniformly distributed (Figure 3.2c) the ability to pick good cross
edges allows seg_cache to perform better than other policies. Again we see in Figure 3.2d that
when computation time is selected with a zipf distribution, cache performance plays an even higher
role and seg_cache performs the best.
We ran other combinations of distributions (omitted due to space constraints) and these general
trends seem to hold. The state-size distribution generally does not seem to have an appreciable
effect on the relative performance of policies.

3.3.5 Correlated Computation Time
Thus far our experiments have selected all gains, state sizes and computation times independently.
It is, however, often the case in practice that state size and computation time are correlated. Figure 3.3 shows the results when we pick the state s using the uniform distribution, and then the
computation time of the module is its (s/maxState) × maxComp where maxState = 32KB and
maxComp = 50µs. Again we see that seg_cache outperforms all scheduling policies in the majority of cases. We see that the results are quite similar to Figure 3.2c, which is not surprising since
again state size and computation are uniformly distributed. In fact, the distribution on computation times is not identical; the range of computation times for the un-correlated experiment shown
28

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

1200

Runtime (s)

1000
800
600
400
200
0
140

160

180

200

220

240

260

280

300

Seg_Cache Runtime (s)

Figure 3.3: seg_runtime, seg_random, bin_fullest and bin_emptiest against seg_cache for correlated state size and computation time.
(Figure 3.2c) is [0.5µs, 50µs] while for the correlated experiment it is [1.56µs, 50µs]. Therefore,
the overall computation time of these correlated pipelines generally larger and therefore, we would
expect seg_cache to perform worse. However, since seg_cache balances the state sizes in its segmentation, due to the correlation, in this case, it appears to also manage to balance the computation
times to a certain extent, leading it to perform better than other strategies.

3.3.6 Segmentation Based on Computation Time and Cache
These experiments indicate that segmentation is good, since either cache-based segmentation (seg_cache) or computation-based segmentation (seg_runt) generally dominates the other policies. Therefore, we experimented with a new algorithm, seg_both: a segmentation strategy that considers both
computation and cache misses. For this policy, we first estimate the time for a cache miss (we found
it to be about 3 nanoseconds). We then calculate the normalized load for all possible segments hx,
y,t ihat fit in cache by including both the time spent incurring cache misses and the time spent on

29

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

seg_both

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

seg_both

200

220

240

1200

2500

1000

Runtime (s)

Runtime (s)

2000

1500

1000

800
600
400

500

0
800

200

900

1000

1100

1200

1300

0
120

1400

140

160

180

Seg_Cache Runtime (s)

Seg_Cache Runtime (s)

(a) Uniform Gain, Uniform State, Uniform Computation

(b) Zipf’s Gain, Uniform State, Zipf’s Computation

Figure 3.4: Segmentation based on both computation time and cache. We see that this form of
segmentation does a better job of load-balancing computations when computation-based strategies dominate, but also gets most of the advantage of cache-based segmentation when seg_cache
dominates.
modules’ computation. We then greedily assign contiguous segments to processors so as to minimize the maximum load per core via a binary search algorithm. Since this algorithm considers
cache misses, the buffers on cross edges are assigned in the same manner as seg_cache.
Figure 3.4 shows the results for this policy for both the case of uniform gain, uniform computation
and for zipf gain, zipf’s computation. If we compare Figure 3.4a to Figure 3.2a, we see that
in the case where computation-based segmentation is better than cache-based segmentation, this
combined segmentation policy beats both seg_cache and seg_runt. On the other hand, comparing
Figure 3.4b to Figure 3.2d indicates that when cache-based segmentation is better, this combined
segmentation does almost as well as seg_cache. Therefore, this policy seems to provide the best of
both worlds.

30

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

900
800

Runtime (s)

700
600
500
400
300
200
100
0
140

145

150

155

160

165

170

175

Seg_Cache Runtime (s)

(a) Uniform Gain, Small State and Computation

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

seg_cache

seg_runt

seg_rand

bin_full

bin_empty

4000

1000
900

3500

800
3000

Runtime (s)

Runtime (s)

700
600
500
400

2500
2000
1500

300
1000
200
500

100
0

0
120

140

160

180

200

220

240

400

500

600

700

800

900

1000

Seg_Cache Runtime (s)

Seg_Cache Runtime (s)

(b) Uniform Gain, Med. State and Computation

(c) Uniform Gain, Large State and Computation

Figure 3.5: Normalized results seg_runtime, seg_random, bin_fullest and bin_emptiest against
seg_cache when state size and computation times have been fixed and gains are selected with a
zipf distribution.

31

3.3.7 Fixed State and Computation
In order to better understand these scheduling policies, we conducted experiments where the module state sizes and computation times have been fixed and only the gains vary. Figure 3.5 shows
M
three experiments, using ‘small’,‘medium’ or ‘large’ values, where state ∈ { 1024
,M
, M } and
16 2

computation ∈ {1, 10, 50}µs. In all experiments, seg_cache performs as well as or better than the
other policies, since balancing the state sizes automatically balances the computation as well.
In Figure 3.5a both bin_fullest and seg_runtime are able to perform nearly as well as seg_cache.
Note that the total state of the entire pipeline is small. Since both these policies create approximately 1 segment per processor and these segments fit in cache. Therefore, these policies get most
of the advantage of seg_cache since they never have to pay to reload the state and only pay for 2
cross edges per processor. bin_empty is significantly worse since it has too many cross edges. As
we increase the state sizes, the modules assigned to each processor no longer fit entirely in cache
and the cost of loading segments starts mattering.
Another interesting trend is that bin_empty starts performing better as the computation and state
size increase. The reason for this is subtle. Since seg_runt and bin_full essentially perform segmentation based on computation cost, they put large contiguous segments on each processor as the
state size and computation increases. At the largest size, each processor’s cache can only fit 1 or
2 modules at a time. Therefore, both these policies load a single module, execute it, and then load
the next module, and so on, paying a large cache cost for each firing. On the other hand, bin_empty
distributes contiguous modules across processors with each segment consisting of a single module
— therefore, for large state sizes, it is essentially doing what seg_cache does, but without the advantage of large buffers on cross edges. This policy potentially allows it to execute each module
many times before loading the next module.

32

3.4

Related Work

Most existing work on scheduling streaming computations computation costs of the modules only
and tries to balance to computation across processors in order to maximize throughput. Bokhari
solved the throughput optimization problem for pipeline mapping by finding a minimum bottleneck path in a layered graph that contains all information about application modules [17]. Hansen
et al. later improved Bokhari’s solution using dynamic programming [38]. Subsequently, many
efficient algorithms have been proposed, both for homogenous processors [29, 59, 62] and heterogenous processors [15]. Researchers have also considered distributed memory communication
models of various kinds to understand the effect of communication on throughput of pipeline computations [2, 3]. Researcher have also considered the problem of maximizing throughput and minimizing latency as a bi-criteria scheduling problem [13, 14]. In addition, replication of state-less
modules in order to improve throughput has also been studied [30,44,50,68]. None of this research
explicitly addresses the problem of minimizing the number of cache misses.
Heuristic cache-aware scheduling of streaming programs on both single processors and multiprocessors has been studied by several research groups [43, 56, 66]. Since all of these algorithms are
based on heuristics, they do not guarantee optimality with respect to the number of cache misses.
However, their empirical results support our claim that optimizing the number of cache misses can
lead to significant improvement in performance. The only prior theoretical work that considers the
number of cache misses for streaming applications considers only single processor schedules in
both the cache-aware [6] and cache-oblivious [5] setting.

33

3.5

Conclusions and Future Work

In this Chapter, we evaluated the impact of scheduling policies on the performance of streaming pipeline applications. In most of the experimental benchmarks, scheduling based solely on the
modeled cache effects is indeed significantly more effective than the more common load-balancing
of computation costs. In the case that computation cost dominates the workload, a mixed strategy
that takes into account both cache misses and computation cost outperforms one designed to balance the computation. We conclude that the cost of cache misses should not be ignored when
designing scheduling strategies for streaming pipelines.
In Chapter 4 we generalize our execution engine and investigate how our scheduling policies impact real-world applications.

34

Chapter 4
AMPipe: Automatically Mapping
Streaming Pipelines

4.1

Intro

In Chapter 2 we explored static scheduling strategies for streaming pipeline applications on parallel machines with private caches. We presented a prototype system in Section ?? which utilized
generic configurable kernels which allowed us to test on a range of pipeline applications types
with varying state, runtime and edge gain properties. While these experiments effectively captured
a number of classes of applications, they were nevertheless synthetic and could not perfectly represent the real-world. To that end, our aim in this chapter is to integrate the mapping and scheduling
systems into a full streaming application platform allowing us to implement real-world applications
for experimentation.
We thus present an extensible framework, AMP IPE, for developing streaming pipeline applications. AMP IPE evolved as an extension of S CALA P IPE which allows users to easily develop
streaming-pipeline applications for heterogeneous systems where different kinds of hardware may
be used for individual computation kernels. AMP IPE’s focus is instead on applications targeting
35

chip-multi-processors and it therefore has a different set of design goals. The programmers simply
provide all the modules of the pipeline and their connecting edges but they do not have to specify
the details of the schedule such as which module(s) run on which processor and how the modules
communicate with each other. The framework provides a number of the mapping algorithms described in Chapter 2 which automatically decide which module to schedule on which processor.
In addition, AMP IPE provides the runtime scheduler used in our prototype that automates the firing of modules and orchestrates the communication between them. AMP IPE is designed to be an
easy-to-use, extensible streaming-pipeline system which produces parallel applications without requiring users to be concerned with traditionally difficult tasks like synchronization, communication
or deadlock.

4.1.1 Contributions
AMP IPE has the following features:
Programming interface and profiling: AMP IPE provides a simple programming interface where
the user can simply specify each module by specifying its computation; the number of elements it
reads and writes in its input and output edges each time it fires; and the types of these elements. In
addition, the programmer separately specifies the pipeline topology by connecting these modules.
Therefore, the same module may be used in various pipelines easily. In order to create good
mappings, AMP IPE requires some additional information about kernels. AMP IPE also provides
profilers to allow users to collect this information about each module easily.
Automatic Mapping and Replication: Given a pipeline, AMP IPE provides automatic mapping
algorithms that try to maximize the throughput of the computations. In addition, users can also
write their own mapping algorithms easily in AMP IPE. We now describe a couple of automatic

36

mappers AMP IPE provides with an example; while there are other mapping algorithms provided
with AMP IPE, we find that these perform the best.
Consider the pipeline in Figure 4.1a with 6 modules that we want to schedule on 4 processors.
The number in each kernel shows the normalized computation requirement of that kernel and the
numbers on the edges represent the normalized data volume that crosses this edge on average.
The first, and simplest, mapping algorithm, called S EG RUNTIME, simply tries to load-balance the
modules across all processors based on their computation requirements — the mapping is shown
on the Figure 4.1b for 4 processors. Each shaded region executes on a single processor. In this case,
The first, second and third modules execute on their own processor while the last three modules
execute on the last processor. However, this mapper does not take the cost of cache misses into
consideration at all. Note that the second edge carries more data than all the other edges. With the
mapping created by S EG RUNTIME, all the data on this edge causes cache misses on private caches
since it must be sent from the first processor to the second processor. The next mapping algorithm,
S EG B OTH, takes both computation and cache misses into consideration and for this example it
creates the mapping shown in 4.1c — note that it puts the first and second module on the same
processor in order to ensure that the high-volume edge is “not cut” and the data on this edge stays
in cache.
Note that both of these mappings are quite load-imbalanced since module C is the bottleneck module and they can only use one processor to work on this module; therefore, all the other processor
will spend a bulk of their time idling waiting for the processor running this module to produce/consume data. In order to solve this, we incorporate replication in both S EG RUNTIME and S EG B OTH.
Figure 4.1d shows the mapping we add replication to S EG B OTH. Here the bottleneck module has
two copies, and each runs on different processor. Half the data is processed on each copy making the average load of each copy 6. Not all modules are replicable — AMP IPE allows users to

37

specify which modules are replicable and then automatically finds the best mapping considering
replication.
Runtime Scheduling, Coordination and Communication: One the mapping is created, AMP IPE
generates code for all the communication and synchronization between modules. For instance, the
streaming abstraction assumes that each edge has an infinite buffer; however on a real platform,
we must use finite buffer sizes. AMP IPE automatically uses appropriate buffer sized based on
the situation. The edges between modules that execute on the same processor have small buffer
sizes to both conserve space and ensure that all the data on these edges remains in cache. On the
other hand, the edges that go between processors have larger buffer sizes (based on the mapping
algorithm) to reduce idle time. In addition, AMP IPE also has 5 types of edges and picks the
correct type of edge based on the situation. In particular, if we have an edge where the source and
the destination modules are on different processors, then AMP IPE uses a thread safe ring buffer.
More interestingly, AMP IPE automatically orchestrates communication for replication. Note that
in the presence of replication, the non-replicated modules should still see data in the correct order
— that is, the FIFO order in the absence of replication. AMP IPE automatically distributes and
collects data to ensure this correct ordering. For instance, in mapping shown in Figure 4.1d, it
automatically ensures that half the data goes to one copy and the other half to the other copy of
module C. In addition, it also puts the data back together in the right order so that module D sees
data in the same order as it would have if C hadn’t been replicated.

4.2

Design and Implementation

In this section we present a design and implementation of AMP IPE, a streaming-pipeline application development system that is easy to use and doesn’t require the users to deal with mapping,
communication and synchronization. In Figure 4.2 we show the overall structure of AMP IPE
38

(a)

A
3

B
16

C
1

D
1

12

3

E

F
1

1
1

1

1

(b)

16

1

1

1

1

(c)

16

1

1

1

1

(d)

16

1

1

1

1

Figure 4.1: A pipeline example with various mappings.
which comprises of three separate parts : 1) a front-end which parses the user’s pipeline, and allows the user to profile the pipeline, and generates the code for the pipeline (given a mapping); (2)
a mapping pass that calculates the good mapping and this mapping is used in code generation; and
(3) a runtime system which schedules and executes the application using the mapping.

4.2.1

AMP IPE front-end

AMP IPE’s front-end is an extension of Scalapipe. It provides the programming interface, the
profiler and code generation.

Programming interface The aim of AMP IPE’s programming interface is to provide users the
ability to easily create streaming pipeline applications at a high level. We adopt most of the programming interface from S CALA P IPE [74]. However, unlike S CALA P IPE, users do not have to
provide a mapping and do not have to indicate which edge implementation to use for communication between modules; AMP IPE automatically calculates these. In order to calculate these,
AMP IPE does require some information about each kernel, namely its runtime and its memory
footprint each time it fires.

39

Figure 4.2: Architecture of the AMP IPE platform.
In Figure 4.3 we show a full “Hello, world!” application. The users responsibilities are to 1)
define each individual computation kernel 2) describe the pipeline structure by chaining kernels
together and 3) provide information about the available computing resources. This last piece allows
AMP IPE to compute a good mapping and includes the number of cores available and the size of
the highest-level private cache (Lines 28 and 29). The cache size is only needed if the user wants
to use a mapping algorithm that takes cache misses into consideration.
Each kernel’s definition includes the data type and size of their input and output as well as the
number of data items read and written each time the kernel fires. (Source and sink kernels are
exceptions since they lack output and input edges respectively.) The pipeline in Figure 4.3 contains
two kernel definitions BarKernel and BazKernel respectively. Since BarKernel is a source
kernel, it only provides the output edge, and indicates that it writes one item of type STR to its
output edge each time it fires (Lines 6 and 7). Once declared, the output is simply used as a
variable within the kernel; sending data along the edge is accomplished by assigning a value to

40

the variable (Line 9). In addition, it indicates that its running time is 1ns and that its state size
is 64 Bytes (Lines 4 and 5). Finally, the kernel specifies its computation which can be arbitrary,
but in this case it must contain one instruction that says out = .. to ensure that some string
gets written to the output edge. Similarly BazKernel is a sink kernel; Lines 17 and 18 show
its computation and memory requirements, Lines 19 and 20 shows how it reads from its incoming
edge, and Lines 21—23 are its computation.
The actual pipeline is created in Lines 31 and 32 which indicate that BarKernel is the first
kernel and sends its input into BazKernel. Line 34 initializes the pipeline itself. Users may
also utilize external functions within their kernel allowing users to take advantage of available
C/C++ programming libraries. A simple example is shown in Lines 11—15. This functionality
is must more flexible and powerful than shown in this example and dramatically simplifies the
development process; in fact most of the applications used in this paper take advantage of external
C++ functions.

Profiling the Pipeline As mentioned above, to do automatic mapping, AMP IPE needs information about each kernel’s runtime and memory usage. If the user does not already have the
necessary profiling information, they can use AMP IPE’s provided utilities to collect it by running
a profiling pass which runs the pipeline and produces runtimes and memory footprint for each kernel and saves intermediate object files. AMP IPE includes two python utilities that will read this
saved information and display it in a readable format so that the user may plug it into their kernel
definitions.

Code Generation

The final component of AMP IPE’s front end involves generating the code for

the user’s application. AMP IPE adopts utilities from S CALA P IPE for producing the code for each
individual kernel. AMP IPE has two main additions to the code generated: 1) the edge library,
41

which the different kinds of communication channels that may be used, and 2) the underlying
runtime system which handles the scheduling and execution of the pipeline. The details for these
additions are presented in Section 4.2.3.

4.2.2 Mapping and Code Generation
Mapping and scheduling algorithms are an important component of AMP IPE. AMP IPE provides a
few default mappers that are very good at providing good load balance. The automatic mapping of
pipelines allows users to quickly develop and deploy applications by removing the task of making
assignment decisions from their workload. In addition, AMP IPE mapping framework utilizes a
simple interface allowing users to easily define their own mapping algorithm.
AMP IPE’s mapping framework is based on the idea of segmentation. A segment is a set of contiguous modules restricted to the same processor. Note that a segment may contain anything between one module and all the modules — therefore, this policy can be quite general. How segments
are chosen varies by scheduling policy. Grouping kernels in this way allows us to create a distinction between the expensive synchronized cross edges connecting neighboring segments and the
cheap internal edges connecting kernels within a segment. AMP IPE automatically calculates the
buffer that must be allocated to internal edges using the neighboring kernels input and output rates;
it basically calculates the smallest buffer size necessary to prevent deadlock. The mappers decide
how to size buffers for cross edges. Here we first briefly describe two mapping default mapping
strategies provided by AMP IPE (previously discussed in Section 2.2.3). We then discuss how
AMP IPE provides replication of modules. Finally, we explain briefly how users can write their
own mappers in AMP IPE.

42

S EG RUNTIME: Evenly distributing work In many compute bound applications it is sufficient
to find a mapping of kernels to cores that minimizes the total work assigned to any one computing
resource. In particular, S EG RUNTIME finds exactly P segments such that the maximum load of any
segment is minimized over all such segmentations. It uses dynamic programming to calculate such
segments. Note that we can not simply use the computation time given by the user directly. Since
each kernel may have different input and output rates, different kernels may fire different number
of times over the course of the entire computation. Therefore, we first compute the normalized
load and use this in the dynamic program. It is quite easy to see that using this strategy is within a
factor of 2 of optimal assignment if runtime is the only factor. In addition, S EG RUNTIME makes
the cross edge buffer large enough constant (in our case, it is 100 times the size of the data on the
edge) that modules don’t block due to lack of data on the cross edges.

S EG B OTH: Minimizing total load

In Chapter 2 we show that for computations which are not

compute-bound – those which employ a large amount data movement – relying solely on kernel
workload may not be sufficient for finding mapping with good throughput. In particular, how we
map an application has an impact on how many cache misses the application incurs and if we
only consider runtime, then we can massively increase the number of cache misses incurred by the
application, causing a reduction in throughput. To see this, consider again the example application
shown in Figure 4.1. If we only consider runtime, then we may put modules A and B in different
segments and therefore they will execute on different processors. This means that the edge from A
to B is a cross edge and all the data going from A to B has to be written out of provate caches and
read back in. Since the edge from A to B carries a lot of data, this causes many cache misses. On
the other hand, if we keep A and B in the same segment and carefully schedule the segment then the
data going from A to B can remain within the private cache of this processor avoiding these cache
misses. Therefore, intuitively, we want to avoid (if possible) segment boundaries (cross edges)
at expensive high-volume edges. S EG B OTH, like S EG RUNTIME, also tries to create segments
43

while minimizing the maximum load across all segments; however, it considers both runtime and
cache misses when calculating the load. In particular, it first profiles the machine to calculate the
cost of each cache miss. Then, at a high-level, for each possible segment, it calculates the total
load by summing the runtime of the segment with the time it takes due to cache misses when it
reads data from incoming cross edge and writes to the outgoing cross edge. It then uses dynamic
programming to find the best possible segmentation. It does so by guessing at the load, and then
trying to find a segmentation with that or smaller load using dynamic programming. It wraps this
in a binary search for the best load. In addition, it allocates size M buffers on each cross edge
(where M is the size of the last level cache as indicated by the programmer or by profiling).

Exploiting data-parallelism

Distributing kernels to different cores exploits the explicit task-

parallelism described by the pipeline. In many cases, applications do not have sufficient taskparallelism to fully utilize the provided computing resources. For example, consider one of our
benchmarks, LZ77, whose pipeline only has 3 kernels. Without replicating one or more of the kernels the application would not be able to take advantage of the available resources. Alternatively,
it may be that one kernel makes up over a majority of the computation. In this case, rather than
distributing kernels evenly across cores, the bottleneck kernel should instead be replicated enough
to allow for an even distribution of load.
At a high-level, the replication mappers find a minimal load mapping in two stages. It first computes the minimum and maximum load of all possible groups of contiguous kernels, where the
first and last kernel of each group may be replicated. Next it finds the smallest total load any
computing resource can be assigned, which yields a valid mapping, via binary search and dynamic
programming. Why do we only consider replicating the first and the last kernel of any segment?
Note that we only want to replicate bottleneck kernels. If a kernel in the middle of a segment is a

44

bottleneck, then we would just split this segment into multiple segments. The replication algorithm
also decides what fraction of the data goes to each of the replicated modules.
AMP IPE supports replication for both S EG RUNTIME and S EG B OTH; the sole difference between
the two provided replication mappers is the specific load calculation. S EG RUNTIME with replication computes load as a function of work, whereas S EG RUNTIME with replication includes both
the work and an approximation of the cost of the memory accesses required from performing input
and output.
Replication in AMP IPE is done as needed. If a mapping of the pipeline can be found that evenly
distributes the total load across cores without the use of replication there is no benefit to replicating
kernels. Indeed replication is only beneficial when there exist bottleneck kernels in the pipeline or
when there are not enough kernels assign to all available resources.

An extensible mapping framework At a high-level, the responsibilities of the mappers in AMP IPE
are to 1) assign kernels to cores in an informed manner and 2) to determine the buffer sizes on cross
edges between each kernel. To write their own mapper, the user first implements a partitioning
function based on the kernel information provided by the user to create all the segments. It then
decides how to assign segments to cores (more than one segment can be assigned to a single core).
Finally, it must decide how to allocate buffers on cross edges by using either one of the default
strategies or by defining their own. Mappers are hooked into the front-end via a single map routine
which processes the pipeline and performs the three key operations.

45

4.2.3 Runtime
AMP IPE’s runtime has a few different functions: First, it must schedule each segment on one
processor and decide when each segment executes. Second, it must decide which kernel within
each segment to execute at any time. Finally, it must manage communication between kernels via
edges in the pipeline. Alongside the code generated for the user’s kernels, AMP IPE produces a
library which includes the necessary scheduling routines for threads, segments and kernels as well
as various communication channels to support the different kinds of edges that the mapping may
generate.

Thread scheduling The thread scheduling in AMP IPE is straightforward. Each segment is
assigned to a single thread and each thread is pinned to a single processor. However multiple
segments may be assigned to the same thread. The resulting application created by AMP IPE is
a multi-threaded process where each thread manages it’s set of assigned segments individually.
When assigned multiple segments, AMP IPE’s threads prioritize executing those which are found
the farthest downstream and are ready for execution. A segment is ready for execution when the
incoming cross edge into the segment is at least “half full” and the outgoing cross edge is at least
“half empty.” Recall that AMP IPE assigns large buffers to cross edges and this half full/half empty
criterion makes sure that once a segment is scheduled, it runs for a while before another segment
needs to be scheduled. This reduces overheads of warming up the cache, etc. If multiple segments
on the thread are ready, AMP IPE schedules the one that is farthest downstream since this allows us
to finish emptying the pipeline faster. Once a segment is fired, it continues executing until either
its incoming cross edge buffer is empty or its outgoing cross edge buffer is full.

Segment scheduling Segments are lists of consecutive kernels which are fired as a single unit. A
kernel is ready when there is enough data in its input edge buffer to satisfy its input requirements
46

and enough space on its output edge buffer to satisfy its output requirements. Within a segment, we
use the same strategy as we use across segments — that is, we always fire the furthest downstream
kernel that is ready to fire within a segment. The buffers on edges within the segment are large
enough to guarantee that we can always find some ready kernel in a segment as long as the segment
itself is fireable. When one of a segment’s kernels is replicated, the segment scheduling may be
slightly more complicated since instead of just one incoming (and outgoing) edge, the segment
may have more than one incoming (or outgoing edge). In this case, the segment keeps internal
counters to keep track of the elements on each edge and appropriately decides which kernel to fire.

Communication Library One of the main goals of AMP IPE is to remove the challenge of handling communication from the user; to that end we include a number of different edge classes, each
designed for a different style of communication while using the same interface. When designing
their computation kernels, the user need not worry about which type of edge is used to communicate — they simply use the interface described earlier. The type of communication necessary
for any given edge is determined not by the user, but during the mapping stage of AMP IPE. The
runtime will automatically use the correct type of edge depending on the type of communication
necessary. AMP IPE includes 5 communication types.
Internal Edges: Simple non-synchronized edges which are used between kernels within the same
segment are implemented as simple buffers since the same thread is both the producer and consumer.
Cross Edges: Thread-safe, single-producer single-consumer edges which can be accessed by at
most two threads to communicate between two segments when neither of the two end-points of the
edge are replicated modules. A simple thread safe ring buffer is used in this case. This keeps the

47

buffer from overflowing and also prevents readers from reading from an empty queue by synchronizing on the number of items in the queue. Figure 4.7 shows the read and write methods of our
thread safe queues.
Split Edges: Thread-safe, single producer multiple-consumer edges which splits data amongst
multiple threads. These edges are used during replication when the consumer kernel exists in more
than one segment and the producer only exists on one. For instance, in Figure 4.1, in the replicated
mapping, this edge connects module B to multiple copies of module C. The mapping algorithm
decides which fraction of the data must go to each of the replicated copies of a module and split
edge implementation is responsible for enforcing these decisions. In our current implementation,
split edges first sends its entire portion of the data to the first copy of the module, then to the second
copy and so on. The alternative would be distribute data in a round-robin manner, but we find that
this contiguous partitioning works better.
Join Edges: These are counterpart to split edges and are thread-safe, multiple-producer singleconsumer edges which collect data from multiple threads. These edges used during replication
when the producer kernel exists in more than one segment. These edges must make sure that data
ordering is preserved — that is, to the consumer kernel, it should look like the preceding kernel
was not replicated and executed the elements in the correct order. Therefore, AMP IPE edge implementation carefully reads data from multiple producers in the correct order. We’ve designed as our
split and joine edges as collections of thread-safe queues. Where only the producer or consumer
interacts with the split or join edge interfaces respectively. Figure 4.8 shows our Split/Join edge
implementations. Each edge keeps a set of TSPQ edges which the replicated kernels will interact
with and keeps track of when the edge must switch amongst them.
Interchange Edges: Thread-safe multiple-producer multiple-consumer edges which bridge communication between two neighboring kernels which have both been replicated. An Interchange

48

edge connects multiple ITSPQ edges which serve as the endpoints to each of the replicated procuders and consumers. Given the order of the incoming producer and consumer ITSPQ edges and their
sizes, the Interchange determines the start and end pointers for each ITSPQ within its own buffer.
With these endpoints, the Interchange breaks the buffer into regions which can be assigned to ITSPQs such that for each region there is exactly one producer and exactly one consumer, which an
ITSPQ may be assigned multiple neighboring regions. To prevent races, ITSPQs grab a lock on
region whenever performing a read or write. Figure 4.9 shows this implementation.
From the perspective of a kernel, the actual type of edge does not matter, it utilizes the same
interface to pull or push data from any of them. Each edge type also preserves the original ordering
of data, a task especially important in the context of replication.

4.3

Experiments

In this section we explore how AMP IPE handles a number of real-world streaming-pipeline applications. To demonstrate the features included in AMP IPE we implemented DES, LZ77, Ferret and
a deep neural-network (DNN) testing application. We used AMP IPE interface to program these applications and ran then using S EG RUNTIME and S EG B OTH with and without replication. We find
in all cases that AMP IPE’s automatic mapping produces applications who’s performance is able to
scale with increasing computational resources. In particular, we find that when we have pipelines
which are compute bound or which have similar data volume across all edges, S EG RUNTIME and
S EG B OTH find the same mapping and perform comparably. When these are not true (in DNN),
S EG B OTH performs better. In addition, replication is really useful in getting good performance for
pipelines with bottlenecks (ferret and LZ77).

49

App-Mapp

1

2

4

8

des-sr
des-sb
lz77-sr
lz77-sb
lz77-srr
lz77-sbr
ferret-sr
ferret-sb
ferret-srr
ferret-sbr
dnn-sr
dnn-sb

404.3 (1.0)
404.4 (1.0)
148.6 (1.0)
148.6 (1.0)
148.6 (1.0)
148.6 (1.0)
550.1 (1.0)
562.0 (1.0)
550.1 (1.0)
562.0 (1.0)
269.6 (1.0)
269.6 (1.0)

230.4 (1.8)
213.5 (1.9)
150.9 (1.0)
153.2 (1.0)
113.6 (1.3)
115.7 (1.3)
521.4 (1.1)
379.3 (1.5)
333.2 (1.7)
326.9 (1.7)
144.0 (1.9)
139.2 (1.9)

142.0 (2.8)
127.8 (3.2)
154.4 (1.0)
154.9 (1.0)
54.3 (2.7)
54.9 (2.7)
367.2 (1.5)
382.0 (1.4)
184.8 (3.0)
179.1 (3.1)
84.4 (3.2)
71.2 (3.8)

96.5 (4.2)
78.2 (5.2)
154.3 (1.0)
154.4 (1.0)
34.1 (4.4)
34.9 (4.3)
366.3 (1.5)
379.3 (1.5)
118.7 (4.6)
113.5 (4.8)
51.9 (5.2)
37.6 (7.2)

Table 4.1: Runtimes (with speedup over serial in parenthesis) for DES, LZ77, Ferret, and DNN.
sr is S EG RUNTIME without replication, srr is S EG RUNTIME with replication, sb is S EG B OTH
without replication and sbr is S EG B OTH with replication. For DES and DNN, mappers with and
without replication find the same mapping.
Experimental Setup All experiments were run on an 8-core Intel Xeon E5-2687W with a 20MB
SmartCache, running at 3.1GHz with hyperthreading enabled. Though the streaming model assumes an infitite input stream, most real streaming applicaitons work on a finite dataset. We selected inputs large enough to amortize the startup and cleanup phases of the pipeline execution.

DES

Data Encryption Standard was a federal encryption standard in the United States until 2002

when it was superseded by the Advanced Encryption Standard. Figure 4.10a shows the DES
pipeline consisting of a series of Feistel function and permutation kernels that process 64 bit blocks
of data. Since the data volume of every edge is the same (each kernel consumes and produces 64
bits of text on each firing) both S EG B OTH and S EG RUNTIME produce the same mapping, dividing
up the kernels evenly into 8 segments; since this pipeline has enough kernels, no replication is
necessary.

50

Although the segmentations found by both S EG B OTH and S EG RUNTIME are the same, Table 4.1
shows that S EG B OTH slightly outperforms S EG RUNTIME on DES. This difference is due to the
buffer allocation strategies used by each mapper. S EG B OTH, in this case, incidentally creates large
buffers (the size of cache). This allows each segment to execute for longer when it is scheduled,
thereby reducing idle time.

LZ77

LZ77 is a lossless, dictionary file compression algorithm which maintains a sliding win-

dow view of the input and removes duplications by replacing them with forward references. Like
DES, LZ77 streams input files through a series of computations and thus fits naturally into the
streaming model. Figure 4.10b shows the LZ77 pipeline which is made up of Reader and Writer
kernels for performing file i/o and a central compress/decompress kernel which is responsible for
the majority of the work.
With such few kernels, LZ77 does not have enough task parallelism. The original LZ77 algorithm
can not directly use replication in the middle kernel since the sliding window requires the order of
data to remain unchanged. Instead, we implemented a variation of LZ77 which resets the window
at set intervals, removing dependencies and exposing data-parallelism. This variation splits the
input into blocks of 16384 characters which can be compressed individually, allowing us to mark
the compression kernel as replicable.
Again, S EG RUNTIME R and S EG B OTH R create the same mapping, shown in Figure 4.10c. For P
cores, both mappers create P copies of the Compress kernel and assigns one to each core. The
Reader and Writer kernels are both paired with one copy of the Compress kernel each. Since
Compress operates on larges blocks of data at a time, the difference between the buffers allocated
S EG RUNTIME and S EG B OTH is minimal. Therefore, the difference between the running time of
the two mappers is minimal. In contrast, if we do not use replication, LZ77 slows down as we
increase the number of cores. This is due to the fact that most of the work is still done by one
51

module and we have simply added the overheads of data transfer between modules. Therefore,
replication is essential to get performance for this pipeline.

Ferret Ferret, which has been included in the parsec benchmark suite using a pipeline implementation, is a content based similarity search application. We show the parsec pipeline in Figure 4.10d
which is broken up into 5 compute kernels. Profiling the pipeline reveals an uneven distribution of
work amongst the kernels. As we can see, the Rank kernel is the bottleneck. Figure 4.10e shows
the ferret pipeline after replication on 8 cores. As we can see, it is not always sufficient to just
replicate the largest bottleneck. After creating enough replications of the Rank kernel, vec kernel emerges as the bottleneck. Rather than blindly replicate Rank accross all cores, our mappers
automatically identify when a new kernel has become the bottleneck and replicate it in order to
continue to reduce the maximum load. Again, S EG B OTH and S EG RUNTIME find the same mapping and provide the same performance. However, the mapper without replication provides very
little speedup due to the bottleneck module, while replication provides considerable performance
advantage and scalability as we increase the number of cores.

DNN

We’ve implemented a neural-net testing computation which we call DNN. For this ex-

periment, our goal is not to produce a meaningful interpretation of the input data, but rather to
demonstrate how a neural network might be tested as a parallel streaming application. Our DNN
pipeline is made up of a series of hidden layers which we combine into individual kernels. Layers
are made up of a collection of perceptrons which check if the weighted sum of it’s inputs is greater
than it’s assigned threshold. Perceptrons are connected to the previous layer, observing the outputs
of a fixed set of perceptrons from that layer. For our application, each layer has the same number
of perceptrons and is characterized by how connected it is to the previous layer. The DNN pipeline
can be found in Figure 4.10f. It is made up of 64 hidden layers which have varying connectivities.

52

P S EG RUNTIME S EG B OTH
1
2
4
8

237566435
164066083
175842671
232859976

237566435
136108168
83401537
42227065

Table 4.2: L2 Cache miss counts for DNN with for S EG RUNTIME and S EG B OTH
We created a network where 1/4 of the edges are low volume edges between layers with low connectivity, while the remaining edges all have high volume. DNN is a memory bound application
and the data volume across different edges varies — therefore, as can be expected, S EG RUNTIME
and S EG B OTH find different segmentations.
Figure 4.10fa shows the segmentation found by S EG RUNTIME, while Figure 4.10fb shows the
segmentation found by S EG B OTH. S EG RUNTIME is strictly concerned with distributing the work
of individual kernels evenly causing it to select high-volume edges as cross edges. On the other
hand, S EG B OTH avoids this and picks low-volume edges as cross edges. In Table 4.1 we see that
the mapping found by S EG B OTH does indeed outperform S EG RUNTIME and gets a speedup of 7.2
on 8 cores. A closer inspection of the cache misses in Table 4.2 indicates that this reduction in
runtime is due to a significant reduction in the number of cache misses.

4.4

Related Work

How to mapping of pipelines has been studied extensively, mostly without replication for both
homogenous and heterogenous platforms and here we do not review this work and just provide a
few citations [4, 15, 17, 29, 38, 59, 62].

53

Pipeline Scheduling Systems A number of pipeline scheduling systems already exist which aim
to provide a high-level pipeline programming interface with efficient runtime scheduling. The closest related works are the X language, AutoPipe and ScalaPipe [32, 74] which all serve as ancestors
of AMP IPE. These were both designed for easily creating pipelines for heterogenous platforms.
Users develop kernels in C or VHDL (in X) and in Scala (in ScalaPipe) and then can run them
on CPUs, FPGAs, GPUs, etc. In these systems mapping is always the responsibility of the user.
Should a set of kernels be mapped to a CPU resource, scheduling is managed dynamically by the
OS rather than directly within a runtime as in AMP IPE. Similarly, a recent work on RafLib [11],
a C++ template library that aims to enable high-performance stream processing while utilizing
a natural C++ interface that provides a minimal learning-curve, also does not provide automatic
mapping, especially if we consider cache misses.
StreaMIT [71] offers a stream-based programming language that supports general streaming applications beyond pipelines. The StreaMIT compiler uses a phased scheduling strategy to minimize
latency, as oppose to minimizing throughput, which is the focus of AMP IPE. Gordon et al. [36]
extends this work to multiple cores by evenly distributing load using a greedy assignment while
only considering work and not cache misses for replication. Lee et al. [47] explore beyond these
construct-and-run systems and proposed the piper algorithm which tackles on-the-fly pipeline parallelism. Cilk-P, their prototype implemetation, discovers the pipeline structure at during execution
and utilizes the Cilk runtime system to dynamically schedule and balance load accross multiple
cores.

Replication

Replication has been considered in some limited cases. Subhlok et al. considered

a model where every task can be perfectly parallelized [68] as well as solutions for latencythroughput trade-offs [69]. Kudlur et al. and Cordes et al. used integer linear programming to
extract data parallelism from streaming pipelines [30, 44]. While they considered replication, they

54

all assumed that the work of replicated stages should be evenly divided and assigned to replicas.
Li et al. [?] provided an algorithm for replication on heterogenous platforms and with uneven division. Wang et al. proposed a machine learning-based approach to mapping streaming applications
on multi-cores [73]. However, all these replication approaches only consider runtime and do not
consider cache misses. AMP IPE is unique in being able to add replication even when we consider
cache misses to the load balance criterion.

4.5

Conclusions

In this chapter we have presented AMP IPE, an extensible, easy-to-use parallel streaming-pipeline
application development system that finds good static mappings automatically using profiling information. The application generated is a multi-threaded C++ program built ontop of AMP IPE’s
underlying runtime system which utilizes a communication library to facilitate data movement and
synchronization. Experimental results show with automatic replication and mapping AMP IPE can
continue to achieve speedup on real-world applications as you increase the available computing
resources.

55

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

object FOO extends App {
val BarKernel = new Kernel("Bar") {
/∗ Kernel setup ∗/
config(state, 64)
config(runtime, 1)
config(outrate,1)
val out = output(STR)
/∗ What to execute ∗/
out = "Hello World!"
}
/∗ An external C function to be used within a kernel ∗/
val printf = new Func("printf") {
include("stdio.h")
external("C")
}
val BazKernel = new Kernel("Baz") {
config(state,64)
config(runtime,1)
config(inrate,1)
val in = input(STR)
val instr = local(STR)
instr = in
printf("%s\n",instr)
}
/∗ The application ∗/
val app = new Application {
/∗ Application paramters and system info∗/
param(cache,262144)
param(cores,2)
/∗ pipline creation ∗/
val bar = BarKernel()
BazKernel(bar)
}
app.emit("foo")
}

Figure 4.3: A full “Hello, World” pipeline programming demonstrating kernel,pipeline and application definitions including the newly required calls to pass profiling and system information to
the front-end.

56

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

/∗ Abstract Mapper class definition ∗/
private[scalapipe] abstract class Mapper(val p: pipe) {
/∗ Compute the edge gains∗/
def compute_gains()
/∗ Assign minimum buffers for all edges∗/
def assign_min_buffers)()
/∗ Abstract methods ∗/
def create_segments(): Unit
def assign_segments_to_cores(): Unit
def map() {
compute_gains()
assign_min_buffers()
create_segments()
assign_segments_to_cores()
}
}

Figure 4.4: AMP IPE’s abstract Mapper class which provides the interface for defining
new Mappers which at a minimum requires users to implement the create_segments and
assign_segments_to_cores methods.

57

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

bool schedule_maxload(long load,segments ∗ segs);
/∗ The SegBoth algorithm∗/
void create_segments() {
compute_all_loads()
/∗ Find load budget ∗/
L_min = L_max = −1;
L_guess = 1;
segments ∗ segs;
while(L_max == −1) {
if (!schedule_maxload(L_guess,segs))
L_min = L_guess ; L_guess ∗= 2;
else
L_max = L_guess;
}
/∗ Search between L_min and L_max ∗/
done = false
L_next = mid(L_min,L_max) /∗midpoint∗/
while(!done){
L_guess = L_next;
if (!schedule_maxload(L_guess,segs))
L_min = L_guess ;
else
L_max = L_guess
L_next = mid(L_min,L_max)
if (L_guess == L_next)
done = true
}
/∗ segs now holds the segmentation∗/
}

Figure 4.5: The S EG B OTH segmentation algorithm. schedule_maxload is a helper function which
determines if a segmentation can be found within a given load and if so, stores it.

58

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

/∗ Determine start and end kernels∗/
int start,end;
int fireKernelNum = start
for(;;) {
if (kernelList[fireKernelNum]−>firable()) {
kernelList[fireKernelNum]−>run();
update_read_count();
update_write_count();
if (fireKernelNum != end)
fireKernelNum++;
}
else {
if (fireKernelNum == start+1)
break;
else
fireKernelNum−−;
}
}

Figure 4.6: Kernel scheduling function within a segment. Each firing results in the first kernel of
the segment executing once before propataing it’s output as far as possible through the segment.

59

1
2
3
4
5

template <typename T>
void TSPQ<T>::read(void ∗ l) {
/∗ Cast the templated type ∗/
T∗ loc (T∗)l;
T∗ buff = (T∗)this−>m_buff;
/∗ Spin while the buffer is empty∗/
while(empty())
;
auto curr_pos = m_read_pos.load();
memcpy(loc,&buff[curr_pos],sizeof(T));
m_read_pos.store((curr_pos+1) %this−>m_size,std::memory_order_release);

7
8
9
10
11
12
13

}

15
16
17

template <typename T>
bool TSPQ<T>::empty() {
/∗ Check if the read and write pointers are the same ∗/
return m_write_pos.load(std::memory_order_acquire) == m_read_pos.load(std::
memory_order_acquire);
}

18
19
21
22
23
24
25

template <typename T>
void TSPQ<T>::write(void ∗ v) {
/∗ Cast the templated type ∗/
T∗ val = (T∗)v;
T∗ buff = (T∗)this−>m_buff;
/∗ Spin while the buffer is full ∗/
while(full())
;
auto curr_pos = m_write_pos.load();
memcpy(&buff[curr_pos],val,sizeof(T));
m_write_pos.store((curr_pos+1)%this−>m_size,std::memory_order_release);

27
28
29
30
31
32
33
35
36
37
38
39
40

}
template <typename T>
void TSPQ<T>::full() {
/∗ Check that the write pointer is not right behind the read pointer ∗/
auto curr_pos = m_write_pos.load(std::memory_order_acquire);
return (curr_pos+1)%this−>m_size == m_read_pos.load(std::memory_order_acquire);
}

Figure 4.7: AMP IPE’s single producer single consumer thread safe ring buffer used for cross edges
between non-replicated modules.

60

1
2
3
4

/∗ Split edge write method ∗/
void STSPQ::write(void ∗ val) {
/∗ Get the queue ∗/
Edge ∗ queue = m_queues[m_write_pos];
while(queue−>full())
;
/∗ Call the underlying queue’s write ∗/
m_queues[m_write_pos]−>write(val);
m_cnt++;
/∗ Check if we need to change queues ∗/
if (m_cnt == m_counts[m_write_pos]) {
m_write_pos = m_write_pos < m_splits−1 ? m_write_pos+1 : 0;
m_cnt = 0;
}

6
7
8
9
10
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

}
/∗ Join edge read method ∗/
void JTSPQ::read(void ∗ loc)
{
/∗ get the queue ∗/
Edge ∗ queue = m_queues[m_read_pos];
while(queue−>empty())
;
/∗ Call the underlying queue’s read ∗/
m_queues[m_read_pos]−>read(loc);
m_cnt++;
/ ∗ Chekc if we need to change queues ∗/
if(m_cnt == m_counts[m_read_pos]) {
m_read_pos = m_read_pos < m_splits−1 ? m_read_pos+1 : 0;
m_cnt = 0;
}
}

Figure 4.8: AMP IPE’s Split and Join edges, which are responsible for trafficking data between
replicated and non-replicated kernels. Only one producer will be connected to a split edge, while
only one consumer will be connected to a join edge, making the read and write methods unnecessary in those two classes respectively.

61

1
2
3
4
5
6

template <typename T>
Interchange<T>::Interchange(uint64_t _size, int _writers, uint64_t ∗_wlens,
int _readers, uint64_t ∗_rlens) :
size(_size),writers(_writers),write_lens(_wlens),
readers(_readers),read_lens(_rlens) {
find_regions();
/∗ create per−region flags / locks ∗/
full_flags = new bool[nregions];
counts = new uint64_t[nregions];
muts = new std::mutex[nregions];

8
9
10
11

/∗ Assign regions to the ITSPQs ∗/
assign_regions();

13
14
15
17
18
19
20
21

}
template <typename T>
void ISPQ<T>::read(void ∗ l)
{
T∗ loc = (T∗) l;
T∗ buff = (T∗) it−>buff;

23
24

while(empty())
;

26
27

/∗ Get this reader’s region∗/
int curr_pos = (∗it).read_pos[id];
uint64_t region = it−>read_regions[id∗it−>nregions+region_pos];

28

31
32

{
/∗ lock the region ∗/
std::lock_guard<std::mutex> lk(it−>muts[region]);

34

memcpy(loc,&buff[curr_pos],sizeof(T));

36

it−>counts[region]−−;

38
39

/∗ Check if we’ve finished a region and change if so ∗/
if (curr_pos == it−>regions[region]−1)
{
it−>full_flags[region] = false;
int end = it−>read_regions[id∗it−>nregions];
region_pos = (region_pos == end) ? 1 : region_pos + 1;
}
} /∗ lock released ∗/

30

40
41
42
43
44
45

/∗ Update the position read position ∗/
if (id == 0)
it−>read_pos[id] = (curr_pos == it−>read_ends[id]−1) ? 0 : curr_pos+1;
else
it−>read_pos[id] = (curr_pos == it−>read_ends[id]−1) ? it−>read_ends[id−1] : curr_pos + 1;

47
48
49
50
51
52

}

Figure 4.9: The Interchange and ITSPQ classes 62
responsible for trafficking data between two replicated kernels

(a) The pipeline configuration for
DES. It has two initialization and
two finalization kernels and in the
middle there are 16 functional kernels.

(d) Pipeline configuration for
ferret. This application has
unbalanced loads and the rank
module is initially a bottleneck. However, once that
module is replicated, Vec
module becomes a bottleneck.

(b)
The
pipeline
for
LZ77;
it
contains only
3
kernels,
therefore,
it
is
essential
that we take
advantage of
data
parallelism using
replication.

(c) LZ77 pipeline after
replication for both
S EG RUNTIME
and
S EG B OTH. At 8 cores,
{r, c0 } are assigned
together, {c7 , w} are
assigned together and
the remaining c copies
are assigned alone.

(e) Ferret pipeline after replication.
Two modules are replicated — therefore, this application uses all kinds of
edges — split edges between e and
v, interchange between v and r and
join between r and o. The mappers put together {l, s, c, v0 }, {v3 , v4},
{v0 , 0},and assign the remaining v and
r copies alone.

(f) Pipeline configuration for DNN, we show only the first 17 kernels for space. This applicaiton
has many more kernels than cores providing plenty of options for how a mapper might divide
them up.

63

Chapter 5
Locality-Aware Dynamic Task-Graph
Scheduling

5.1

Intro

In recent years a number of high-level programming languages and libraries have emerged, such
as OpenMP, Cilk Plus, Nabbit, etc. These languages and libraries allow programmers to express
the logical parallelism in their programs while the runtime scheduler schedules the work on the
available cores automatically. For multicores with few cores and uniform access to the memory
hierarchy, these languages and runtime systems provide both good performance and a relatively
simple programming model.
On large multicores with non-uniform memory access (NUMA), however, locality is an important
consideration since a remote memory access—access to data reachable from a memory controller
that is further away via the on-chip network—can cost much more than a local memory access.
Regular applications can be structured to implicitly ensure locality between initialization and subsequent use when using static schedulers such as in OpenMP. However, irregular applications need
dynamic load balancing which dynamic schedulers, such as those in OpenMP tasks, TBB, and Cilk
64

Plus provide. However, they have no notion of the location of data and often fail to provide good
performance for regular memory-intensive applications.
Ideally, one would like to have a high-level and easy-to-use programming model which incorporates dynamic scheduling and locality. We present NABBIT C, a locality-aware extension of a task
graph library NABBIT. In the NABBIT programming model, the programmer expresses computations as a task graph where each node is a task and edges represent dependences between tasks.
NABBIT is a library built on top of a Cilk Plus3 and therefore, NABBIT programs are scheduled
using a provably good work-stealing scheduler.
In this chapter, we make NABBIT locality aware by allowing the programmer to give locality hints
to the scheduler using a simple coloring scheme. In particular, we make the following contributions.

1. We extend the interface so that the programmer can provide a color to each task; if a task is
colored a color c, then the data used by this task is local to processor with color c. Multiple
nearby cores can have the same color.
2. We modify both the NABBIT library and the Cilk Plus runtime system to allow processors
to preferentially execute tasks that share the color with them. Therefore, if the user provides
“correct coloring”, then workers preferentially execute tasks that access local data, thereby
reducing the expensive remote accesses.
3. NABBIT C tries to strike a balance between improving locality and preserving the guarantees
of low overhead and good load balance provided by NABBIT. We prove that NABBIT C,
by and large, preserves the asymptotic guarantees provided by NABBIT. In particular, for
3

It was originally designed to be built on top of Cilk++, but it is trivial to port to Cilk Plus. Indeed, it was designed
so that it can be ported to any programming language that supports fork-join parallelism.

65

Thread 1
a)
b)
a
d
e
b

c
a

b

f
d

c

Thread 2
c)
h
i

e
f

j
h

i

j

Figure 5.1: NABBIT scheduling examples presented in two stages. A thread surrounding a node denotes
that thread is processing that node. A solid line from a to b denotes that b is a predecessor of a while a
dashed line from b to a denotes that a is in b’s successor list.

reasonable task graphs—those with enough parallelism and where tasks of all colors appear
near the root of the graph—NABBIT C provides nearly asymptotically optimal speedup.
4. We evaluated the performance of NABBIT C on a suite of memory intensive applications and
find that it succeeds in providing both good load balance and good locality. It consistently
out performs vanilla NABBIT due to improvements in locality. In addition, on PageRank, an
exemplar irregular benchmark, NABBIT C outperforms OpenMP static and guided scheduling strategies by combining dynamic load balancing and locality awareness.

5.2

Background

In this section, we describe NABBIT, a high-level task-graph scheduling library built on top of the
Cilk Plus runtime system. We outline the NABBIT programming model and show how NABBIT
recursively executes task graphs in parallel. We also provide a brief overview of the GCC Cilk
Plus implementation upon which NABBIT is built.

66

N ABBIT task-graph scheduling. NABBIT schedules task graphs through static and dynamic exploration of the task graph. A NABBIT task graph is a directed acyclic graph with a set of explicit
nodes that represent tasks and edges represent dependences between tasks. Each node u in the task
graph specifies its predecessors—tasks that have edges to u and therefore must be executed before
u can be processed. For the paper, we will use the terms node and task interchangeably.
We summarize the key aspects of the NABBIT dynamic task graph scheduler (more details in
Agrawal et al. [7]). One interesting property of NABBIT is that it computes nodes on demand.
The scheduler takes an input specified in the form of a sink node, whose execution completes the
execution of the task graph. Upon creation, this node has a list of predecessors and no successors.
The sink node together with the predecessor specification transitively identifies all vertices that
need to be executed to compute the sink node. The scheduler actions:

1. To process a node, a thread initializes the node and its list of predecessors and proceeds to
execute them in a recursive parallel depth-first fashion. Consider the example in Figure 1a.
When thread 1 wants to process a and finds that b and c are its predecessors that have not
been initialized, it goes ahead and tries to process one (b in this case). While b is being
processed, another thread can steal c.
2. When processing a node’s predecessor, if a thread finds that some predecessor has already
been initialized by some other thread but has not finished executing, the thread adds the
current node to the predecessor’s successor list and moves on. In Figure1b, thread 1 is
processing d while thread 2 is processing e. f is a predecessor of both d and e. Each thread
will try to initialize the predecessor f but only one will succeed, in this case thread 1. Thread
2 is attaches e to f ’s successor list and tries to find other work to do.
3. After a node is computed, the thread checks if there are any enqueued successors and if so,
determines if those successors are ready to execute (i.e., have no other predecessors on which
67

they are waiting). In the event that a successor is ready, the thread will recursively execute
that node. In Figure 1c, both nodes i and j have h enqueued in their successor lists. Thread
1 computes i and checks if h is ready to execute. Since h still depends on j, thread 1 moves
on. Thread 2, after computing j checks h, sees that it is ready and proceeds to execute it.

This procedure ensures that a node is computed only after all its (transitive) predecessors have
been computed, ensuring correctness. In addition until an initialized node u is computed, it is (a)
either in a thread’s stack, (b) in a successor list, or (c) is a predecessor of an initialized node. This
guarantees that every node u will be inspected and executed eventually. Also, this ensures that the
sink node, and thus whole task graph, is executed to completion.
Atomicity choices ensure the absence of data races. The predecessor and successor lists allow
threads to execute without blocking/waiting for any action by another thread. The recursive parallel
design allows for the implementation of the NABBIT’s scheduler as a Cilk Plus program. All
vertices in either the predecessor or successor lists can be executed in parallel. In addition, NABBIT
ensures that no ordering constraints other than those implied by the predecessor relationships is
imposed on the execution: a node u is ready to execute immediately after all its predecessors
have been computed and unless every processor is busy doing other work, some processor will
find and execute u. This ensures that NABBIT does not alter the task graph’s critical path length,
enabling the scheduler to guarantee asymptotic optimality. Essentially, if the task graph itself has
a parallelism of at least P , then NABBIT guarantees that it gets Ω(P ) speedup on P processors for
most reasonable task graphs. In addition, since it leverages the Cilk Plus work-stealing scheduler
and uses distributed processing, NABBIT provides low overheads. These properties of asymptotic
optimality and low overheads are not normally achieved by other task graph schedulers, such as
scheduler currently used in OpenMP’s SMPSs [61], since they do not process nodes on demand.

68

GCC Cilk Plus We compile NABBIT using the GCC implementation of Cilk Plus, an extension
to C++. Cilk Plus is a processor oblivious language—the programmer expresses the logical parallelism of the program using three keywords without any reference to how many threads must
execute the program and how. The cilk_spawn keyword indicates that the succeeding function can
execute in parallel with its continuation. The cilk_sync keyword is a local barrier; all previously
spawned functions by current function must complete before the program execution can move this
statement. Cilk Plus also provides a parallel_for keyword, which indicates that all iterations can be
executes in parallel. This keyword is essentially syntactic sugar and is implemented using spawns
and syncs.
The Cilk Plus runtime system uses randomized work stealing to schedule these fork-join programs
on P available cores. The program executes on P worker threads, one for each core in the target
machine. Each worker has a local deque of work. When a worker p executing function foo spawns
a function bar, the frame corresponding to the caller foo is placed at the bottom of the p’s deque
and p starts executing bar. When p returns from a function, it pops the function at the bottom of
its deque and continues executing. (If executing on one thread, the program follows the normal
depth-first execution followed by C or C++.) If worker q’s deque is empty, it becomes a thief, picks
a random victim worker, say p, steals the top frame from p’s deque, and starts executing it. If a
steal attempt is unsuccessful, meaning that the victim had an empty deque, then the thief continues
to steal until it finds work.The Cilk Plus compiler inserts code at spawns and syncs to ensure that
deques are managed correctly. In addition, when a worker’s deque is empty, it makes calls into the
runtime to make sure that steals occur correctly.

69

5.3

Theory

We now present a simple analysis showing that the modifications made to NABBIT do not negatively effect the asymptotic runtime—this implies that NABBIT C also provides almost asymptotically optimal load balancing for programs that have enough parallelism.
Just as in the NABBIT paper [7], say, we are given a task graph G = (V, E), where each node u has
work W (u). Also, say that s is the unique node with zero in-degree and t is the unique node with
zero out-degree. If these nodes are not unique, we can trivially add dummy root and final nodes.
Define M as the number of nodes on the longest path in V from the source s to the sink t.
We can define the work T1 as the time it takes to execute the task graph on a single processor and
span T∞ as the time it takes to execute it on an infinite number of processors. Therefore, the work is
P
T1 = u∈V W (u) + O(|E|). The second term is due to the fact that each edge needs to be checked
P
to make sure that it is satisfied. Similarly, we the span is T∞ = maxp∈paths(s,t) { u∈p W (u) +
O(M )} since nodes along any path through V can not execute in parallel. By the work and span
laws [31, p. 780], the completion time on P processors for a task graph is at least max{T1 /P, T∞ }.
We will prove the following theorem—the analysis is a small extension to the analysis of runtime
for NABBIT.
Theorem 5.3.1. For task graph G = (V, E) with maximum degree d, NABBIT C executes G in time
O(T1 /P + T∞ + M lg d + lg(P/ǫ) + C) time on P processors with probability at least 1 − ǫ where
C is the amount of time each worker spends at startup trying to find a node of its own color.
This theorem is similar to the theorem proved for NABBIT [7] apart from the last term C. The
main difference between NABBIT and NABBIT C is the fact of colored steals. In particular, when a
worker runs out of work in NABBIT, it performs a random steal. On the other hand, when a worker
runs out of work in NABBIT C, it first checks a constant number of deques to see if it can find work
70

of its own color and only performs a random steal if all these checks fail. In addition, at the start of
the computation, NABBIT C forces a colored steal and each processor may make C checks to find
a node of its own color where C may not be a constant.
Lemma 5.3.2. The total number of colored steals performed by NABBIT C is O(W + S + P C)
where S is the number of random steal attempts, W is the number of steps the processors spend
working on computation nodes, and C is the number of checks each processor performs at the
beginning of the computation. Consequently, the total number of colored steals is bounded by
O(T1 + P T∞ + P M lg d + P lg(P/ǫ) + P C)

Proof. Trivially, the number of checks at the beginning of the computation is P C since each
processor performs at most C of them. After this, after a constant number of checks, a processor
has either found work (therefore, these checks are bounded by O(W )) or the processor performs
a random steal (these checks are bounded by O(S)). Summing these up gives us the result. The
NABBIT analysis proves that the total number of work steps in the computation is at most T1 and
the total number of steal steps is at most O(P T∞ +P M lg d+P lg(P/ǫ)). This gives us the desired
bound.

At any step, a worker is either working, doing a random steal, or doing a colored steal. Therefore,
the total number of processor steps is bounded by O(T1 + P T∞ + P M lg d + P lg(P/ǫ) + P C)
Since there are a total of P workers, we can divide by P to get the desired running time.

5.4

Design

In this section, we describe our extensions to NABBIT to specify colors, propagate this information
through the runtime, and extend the scheduler to be take into account task colors. Throughout
71

6
7

class DynamicNabbitNode:
Key key //this node’s key
List<Key> predecessors //List of predecessors’ keys
List<DynamicNabbitNode ∗> successors //List of successor nodes
virtual void init() //initialize this code (user−defined)
virtual void compute() //compute this task representing this node (user−defined)
int color() = color(node.key) //helper function: this node’s color

9

int color(Key key) //user−defined function mapping a key to its color

1
2
3
4
5

Figure 5.2: NABBIT C abstract class interface
this section, we will present the relevant NABBIT interface and the NABBIT C extensions for coloraware task graph scheduling.

N ABBIT C interface
Recall that in NABBIT, users model their computation as a task graph, where nodes are tasks to
be computed and edges represent data dependencies between computations. Algorithm 5.2 shows
the abstract class interface for defining nodes and their data dependencies. All nodes in a task
graph dynamically scheduled by NABBIT inherit from DynamicNabbitNode class and implement the
member functions shown in Algorithm 5.2. The init() and compute() functions serve to initialize
node parameters and perform the computation represented by the node, respectively. Each task
(node in the task graph) is associated with a unique key. The user also specifies the list of predecessors, identified by their keys, this node depends on in the predecessors array. In addition to
the information on tasks and their dependencies needed by original NABBIT, NABBIT C requires
the user to define a color() function that returns a node’s color. This function definition serves as
the mechanism for the user to provide locality information to NABBIT C and is the only additional
piece of information the user must provide.

72

We now present extensions to NABBIT and the underlying Cilk Plus runtime that together constitute the NABBIT C infrastructure to exploit this color information to optimize locality.

Designing a locality-guided task-graph scheduler
NABBIT C attempts to achieve multiple goals during scheduling: (1) improve data locality by executing nodes of the same color as the executing processor; (2) achieve good load-balance for the
computation as a whole; and (3) introduce minimal overhead into the original NABBIT scheduling
pathway. Extending the NABBIT task-graph scheduling library to make use of user-provided locality information involves altering how Cilk Plus workers find and determine what to work on. We
introduce two specific changes to NABBIT in order to implement this change in policy: (1) coloraware scheduling using morphing continuations allow workers to reorganize work so that they
may preferentially execute nodes that have the same color as theirs and (2) colored steals allow
Cilk Plus workers to find work of their color from the current set of stealable frames. In order to
implement these policies, we must also change the Cilk Plus runtime system. We now describe
these changes.
Color-aware execution order using morphing continuations. The primary source of concurrency in NABBIT is the concurrent processing of all predecessors or successors4 of a given node.
As explained in Section 5.2, NABBIT enables this by spawning the execution of all predecessors
(or successors) in parallel using a parallel for loop. NABBIT is oblivious to the order in which
these nodes are processed. NABBIT C, however, extracts colors from this list of nodes in order to
preferentially process nodes with colors that would improve locality.
4

As described in Section 5.2, NABBIT implicitly maintains successors array for each node u and NABBIT may
push successor nodes into it when they must wait for u to complete.

73

The crucial function for this purpose is the function spawn_colors shown in Algorithm 5.3. This
function is called on a list of colors colors (and implicitly, a set of nodes which have these colors).
At a high-level, when a processor with color c_p is executing this function, it tries to execute the
nodes with color c_p by recursively calling spawn_colors on the half of the list that contains c_p.
Once it reaches the base case (the set colors contains only one color), it then spawns all the nodes of
color c_p using the function spawn_nodes. This function spawn_nodes is essentially a parallel-for
loop over the nodes of this color.
The function spawn_colors essentially re-organizes the order in which nodes are spawned so that
the nodes of the preferred color c_p are spawned first. Therefore, it implements what we call a
morphing continuation. The particular strand that is spawned and the continuation of the strand
depends on the color of the processor which is doing the spawn. Another important thing to note
about this code is that if the preferred color c_p is not present in the list, the function will spawn
the nodes in the original ordering of the list — therefore, a worker does not stall even if it can not
find the work of its color.
The function spawn_colors is called in three places in the NABBIT C library. Algorithm 5.4 shows
the actions to initialize and execute a node. init_and_compute() acquires the colors of the current node’s predecessors and invokes spawn_colors() if there exist more than one. Similarly, when
spawning the list of successors, compute_and_notify() collects the set of colors for the list of successors and invokes spawn_colors() if there are more than one. Finally, spawn_colors is a recursive
function which is also called by itself.
This morphing continuation design allows us to use the same mechanism in two scenarios. First,
when a processor spawns the predecessors (or successors) if the node it is currently working on,
it uses spawn_nodes to preferentially execute the predecessor(s) (or successor(s)) of its own color.
Second, and the more subtle point, is as follows. Note that in Cilk Plus, when a thief worker steals

74

a task after the spawn of a function5 , it executes the function’s continuation. Since spawn_colors
is recursive, when a worker steals a continuation, the first statement is executes is spawn_colors.
Therefore, the thief also preferentially executes the nodes of its color using the same mechanism.
Colored Steals: When a worker has no assigned work, either because it has run out of local work
or is at the start of execution, we want that worker to acquire work of its preferred color if possible.
In order to do so, we change the stealing policy of Cilk Plus to allow colored steal where a worker
checks a deque and only steals the work (continuation) at the top of the deque if that continuation
contains some node of this worker’s preferred color. We will describe the implementation below
— we first describe our policy details about when we do colored steals vs. random steals.
One of the goals of NABBIT C is to strike a reasonable balance between locality — workers preferentially execute work of their color — and load balance — workers are not idle for too long.
In order to do so, we make two changes to the standard Cilk Plus policy of random steals. First,
when a worker p with color cp runs out of work, it does a constant number of colored steal attempts
before attempting a random steal. That is, it randomly picks a victim worker q and checks if the
frame on the top of q’s deque has any tasks of color cp — if so, it steals this frame making this
a successful colored steal. If not, it tries again. If it fails on a constant number of colored steal
attempt, it makes a random attempt where it steals whatever is on the top of the victim worker’s
deque regardless of whether it has a task of color cp or not. This policy makes sure that p tries to
find work of its own color, but then also maintains provable load balance guarantees (as shown in
Section 5.3) by greedily doing any work available if it can not easily find work of its color.
There is an exception to this policy, however, at the beginning of the computation. At the beginning
of the computation, one worker starts out with executing the root node and all other workers are
stealing. At this time, if a worker begins execution in a region of a task graph with no tasks of
5

A task is represented at runtime by the task’s stack or activation frame

75

its preferred color, it will continue executing the available non-preferred tasks until all work is
exhausted (as explained in the morphing continuations section). In addition, often, the first steal
represents a significant amount of work (conceptually corresponding to nodes higher up in the task
graph or computation tree) and a random first steal can potential to lead poor locality. Therefore,
we enforce that the first steal a worker performs is a successful colored steal. After the first steal,
the worker follows the policy explained above. This enforcement does affect Cilk’s time bound,
which we explore in Section 5.3. In our experiments, we found that if all colors are available at
the root of the task graph, this time to first work (successful steal) is agnostic to the application, is
strictly determined by the number of processors, and, in general, has a small impact on the overall
execution time.
We now describe the changes made to both NABBIT and Cilk Plus in order to implement the
colored steal policy.

Color-aware GCC Cilk Plus runtime
We make the GCC Cilk Plus runtime color aware by making the following changes. First, we
add two additional functions to the Cilk Plus API, shown in Algorithm 5.5, that allows NABBIT C
to provide color information to the runtime system. The first function is straight forward and is
simply used by each worker to set the color of this worker. We pin worker threads and assign them
a unique color based on their thread id. The second one is used to implement colored steals and
requires more explanation. Recall that in order to do colored steals, a thief worker must be able to
tell which color nodes are available in the frame that is on the top of victim worker’s deque. This
API allows NABBIT C to pass this information to Cilk Plus runtime. In particular, before every
cilk_spawn, NABBIT C calls cilkrts_set_next_colors() to inform the Cilk Plus runtime about which

colors are available in the continuation.

76

The Cilk Plus runtime is also changed with respect to what it does on spawns. At each spawn, the
vanilla Cilk Plus pushes the frame of the currently executing function into a worker’s deque —
allowing some other worker to steal the continuation of the spawn. To enable colored steals, we
maintain a color deque alongside the work deque to hold the colors available in each continuation.
When NABBIT C calls cilkrts_set_next_colors with a set of colors before the spawn statement, this
set of colors is pushed at the bottom of the color deque — therefore, each continuation on the work
deque has a corresponding set of colors on the color deque.
Now it is easy to see how one can implement colored steals. When a worker p wants to do a colored
steal, it simply checks to see of color cp (p’s preferred color) is in the set of colors on the top of
victim’s color deque. If so, it pops the top of both the color deque and the work deque and puts
them on the top of its corresponding deques making it a successful colored steal. Since the number
of colors is determined by the number of workers, we make each entry in the colored deque a fixed
length array of boolean flags indicating colors contained in the corresponding continuation. This
makes the thief’s check a constant time operation.

Setting continuation colors in N ABBIT C

As mentioned above, NABBIT C must set colors of

continuation at each spawn using cilkrts_set_next_colors function. This is done on Lines 12, 29,
etc within the code in Figure 5.3. Note that this fits in seamlessly with the design of morphing
continuations. At each spawn, we know exactly which colors are available within the spawn and
which are available within the continuation. Therefore, NABBIT C can easily notify Cilk Plus of
the colors available in the continuation by simply telling it which colors are available in the second
call to spawn_colors.

77

Optimizing locality through coloring
NABBIT C requires that the user intentionally distribute data across their system and provides a
coloring that captures computation locality. We rely on the user knowing how best to distribute
data (but not partitioning work among threads), although in many cases an even distribution is
sufficient. The coloring the user provides to NABBIT C is intended to capture the locality of work
performed, based on their data initialization. For this we make two assumptions about color: (1)
that data initialized by each individual worker thread is given a unique color and (2) that each node
of the computation task-graph is assigned a single color. Requiring the user to describe each node
with a single color can lead to some information loss about a node’s locality. For example, a node
(corresponding to a task) can require data from multiple colored regions and a single color cannot
comprehensively describe the node’s locality. In these scenarios, the user specifies the node’s color
to be the one that maximizes locality for that node.

5.5

Experiments

In this section, we evaluate NABBIT C by comparing it’s performance against original NABBIT and
O PEN MP. In particular, we try to answer the following questions:

• How well does NABBIT C address locality deficiencies in NABBIT? We answer this question
using benchmarks in which locality-optimized and load balanced schedules can be created
using static scheduling of O PEN MP and find that NABBIT C provides much better performance than NABBIT and performance comparable to O PEN MP.
• How well can NABBIT C improve data locality while preserving the dynamic load balancing
benefits from NABBIT? We answer this question using the PageRank benchmark, which
78

cannot be easily statically scheduled, using different data sets. In this case, NABBIT C really
shines and performs better than both O PEN MP and NABBIT.
• To what extent does NABBIT C improve data locality? We find that NABBIT has significantly
fewer remote accesses compared to NABBIT.
• Does the use of colored steals increase the overall cost to find work as compared to random stealing? We find that while the cost of enforcing the first colored steal is significant,
NABBIT C makes up for this overhead by having fewer steal attempts later.
• What is the impact of the choice of colors by the user? We consider the behavior of NAB BIT C

using two particularly bad color choices and compare its behavior with NABBIT.

In general, NABBIT C shines on benchmarks with irregular memory access patterns, remains competitive with O PEN MP when memory accesses are more regular, and almost always outperforms
original NABBIT. We observe that our modifications to NABBIT and Cilk Plus introduce minimal
overheads, affording performance gains due to a reduction in remote memory accesses when a
good coloring is provided.

Experimental Setup

All our experiments were performed on an 80-core NUMA machine with

8 Intel Xeon E7-8860 2.27GHz 10-core processors and 1TB of collective DRAM. The machine
uses Red Hat Linux 4.4.7-9 configured with 4KB pages. We use a stable GCC 4.9.0 build from the
gcc-cilkplus branch for compiling our O PEN MP and NABBIT benchmarks and extend this build
for NABBIT C.

Benchmarks and Baselines We will compare NABBIT C performance to NABBIT and O PEN MP.
O PEN MP offers multiple scheduling strategies for parallel for loops. The O PEN MP STATIC policy

79

simply divides up the iteration space evenly among workers while O PEN MP GUIDED dynamically
load balances using adaptive block sizes.
Table 5.1 details the benchmarks and input configurations used. We selected various memorybound applications to demonstrate the importance of achieving good locality when scheduling
task-graph computations. The first five benchmarks exhibit regular memory access patterns. We
consider these benchmarks to demonstrate the limitations of a dynamic task graph scheduler such
as NABBIT that does not account for locality, and evaluate the potential for NABBIT C to address
these limitations. For these benchmarks, O PEN MP STATIC performs very well if we match the
initialization and computation loops; as explained later, this strategy provides optimal locality to
regular applications even without locality hints. Therefore, we only compare against this O PEN MP
strategy since it always performs better than O PEN MP GUIDED.
PageRank iteratively computes the PageRank using the power method [60]. This benchmark exhibits access patterns dependent on the graph structure, with varying amounts of work per vertex.
We consider three data sets from web crawls [46] that vary in size and graph structure. Specifically,
twitter-2010 shows wider variation in its connectivity (e.g., much larger maximum out-degree) than
the other data sets considered. On this benchmark, we compare against both O PEN MP STATIC and
O PEN MP GUIDED strategies for this benchmark.
The Smith-Waterman dynamic program [67] benchmarks exhibit highly regular memory access
patterns. We have implemented the wavefront computation in O PEN MP, which must synchronize
at each diagonal step. In NABBIT and NABBIT C, we model the entire computation as a task-graph,
exposing more parallelism.

Coloring strategy

In all benchmarks, we used O PEN MP to distribute data evenly across the

machine, with each processor core initializing a unique region of the data. Each thread is pinned to

80

a processor core and given a unique color. During initialization, each data region is colored based
on the color of the thread that initializes it. For regular benchmarks, we group the data accessed
by each node based on their color, and pick the color corresponding to the largest fraction of
data as the node’s color. This color function, provided by the user, can be implemented efficiently
for regular benchmarks. Computing the largest color is expensive for irregular benchmarks such as
PageRank, where the accesses are data-dependent and involve a large number of irregular accesses.
In PageRank, each task takes a block of pages as input, which are accessed regularly, and updates
the ranks of pages linked to them, which are accessed in an irregular fashion. The irregular accesses
while traversing the links are not avoidable. Therefore, we color each task based on the block of
pages it takes as input.

5.5.1 Overall performance
We now demonstrate the effect of locality-guided scheduling on the overall performance. In Figure 5.6, we present the speedup achieved by O PEN MP, NABBIT, and NABBIT C over serial execution. Error bars show standard deviation across five runs. In general, NABBIT C outperforms
NABBIT when the problem is sufficiently large. NABBIT C shines best with larger irregular PageRank benchmarks, where the impact of locality is more prominent, while remaining competitive
with O PEN MP on the stencils and NAS benchmarks and outperforming O PEN MP for the SmithWaterman dynamic programs.
We see that in cg, when there are very few nodes in the task graph, NABBIT C’s benefit over original NABBIT becomes negligible because processor cores have few nodes to work with. With
mg, heat, fdtd, and life, when there are many nodes in the task graph, NABBIT C is able to continue getting good performance while original NABBIT suffers due to its locality-obliviousness.
For these benchmarks, we see that O PEN MP consistently performs best. When threads are pinned

81

and the computation loops are scheduled in the same way as the data initialization loops, O PEN MP
achieves the maximum locality possible despite not having received any explicit locality hints from
the programmer. In addition, it also achieves good load balance, since each iteration does approximately equal amount of work. For these benchmarks, NABBIT C’s performance approaches that
of O PEN MP, whereas NABBIT’s scalability suffers with increase in core count. For PageRank,
O PEN MP is not able to maintain its consistency in performance because it is no longer able to
achieve locality and load balance simultaneously due to the irregular nature of this application. We
see that for larger problems (indicated by the problem size and the larger serial execution time),
NABBIT C scales better than original NABBIT, O PEN MP STATIC, or O PEN MP GUIDED. For SmithWaterman we see that with the unavoidable remote accesses inherent in the algorithms, NABBIT C
and NABBIT perform comparably. Both, however, are able to exploit more parallelism than the
wavefront O PEN MP implementation and edge out ahead.

5.5.2 Locality impact of N ABBIT C’s scheduling strategy
We now look closer at the locality achieved by NABBIT C during the execution of these benchmarks. Because counting each memory reference might be expensive6 , we perform this check at
the node level in the task graph. This consists of two parts. Note that each of our evaluation system consists of eight NUMA domains, each with 10 cores. First, for each thread, we count the
number of nodes it executes that are not the same color as any thread in the same NUMA node.
Second, for each thread, we check all predecessors of executed nodes, and count those that are not
the same color as any thread in the same NUMA node. Sum of these counts across all threads is
reported as the number of remote accesses. For the regular benchmarks, we can compute this as
the benchmarks execute without perturbing the execution. For PageRank, this instrumentation can
6

We were limited by OS version and available hardware counters and were unable to measure remote accesses,
stall cycles, etc.

82

significantly perturb the execution time. Therefore, we track the nodes executed by each thread to
record the schedule used in the timing runs. This schedule is replayed to compute the percentage
of remote accesses.
Figure 5.7 shows the percentage of accesses that are remote for NABBIT, NABBIT C, and O PEN MP STATIC, on 20 or more processor cores (smaller core counts fit in one NUMA domain and do
not incur remote accesses). Because NABBIT relies on the random steals in Cilk Plus to disseminate work, the percentage of remote accesses increases with scale, ranging from 45% to 88%,
exhibiting a consistent trend across all benchmarks. The introduction of colored steals significantly decreases the percentage of remote accesses. For all benchmarks except twitter-2010 and
the Smith-Waterman benchmarks, NABBIT C incurs 0% to 9% remote accesses. Importantly, unlike in the case of NABBIT, this percentage does not strictly increase with scale for the regular
benchmarks. All strategies incur a high percentage of remote accesses for twitter-2010 and SmithWaterman.
For regular applications, O PEN MP STATIC incurs almost no remote accesses, as we expect from
how the data is initialized. For PageRank, O PEN MP STATIC still has fewer remote accesses than
NABBIT C; however, as we saw above, it does not have good performance since it is unable to provide good load balance. This result indicates the importance of both locality and load balance—
while NABBIT provides great load balance and O PEN MP STATIC provides great locality, NABBIT C
performs better than both on this irregular benchmark since it simultaneously considers both metrics.

5.5.3 Overheads due to colored steals
The two sources of overhead for NABBIT C arise from requiring a constant number of colored
steals before performing random steals and forcing the first steal to be a colored steal.
83

Effect on total steals We now look at the comparison of NABBIT C and NABBIT at a more
fine-grained level. In Figure 5.8 we see that NABBIT C, perhaps counter-intuitively, performs far
fewer total successful steals than NABBIT. The introduction of colored steals, and specifically
enforcing the first colored steal, helps to significantly reduce the total number steals by ensuring
that thieves acquire nodes higher up in the task graph to start with. Due to the depth-first nature of
the scheduler, nodes higher up in the task graph have more potential work. Therefore, by ensuring
thieves begin with nodes connected to the root, NABBIT C is able to effectively increase the amount
of work each worker begins with, reducing the total number of steals required.

Overhead due to enforcing first colored steal To calculate the overhead of ensuring that the
first steal is a colored steal, Fig. 5.9 shows the average amount of time processor cores spent
waiting to acquire work for the heat benchmark. We observed that the times were very similar for
all other benchmarks and do not present them here due to space limitations. While this overhead
can be substantial, it is agnostic to the application, provided there is at last one node from each
color connected to the root. This startup cost can be amortized out with larger, longer running
benchmarks. Additionally, recall that we observed that, in practice, enforcing the first colored
steal results in far fewer total number of steals which makes up for this overhead.

5.5.4 Importance of good coloring
Overheads with invalid coloring To evaluate this worst case overhead from attempted colored
steals, we assigned all nodes an invalid color (no worker has this color), ensuring that all colored
steals fail. Therefore, this version of NABBIT C behaves like original NABBIT apart from incurring
the overheads of colored steals. In Table. 5.3, we see that NABBIT C with this alternative coloring
performs comparable to original NABBIT indicating that the additional work performed by colored

84

steals introduces minimal overhead. Specifically, we observe that the mean speedups are within
one or two standard deviations, indicating that, for the benchmarks considered, colored steals incur
no statistically significant overhead7 .

Behavior under bad coloring

The performance of NABBIT C is directly tied to the coloring

provided by the user. NABBIT C assumes the user has constructed a “good” coloring and makes
decisions based on this assumption. In the event that the user has provided a “bad” coloring,
NABBIT C can perform as badly, or worse, than original NABBIT. To test this, we create a coloring
where all nodes are given valid incorrect colors. Therefore, in this implementation, all workers will
preferentially do non-local work. In Table. 5.2, we see that NABBIT C with a bad coloring loses all
the performance benefits achieved due to coloring and performs similar to NABBIT. Interestingly,
we observe that the mean speedups are within two standard deviations, indicating that NABBIT’s
locality behavior under random stealing is statistically no better than that of NABBIT C under an
intentionally bad coloring.

5.6

Related Work

Static task graph schedulers [45, 52, 75] minimize completion time while maximizing locality [72]
by completing expanding and analyzing a task graph, together with accurate information on computation and communication costs associated with each task. We consider task graphs that are
dynamically explored and do not require prior knowledge of task and communication times.
Cilk’s random work stealing is agnostic of locality considerations [33]. Several efforts have incorporated locality considerations by altering the work stealing strategy [37, 51, 55, 63]. These
7

We are working to understand the characteristics of the uk-2007-05 dataset that make it particularly vulnerable
to invalid or bad coloring.

85

approaches do not naturally extend to scheduling data-flow graphs while preserving provably efficiency in terms of scheduling overheads and effectiveness of load balancing.
Event-driven scheduling strategies map tasks to locality domains together with efficient identification and tracking of ready tasks that can be scheduled [20, 26]. In these systems, data distribution
implies a computation partitioning with no further migration of tasks to tackle load imbalance.
SuperMatrix [25], a runtime scheduling system for algorithms operating on blocks as observed in
linear algebra programs, mimics a superscalar microarchitecture’s scheduling strategy in software.
StarPU [9] is a task-graph scheduler for heterogeneous multi-core systems. Neither approach
accounts for data locality. Dague [21], a distributed DAG engine, improves locality by working on
the local queue when possible. XKaapi [35] is a work-stealing-based scheduler for task graphs that
pushes tasks to processors that have better locality for those tasks. It does not preserve the critical
path length or provide provable parallel efficiency.
SMPSs [61] schedules dependent tasks together to improve locality. Legion [10] exploits userspecified locality information and coherence properties to perform locality-aware scheduling using a software out-of-order processor. CnC [23] allows the specification of task graphs that are
scheduling using a variety of strategies. Legion and CnC also allow user-specification to control
task mapping and scheduling (using mappers in Legion and tuners in CnC). Olivier et al. developed various strategies to schedule OpenMP tasks including hierarchical scheduling, and work
stealing by one thread on behalf of others in the same chip [58]. None of these schedulers in these
systems attempt to preserve optimality guarantees. However, the scheduling strategy developed in
this paper can be used to develop provably efficient and locality-aware scheduling algorithms for
these task-graph frameworks.
Bugnion et al. [24] developed compiler-directed page coloring techniques to minimize conflict
misses. Chilimbi and Shaham [28] identified hot data streams and then colocated them to improve
86

spatial locality. Chen at al. studied scheduling threads for constructive cache sharing [27]. Various
approaches have studied the partitioning of shared caches among threads (e.g., [57, 64, 70]). These
approaches cannot be applied to optimize NUMA locality considered in this paper.

5.7

Conclusions and Future Work

In this chapter we presented NABBIT C, a flexible and easy-to-use task graph library that allows the
user to provide locality hints via the use of coloring and provides good load balance via dynamic
scheduling. NABBIT C is geared towards scheduling on NUMA hardware, where remote accesses
may be considerably more expensive than local accesses, but one must strike a balance between
locality and load balance to get good performance. Experimental results indicate that this approach
is promising, especially for memory intensive irregular applications running on NUMA machines,
where static scheduling can compromise load balancing and locality-unaware dynamic scheduling
has too many remote accesses. While NABBIT C uses Cilk Plus as the underlying language and
runtime, we believe this approach can be implemented on other systems such as Intel’s Threading
Building Blocks.
There are a number of potential directions for future work. Although coloring functions are not
often difficult to produce, user can still provide bad, or invalid, colorings which prevent NABBIT C
from getting any benefit over NABBIT. Providing the user containers which could automate data
distribution would allow for the discovering colors automatically, reducing the user’s programming
effort and ensuring a good coloring is always found.
Allowing for more complex colorings, like hierarchical colors, would allow NABBIT C to model
the system’s memory hierarchy more accurately. Processors could be assigned unique colors to
model their cache while those grouped on the same NUMA domain would share a parent color.
87

Remote memory accesses can also have differing access times depending on the machine’s memory
network which could be captured with further depth within a coloring hierarchy.

88

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

//Recursively spawn colors using morphing continuations
void spawn_colors(colors):
if len(colors)==1:
spawn_nodes(colors[0])
else
c_p = /∗this worker’s color∗/
/∗split available colors into two halves∗/
first_half = colors[0:len(colors)/2]
second_half = colors[len(colors)/2:]
if c_p in second_half.keys():
swap(first_half, second_half)
cilkrts_set_next_colors(second_half.keys())
cilk_spawn spawn_colors(first_half)
spawn_colors(second_half)
cilk_sync
//Recursively spawn nodes of the same color
void spawn_nodes(nodes):
if len(nodes)==1:
if nodes[0] is a successor:
if nodes[0] is ready:
nodes[0].compute_and_notify()
else: /∗predecessor key∗/
try_init_compute(this,nodes[0])
else:
color = nodes[0].color() /∗all nodes have same color∗/
first_half = nodes[0:len(nodes)/2]
second_half = nodes[len(nodes)/2:len(nodes)]
cilkrts_set_next_colors(color)
cilk_spawn spawn_nodes(first_half)
spawn_nodes(second_half)
cilk_sync

Figure 5.3: Pseudo-code for color-aware spawning of a set of nodes in NABBIT C using morphing
continuations. We use a hybrid C++/Python syntax to enhance readability.

89

1
2
3
4
5
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
34
35
36
37
38
39

/∗Helper functions to obtain colors∗/
int color(DynamicNabbitNode node):
return node.color()
int color(DynamicNabbitNode ∗node):
return node−>color()
/∗Gather list of spawns based on their color.
T = Key (for predecessor list) or
T = DynamicNabbitNode (for successor list)∗/
auto gather_colors(T nodes):
//group nodes based on their colors
Map<int,List<T>> colors
for n in nodes:
colors[color(n)].add(n)
return colors
/∗Initialize this (already created) node and compute∗/
void init_node_and_compute():
this.init()
colors = gather_colors(this.predecessors)
spawn_colors(colors)
if all this.predecessors have been computed:
this.compute_and_notify()
/∗Try to initialize node’s predecessor with key pkey ∗/
void try_init_compute(node, pkey):
//atomically attempt to create a predecessor with key pkey
pred = /∗reference to node for key pkey∗/
if /∗creation succeeded∗/:
pred.init_node_and_compute()
else: /∗already created by this or some other thread∗/
atomic pred.successors.add(node) //enqueue
/∗compute a node and notify its successors∗/
void compute_and_notify():
this.compute()
while /∗there are new successors in this.successors∗/:
colors = gather_colors(this.successors)
spawn_colors(colors)

Figure 5.4: Key routines to spawn predecessors and successors in NABBIT C.

1
2

void cilkrts_set_worker_color(int color)
void cilkrts_set_next_colors(List<int> colors)

Figure 5.5: Extensions to the Cilk Plus RTS API to inform the runtime of the worker’s preferred
color and the colors available in a continuation.
90

Benchmark
cg

Description Problem size

NAS conjugate gradient
mg
NAS multigrid
heat
Heat diffusion stencil
fdtd
Finite
difference time
domain
life
Conway’s
game
of
life
page-uk-2002
PageRank
uk-2002
dataset
page-twitter-2010 PageRank
twitter2010
dataset
page-uk-2007-05 PageRank
uk-2007-05
dataset
sw
SmithWaterman
(n3 )
swn2
SmithWaterman
(n2 )

Iterations Task graph
Serial
nodes time (seconds)

N A = 900000,
N N Z = 26

1

300

309

n{x, y, z}
=
2048, LM = 11
n = 16384, m =
655360
n = 16384, m =
655360

1

16384

690

5

102400

377

5

102400

970

n = 16384, m =
655360

5

102400

275

nv = 18M, ne =
298M

10

1800

198

nv = 41M, ne =
1468M

10

4100

1025

nv = 105M,
ne = 3738M

10

10500

900

{n, m} = 5120,
B = 32x32

1

25600

450

{n, m}
=
131072,
B = 1024x1024

1

16384

179

Table 5.1: Benchmark configurations and serial O PEN MP STATIC execution time. The PageRank
benchmarks use the same code with three different web crawl datasets [18, 19, 46].

91

O PEN MP GUIDED

N ABBIT

O PEN MP STATIC

N ABBIT C

60
40
45
40
35
50
25
20
35
30
40
30
20
25
15
25
30
20
15
20
10
15
20
15
10
10
10
5
10
5
5
5
0
0
0
0
0
0 1020304050607080 0 1020304050607080 0 1020304050607080 0 1020304050607080 0 1020304050607080
30

Speedup

25

cg

mg

life
40
35
40
20
15
30
15
25
30
15
10
20
10
20
10
15
10
5
5
5
10
5
0
0
0
0
0
0 1020304050607080 0 1020304050607080 0 1020304050607080 0 1020304050607080 0 1020304050607080
20

25

heat

fdtd

50

Speedup

20

page-uk-2002

page-twitter-2010

page-uk-2007-05

sw

swn2

Figure 5.6: Speedup for all benchmarks. x-axis: number of threads (processor cores); y-axis:
speedup over serial O PEN MP STATIC.

N ABBIT C

20

20

40

40
60
80 20
40
60
80 20
40
60
80 20
page-uk-2002
page-twitter-2010
page-uk-2007-05

40

mg

60

80 20

40

cg

60

40
60
heat

O PEN MP STATIC

80 20

40

80 20

N ABBIT

life

sw

60

80 20

40

60
fdtd

80

60

80 20

40
60
swn2

80

Figure 5.7: Percentage of accesses that are to data in remote NUMA domains. We show percentages for 20–80 cores (1–10 cores fit in one NUMA domain and do not incur remote accesses).
x-axis: core count; y-axis: Percentage of accesses that are remote.

92

cg
mg
fdtd

heat
life
uk02

twitter
uk07

sw
swn2

4000
3500
3000
2500
2000
1500
1000
500
0
40000
35000
30000
25000
20000
15000
10000
5000
0

a) N ABBIT C

0

10

20

30

40

50

60

70

80

b) N ABBIT

Time (s)

Figure 5.8: Average number of successful steals for (a) NABBIT C and (b) NABBIT. x-axis: number
of cores; y-axis: Average number of successful steals
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0

1

2

4

10

20

40

60

80

Processor cores

Figure 5.9: Average idle time per processor core (across all processor cores and runs) due to forcing
the first colored steal for the heat benchmark. Error bars show standard deviation across five runs
among all processor cores. We observed this time was the same for all benchmarks.
P

cg

20
40
60
80

0.96
0.97
0.96
1.02

mg heat fdtd life uk-02 twitter uk-07 sw swn2
0.76
0.88
0.97
0.96

0.66
0.78
0.95
0.93

1.08
1.10
1.08
0.94

0.65
0.77
0.94
0.97

0.75
0.83
0.89
0.91

0.83
0.55
0.49
0.41

0.76
0.94
1.02
1.02

0.76
0.88
0.97
0.96

0.66
0.78
0.95
0.93

0.98 0.89 0.83 1.05 0.83 0.85
0.02 0.08 0.12 0.06 0.13 0.06

0.57
0.16

0.93 0.89 0.83
0.11 0.08 0.12

Table 5.2: Speedup of NABBIT C over NABBIT when all tasks are assigned bad colors resulting in
preferential execution of non-local tasks. S.D. denotes standard deviation.
93

P

cg

20
40
60
80

1.03
1.02
0.99
1.06

mg heat fdtd life uk02 twitter uk07 sw swn2
0.99
0.99
0.98
0.97

0.94
0.99
0.94
0.88

1.06
1.04
1.03
0.91

0.93
0.92
0.92
0.98

1.03
0.97
0.98
0.94

1.12
1.09
1.01
1.07

1.12
1.10
1.08
1.07

0.99
0.99
0.98
0.97

0.94
0.99
0.94
0.88

1.03 0.99 0.94 1.01 0.94 0.98
0.02 0.01 0.04 0.06 0.03 0.03

1.07
0.04

1.09 0.99 0.94
0.02 0.01 0.04

Table 5.3: Speedup of NABBIT C over NABBIT when all tasks are assigned invalid colors resulting
in failure of all colored steal attempts. S.D. denotes standard deviation.

94

Chapter 6
Locality-Friendly Parallel For Loops

6.1

Introduction

PARALLEL _ FOR loops are a simple mechanism for programmers to understand:

rather than execute

sequentially, loop iterations can instead be executed in parallel. Programming languages that offer
a PARALLEL _ FOR loop typically make use of the same syntax as a regular loop, making it easy for
programmers to parallelize their applications with minimal changes to their code.
There exist many applications where the core of the work done is within the iterations of a for loop
all of which can often benefit directly from PARALLEL _ FOR loops, provided there does not exist
any data dependencies between iterations.
In many cases where an application utilizes a series of PARALLEL _ FOR loops, such as performing
the same computation over many time-steps, if the same cores execute the same iterations each
invocation of the loop, they end up reducing cache misses as well as remote memory accesses
when executing on a machine with non-uniform memory accesses. Figure 6.1a demonstrates how
data might get used during a statically scheduled PARALLEL _ FOR loop. By assigning iterations

95

directly to cores, the data needed by an individual core will be the same each time the loop is
executed.
Although C ILK P LUS, a popular parallel programming language, offers PARALLEL _ FOR loops,
they are executed dynamically, with no guarantee of which core executes which iteration, preventing them from being able to take advantage of these potential locality benefits. In C ILK P LUS,
PARALLEL _ FOR

loops are syntactic sugar for recursively split loop iterations that are scheduled

dynamically at runtime. Figure 6.1b demonstrates one iteration of a C ILK P LUS PARALLEL _ FOR.
The iterations acquired by P2 depend on it acquiring half of the total iterations from P1 in the
beginning. If another core was able to acquire this instead in in a successive execution, P2 would
eventually find a different set of iterations. Although this PARALLEL _ FOR implementation does
not readily benefit from locality it ensures load balance.
In this chapter we present locality-friendly PARALLEL _ FOR loops for the C ILK P LUS runtime. Our
goals are to:

1. Elevate PARALLEL _ FOR loops beyond syntactic sugar into first class citizens within the
C ILK P LUS runtime.
2. Minimize the changes required of the user to when parallelizing their loops or switching
from the current PARALLEL _ FOR implementation.
3. Implement statically assigned iterations so that cores may execute the same iterations each
time a loop is invoked.
4. Support dynamic load-balancing after initial assignments have been completed.
5. Have as minimal overhead as possible.

96

(a) Static assignment of loop iterations across processors. (b) Dynamic discovery of loop iteration assignment across
Each execution of this loop results in the same assignment processors. Here P2 acquires the right half from P1, however
in subsequent executions of the loop if another processor acquires that half first, P2 will get some other set of iterations

While we find that our design for CILK _ PARFOR is indeed able to accomplish most of our goals, it
introduces additional overhead which can become problematic for applications with small workloads.

6.2

Related Work

PARALLEL _ FOR

loops have been implemented in some form or another for many commonly used

programming languages. These implementations, do not, however, combine good load-balance
while addressing locality.
Two related PARALLEL _ FOR scheduling algorithms address locality directly. Markatos and Leblanc [54]
proposed the affinity scheduling algorithm which divides the iteration space evenly across cores,
has them work on decreasingly smaller subsets of iterations until completed and then allows them
to assist the core with the largest remaining set of iterations. Li et al. [49] expand upon AFS by
first distributing data across NUMA domains and having workers begin with local work first. They
improve temporal locality by considering the data distribution used when ordering iterations execution and reduce the number of synchronization operations by unifying their subtask sizing policy.

97

While both of these algorithms consider locality, they rely on global synchronization which can
introduce significant overhead.

6.3

Background

In this section we provide a brief overview of the C ILK P LUS runtime system including the programming interface it provides, the basic data structures used to implement randomized workstealing and its implementation of PARALLEL _ FOR as syntactic sugar.
The C ILK P LUS runtime utilizes a randomized work-stealing scheduler to efficiently execute applications in parallel. It houses a number of worker threads which, in distributed fashion, acquire
work from one another to dynamically. Programmers use the cilk_spawn and cilk_sync keywords
which allow to mark opportunities for parallelism and enforce synchronization respectively.
At a high-level, a cilk_spawn designates that the spawned routine is capable of running in parallel with everything after the cilk_spawn and until the next cilk_sync. It is important to not that a
cilk_spawn does not guarantee that the routine will in fact execute in parallel with the following

code. Users specify logical parallelism while workers determine at runtime what will be executed
where.
When a worker spawns a function, it begins executing that function right away and makes the
remaining work available for some other worker to steal. This stealable work takes the form of a
frame stored inside the worker’s double-ended queue deque. During normal execution, workers
push and pop frames from the bottom of their deque, in similar fashion to the C call stack. When
a worker runs out of work they become a thief and attempt to randomly steal a frame from the
top of another worker’s deque. Frames at the top of a worker’s deque are from spawns performed

98

earlier in the application and have a higher potential for capturing a larger amount of the workload,
a feature which becomes important for proving CILK’s time bound.
C ILK P LUS additionally offers a cilk_for loop which automatically parallelizes the of a PARAL LEL _ FOR

loop. These loops are implemented as syntactic sugar for recursive spawns, making the

runtime completely unaware of the cilk_for loops. cilk_for loops are recursively split in half, either
to sets of size n/P , where n is the number of iterations and P is the number of workers, or to a
user-specified size. At each split, half of the iterations are spawned, allowing a thief to acquire
them, while the other is taken by the current active worker.
Due to the fact that this implementation is built with cilk_spawn and cilk_sync, it is able to maintain
CILK ’s

optimal time bound. However, in memory bound applications, where locality can play

the main role in determining performance, a dynamically scheduled PARALLEL _ FOR loop can
perform poorly. cilk_for cannot ensure loop iterations are executed by the same worker over multiple
invocations, forcing workers to continually require new memory from random locations which can
severely impact performance.

6.4

Design

In this section we detail the design of our CILK _ PARFOR loops which are no longer just syntactic
sugar for a series of spawns but instead first class citizens handled within the C ILK P LUS runtime.
The main objective of our design is to address the inherent locality issues of successive, dynamically scheduled PARALLEL _ FOR loops by introducing static assignments – workers can get better
locality by executing the same set of iterations each time a PARALLEL _ FOR loop is invoked. Upon
completion of their initial assignment, workers are then able to steal iterations from one another
sacrificing locality for load-balance.
99

6.4.1 Assigning Iterations to Workers
Our CILK _ PARFOR loops utilize a static assignment policy, directing workers to execute the same
subset of iterations each time the CILK _ PARFOR is executed. Working on the same subset of iterations for each CILK _ PARFOR loop invocation can grant both cache and NUMA locality benefits.
Additionally, a naive assignment of iterations can also result in an even load distribution when
individual iterations have similar workloads.
To distribute iterations across workers, CILK _ PARFOR divides up the iteration space into evenly
sized contiguous subsets and passes each to a unique worker. Each worker is given an equal number
of iterations and no iteration is given to more than one worker. In the event that there are more
workers than iterations, CILK _ PARFOR will assign a single iteration to as many workers as there
are iterations.
With such an assignment policy, there are a number of ways CILK _ PARFOR can get locality benefits. Consider the case where the data required by the subset of iterations assigned to a particular
worker is able to fit in that worker’s private cache. If the CILK _ PARFOR loop is executed again, the
worker will be assigned the same iterations as before and therefore the data require will already be
in cache. Dynamic scheduling policies can not make the guarantee that workers execute the same
iterations during each CILK _ PARFOR and as a result, incur far more cache misses.
On NUMA machines, static assignment policies can be particularly good at reducing the number
of expensive remote memory accesses. To achieve this, the data used by a particular worker must
live in that worker’s NUMA domain. User’s can infer the iteration assignments that our policy will
find and distribute the necessary data accordingly. Assuming first-touch this can also be done with
the use of a starting initialization CILK _ PARFOR loop at the beginning of the application.

100

6.4.2 Elevating CILK _ PARFOR beyond syntactic sugar
Vanilla C ILK P LUS executes cilk_for loops by recursively splitting and spawning iterations, making
it simply syntactic sugar for a number of spawns. To elevate CILK _ PARFOR loops into first class
citizens, we introduce the notion of CILK _ PARFOR frames into the runtime system, add a new meta
data structure into the runtime to hold the loop information and update the scheduling routines to
handle CILK _ PARFOR loops accordingly.
We’ve introduced two special frame types for CILK _ PARFOR loops, PARALLEL _ FOR parent frames
and iteration frames. The PARALLEL _ FOR parent frame is marked upon creation of the CILK _PARFOR

loop and will collect child iteration frames as workers arrive and execute their assigned

iterations. Iteration frames are not instantiated during the creation of the CILK _ PARFOR loop but
instead created when a worker begins executing their assignment. Workers discover the CILK _PARFOR

loop by attempting to steal from a victim who’s working on the loop. Once discovered,

rather than taking work off of the victim’s deque, the thief creates a new iteration frame and begins
executing their assignment.
The CILK _ PARFOR meta data structure holds all of the information pertinent to the CILK _ PARFOR
loop, including the iteration assignment for each worker as well as a pointer to the loop’s body. The
iterations within a worker’s assignment are the remaining iterations that the worker will execute.
Upon creation of the CILK _ PARFOR loop, this will be the worker’s initial static assignment and as
the worker completes iterations, it updates it’s current assignment to reflect it’s progression.

6.4.3 Stealing iterations towards load balance
Workers may not always complete their iteration assignments simultaneously, either due to imbalance in the workloads of individual iterations or differences in when they start executing their
101

1
2
3

CILK_PARFOR_BEGIN(it,start,end,incr) {
...
} CILK_PARFOR_END(it,start,end,incr);

Figure 6.1: CILK _ PARFOR user interface
iterations. To address this problem, we include support to allow workers to steal iterations from
one another. Stealing iterations from a victim requires updating their current assignment. Recall
that a worker’s assignment is it’s remaining iterations to execute. As a worker progresses through
its assignment, it takes iterations from the “front” while a thief will steal iterations from the “back”.
To prevent races on iterations we ensure only a single thief can attempt to steal by requiring it to
hold the victim’s steal lock. The victim is not required to take this lock in order to take iterations,
instead we utilize the Dekker protocol [?] which ensures that the victim and the single thief do not
take the same iterations.

6.5

Implementation

In this section we describe the implementation details of our CILK _ PARFOR loops. Our aim is
to adopt the classic PARALLEL _ FOR programming interface and include runtime support within
C ILK P LUS for executing PARALLEL _ FOR loops natively. Our implementation works alongside
the normal C ILK P LUS scheduling body, relying on an augmented steal policy to both find assigned
work and continue work-stealing when all assigned iterations are completed.

6.5.1 Programming Interface
CILK _ PARFOR

uses a familiar programming interface, requiring from the user an iterator, an initial

assignment, a final assignment and an increment to iterate with. We demonstrate this interface in
102

Figure 6.1. The programmer utilizes a pair of macros wrapped around their iteration code to specify
their CILK _ PARFOR. In this signature, it is a variable name which may or may not be instantiated
already, while start,end and incr are integers8 for the initial assignment, the final assignment and the
increment value .
Although this interface does not directly mimic the traditional PARALLEL _ FOR programming interface, it makes use of the same input parameters. We utilize macro definitions over a novel
keyword for ease of implementation as doing so only requires changes to the library and not the
compiler itself.

6.5.2

CILK _ PARFOR

loop execution

CILK _ PARFOR loops are executed in three stages:

1) creation, 2) iteration execution and eventually

3) destruction. The user-thread, which we refer to as the creator, is responsible for storing the
iteration code block, as well as creating and populating a metadata structure to hold assignments
and book-keeping utilities. Once finished, the creator begins executing their assigned iterations
while other workers begin arrive to do the same. Upon completion of all iterations, control is
returned to the creator thread to cleanup and end the CILK _ PARFOR.

CILK _ PARFOR

creation

The creation of a CILK _ PARFOR involves two main tasks: 1) storing the user’s iteration code block
and 2) instantiating a metadata structure to store the CILK _ PARFOR parameters and hold relevant
book keeping information.
8

Although we currently only support integers, our design could readily accept arbitrary data types via templating.

103

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

#define CILK_PARFOR_BEGIN(it_n,st,ed,inc) \
do { \
auto iter_body_spawn_helper = \
[&](__cilkrts_parfor_data ∗__cilk_pfd,__cilkrts_stack_frame ∗__cilk_psf) { \
__cilkrts_worker ∗w = __cilkrts_get_tls_worker(); \
CILK_PARFOR_ITER_PROLOG(__cilk_psf,w) \
auto iter_body_func = \
[&](__cilkrts_parfor_data ∗__cilk_pfd) \
{\
int done = 0; \
__cilkrts_parfor_iter_data __cilk_itd; \
__cilkrts_parfor_cnt_t it_n; \
while(!done) { \
done = !__cilkrts_check_for_iteration(__cilk_pfd,w,&__cilk_itd); \
it_n = __cilk_itd.start ∗ inc; \
while (it_n < __cilk_itd.end∗inc) { \
do
/∗ Iteration code goes here ∗/
#define CILK_PARFOR_END(it_n,st,ed,inc) \
while(0); \
it_n += inc; \
}\
}\
CILK_PARFOR_ITER_EPILOG() \
__cilkrts_end_iteration(w,__cilk_pfd,&__cilk_iter_sf); \
}; \
iter_body_func(__cilk_pfd); \
}; \
__cilk_parfor_execute_loop(st,ed,inc,iter_body_spawn_helper); \
} while (0);

Figure 6.2: The expansion of the CILK_PARFOR_BEGIN and CILK_PARFOR_END macros

104

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

template <typename IterHelperFunc>
CILK_PARFOR_NO_INLINE
void __cilk_parfor_execute_loop(__cilkrts_parfor_cnt_t _start,
__cilkrts_parfor_cnt_t _end,
__cilkrts_parfor_cnt_t _incr,
const IterHelperFunc& iter_helper_func)
{
// We’re CALLING P4,
__cilkrts_stack_frame __cilk_psf;
__cilkrts_parfor_data ∗ pfd;
CILK_PARFOR_PROLOG(__cilk_psf);
__cilkrts_worker ∗w = __cilkrts_get_tls_worker();
// Get ready to set the jump point
CILK_PARFOR_FAKE_SAVE_FP(__cilk_psf);
int ret = CILK_SETJMP(__cilk_psf.ctx);
if (ret == 0) { // Setup
__cilkrts_parfor_params pfp = {_start,_end,_incr};
__cilkrts_setup_parfor(&pfd,&(__cilk_psf),&pfp);
}
__cilkrts_worker ∗wk = __cilkrts_get_tls_worker();
if (pfd−>jump_flag[wk−>self] <= 1) { // Execute iterations
iter_helper_func(pfd,&__cilk_psf);
// Worker will never return here
}
else { // Clean up
__cilkrts_destroy_parfor(pfd);
}
CILK_PARFOR_EPILOG();
return;
}

Figure 6.3: The main CILK _ PARFOR execution body
The storing of the iteration code block is handled within the expansion of the provided macros, as
shown in Figure 6.2. The user’s iteration code is wrapped within a helper lambda function which
the creator takes into __cilk_parfor_execute_loop, the main CILK _ PARFOR execution body. This
helper function will be used by each worker to iterate over their assigned iterations and execute the
user’s code block with the appropriate iterator setting.
In Figure 6.3 we show __cilk_parfor_execute_loop, the main execution body for CILK _ PARFOR.
When the creator first executes this function it creates an instance of __cilkrts_parfor_data which

105

we refer to as the CILK _ PARFOR’s pfd. The pfd is a shared data structure holding the CILK _PARFOR

parameters, worker’s assignments as well as all bookkeeping information. Following this

instantiation, the creator calls setjmp to store a jump location which all other workers will utilize to
enter the execution body and begin their iterations. A call to longjmp with this jump location results
in that worker returning from the setjmp call with a return value of 1 – The creator ,who first calls
setjmp, returns with a value of 0. Checking the return value of setjmp allows the creator to complete

the CILK _ PARFOR setup.
We utilize this jump location for workers to begin executing a set of iterations and for the creator
to return and end the CILK _ PARFOR. To differentiate between which jump is occurring, we check
a jump flag which is always set immediately before exiting the runtime.

Executing iterations
Although the creator is able to begin executing iterations immediately after creating the CILK _PARFOR

loop, there are two tasks which must be completed before any other worker is able to

begin their own iterations. Before the first non-creator worker can begin executing iterations, they
must first 1) clear out any work from the creator’s deque that is above the CILK _ PARFOR frame and
2) detach and suspend the CILK _ PARFOR frame. Once completed, all workers are able to begin
executing their assigned iterations.
The creator’s deque must first be cleared before any other worker can start their iterations. When
the creator begins executing their assignment, they push a frame associated with the CILK _ PARFOR
onto the bottom of their deque and proceed to immediately execute their iterations with a new
iteration frame. To discover the CILK _ PARFOR loop, workers must perform a steal attempt that
finds either the CILK _ PARFOR frame (in the case that no other have done so yet) or an iteration

106

frame. Since steal attempts only look at the top of a victim’s deque, thieves must first pop all other
frames to access the CILK _ PARFOR frame.
Once the first worker encounters the CILK _ PARFOR frame rather than take the frame for themselves, as is done during a normal steal attempt, they suspend that frame and attach to it a new
iteration frame which they create for themselves. Recall that a suspended frame is a frame not currently owned by a worker to which child frames have been attached. A suspended frame is meant
to become ready for execution when all of it’s children have completed. The parent CILK _ PARFOR
frame keeps track of the active child iteration frames with it’s join counter, a value which will be
important for cleanup.
When a worker is ready to begin executing a set of iterations, they set their jump flag within the
pfd and make a call to longjmp to return to the location stored by the creator’s call to setjmp (6.3,

Line 15) and begin executing their assignment. Whenever a worker has jumped to this point with
a jump flag of 1, they have an assigned set of iterations ready execute. Each worker’s assignment
is stored as a set of start and end indices held within the pfd. Upon the creation of the CILK _PARFOR

these indices represent the worker’s initial assignment, however, if the worker completes

their initial assignment and steals more iterations, the start and end indices will be set to reflect
this new assignment – these indices effectively represent the iterations that the worker has left to
execute when they next as they arrive and proceed with the CILK _ PARFOR execution body. To
execute these iterations, the worker calls the helper function holding the user’s iteration code.
Lines 13-28 from Figure 6.2 show the loop responsible for executing assigned iterations. While iterations remain, the worker queries the pfd for a new iteration to execute. __cilkrts_check_for_iteration
increments the worker’s start index by one,

9

indicating that the iteration at that index will now

9

At the moment we only allow for iterations to be divided into grains size of 1, however, supporting larger grain
simply requires changing the amount with which __cilkrts_check_for_iteration increments the start index.

107

be executed. This continues until there are no more iterations left within the worker’s current assignment. To allow workers to steal iterations from each other, __cilkrts_check_for_iteration uses the
Dekker protocol, a lock-free resource sharing algorithm, to prevent races on iterations. Thieves
are responsible for acquiring a steal lock before attempting to interrupt the victim and acquire
iterations through Dekker.

Steal policy
CILK _ PARFOR

integrates within C ILK P LUS’s runtime scheduling through an augmented steal pol-

icy which allows workers to discover assignments or find iterations to steal. Continuing to utilize
steals as the mechanism for acquiring work affords us a natural integration into the scheduling
while readily providing support for nested parallelism within iterations. Although such support
comes with our implementation we leave the exploration of this feature to future work.
Rather than instrumenting a novel mechanism to pass iterations directly, , workers discover the
CILK _ PARFOR

and the iterations they should execute through random steal attempts. During a

normal steal attempt, a worker acquires a frame from the top of the victim’s deque and proceeds
to execute it. When a thief finds a victim working on CILK _ PARFOR iterations, by finding either a
CILK _ PARFOR

frame or an iteration frame, they instead find their own set of iterations to execute.

The following priority is observed when a thief searches for iterations upon encountering a CILK _PARFOR

during a random steal:

1. The thief’s initially assigned iterations
2. An initial assignment which some other worker has yet to start
3. The iterations of a randomly chosen victim

108

Figure 6.4: Iteration steal example
Prioritizing the initial static assignment ensures the main goal of our implementation, to allow
workers to acquire the same iterations over multiple invocations of a CILK _ PARFOR. Our secondary goal, to employ dynamic load-balance when initial assignments have finished, is accomplished by allowing workers to steal iterations from one another. When stealing iterations, we
prioritize an initial assignment which has yet to start over the iterations of a random victim, ensuring that the theif acquires as many iterations as possible. Allowing workers to steal iterations
from themselves sacrifices the benefits of always executing the same iterations to reduce idle time
and improve load-balance. In some cases, however, the cost of a worker computing an iteration
that was not initially assigned to them could out-weigh the idle time saved from stealing it, we
therefore allow users to disable the stealing of iterations to ensure a purely static distribution.

109

Once a victim has been found, either by discovering they have not begun their initial assignment
or via random selection, the thief steals half of the victim’s current assignment for themselves.
Figure 6.4 demonstrates this.

a) Worker 1 is assigned 5 iterations where S1 and E1 represent it’s start and end indices respectively.
b) Worker 1 takes the gets the first iteration to execute when it calls __cilkrts_check_for_iteration
(Figure 6.2: Line 14). ls1 and le1 are the local start and end values that are used by the grain
execution loop Figure 6.2: Lines 16-22.
c) Worker 2 steals iterations from Worker 1. The theif accomplishes this by setting E2 ← E1,
).
E1, S2 ← S1 + ( E1−S1
2
d) Worker 2 begins executing the newly stolen iterations.

Once a worker has updated their starting index, the iterations they have taken are no longer available for stealing – thieves are only capable of stealing from a victim’s set of assigned iterations.

Cleaning up
Upon completion of all the CILK _ PARFOR iterations, the creator is responsible for destroying the
__cilkrts_parfor_data instance and exiting the CILK _ PARFOR.

When a worker finishes an assigned set of iterations, they return to the runtime with a call to
__cilkrts_end_iteration. There, they unlink their current frame from the parent CILK _ PARFOR frame

decrementing the parent’s join counter. When the join counter is brought to zero there are no
more active workers executing iterations, however, it may be the case that some worker has yet
110

to arrive and begin executing their assigned iterations. To ensure all iterations are completed, a
worker decrementing the parent’s join counter to zero will also check if there exist any workers
who have yet to arrive, and if so, steal iterations from them. If this is not the case, the parent
CILK _ PARFOR

frame is then pushed to the creator, who’s jump lag is set so that after returning to

the stored jump location, Line 00 in Figure ??, they finalize the CILK _ PARFOR by destroying the
__cilkrts_parfor_data and continuing with the remainder of the application.

6.6

Experiments

In this section we present an evaluation of our CILK _ PARFOR loops which address key points from
our initial goals. Our aim is to show our implementation:
1. Performs no worse than vanilla cilk_for loops on time-bound applications
2. Can improve speedup over cilk_for loops on data-bound applications
3. Introduces minimal overhead
The following experiments will show our CILK _ PARFOR loops can achieve goals 1) and 2) but
falls short on 3). Our implementation introduce a non-trivial overhead which can result in poor
performance in some cases.

6.6.1 Experimental Setup
Our experiments were run on a 32-core machine with four Intel Eight Core Xeon E5-4620 2.2Ghz
processors which have 16M shared L3 and 256K private L2 caches. Our experiments were compiled using gcc 5.2.0 using a modified C ILK P LUS library for our CILK _ PARFOR loops.
111

We focus on the average overall speedup of the applications over 10 runs and compare the following:
• cilk_for: C ILK P LUS’ vanilla cilk_for loops
• O PEN MP STATIC: O PEN MP’s static PARALLEL _ FOR loops
• CILK _ PARFOR - HYBRID: Our CILK _ PARFOR loops with static iteration assignments and iteration stealing enabled
• CILK _ PARFOR - STATIC: Our CILK _ PARFOR loops with static iteration assignments and iteration stealing disabled
We include O PEN MP loops in our evaluation as they provide the type of statically assigned iterations that we want without support for load-balance. To compute speedup, we compare against
O PEN MP’s serial execution time.

6.6.2 Time-Bound Applications
Although time-bound applications are unable to benefit from improvements in locality it is important to ensure our CILK _ PARFOR can perform as well as vanilla cilk_for loops. To explore this
application type, we implemented the wavefront version of the n3 Smith-Waterman algorithm. For
this experiment we use a problem size of 2048x2048 with blocks of size 16x16.
Figure 6.5 shows speedup over serial O PEN MP as we increase the number of cores. All implementations achieve roughly the same speedup over the baseline. In this case, C ILK P LUS’s dynamically
scheduled cilk_for is as good as O PEN MP’s purely statically scheduled PARALLEL _ FOR loops and
our CILK _ PARFOR loops are no worse than either regardless of whether or not iteration stealing is
enabled.
112

20
O PEN MP

CILK PARFOR -S TATIC

CILK PARFOR -H YBRID

CILK FOR

Speedup

15

10

5

0

1 2

4

8

16
Processors

Figure 6.5: Smith-Waterman results

113

32

30
O PEN MP

CILK PARFOR -S TATIC

CILK PARFOR -H YBRID

CILK FOR

25

Speedup

20

15

10

5

0

1 2

4

8

16
Processors

32

Figure 6.6: Conway’s Game of Life results

6.6.3 Data-Bound Applications
Data-bound applications rely more heavily on locality as the majority of their workload comes
from performing data accesses. To evaluate our CILK _ PARFOR’s effect on locality we look a
stencil application with very regular memory access patterns. Conway’s Game of Life is a cellular
automaton which computes whether a cell lives or dies over time depending on it’s immediate
neighborhood. For this experiment we use a problem size of 16K x 16K and use blocks of size
16x256.

114

Figure 6.6 shows speedup over serial O PEN MP as we increase the number of cores. From these
results notice two key observations: 1) O PEN MP performs best for this application, 2) CILK _PARFOR - STATIC performs worse than cilk_for, however enabling iteration stealing in CILK _ PARFOR HYBRID

allows our CILK _ PARFOR loops to compete with O PEN MP.

O PEN MP is able to outperform all other implementations on such an application because a strictly
static assignment of iterations leads to perfect locality. Furthermore, with the application itself
being highly regular, an even distribution of iterations results in an even distribution of work.
O PEN MP has the added benefit of introducing very minimal overhead, with it’s threads being able
to begin executing iterations immediately.
Surprisingly CILK _ PARFOR - STATIC performs poorly, despite being purely static like O PEN MP.
The reason for this difference being that CILK _ PARFOR - STATIC workers are not able to execute
iterations immediately and must first perform random steal attempts in order to discover the CILK _PARFOR

loop. This delay does, however, not explain why CILK _ PARFOR - STATIC performs worse

than cilk_for. With iteration stealing disabled, workers in CILK _ PARFOR - STATIC go idle once they
have completed their assignment. When coupled with delayed starts, this results in an work imbalance at runtime, with many workers staying idle. Although cilk_for also suffers from the same
delayed start, workers continue performing random steals and assisting each other after completing a subset of iterations and can achieve better load balance. Enabling iteration stealing allows
our CILK _ PARFOR loops to edge out over cilk_for since they are able to get good load balance to
counter act the delayed starts but get better locality with initial static assignments.

6.6.4 Investigating Our Overhead
The block sizes selected for the previous experiment was result of an observation of the overhead
introduced in our CILK _ PARFOR loops. We found that when individual iteration workloads were
115

small, the added call to __check_for_iteration would dominate the execution. Recall this function
called by a worker while executing an assigned set of iterations and is responsible for preventing
races on individual iterations.
When we run Conway’s Game of Life with smaller a block size of 16x16, we spent 82% of the total
time executing __check_for_iteration, prohibiting our CILK _ PARFOR loops from being competitive
in this case.
Using larger block sizes allows us to avoid the issues associated with this overhead, however this
limits our CILK _ PARFOR loop’s performance on smaller applications as larger blocking factors
result in less overall parallelism.

6.7

Conclusions and Future Work

PARALLEL _ FOR

loops offer a convenient way for programmers to quickly parallelize their appli-

cations. Many of these loop based applications can achieve perfect locality with PARALLEL _ FOR
loops when iterations are evenly distributed and executed on the same processor over and over. If
an even distribution of iterations translates to an even work distribution, this static assignment results in good load-balance. It may be the case, however, that loop iterations have differing amounts
of work and a simple static assignment strategy will be incapable of finding a load-balanced schedule.
C ILK P LUS’ cilk_for, on the other hand, naively achieve load balance at the cost of providing no
guarantees about what executes where. As a result, their cilk_for loops cannot take advantage of the
locality inherent in many loop-based applications.

116

Our CILK _ PARFOR loops are able to statically assign iterations to workers, enabling them to exploit the locality benefits of continually executing the same iterations. On top of this, we support
dynamic load-balancing by allowing workers to steal iterations after they have completed their initial assignment. We find that in applications which are not sensitive to locality, our implementation
does not perform significantly worse than vanilla cilk_for. On data bound applications with regular
memory access patterns, which are ideal for O PEN MP’s static scheduling, CILK _ PARFOR is able
to edge out above cilk_for, getting much closer to O PEN MP’s performance.
For future work, it would be ideal to implement CILK _ PARFOR as a language keyword, which
would help simplify the process of porting existing applications to use it. Doing so would require
adding the ability to select grain sizes (the smallest schedulable unit), use arbitrary iteration types
and test conditions for completion.
Additionally we would like to explore how to enable parallelism within an iteration as well as
supporting dynamic discovery of iteration assignments that can be saved and replayed.

117

References
[1] Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related
problems. Communications of the ACM, 31(9):1116–1127, September 1988.
[2] Kunal Agrawal, Anne Benoit, Fanny Dufossé, and Yves Robert. Mapping filtering streaming
applications with communication costs. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pages 19–28, 2009.
[3] Kunal Agrawal, Anne Benoit, Fanny DufossÃl’, and Yves Robert. Mapping filtering streaming applications. Algorithmica, pages 1–51, September 2010.
[4] Kunal Agrawal, Anne Benoit, and Yves Robert. Mapping linear workflows with computation/communication overlap. In ICPADS ’08: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, pages 195–202, Washington, DC,
USA, 2008. IEEE Computer Society.
[5] Kunal Agrawal and Jeremy Fineman. Brief announcement: Cache-oblivious scheduling of
streaming pipelines. In Proceedings of the ACM Symposium on Parallelism in Algorithms
and Architectures, pages 79–81, 2014.
[6] Kunal Agrawal, Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo.
Cache-conscious scheduling of streaming applications. In Proceedings of the 24th ACM
Symposium on Parallelism in algorithms and architectures, pages 236–245, 2012.
[7] Kunal Agrawal, Charles E Leiserson, and Jim Sukha. Executing task graphs using workstealing. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium
on, pages 1–12. IEEE, 2010.
[8] Lars Arge, Michael T. Goodrich, Michael Nelson, and Nodari Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In Proceedings of the Twentieth Annual
Symposium on Parallelism in Algorithms and Architectures, pages 197–206, 2008.
[9] Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU:
a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency
and Computation: Practice and Experience, 23(2):187–198, 2011.
[10] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: expressing locality
and independence with logical regions. In Proceedings of the international conference on
high performance computing, networking, storage and analysis, page 66. IEEE Computer
Society Press, 2012.
118

[11] Jonathan C Beard, Peng Li, and Roger D Chamberlain. Raftlib: a c++ template library
for high performance stream parallel processing. In Proceedings of the Sixth International
Workshop on Programming Models and Applications for Multicores and Manycores, pages
96–105. ACM, 2015.
[12] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Bradley C. Kuszmaul. Concurrent cache-oblivious B-trees. In Proceedings of the Seventeenth Annual ACM Symposium on
Parallelism in Algorithms and Architectures, pages 228–237, July 2005.
[13] A Benoit, M. Hakem, and Y. Robert. Optimizing the latency of streaming applications under
throughput and reliability constraints. In International Conference on Parallel Processing,
pages 325–332, Sept 2009.
[14] A Benoit and Y. Robert. Complexity results for throughput and latency optimization of replicated and data-parallel workflows. In Proceedings of the 2007 IEEE International Conference
on Cluster Computing, pages 497–506, Sept 2007.
[15] Anne Benoit and Yves Robert. Mapping pipeline skeletons onto heterogeneous platforms.
Journal of Parallel and Distributed Computing, 68(6):790–808, 2008.
[16] Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin
Chen, and Michael Kozuch. Provably good multicore cache performance for divide-andconquer algorithms. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on
Discrete Algorithms, pages 501–510, 2008.
[17] S. H. Bokhari. Partitioning problems in parallel, pipeline, and distributed computing. IEEE
Trans. on Computers, 37(1):48–57, January 1988.
[18] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In
Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino,
and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide
Web, pages 587–596. ACM Press, 2011.
[19] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques.
In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages
595–601, Manhattan, USA, 2004. ACM Press.
[20] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault,
and Jack J Dongarra. PaRSEC: Exploiting heterogeneity to enhance scalability. Computing
in Science & Engineering, 15(6):36–45, 2013.
[21] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier,
and Jack Dongarra. Dague: A generic distributed dag engine for high performance computing. Parallel Computing, 38(1):37–51, 2012.

119

[22] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook
for GPUs: Stream computing on graphics hardware. ACM Trans. on Graphics, 23(3):777–
786, August 2004.
[23] Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, et al. Concurrent collections. Scientific Programming, 18(3-4):203–217, 2010.
[24] Edouard Bugnion, Jennifer M Anderson, Todd C Mowry, Mendel Rosenblum, and Monica S
Lam. Compiler-directed page coloring for multiprocessors. In ACM SIGPLAN Notices,
volume 31, pages 244–255. ACM, 1996.
[25] Ernie Chan, Enrique S Quintana-Orti, Gregorio Quintana-Orti, and Robert Van De Geijn.
SuperMatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and
architectures, pages 116–125. ACM, 2007.
[26] Sanjay Chatterjee, Sagnak Tasirlar, Zoran Budimlic, Vincent Cavé, Milind Chabbi, Max
Grossman, Vivek Sarkar, and Yonghong Yan. Integrating asynchronous task parallelism with
MPI. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS
2013, Cambridge, MA, USA, May 20-24, 2013, pages 712–725, 2013.
[27] Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and
Chris Wilkerson. Scheduling threads for constructive cache sharing on cmps. In Proceedings
of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA
’07, pages 105–115, New York, NY, USA, 2007. ACM.
[28] Trishul M. Chilimbi and Ran Shaham. Cache-conscious coallocation of hot data streams. In
Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and
Implementation, Ottawa, Ontario, Canada, June 11-14, 2006, pages 252–262, 2006.
[29] Hyeong-Ah Choi and Bhagirath Narahari. Algorithms for mapping and partitioning chain
structured parallel computations. In Proceedings of the International Conference on Parallel
Processing, pages 625–628, 1991.
[30] Daniel Cordes, Andreas Heinig, Peter Marwedel, and Arindam Mallik. Automatic extraction
of pipeline parallelism for embedded software using linear programming. In Proceedings of
IEEE 17th Interenational Conference on Parallel and Distributed Systems, pages 699–706,
2011.
[31] T Cormen, C Leiserson, R Rivest, and C Stein. Introduction to Algorithms. MIT Press, 2009.
[32] Mark A. Franklin, Eric J. Tyson, James H. Buckley, Patrick Crowley, and John Maschmeyer.
Auto-pipe and the X language: A pipeline design tool and description language. In Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, April
2006.
120

[33] Matteo Frigo, Charles E Leiserson, and Keith H Randall. The implementation of the Cilk-5
multithreaded language. In ACM Sigplan Notices, volume 33, pages 212–223. ACM, 1998.
[34] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy.
streams: a review. SIGMOD Rec., 34(2):18–26, June 2005.

Mining data

[35] Thierry Gautier, Joao VF Lima, Nicolas Maillard, and Bruno Raffin. XKaapi: A runtime
system for data-flow task programming on heterogeneous architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 1299–
1308. IEEE, 2013.
[36] Michael I Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task,
data, and pipeline parallelism in stream programs. ACM SIGOPS Operating Systems Review,
40(5):151–162, 2006.
[37] Yi Guo, Jisheng Zhao, Vincent Cave, and Vivek Sarkar. SLAW: a scalable locality-aware
adaptive work-stealing scheduler for multi-core systems. In ACM Sigplan Notices, volume 45, pages 341–342. ACM, 2010.
[38] P. Hansen and K.-W. Lih. Improved algorithms for partitioning problems in parallel,
pipelined, and distributed computing. IEEE Trans. on Computers, 41(6):769 –771, June
1992.
[39] Mor Harchol-Balter. The effect of heavy-tailed job size. distributions on computer system
design. In Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions
in Economics, 1999.
[40] Mor Harchol-Balter and Allen B. Downey. Exploiting process lifetime distributions for dynamic load balancing. ACM Transactions on Computer Systems, 15(3):253–285, 1997.
[41] Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany, Jung Ho Ahn, Peter Mattson, and John D. Owens. Programmable stream processors. Computer, 36(8):54–62, August
2003.
[42] B. Khailany, W.J. Dally, S. Rixner, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens,
B. Towles, and A. Chang. Imagine: Media processing with streams. IEEE Micro, pages
35–46, March/April 2001.
[43] Sanjeev Kohli. Cache aware scheduling for synchronous dataflow programs. Technical Report UCB/ERL M04/3, EECS Department, University of California, Berkeley, 2004.
[44] Manjunath Kudlur and Scott Mahlke. Orchestrating the execution of stream programs on
multicore platforms. In Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages 114–124, 2008.
[45] Yu-Kwong Kwok and Ishfaq Ahmad. Static scheduling algorithms for allocating directed
task graphs to multiprocessors. ACM Comput. Surv., 31(4):406–471, December 1999.
121

[46] Laboratory for web algorithmics. http://law.di.unimi.it/datasets.php.
[47] I Lee, Ting Angelina, Charles E Leiserson, Tao B Schardl, Zhunping Zhang, and Jim Sukha.
On-the-fly pipeline parallelism. ACM Transactions on Parallel Computing, 2(3):17, 2015.
[48] Will Leland and Teunis J. Ott. Load-balancing heuristics and process behavior. In Proceedings of the 1986 ACM SIGMETRICS Joint International Conference on Computer Performance Modelling, Measurement and Evaluation, pages 54–69, New York, NY, USA, 1986.
ACM Press.
[49] Hui Li, Sudarsan Tandri, Michael Stumm, and Kenneth C Sevcik. Locality and loop scheduling on numa multiprocessors. In Proceedings of the International Conference on Parallel
Processing, pages II–140. Citeseer, 1993.
[50] Peng Li, Kunal Agrawal, Jeremy Buhler, and Roger D. Chamberlain. Adding data parallelism
to streaming pipelines for throughput optimization. In Proceedings of the 20th International
Conference on High Performance Computing, pages 20–29, 2013.
[51] Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V Kale. Optimizing data locality for fork/join programs using constrained work stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,
pages 857–868. IEEE Press, 2014.
[52] GQ Liu, Kim-Leng Poh, and Min Xie. Iterative list scheduling for heterogeneous computing.
Journal of Parallel and Distributed Computing, 65(5):654–665, 2005.
[53] Ying Liu, Nithya Vijayakumar, and Beth Plale. Stream processing in data-driven computational science. In Proceedings of the 7th IEEE/ACM International Conference on Grid
Computing, pages 160–167, 2006.
[54] Evangelos P Markatos and Thomas J LeBlanc. Using processor affinity in loop scheduling
on shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed systems,
5(4):379–400, 1994.
[55] Seung-Jai Min, Costin Iancu, and Katherine Yelick. Hierarchical work stealing on manycore
clusters. In 5th Conf. on Partitioned Global Address Space Prog. Models, 2011.
[56] Arno Moonen, Marco Bekooij, Rene Van Den Berg, and Jef Van Meerbergen. Cache aware
mapping of streaming applications on a multiprocessor system-on-chip. In Proceedings of
the Conference on Design, Automation and Test in Europe, pages 300–305, 2008.
[57] Frank Mueller. Compiler support for software-based cache partitioning. In ACM Sigplan
Notices, volume 30, pages 125–133. ACM, 1995.
[58] Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, and Jan F. Prins. Scheduling
task parallelism on multi-socket multicore systems. In Proceedings of the 1st International
Workshop on Runtime and Operating Systems for Supercomputers, ROSS ’11, pages 49–56,
New York, NY, USA, 2011. ACM.
122

[59] Bjørn Olstad and Fredrik Manne. Efficient partitioning of sequences. IEEE Transactions on
Computers, 44(11):1322–1326, 1995.
[60] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation
ranking: bringing order to the web. 1999.
[61] Josep M Perez, Rosa M Badia, and Jesus Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster Computing, 2008 IEEE International Conference on, pages 142–151. IEEE, 2008.
[62] Ali Pınar and Cevdet Aykanat. Fast optimal load balancing algorithms for 1d partitioning.
Journal of Parallel and Distributed Computing, 64(8), 2004.
[63] Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-stealing. In Euro-Par 2010Parallel Processing, pages 217–229. Springer, 2010.
[64] Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead,
high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th
Annual IEEE/ACM International Symposium on Microarchitecture, pages 423–432. IEEE
Computer Society, 2006.
[65] John W. Romein, P. Chris Broekema, Ellen van Meijeren, Kjeld van der Schaaf, and
Walther H. Zwart. Astronomical real-time streaming signal processing on a Blue Gene/L
supercomputer. In Proceedings of the Eighteenth Annual ACM Symposium on Parallelism in
Algorithms and Architectures, pages 59–66, 2006.
[66] Janis Sermulins, William Thies, Rodric Rabbah, and Saman Amarasinghe. Cache aware
optimization of stream programs. ACM SIGPLAN Notices, 40(7):115–126, 2005.
[67] Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981.
[68] Jaspal Subhlok and Gary Vondran. Optimal mapping of sequences of data parallel tasks.
In Proceedings of 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, pages 134–143, 1995.
[69] Jaspar Subhlok and Gary Vondran. Optimal latency-throughput tradeoffs for data parallel
pipelines. In Proceedings of the eighth annual ACM symposium on Parallel algorithms and
architectures, pages 62–71. ACM, 1996.
[70] G Edward Suh, Larry Rudolph, and Srinivas Devadas. Dynamic partitioning of shared cache
memory. The Journal of Supercomputing, 28(1):7–26, 2004.
[71] W. Thies, M. Karczmarek, and S.P. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction,
pages 179–196, 2002.
123

[72] Nagavijayalakshmi Vydyanathan, Sriram Krishnamoorthy, Gerald M. Sabin, Ümit V.
Çatalyürek, Tahsin M. Kurç, P. Sadayappan, and Joel H. Saltz. An integrated approach to
locality-conscious processor allocation and scheduling of mixed-parallel applications. IEEE
Trans. Parallel Distrib. Syst., 20(8):1158–1172, 2009.
[73] Zheng Wang and Michael FP O’Boyle. Partitioning streaming parallelism for multi-cores:
a machine learning based approach. In Proceedings of the 19th international conference on
Parallel architectures and compilation techniques, pages 307–318. ACM, 2010.
[74] Joseph G Wingbermuehle, Roger D Chamberlain, and Ron K Cytron. Scalapipe: A streaming application generator. In Application Accelerators in High Performance Computing
(SAAHPC), 2012 Symposium on, pages 44–53. IEEE, 2012.
[75] Tao Yang and Apostolos Gerasoulis. PYRROS: static task scheduling and code generation
for message passing multiprocessors. In Proceedings of the 6th international conference on
Supercomputing, pages 428–437. ACM, 1992.
[76] Geore Kingsley Zipf. Selected studies of the principle of relative frequency in language.
Cambridge, Mass. : Harvard University Press, 1932.

124

