An event-triggered programmable prefetcher for irregular workloads by Ainsworth, Sam & Jones, Timothy M.
ASPLOS Submission #3– Confidential Draft – Do Not Distribute!!
An Event-Triggered Programmable Prefetcher
for Irregular Workloads
Abstract
Many modern workloads compute on large amounts of
data, often with irregular memory accesses. Current ar-
chitectures perform poorly for these workloads, as existing
prefetching techniques cannot capture the memory access pat-
terns; these applications end up heavily memory-bound as
a result. Although a number of techniques exist to explicitly
configure a prefetcher with traversal patterns, gaining sig-
nificant speedups, they do not generalize beyond their target
data structures. Instead, we propose an event-triggered pro-
grammable prefetcher combining the flexibility of a general-
purpose computational unit with an event-based programming
model, along with compiler techniques to automatically gen-
erate events from the original source code with annotations.
This allows more complex fetching decisions to be made, with-
out needing to stall when intermediate results are required.
Using our programmable prefetching system, combined with
small prefetch kernels extracted from applications, we achieve
an average 3.0× speedup for a variety of graph, database and
HPC workloads.
1. Introduction
Many modern and emerging workloads perform computation
on large amounts of data, which cannot fit in the caches of
current systems. Many of these accesses are irregular and
difficult to predict in advance, resulting in heavily memory-
bound execution with frequent, long stalls from high DRAM
latency [1–3].
There are several techniques available to address these chal-
lenges. One option is to utilize thread-level parallelism within
an application, to cope with latency using aggressive multi-
threading [4], effectively parallelizing loads by having many
threads stalled at once. This is typical of workloads running on
graphics cards, for instance [5]. However, this technique only
works if the application exhibits a great deal of thread-level
parallelism. This is often not the case in big data workloads,
due to complex and unpredictable reads and writes to the same
data [6], and the difficulty of creating effective partitions for
the parallel cores to work on.
Another option is prefetching, either through hardware
prefetch units or software instructions. However, traditional
stride-based prefetchers [7, 8] only work for very regular com-
putations, typically involving either dense matrices or entirely
sequential memory accesses. History-based prefetchers [9]
only work for highly repeated computation. Neither of these
apply to many big-data applications, such as database opera-
tions, graph workloads and many high-performance comput-
ing (HPC) kernels, which exhibit much more complicated,
irregular traversals of data, involving pointer chasing and in-
direct array lookups [4]. Techniques have been proposed
specifically for irregular accesses, such as pointer fetching
prefetchers [10], which fetch plausible pointers from observed
memory loads. However, these lack the ability to look ahead
in arrays, cannot fetch commonly-used index-based data struc-
tures (as the loaded memory doesn’t include pointers), and
suffer from severe over-fetching from memory, due to a lack of
ability to have fine-grained control over prefetches. Software
prefetching [11, 12], on the other hand, fills the main CPU’s
pipeline with many extra instructions, and is unable to deal
with accesses involving multiple loads without stalling.
Nevertheless, despite the lack of success for traditional,
implicit prefetching techniques on these workloads, it is still
possible to mitigate the cost of latency associated with mem-
ory accesses. Techniques to extract memory-level parallelism
for a variety of memory-bound applications exist [1–3, 13, 14]
through explicit configuration of traversal patterns, gaining
significant performance improvements for the targeted work-
loads. However, currently such architectural techniques are
highly specialized to the target computation, so adding them
to general-purpose systems may be infeasible due to the lack
of wide applicability. Further, they are unable to deal with
the rapid evolution of algorithms within the field due to their
fixed-function nature.
To this end, we have designed an event-based programmable
prefetching system for general-purpose workloads in a variety
of domains including graphs, databases and HPC. We couple
a conventional high-performance out-of-order computation
core with a specialized prefetching structure for the L1 cache,
attached to several in-order programmable prefetch units. The
event-based programming model allows each prefetch unit
to issue and react to multiple loads at once without stalling.
This enables the system to prefetch based on the results of
earlier prefetches, in addition to prefetching from multiple
data structures concurrently.
We further provide compiler techniques to generate event
programs for these cores based on the original source code
and thus alleviate manual effort for simpler access patterns,
using annotations to specify what needs to be prefetched.
On a wide set of memory-bound benchmarks, we achieve a
3.0× average speedup, with high utilization of the prefetches
brought into the cache, and negligible additional memory
accesses for most workloads.
1 for(x = 0; x < in.size; x++) {
2 SWPF(htab[hash(in.key[x+dist])]); // Software prefetch
3 Key k = in.key[x];
4 Hash h = hash(k);
5 Bucket b = htab[h];
6 ListElement l = b.listStart;
7 while(l != NULL) {
8 if(l->key == k) {
9 wait_til_oldest(); // Multithreading
10 out.match[out.size] = k;
11 out.size++;
12 }
13 l = l->next;
14 }
15 signal_iter_done(x); // Multithreading
16 }
Figure 1: Hash join kernel with two latency-hiding techniques.
2. Existing Work
There is an abundance of work in the literature concerning
prefetching, and we describe the most relevant works here,
highlighting the elements that are beneficial for workloads
with irregular memory accesses.
Fetcher Units Much of the research into efficient execu-
tion of irregular workloads has focused on highly specialized
fetcher units. These systems take control of memory accesses
for a particular access pattern, extracting performance through
parallel loads of data, often with large performance improve-
ments. SQRL [2] and DASX [15] are fetcher systems designed
for iterative accesses of B-tree, vector and hash table structures.
Similarly, Kocberber et al. [1, 13] focus on the optimization of
database inner joins by parallel hash table walking. Fetcher
units can realize energy savings through removal of the origi-
nal load instructions. However, they are highly specialized to
specific data accesses, and the original code has to be signifi-
cantly modified, yielding code incompatible with devices not
featuring fetcher units.
Configurable Prefetchers This paper develops a config-
urable prefetcher exposed at the architectural level, and ideas
showing the benefits of this have been proposed in the past.
Al-Sukhni et al. [16] use explicit Harbinger instructions at the
program level to control linked-list pointer fetching. Yang
and Lebeck [17] develop a programmable prefetching scheme
for linked data structures. The programmable fetchers are al-
lowed to stall, and so cannot deal with patterns which require
overlapping of memory accesses to achieve high performance.
Ainsworth and Jones [14] design a configurable prefetcher
specifically for graph workloads, gaining large speedups, but
only targeting specific traversals for a particular graph format.
Implicit Irregular Prefetchers Many attempts have been
made at prefetching irregular structures using more traditional,
implicit schemes without configuration. This is desirable, as
it reduces manual effort, and does not require recompilation.
However, although significant progress has been made, none
have been implemented in any commercial system [18].
Pointer-fetching prefetchers [10], which fetch all plausible
virtual addresses from cache lines read by a core, have been
proposed in several schemes. The main downside to these
approaches is the large over-fetch rate. In addition, these
schemes are unable to deal with the array-indirect patterns
seen in many workloads.
Attempts to extract dependence-graph streams at runtime,
by detecting dependent loads, have been made [19–21]. These
run dynamically-detected streams on programmable units
when the start of a set of loads is identified, to prefetch the
data. Mutlu et al. propose a runahead scheme [22], which
utilizes idle chip resources on a cache miss to dynamically
prefetch loads. These are limited by being tightly bound to
the instruction stream, thus are unable to exploit significant
lookahead, or prefetch from other prefetched loads.
Yu et al. [23] pick up stride-indirect patterns using runtime
analysis of executed code on the CPU, to find the base array
and size of each data element. This achieves prefetching of
this single pattern, at the expense of complicated analysis
hardware in the core, which may affect the critical path of
execution.
Helper Threads One solution for prefetching irregular ap-
plications has been to use separate CPU threads to prefetch
data in software. Kim and Yeung [24] use auto-generated
“pre-execution threads” from compiler analysis. These have
the desirable property that no extra hardware is required. How-
ever, they use an additional thread on a high-performance core,
which could consume significant amounts of energy. They
are further unable to deal with prefetches based on prefetches
without stalling [25]. Further, the lack of a hardware event
queue makes synchronization on loads difficult and expensive.
Lau et al. [26] propose a similar scheme, but with architectural
support: a single small helper core is attached to a main core
to assist with processing tasks. This tight coupling somewhat
helps alleviate the synchronization problem, but still exhibits
the same stalls as above. Given this, a single core is rarely
able to meet the processing needs of complex access patterns.
Summary While there are elements of techniques from the
literature that can help with efficient and timely prefetch of
data into the cache for irregular workloads, there is currently
no complete solution. We next consider how several existing
schemes perform on a complex benchmark kernel, motivat-
ing the need for event-based and decoupled programmable
prefetching hardware that we develop in section 4.
3. Motivation
Figure 1 gives an example of a typical hash join kernel, as
used in databases. We have an indirect access to a hash table
array via a hash on a sequential access to a key array, followed
by linked-list traversals.
There are several challenges here for existing prefetchers.
First, as a result of the hash function, accesses to the hash
table array are unpredictable and scattered throughout memory,
2
	





	





	



	









(a) Original
	








	






	








	














	

(b) Software
	






	




	




	












 





 !"
#

 $
(c) Multithreaded
	





	





	



	










	





	



	
























(d) Helper thread
	







	







	

	



	





	

















	










 























	
	



(e) Desired
Figure 2: Execution of hash join codes. Software prefetch can only reduce stalls to the hash table buckets. Multithreading over-
laps parallel sections, but must synchronize on dependences. Ideally we would prefetch hash table buckets and list items sepa-
rately from the main computation and allow the prefetcher to issue further prefetches based on the results of earlier prefetches.
with no spatial or temporal locality among them. Without
knowing the hash function, there is no chance of being able
to accurately prefetch entries. Second, the linked-list traversal
does not perform a significant amount of work on each element.
Although pointer prefetchers could identify l->next as the
address for the next element to process, the lack of work
performed on each iteration of the while loop means that a
prefetcher cannot hide the memory access latency of bringing
in the next list item.
Figure 2a shows how this unmodified code would execute.1
Light green boxes denote the calculation of the hash and load
of the hash-table bucket. Darker green boxes show a load of
a linked-list item. Diagonal lines in the boxes show a stall,
waiting for the data to arrive from a lower level cache or main
memory. As can be seen, each load causes a stall due to the
lack of temporal and spatial locality in the code.
Software Prefetching In this example, software prefetch-
ing [11] can be more beneficial than using a hardware
prefetcher, since we can encode the hash function inside the
prefetch instruction. Figure 1 shows this instruction and its
position within the code. We prefetch at a fixed number of
for-loop iterations into the future (dist) to bring hash-table
elements into the cache in advance of them being used. How-
ever, we cannot help with the linked-list traversal because the
software does not get notified about the results of this hash-
table item prefetch. We are restricted to prefetching the linked
1The nature of this code, where the linked list is dependent on the hash-
table bucket load, means that an out-of-order core would not be able to exploit
memory-level parallelism through multiple outstanding loads.
list for the current hash-table item, which suffers the same
memory latency hiding challenges as in hardware.
Figure 2b shows how the software prefetch improves per-
formance. Yellow boxes denote the calculation of the prefetch
address and corresponding prefetch instruction. We assume
a prefetch distance of 1 iteration in this example, meaning
that the first iteration prefetches the hash-table bucket for the
second iteration, and so on. As can be seen, for the second and
subsequent iterations, there is no stall for loading the bucket
(although the prefetch instruction itself incurs an overhead).
After four iterations, execution finishes slightly earlier than in
the original code, but the inability to prefetch the linked-list
items limits the performance increase.
Multithreading A third option is to exploit thread-level par-
allelism. Each of the for-loop iterations can be executed as a
separate thread to hide the memory latencies. However, the
algorithm is not embarrassingly parallel, and the order of the
output keys could change by executing iterations out of order,
so synchronization is required to prevent this.
Code for this option is shown in figure 1, and its execution
on two threads is shown in figure 2c. When a matching key is
found, the thread waits until it is executing the oldest iteration
before writing to the output array, to preserve ordering. This is
performed by calling wait_til_oldest(); the companion
signal_iter_done() signals at the end of each iteration to
keep track of the oldest iteration currently executing.
In the example (figure 2c), there is a match on the key in
the first list item in the second iteration. However, since the
first iteration on core 0 is still running, this second iteration
3
Figure 3: Structure of the programmable prefetcher.
must wait until that is finished before writing to the output
array. Despite this idle time, the multithreaded version in this
example completes faster than with software prefetching by
overlapping execution and stalls where possible.
Helper Thread A fourth type of prefetching is to duplicate
the memory accessing part of the loop into a separate, helper
thread. This thread can run in a different context on the same
core as the main thread, if simultaneous multithreading support
is available, to prefetch into the main L1 cache. Execution
for this technique is shown in figure 2d. The fundamental
limitation of this approach is that the helper thread cannot
load data in fast enough to stay ahead of the main thread. The
helper thread cannot use prefetches but must stall on each load
to be able to use results from it.
Desired Behavior However, in the ideal case we would have
no stalls at all. The workload actually contains a significant
amount of memory-level parallelism that existing techniques
are unable to exploit: we can parallelize over the array in.key,
allowing us to prefetch multiple linked lists at once, by over-
lapping the sequential linked-list fetches. If we could decouple
the calculation of prefetch addresses from the main execution
in a way that prevents stalling on each load, we would be able
to take advantage of this parallelism and bring data into the
cache shortly before it is used. This would lead to an execu-
tion similar to that in figure 2e where, after a warm-up period,
computation can proceed without stalls, since data is immedi-
ately available in the first level cache. To realize this we must
allow the prefetcher to react to data coming back from its own
prefetches, and give it knowledge of the computation being
performed, so that it can calculate the next set of prefetches
based on the data structures being traversed.
4. Programmable Prefetcher
This section develops a programmable prefetcher, suitable for
a wide variety of applications, but especially targeted towards
workloads with irregular, yet calculable memory accesses.
Figure 3 shows the overall structure. We add programmable
prefetch units and supporting hardware to generate prefetches
based on an application’s current and future working set. The
prefetcher is event-based, to avoid stalling, yet enable further
fetches to be made from the results of an earlier prefetch.
All snooped reads from the main core, and prefetched data
reaching the L1 cache, initially go into an address filter. Fil-
tered addresses move into the observation queue, to be re-
Start Addr End Addr Load Ptr PF Ptr Obs EWMA PF EWMA Start PF EWMA End
Figure 4: An example address filter table entry.
moved by the scheduler when it detects a free programmable
prefetch unit (PPU). These programmable units are low fre-
quency, in-order cores that execute a small computation for
each address received from the scheduler, and generate zero
or more prefetches as a result. These are placed into the FIFO
prefetch request queue. When the L1 cache has an available
MSHR, it removes the first prefetch request and issues it to the
L2 cache. The following subsections describe each structure
in more detail.
4.1. Address Filter
The address filter snoops all loads coming from the main
core, and prefetched data brought into the L1 cache from the
L2. This filter holds multiple address ranges that we wish to
monitor and use to create new prefetches, for example the hash
table (htab) in the kernel from figure 1. The address filter
is configured through explicit address bounds configuration
instructions running on the main core. These instructions are
generated by the compiler or programmer when creating the
code that executes on the PPUs.
The configuration is stored in the filter table, as shown in
figure 4. It stores virtual address ranges for each important
data structure, along with two function pointers to small com-
putation kernels: Load Ptr, to be run when a load is observed
to that range, and PF Ptr, to be run when a prefetch to that
range is completed. Some ranges are also used for scheduling
purposes (see section 4.4), and these are marked in the table.
Filtered addresses (observations) are placed in the observa-
tion queue along with their function pointers and, in the case
of a prefetch observation, the prefetched cache line. Address
ranges can overlap; an address in multiple ranges stores an
entry for each in the queue.
4.2. Observation Queue and Scheduler
Filtered addresses are placed in a small observation queue
before being assigned to a core. The queue is simply a FIFO
buffer to hold observations until a PPU becomes free. As
prefetches are only performance enhancements, in the event of
this queue filling up, old observations can be safely dropped
with no impact on correctness of the main program.
Once a PPU becomes free, the scheduler writes the cache
line and virtual address of the data into the PPU’s registers,
then sets the PPU’s program counter to the registered prefetch
kernel for that observation, starting the core. The scheduler’s
job is simply to monitor the PPUs and assign them work from
the FIFO observation queue when required.
4.3. Programmable Prefetch Units (PPUs)
The PPUs are a set of in-order, low power, programmable
RISC cores attached to the scheduler of the prefetcher, and are
4
responsible for generating new prefetch requests. The PPUs
operate on the same word size as the main core so that they
can perform address arithmetic in one instruction.
Each prefetcher unit is paused by default. When there is data
in the observation queue, and there is a free PPU, the scheduler
sends the oldest observation to that PPU for execution. The
PPU runs until completion of the kernel, which is typically
only a few lines of code. During execution it generates a
number of prefetches, which are placed in the prefetch request
queue, then sleeps until being reawakened by the scheduler.
Attached to the PPUs is a single, shared, multi-ported in-
struction cache. PPUs share an instruction cache between
themselves, but not with the main core; PPU code is distinct
from the main application, but any observation can be run on
any PPU. The amount of programmable prefetch code required
for most applications is extremely small, so the instruction
cache size requirements are minor: in the benchmarks de-
scribed in section 7 a maximum of 1KB is fetched from main
memory by the PPUs for the entirety of each application.
The PPUs do not have a load or store unit, and therefore
have no need for a data cache. This means they are limited
to reading individual cache lines that have been forwarded to
them, local register storage, and global prefetcher registers.
Removing the ability to access any other memory reduces
both the complexity of the PPUs and the need for them to
stall. Although this limits the data that can be used in prefetch
calculations, we have not found a scenario where any addi-
tional data is required. Typically the prefetch code will simply
take some data from the cache line, perform simple arithmetic
operations, then combine it with global prefetcher state, such
as the base address of an array, to create a new prefetch ad-
dress. Having no additional memory also means that each
PPU has no stack space for intermediate values, but registers
are available and provide ample storage for temporary values.
In practice we have not found this to be an issue.
4.4. EWMA Calculators
For some applications, the lookahead distance for issuing
prefetches cannot be set using a fixed value. It may be input
dependent, and may vary depending on the timing statistics of
the particular system. In some workloads, notably breadth-first
searches on graphs, the prefetch distance may vary within the
computation as phases access differently-sized elements.
Prior research has dealt with this challenge by considering
the ratio between computation and memory access times. For
example, Mowry et al. [27] divide the prefetch latency by the
number of instructions in the shortest path through a loop to
determine the number of iterations ahead to prefetch.
We generalize this idea and perform the calculation dy-
namically in hardware using exponentially weighted moving
average (EWMA) calculators to generate times for a variety of
observed events. EWMAs can be implemented very efficiently
in hardware with minor amounts of state [28], and means that
PPUs do not need to perform timing calculations.
The computed EWMAs are specified in the address filter
table, as shown in figure 4. When an observed read occurs
to a particular data structure, if Obs EWMA is set, the time
between this event and the previous event on the same address
bound is recorded. This can give us, for example, the time
between FIFO accesses for breadth-first search. To time how
long loads take, when PF EWMA Start is set, we signify the
start of a timed prefetch EWMA, and attach the current time to
the event generated. We propagate this to resulting prefetches
until we reach an address range with PF EWMA End set, when
we use the time between the two events as input into a load
time EWMA.
4.5. Prefetch Request Queue
The prefetch request queue is a FIFO queue containing the
virtual addresses that have been calculated by the PPUs for
prefetching, that have not yet been processed. Once the L1
data cache has a free MSHR, it takes the oldest item out of
this queue, translates it to a physical address using the shared
TLB, then issues the prefetch to this address. As with the
observation queue, older requests can be dropped if the queue
becomes full, without impacting application correctness.
4.6. Memory Request Tags
While array ranges, which can be captured by virtual address
bounds, can be identified easily by the configuration steps
discussed in section 4.1, these aren’t the only structures a
prefetcher needs to react to. Linked structures (e.g. trees,
graphs, lists) can be allocated element-by-element in non-
contiguous memory regions and require identification when
their prefetched data arrives into the cache. To deal with
these we store a single tag in the MSHR that identifies the
data structure that the prefetch targets, such as a hash-table
bucket’s linked list. When a prefetch request returns data, and
has a registered tag, the cache line is sent to a PPU loaded
with the function pointer for that structure.
4.7. Hardware Requirements
Though the prefetcher features many programmable units,
each one of these is intended to be a very small,
microcontroller-style unit, such as the ARM Cortex M0, which
contains fewer than 12,000 gates [29] (approximately 50,000
transistors). Compared to a typical out-of-order superscalar
CPU, for example a first generation Intel i7 processor at over
700 million transistors [30] (or 66 million per core, excluding
caches), associated silicon area for our prefetcher is minimal.
4.8. Summary
We have developed a programmable prefetcher that responds
to filtered load and prefetch observation events. These feed
into a set of programmable prefetch units, which run kernels
based on the events to issue prefetches into the main core’s
cache. The following sections describes how these units are
programmed.
5
1 int64_t acc = 0;
2 for(x=0; x<N; x++) {
3 acc += C[B[A[x]]];
4 }
5 return acc;
(a) Main program
1 void on_A_load() {
2 Addr a = get_vaddr();
3 a += 128;
4 prefetch(a);
5 }
1 void on_A_prefetch() {
2 int64_t dat = get_data();
3 Addr fetch = get_base(1)
4 + dat * 8;
5 prefetch(fetch);
6 }
1 void on_B_prefetch() {
2 int64_t dat = get_data();
3 Addr fetch = get_base(2)
4 + dat * 8;
5 prefetch(fetch);
6 }
(b) PPU code
Figure 5: A loop with irregular memory accesses to arrays B & C, but significant memory-level parallelism for accesses to A.
Also shown are the functions executed by the PPUs to exploit this MLP.
5. OS and Application Support
To target the prefetcher, custom code must be generated for
each application. This section describes the event-based pro-
gramming model used for this, that is suited for latency toler-
ant fetches on multiple PPUs. It also considers the interaction
with the operating system and context switches. In this section
we assume prefetch code is written by hand. We then go on to
consider compiler assistance in section 6.
5.1. Event Programming Model
The PPU programming model is event-based, which fits natu-
rally with the characteristics of prefetch instructions that have
variable latency before returning their data. Events generate
prefetches rather than loads, which can then be reacted to by
new events when they arrive in the core. These are issued
to the memory hierarchy when resources become available,
as described in section 4. This is naturally latency-tolerant,
avoiding PPU stalls while waiting for prefetched data.
Events run on the PPUs are determined from the addresses
loaded or prefetched into the cache. If and when prefetches
return data, the scheduler can select any PPU to execute the
corresponding event, rather than being constrained to the origi-
nating unit. This makes the architecture suitable for prefetches
requiring loads for intermediate values, which would other-
wise stall the prefetcher. A benefit of this style of programming
is that the PPUs do not need to keep state between computa-
tions on each event.
The code for each event resembles a standard C procedure
for a more traditional processor, with a few limitations. There
are no data loads from main memory, stores or stack storage,
because the PPUs do not have the ability to access memory
(apart from issuing prefetches). The only data available to the
PPUs is the address that triggered the event, any cache line
which has been observed (stored in local registers), and global
prefetcher state (stored in global registers, such as address
bounds or configured values such as a hash mask).
We add special prefetch instructions, which are different
from software prefetches because they trigger subsequent
events for the PPUs to handle once they return with data.
Function calls cannot be made, since there is no stack, and
system calls are unsupported.
The prefetch events can be terminated at any time, since
they are not required for correct execution of the application
running on the main core. This happens, for example, on a
context switch when the current application is taken off the
main core. At this time, all PPUs are paused and their prefetch
events aborted. In addition, any operation that would usually
cause a trap or exception (e.g., divide by zero) immediately
causes termination of the prefetch event.
5.2. Example
Consider the program in figure 5(a). Its data accesses are
highly irregular, featuring indirect accesses to arrays B and
C. However, the sequential access of array A means there is a
large amount of memory level parallelism we can exploit to
load in each iteration over x in parallel.
This can be prefetched by loading the PPUs with the code
in figure 5(b). We assume that A, B and C are all arrays of
8-byte values. The address bounds of arrays A, B and C are
configured with the prefetcher as address bounds 0, 1 and
2 respectively, by placing instructions in the original code.
Similarly, the addresses of the kernels in figure 5(b) are taken,
and configured to the relevant load events for the prefetcher.
On observation of a main program read to A, a prefetch event
is triggered which fetches the address two cache lines ahead of
the current read. On prefetch of this, the fetched data is used as
an index into B (get_base(1)), then into C (get_base(2)).
Note that the prefetcher code is a transformation from a
set of blocking loads to a set of non-blocking prefetch events.
The core code for the main program remains sequential and
unchanged save for the configuration instructions, but the
majority of cache misses should be avoided by virtue of the
PPUs issuing load requests in advance of the core program
reaching them.
The special prefetcher functions (e.g. get_vaddr(),
get_base() and get_fetched_data()) are compiler intrin-
sics, which get converted into either register reads or loads
from the attached small, shared, prefetcher-state memory, as
appropriate.
5.3. Operating System Visibility
Although they have many capabilities of regular cores, PPUs
are not visible to the operating system as separate cores, and
so the OS cannot schedule processes onto them. Instead,
the OS can only see the state necessary to be saved across
context switches. Although there may be situations where it is
useful for the OS to see the PPUs as regular cores, avoiding
6
1 int64_t acc = 0;
2 for(x=0; x<N; x++) {
3 swpf(&C[B[A[x+n]]]);
4 acc += C[B[A[x]]];
5 }
6 return acc;
(a) Software prefetch
1 int64_t acc = 0;
2 #pragma prefetch
3 for(x=0; x<N; x++) {
4 acc += C[B[A[x]]];
5 }
6 return acc;
(b) Pragma
Figure 6: Source code for auto-generation of PPU code.
interactions with the OS simplifies their design (for example,
it does not require privileged instructions). As a result, while
the prefetcher initiates page table walks, it cannot handle page
faults, and such a case we discard the prefetch.
The prefetch units are used only to improve performance
and cannot affect the correctness of the main program. There-
fore, the amount of state that needs to be preserved over con-
text switches is small. For example, we do not need to preserve
internal PPU registers, but simply discard them on a context
switch. For the same reason, we can also throw away all
events in the observation queue and addresses in the fetch
queue. Provided context switches are infrequent, this will re-
sult in little performance drop. EWMA values aren’t necessary
over context switches, as they can be recalculated.
As a result, all that is required to be saved on a context
switch is the prefetcher configuration: the global registers and
the address table.
6. Compiler Assistance
Hand-coding events requires considerable manual effort. A
way of generating these events from the original code within
the compiler is more desirable from an end-user point of view.
Software prefetching [11] is a commonly supported tech-
nique whereby a processor can load into the cache system
without waiting for the result. These present a high level ab-
straction for the end user, but have many disadvantages when
executed directly, as discussed in section 3. However, we can
use the address generation code for these software prefetches
to generate hardware events by working backwards through
the loop in which they appear to generate programmable
prefetcher code. This allows us to perform the prefetching
without slowing down the main computation thread.
6.1. Analysis
Our analysis pass over the compiler’s IR starts from the soft-
ware prefetch instructions and works backwards as a depth-
first analysis of the data-dependence graph for each input.
We terminate upon reaching a constant, loop-invariant value,
non-loop-invariant load, or phi node. The goal is to split the
prefetch address generation into sequences of nodes ending in
a single load, which will be turned into the PPU events in a
later pass.
To attain an appropriate level of look-ahead for the PPU
code, the software prefetch instruction must be in a loop with
an identifiable induction variable. We also need a data struc-
	

	

	

	
	
	

 !		
"##		 
#	$%

!&'		
#	$%

!&'		
(&'!$%

##		
	
 )
"('*()


	
	


Figure 7: An overview of our software prefetch conversion al-
gorithm on the control flow graph from code in figure 6.
ture which is accessed using the induction variable, so that we
can infer its value from loads observed in the cache.
Phi nodes identify either the loop’s induction variable, or
another control-flow dependent value. In the former case,
provided no loads have been found in this iteration of the
depth-first search, we can replace the induction variable with
code to infer it from an address, and use the set of found
instructions as the first event for a set of prefetches. The latter
case requires more complex analysis, and in practice is rare,
so we do not discuss it further.
If multiple different non-loop-invariant loads are found in a
search, then more than one loaded value is used to create an
address and the event cannot be triggered by the arrival of a
single data value. In this case the conversion fails. However,
if only one load is found, we package the instructions into an
event, and repeat the analysis again starting from this load.
Figure 7 shows the control-flow graph for the code in fig-
ure 6(a). Analysis starts from the prefetch instruction (line
14), performing a depth-first search on its single input, v5, and
terminating upon reaching the load at line 12. Since this is a
non-loop-invariant load, the three instructions are packaged
together into an event, and analysis restarted with the load.
This terminates on with the load at line 10, and again an event
is created. Finally, the third analysis pass terminates with the
phi node, which is for the loop induction variable, so a new
event is created and no further analysis is required.
6.2. Array Bounds Detection
The prefetcher requires the address bounds for each array ac-
cessed through an induction variable, storing them its address
filter so as to trigger the correct event when snooping a load
7
or prefetch. For example, in figure 7 code for event A must be
executed when observing a load to array A by the main core.
Returned prefetches are handled using the memory request
tags, described in section 4.6.
The start of each array is trivially obtained from address
generation instructions and, in the case of a typed array, the
end address is also simple because the size of the array is stated
explicitly. However, in languages such as C, where arrays can
be represented as pointers, this becomes more challenging.
One option is to pattern match for common cases, for example,
searching backwards for allocation instructions. Another is
to identify the loop termination condition, provided that it is
loop invariant.
6.3. Code Generation
The tasks of the code generation pass are to insert prefetcher
configuration instructions, generate PPU code and remove the
original software prefetch instructions. Using the analysis
described in section 6.2, array bounds are known and so con-
figuration instructions for each array are placed immediately
before the loop. Configuration instructions are also added for
any loop invariant values that are required by the PPU code,
assigning them to unique prefetcher global registers.
To generate prefetcher code, we take sets of instructions
identified using the analysis in section 6.1, and turn them into
event functions. In the first event, we replace the induction
variable phi node with the current address observation (ac-
cessible from PPU registers) subtracted from the base array
address and divided by the size of the array’s elements (which
is typically converted to a shift by later optimizations). We
replace the final instruction in each event, which will either
be a load or software prefetch, with a hardware prefetch in-
struction. If a load, we add a callback so that the next event in
the sequence is called once this prefetch returns. We replace
all loop invariants with global register accesses to values that
have been configured in the main code. The only remaining
load must be to the data observed from the current prefetch or
load event, so can be converted into a register access.
Finally, we remove the now-unnecessary software prefetch
instructions. Dead-code elimination is then used to remove any
instructions that were only required for the software prefetch,
but leaves those that are common subexpressions for other,
still-required instructions.
6.4. Pragma Prefetching
While software prefetches are a relatively descriptive mech-
anism for converting to hardware events, an easier option is
to simply indicate the loop that requires prefetching within it
and let the compiler generate the prefetch events from scratch.
We support this through a custom prefetch pragma (as in fig-
ure 6(b)) using a similar depth-first search approach as in
section 6.1. We start the analysis with loads that feature indi-
rection (so are likely to miss), and that have look-ahead based
on a discovered induction variable.
Main Core
Core 3-Wide, out-of-order, 3.2GHz
Pipeline
40-Entry ROB, 32-entry IQ, 16-entry LQ, 32-
entry SQ, 128 Int / 128 FP registers, 3 Int ALUs,
2 FP ALUs, 1 Mult/Div ALU
Tournament 2048-Entry local, 8192-entry global, 2048-entry
Branch Pred. chooser, 2048-entry BTB, 16-entry RAS
Memory & OS
L1 Cache 32KB, 2-way, 2-cycle hit lat, 12 MSHRs
L2 Cache 1MB, 16-way, 12-cycle hit lat, 16 MSHRs
L1 TLB 64-Entry, fully associative
L2 TLB 4096-Entry, 8-way assoc, 8-cycle hit lat
Table Walker 3 Active walks
Memory DDR3-1600 11-11-11-28 800MHz
OS Ubuntu 14.04 LTS
Prefetcher
Prefetcher
40-Entry observation queue, 200-entry prefetch
queue, 12 PPUs
PPUs In-order, 4 stage pipeline, 1GHz
Table 1: Core and memory experimental setup.
Generating code in this manner means we have slightly less
information to work on than with the software prefetch pass,
since software prefetches can encode runtime information on
what data will miss and be accessed, which a simple pragma
over a loop can miss (e.g., an array access stride pattern). Fur-
ther, is isn’t possible to decide at compile time, without more
information, which loads are likely to access data that is al-
ready in the L1 cache, and thus prefetches to that data structure
are unnecessary (though these could be disabled at runtime
with analysis hardware). However, for simple patterns, this
descriptor is equally powerful as software prefetch conversion.
7. Evaluation
To evaluate our prefetcher we modeled a high performance sys-
tem using the gem5 simulator [31] in full system mode running
Linux with the ARMv8 64-bit instruction set and configuration
given in table 1. To evaluate the compiler techniques presented
in section 6, we implemented them as LLVM passes [32]. We
chose a variety of memory-bound benchmarks to demonstrate
our scheme, representing a wide range of workloads from
different fields: graphs, databases and HPC, described in ta-
ble 2. We skipped initialization, then ran each benchmark to
completion using detailed, cycle-accurate simulation.
7.1. Performance
Figure 8 shows that our programmable prefetcher achieves
speedups of up to 4.3× with manual programming, compared
to no prefetching, for the memory-bound workloads described
in section 7, whereas stride and software prefetchers realize
speedups of no more than 1.4× and 2.2× respectively.
Our compiler-assisted software prefetch conversion pass
(converted) achieves similar speedups to manual events for
8
Benchmark Source Pattern Input
G500-CSR Graph500 [33] Breadth-first search (arrays) -s 21 -e 10
G500-List Graph500 [33] Breadth-first search (linked lists) -s 16 -e 10
PageRank BGL [34] Stride-indirect web-Google
HJ-2 Hash Join [35] Stride-hash-indirect -r 12800000 -s 12800000
HJ-8 Hash Join [35] Stride-hash-indirect, followed by 3 linked list walks per iteration -r 12800000 -s 12800000
RandAcc HPCC [36] Stride-hash-indirect 100000000
IntSort NAS [37] Stride-indirect B
ConjGrad NAS [37] Stride-indirect B
Table 2: Summary of the benchmarks evaluated.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
G50
0-C
SR
G50
0-Li
st HJ-
2
HJ-
8
Pag
eRa
nk
Ran
dAc
c
IntS
ort
Con
jGra
d
Sp
ee
du
p
No direct memory
address access so
software prefetch
not possible
Stride
Software
Pragma Generated
Converted
Manual
Figure 8: The programmable prefetcher realizes speedups of
up to 4.4×. Stride and software prefetchers cannot effectively
prefetch highly irregular memory accesses.
benchmarks except for on the Graph500 workloads, and
our automatic event generation technique based on pragmas
(pragma generated) is able to speed up simpler access pat-
terns as much as manual, but isn’t able to achieve full potential
for four of our eight benchmarks.
Speedups Three benchmarks gain significant improvement
from software prefetching. These are RandAcc, IntSort and
HJ-2, all highly amenable to software prefetching due to their
access pattern, which involves an array indirect based on a
single strided load. The spatial locality means that they don’t
incur significant numbers of pipeline stalls for the prefetch ad-
dress calculation. However, in the extreme (IntSort), software
prefetching causes a 113% increase in dynamic instructions
(with 83% extra for RandAcc and 56% for HJ-2).
In contrast, moving the prefetch address calculations to the
PPUs in our scheme results in larger speedups: from 2.0×
with software prefetch up to 2.8× with PPUs for IntSort, from
2.2× to 3.0× for RandAcc and from 1.4× to 3.9× for HJ-2.
In other workloads, where stride and software prefetch pro-
vide few benefits, our prefetcher is able to unlock significant
memory-level parallelism and realize substantial speedups.
For example, in HJ-8 stride and software prefetching achieve
negligible speedups, yet our PPUs attain 3.8×.
The only significant outlier is G500-List, which, although
achieving 1.7×, is the lowest speedup attained by our pre-
fetcher. The reason for this is that there is no fine-grained
parallelism available within the application, since each ver-
tex in the graph contains a linked list of out-going edges.
Therefore, when prefetching a vertex, each edge can only
be identified through a pointer from the previous, essentially
sequentializing the processing of edges.
There is no bar for software prefetching or conversion for
PageRank in figure 8; the Boost Graph Library code uses tem-
plated iterators which only give access to edge pairs, meaning
it isn’t possible to get the addresses of individual elements to
issue software prefetches to them.
Compiler assistance, both from pragmas and software
prefetch conversion, works well for IntSort, ConjGrad and
HJ-2. While PageRank’s code doesn’t allow software prefetch
insertion due to working on high level iterators, this is not
a problem for the pragma pass, which works on LLVM IR,
and thus can discover the access pattern and generate events
automatically. IntSort, ConjGrad and PageRank have slightly
reduced performance from pragma generated prefetching, as a
result of useless prefetches being generated, as opposed to the
patterns not being discoverable.
RandAcc gains less performance from pragma conversion
than from manual software prefetching. This is because the
benchmark repeatedly iterates over a small 128-entry array,
and thus we can encode wrap-around prefetches in a software
prefetch. As this is a property of multiple control flow loops,
it is difficult to discover in an automated pass, and thus our
scheme leaves the first few entries of the array unprefetched.
Still, our pragma scheme requires less effort from the program-
mer than a software prefetch, in that they only need to identify
target loops, rather than come up with specific prefetches and
look-ahead distances.
HJ-8 gains significant performance improvement from soft-
ware prefetch conversion, because we can specify to prefetch
the first N hash buckets. This differs from software prefetch-
ing, where we cannot do this in a latency tolerant manner, as
it requires reads of prefetched data, and also from pragma
generation, as N cannot easily be discovered from the code.
More generally, we can say that hash tables tend to have few
elements per hash bucket, so even for the case where there
are varying numbers of elements, a conservative "first N" ap-
proach should work well. Still, with manual prefetching, we
can introduce control flow loops, to walk every bucket until
we try to prefetch a null pointer.
G500-CSR gains progressively more performance as more
9
 0
 0.2
 0.4
 0.6
 0.8
 1
G50
0-C
SR
G50
0-Li
st HJ-
2
HJ-
8
Pag
eRa
nk
Ran
dAc
c
IntS
ort
Con
jGra
d
L1
 U
tili
za
tio
n 
Ra
te Data prefetchedtoo early, but next
opportunity would
make it too late
(a) L1 utilization
 0
 0.2
 0.4
 0.6
 0.8
 1
G50
0-C
SR
G50
0-Li
st HJ-
2
HJ-
8
Pag
eRa
nk
Ran
dAc
c
IntS
ort
Con
jGra
d
L1
 C
ac
he
 R
ea
d 
Hi
t R
at
e
No PF Programmable PF
Most loads
miss in L1 but
L2 hit rate
increases from
0.20 to 0.58
(b) L1 read hit rate
Figure 9: While most applications see high prefetch utilization and L1 hit rates, G500-List has to prefetch data too early to attain
memory-level parallelism, so benefits are obtained from having the data in the L2 cache.
programmer effort is expended in prefetching. As neither
of the compiler passes deal with control flow (as software
prefetches fundamentally can’t express loops), it isn’t possible
to prefetch a data-dependent range of edges, and thus we must
instead fetch the first N for fixed N. Further, we can’t use the
knowledge that the start and end value for each vertex in an
edge list will be in the same cacheline in our compiler passes,
as they assume access to only one loaded value at a time. The
pragma pass is unable to identify the need to fetch edge or
visited values from vertex data, due to the complicated control
flow involved, so instead only achieves two stride-indirect
patterns from FIFO queue to vertices, and edges to visited
information, limiting the prefetching achievable.
As G500-List relies heavily on walking long edge lists in a
linked list, it requires loop control flow to prefetch effectively.
Therefore, we cannot express it as a software prefetch, and our
compiler passes have limited impact on performance.
Impact on L1 Cache Figure 9 explores this in more de-
tail. Figure 9(a) shows that while L1 cache utilization is high
for most benchmarks when using our prefetcher, it is com-
paratively low for G500-List. In this application, for larger
vertices, the linked list of edges may be larger than the L1
cache. Traversing this list may result in prefetched data be-
ing evicted from the cache before being used due to capacity
misses from either a) later prefetches to the same edge list,
or b) prefetches or loads to other data. The underlying issue
is that the prefetches occur too early, however there is no
mechanism to delay them. Instead of starting the edge-list
prefetches after a vertex has been prefetched, the only other
point that the list prefetches can start is when the actual ap-
plication thread starts processing the vertex. By this point it
is too late because the main thread will need to follow the
edges, and so prefetches will execute in lock-step with the
main application’s loads (much like figure 2(d)).
The L1 cache read hit rate does increase for G500-List, as
shown in figure 9(b), but only up to 0.42 from 0.34. However,
despite this, the application does gain some benefit from the
early edge-list prefetches by virtue of these edges being placed
in the L2 cache. In this case, the L2 cache hit rate increases
from 0.20 to 0.57.
7.2. Analysis
Our existing programmable prefetcher configuration contains
12 PPUs, each running at 1GHz, compared to 3.2GHz for
the main core. We now show that this realizes most of the
benefits and that scaling continues with increasing numbers
of PPUs and their frequencies, since the prefetch kernels are
embarrassingly parallel.
Clock Speed Figure 10 shows how PPU clock speed affects
each benchmark and the impact of reducing the number of
PPUs. Figure 10(a) demonstrates that approximately half the
workloads gain little benefit from increasing the frequency
of the PPUs. On the other hand, HJ-2 requires a 500MHz
frequency to realize its maximum speedup whereas ConjGrad
and G500-CSR achieve speedups that continue scaling with
the PPU frequency. Overall, the majority of the benefits are
obtained at 1GHz where the geometric mean of speedups is
3×, increasing to 3.1× at 2GHz.
Number of PPUs We explore the relationship between PPU
frequency and the number of PPUs in figure 10(b) for G500-
CSR, chosen as an example of an application that continues
scaling with frequency increases. We show PPU frequencies
up to 4GHz as a study only, to assess this relationship; we do
not expect PPUs to be clocked at this frequency.
The figure shows that speedups are maintained by doubling
the number of PPUs and halving the frequency. Using 3 PPUs
at 2GHz, 6 PPUs at 1GHz or 12 PPUs at 500MHz all achieve
1.9×. The prefetch kernels running on the PPUs are embar-
rassingly parallel, since each invocation is independent of all
others, meaning that scaling can be achieved by increasing the
number of PPUs or their frequencies. It also shows that per-
formance for this workload saturates with 12 PPUs at 2GHz,
as no more is gained by increasing frequency.
PPU Activity Figure 11 further explores the amount of work
performed by the 12 PPUs at 1GHz. This figure shows the
proportions of time that each PPU is awake during computa-
10
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
250MHz 500MHz 1GHz 2GHz
Sp
ee
du
p
PPU Clock Speed
ConjGrad
HJ-8
HJ-2
PageRank
RandAcc
G500-CSR
IntSort
G500-List
(a) Clock frequency impact
 1
 1.2
 1.4
 1.6
 1.8
 2
 2.2
 2.4
 2.6
125MHz 250MHz 500MHz 1GHz 2GHz 4GHz
Sp
ee
du
p
PPU Clock Speed
3 PPUs 6 PPUs 12 PPUs
Doubling the number of PPUs
and halving the frequency
results in the same speedup
(b) Effect of number of cores on G500-CSR
Figure 10: Some applications see little performance loss with slower PPUs, whereas others continue gaining as clock speeds
increase. Doubling the number of PPUs is the same as halving their frequency.
tion. Our scheduling policy is to pick the PPU with the lowest
ID from those available when assigning prefetch work. This
means that the low-ID PPUs are active more of the time than
the high-ID PPUs. Other scheduling policies (such as round-
robin) would spread the work out more evenly, but would not
change the overall performance and would not allow us to
perform this analysis.
When the workload is prefetch-compute bound, adding
more PPUs or using a higher clock speed would improve
performance (as in G500-CSR); work is evenly split between
PPUs and all are kept busy. In contrast, benchmarks such as
PageRank, RandAcc and IntSort cannot fully utilize all PPUs:
all of these workloads contain at least one PPU that is never
awoken. This is mainly due to them requiring only simple
calculations to identify future prefetch targets. As a result,
these applications would achieve similar performance with
slower PPUs (as shown in figure 10(a)) or fewer of them.
ConjGrad is an outlier in that some PPUs do little work, yet
it scales with increasing frequency (figure 10(a)). The reason
for this behavior is that at 1GHz there is not enough work
available for all PPUs to need to be active, but the prefetches
are slightly latency-bound. Therefore minor additional bene-
fits are gained when the clock speed increases and the prefetch
calculations finish earlier. This is in contrast to G500-CSR,
which also scales with the clock speed, where boosting fre-
quency increases the number of prefetches that can be carried
out, resulting in higher performance.
No applications have PPUs that run continuously: the max-
imum activity factor is 0.82. This reflects the fact that the
PPUs only react to events from the main core, and so are not
required during phases where no data needs to be prefetched.
Extra Memory Accesses For efficient execution, it is desir-
able to minimize the total extra traffic we add onto the memory
bus. In general, a programmable solution should prefetch very
efficiently, only targeting addresses that will be required by
the computation. For all but the two Graph500 benchmarks,
the value is negligible: prefetches are very accurate and timely,
and therefore do not fetch unused data. G500-List adds 40%
 0
 0.2
 0.4
 0.6
 0.8
 1
G50
0-C
SR
G50
0-Li
st HJ-
2
HJ-
8
Pag
eRa
nk
Ran
dAc
c
IntS
ort
Con
jGra
d
Ac
tiv
ity
 F
ac
to
r
Little prefetch computation
and not bursty so the first
PPU takes the lion’s share
At least one PPU
unused
PPUs all perform a
similar amount of work
Figure 11: Range, quartiles and median for the proportion of
time each PPU is awake and calculating prefetches at 1GHz.
extra accesses due to the lack of fine-grained parallelism avail-
able. This is down to a fundamental constraint on the linked
list that limits timely prefetching, as discussed in section 7.1.
G500-CSR also has variable work per vertex, meaning the
prefetch distance must be set conservatively based on the EW-
MAs. This results in 16% extra memory accesses.
8. Conclusion
We have presented a programmable prefetcher, which uses
an event-based programming model capable of extracting
memory-level parallelism and improving performance for a
variety of irregular memory-intensive workloads. On a selec-
tion of graph, database and HPC workloads, our prefetcher
achieves an average 3.0× speedup without significantly in-
creasing the number of memory accesses. We have further
provided compiler techniques to reduce the amount of manual
effort for the programmer to utilize the performance benefits
of our scheme, with average 1.9× and 2.5× speedup for the
two schemes we present.
References
[1] O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ran-
ganathan, “Meet the walkers: Accelerating index traversals for in-
memory databases,” in Proceedings of the 46th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-46, 2013.
[2] S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips, “Sqrl:
Hardware accelerator for collecting software data structures,” in Pro-
11
ceedings of the 23rd International Conference on Parallel Architectures
and Compilation, PACT ’14, 2014.
[3] K. Nilakant, V. Dalibard, A. Roy, and E. Yoneki, “Prefedge: Ssd
prefetcher for large-scale graph traversal,” in Proceedings of Interna-
tional Conference on Systems and Storage, SYSTOR 2014, 2014.
[4] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, “Challenges in
parallel graph processing,” Parallel Processing Letters, vol. 17, no. 01,
2007.
[5] D. Merrill, M. Garland, and A. Grimshaw, “Scalable gpu graph traver-
sal,” in Proceedings of the 17th ACM SIGPLAN Symposium on Princi-
ples and Practice of Parallel Programming, PPoPP ’12, 2012.
[6] F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what
cost?,” in 15th Workshop on Hot Topics in Operating Systems (HotOS
XV), May 2015.
[7] T.-F. Chen and J.-L. Baer, “Reducing memory latency via non-blocking
and prefetching caches,” in Proceedings of the Fifth International
Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS V, 1992.
[8] V. Viswanathan, “Disclosure of h/w prefetcher control on some intel
processors.” https://software.intel.com/en-us/articles/
disclosure-of-hw-prefetcher-control-on-some-intel-
processors, Sept. 2014.
[9] D. Joseph and D. Grunwald, “Prefetching using markov predictors,” in
Proceedings of the 24th Annual International Symposium on Computer
Architecture, ISCA ’97, 1997.
[10] R. Cooksey, S. Jourdan, and D. Grunwald, “A stateless, content-
directed data prefetching mechanism,” in Proceedings of the 10th
International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS X, 2002.
[11] D. Callahan, K. Kennedy, and A. Porterfield, “Software prefetching,”
in Proceedings of the Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS
IV, (New York, NY, USA), 1991.
[12] V. Malhotra and C. Kozyrakis, “Library-based prefetching for pointer-
intensive applications,” tech. rep., 2006.
[13] O. Kocberber, B. Falsafi, K. Lim, P. Ranganathan, and S. Harizopoulos,
“Dark silicon accelerators for database indexing,” in 1st Dark Silicon
Workshop (DaSi), 2012.
[14] S. Ainsworth and T. M. Jones, “Graph prefetching using data structure
knowledge,” in Proceedings of the 2016 International Conference on
Supercomputing, ICS ’16, 2016.
[15] S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan, “Dasx: Hard-
ware accelerator for software data structures,” in Proceedings of the
29th ACM on International Conference on Supercomputing, ICS ’15,
(New York, NY, USA), ACM, 2015.
[16] H. Al-Sukhni, I. Bratt, and D. A. Connors, “Compiler-directed content-
aware prefetching for dynamic data structures,” in Proceedings of the
12th International Conference on Parallel Architectures and Compila-
tion Techniques, PACT ’03, 2003.
[17] C.-L. Yang and A. Lebeck, “A programmable memory hierarchy for
prefetching linked data structures,” in High Performance Computing
(H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, eds.), vol. 2327
of Lecture Notes in Computer Science, 2002.
[18] B. Falsafi and T. F. Wenisch, “A primer on hardware prefetching,”
Synthesis Lectures on Computer Architecture, vol. 9, no. 1, 2014.
[19] A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based prefetching
for linked data structures,” in Proceedings of the Eighth International
Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS VIII, 1998.
[20] M. Annavaram, J. M. Patel, and E. S. Davidson, “Data prefetching by
dependence graph precomputation,” in Proceedings of the 28th Annual
International Symposium on Computer Architecture, ISCA ’01, 2001.
[21] A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, “Slice-
processors: An implementation of operation-based prediction,” in
Proceedings of the 15th International Conference on Supercomputing,
ICS ’01, 2001.
[22] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execu-
tion: An alternative to very large instruction windows for out-of-order
processors,” in Proceedings of the 9th International Symposium on
High-Performance Computer Architecture, HPCA ’03, (Washington,
DC, USA), pp. 129–, IEEE Computer Society, 2003.
[23] X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “IMP: Indirect memory
prefetcher,” in Proceedings of the 48th International Symposium on
Microarchitecture, MICRO-48, 2015.
[24] D. Kim and D. Yeung, “Design and evaluation of compiler algorithms
for pre-execution,” SIGPLAN Not., vol. 37, Oct. 2002.
[25] D. Kim and D. Yeung, “A study of source-level compiler algorithms for
automatic construction of pre-execution code,” ACM Trans. Comput.
Syst., vol. 22, Aug. 2004.
[26] E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agar-
wal, “Multicore performance optimization using partner cores,” in
Proceedings of the 3rd USENIX Conference on Hot Topic in Paral-
lelism, HotPar’11, 2011.
[27] T. C. Mowry, M. S. Lam, and A. Gupta, “Design and evaluation of
a compiler algorithm for prefetching,” in Proceedings of the Fifth
International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS V, 1992.
[28] P. Demosthenous, N. Nicolaou, and J. Georgiou, “A hardware-efficient
lowpass filter design for biomedical applications,” in Biomedical Cir-
cuits and Systems Conference (BioCAS), 2010 IEEE, Nov 2010.
[29] http://www.arm.com/products/processors/cortex-m/
cortex-m0.php.
[30] “Intel core i7-960 processor (8m cache, 3.20 ghz, 4.80 gt/s intel
qpi) specifications.” http://ark.intel.com/products/37151/
Intel-Core-i7-960-Processor-8M-Cache-3_20-GHz-4_80-
GTs-Intel-QPI.
[31] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simula-
tor,” SIGARCH Comput. Archit. News, vol. 39, Aug. 2011.
[32] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong
program analysis & transformation,” in Proceedings of the Interna-
tional Symposium on Code Generation and Optimization: Feedback-
directed and Runtime Optimization, CGO ’04, 2004.
[33] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introduc-
ing the graph 500,” Cray User’s Group (CUG), May 5, 2010.
[34] J. Siek, L.-Q. Lee, and A. Lumsdaine, The Boost Graph Library: User
Guide and Reference Manual. Boston, MA, USA: Addison-Wesley
Longman Publishing Co., Inc., 2002.
[35] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main
memory hash join algorithms for multi-core cpus,” in Proceedings of
the 2011 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’11, 2011.
[36] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas,
R. Rabenseifner, and D. Takahashi, “The hpc challenge (hpcc) bench-
mark suite,” in Proceedings of the 2006 ACM/IEEE Conference on
Supercomputing, SC ’06, 2006.
[37] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter,
L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S.
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga,
“The NAS Parallel benchmarks – summary and preliminary results,” in
Proceedings of the 1991 ACM/IEEE Conference on Supercomputing,
Supercomputing ’91, 1991.
12
