Gene-Patterns: Should Architecture be Customized for Each Application? by Liu, Yuhang et al.
ICT Technical Report #20180310
Gene-Patterns: Should Architecture be Customized for Each Application?
Yuhang Liu*, Luming Wang*, Xiang Li*, Yang Wang+, Mingyu Chen*, and Yungang Bao*
*Institute of Computing Technology, Chinese Academy of Sciences
+Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Abstract
Providing architectural support is crucial for newly arising
applications to achieve high performance and high system
efficiency. Currently there is a trend in designing acceler-
ators for special applications, while arguably a debate is
sparked whether we should customize architecture for each
application. In this study, we introduce what we refer to as
Gene-Patterns, which are the base patterns of diverse applica-
tions. We present a Recursive Reduce methodology to identify
the hotspots, and a HOtspot Trace Suite (HOTS) is provided
for the research community. We first extract the hotspot pat-
terns, and then, remove the redundancy to obtain the base
patterns. We find that although the number of applications
is huge and ever-increasing, the amount of base patterns is
relatively small, due to the similarity among the patterns of
diverse applications. The similarity stems not only from the
algorithms but also from the data structures. We build the
Periodic Table of Memory Access Patterns (PT-MAP), where
the indifference curves are analogous to the energy levels in
physics, and memory performance optimization is essentially
an energy level transition. We find that inefficiency results
from the mismatch between some of the base patterns and the
micro-architecture of modern processors. We have identified
the key micro-architecture demands of the base patterns. The
Gene-Pattern concept, methodology, and toolkit will facilitate
the design of both hardware and software for the matching
between architectures and applications.
1. Introduction
In its seventy-year history, computer science at present is at
a turning point, which is largely manifested in three aspects.
First, Moore’s law is deemed to be approaching to its end, as
the increasing of the density of chip components is gradually
slowing down and significantly deviating from the forecast of
Moore’s law [1]. It seems that the era of post Moore’s law is
coming. Second, brand new applications are emerging daily,
which not only aggravates the memory wall problem but also
requires new underlying micro-architectures, especially for
the memory systems [2]. Third, transistor scaling and voltage
scaling are no longer in line with each other, which results
in the utilization wall and the failure of Dennard scaling [3],
and thus calls for an architecture providing high hardware
utilization and high energy efficiency.
This paper published in arxiv was the exact version we submitted to
asplos’18 two years ago.
Applications specify the demands for processor architecture.
However, there exists a mismatch between the existing pro-
cessor architecture and the increasingly diverse applications.
General-purpose CPUs and GPUs are facing difficulties adapt-
ing to the application diversity. As a result, for a specified
application, a large fraction of transistors could be wasted, and
thus only a small fraction of peak performance is achieved.
Although FPGAs provide elasticity when they are used to
customize the architecture for applications, the customization
process is very time consuming. As a result, FPGAs have not
been used as widely as CPUs and GPUs.
To take advantage of the three components (i.e., CPUs,
GPUs and FPGAs), architectures have evolved towards hetero-
geneous chip multi-processors (CMPs) that comprise a mix
of cores and accelerators, and thus provide some architecture
diversity for applications. In recent years, the heterogeneous
CMPs have been customized for special domains such as those
in machine learning related applications [4]. These applica-
tions are pervasive but they only represent a small portion of
all the applications available today and in future. Compared to
the increasing application diversity, the diversity in heteroge-
neous CMPs lags far behind. Therefore, a prominent issue is
whether we should customize the CMP architectures for each
application. In this study, our answer is that: not necessary!
Our solution in this study is inspired by rethinking the
knowledge in biology. In biology, it is well known that the
properties of lives are determined by the combination of a
series of genes. If we think of applications as “lives”, then
all we need to do is to find the corresponding “genes”. Once
we find the “genes” of applications, we only need to focus on
the genes, the types of which are very limited. In that sense,
the design process of computing systems can benefit from
“genetic engineering”.
This paper provides the first (to the best of our knowl-
edge) comprehensive study on the hotspot characteristics of
quite diverse applications. Specifically, we have examined
the patterns of 106 hotspots of 62 benchmarks from eight
representative suites, including SPECint and SPECfp from
SPEC CPU2006 [5] [6], PARSEC [7], BigDataBench [8], ML-
Pack [9], HPCC [10], HPCG [11] and Graph500 [12]. The
applications are from different domains from big data ana-
lytics to high performance computing, from single-thread to
multi-thread. Our study reveals lots of interesting findings and
provides useful guidance for memory access pattern detection,
classification, and optimization.
ar
X
iv
:1
90
9.
09
76
5v
2 
 [c
s.D
C]
  1
 N
ov
 20
19
In this study, for the first time, memory access patterns have
been experimentally identified, mathematically formalized,
and graphically visualized simultaneously for diverse appli-
cations. This pattern study reveals the essence of memory
performance bottleneck and it is useful for code optimization
and architecture design. The base patterns give incentives to
computer system designers to invest in capabilities that will
impact the collective performance of these essential patterns.
In this study, our main contributions are the following:
• We propose a Recursive Reduce (ReRe) methodology for
identifying and representing the memory access pattern of
diverse applications. Accompanying this, we also provide a
HOtspot Trace Suite (HOTS) for the research community.
• We compare the similarities and differences among the
programs in different domains. We propose a method using
five metrics, Reuse aware Locality (RaL) [13], pipeline stall
degree, L2 and beyond active degree, prefetch/request ratio
and latency non-hidden degree. We consider metrics, codes
and pattern figures simultaneously to explore insightful
results.
• We build the Periodic Table of Memory Access Patterns
(PT-MAP), where memory performance of a pattern is de-
termined by the pattern’s location in the table. The indiffer-
ence curves in PT-MAP are analogous to the energy levels
in physics, and memory performance optimization is essen-
tially an energy level transition.
• We extract the “genes” of applications, which are called
base patterns, gene-patterns or meta-patterns, defined as the
minimal complete set (MCS) of building blocks of memory
access patterns. We find that today’s predominant micro-
architecture cannot match all the base patterns, which thus
calls for a series of targeted changes to micro-architecture
in the fetching granularity, caching and prefetching policy.
The remainder of this paper is organized as follows. Sec-
tion 2 describes the recursive reduce methodology. Section 3
characterizes the hotspots with metrics, and Section 4 visual-
izes the patterns and presents the base patterns. We discuss
Gene-Pattern aware accelerators and programming in Sec-
tion 5 and Section 6, respectively. We summarize related work
in Section 7 and conclude in Section 8.
2. Recursive Reduce Methodology
2.1. Overview
The significance of understanding the characteristics of appli-
cations has been well recognized by research community. To
design a desirable architecture, we first need a good under-
standing of application behaviors. Specifically, it is necessary
for architects to explore the application space, which is the
collection of all possible applications. However, the number
of applications that currently exist and are likely to exist in
future is countless. Furthermore, each application consists
of many lines of code. What makes things worse is the ex-
tremely low speed of cycle-accurate simulation, which is 5
Applications (~106×103)
Benchmarks (~102×103)
Hotspots (~102×10)
MAP (~102×10)
PARSEC, HPCC, 
HPCG, SPECint, 
SPECfp, MLPack, 
BigDataBench, 
Graph500
Diverse applications in 
different domains
Code segments (very 
small)
All the hotspot patterns
         
Gene (~10) Base patterns
Figure 1: The Recursive Reduce methodology for building the
base patterns of applications (a× b means that there exist a
different elements and each element is of size b)
orders of magnitude slower than physical execution [14]. As a
result, exhaustive consideration of application-to-architecture
mappings is infeasible.
In this study, we present a Recursive Reduce method (abbre-
viated as ReRe) for identifying and representing the patterns
of diverse applications. As shown in Fig. 1, we use various
benchmark suites in the first step as the representative of all the
real applications. Previous studies have characterized applica-
tions from the perspective of the whole application; this seems
a natural choice [7] [8]. However, the benchmark programs
are still very large and difficult to analyze. We observed the
fact that, in an application program, although there are many
lines of code, usually only a very small portion consumes most
of the execution time. Moreover, an application may comprise
several hotspots, and one hotspot usually corresponds to a
pattern, thus an application may have more than one pattern
during its execution. If we measured the average value of
metrics of the whole application, the diversity of the patterns
in an application is hidden by the average. Therefore, in this
study we characterize applications from the perspective of the
identified hotspots rather than the whole application.
Modern commodity processors provide a number of hard-
ware performance counters to support micro-architecture level
profiling. We use PAPI [15] to collect about 30 events, whose
numbers and unit masks can be found in the Intel Developer’s
Manual [16]. Based on the events, we use HPCToolkits [17]
and Perfexpert [18] to identify the hotspots.
To recognize hotspots, debug information produced by the
compiler is required. The debug information includes the
mapping between source code and the executable binary. An
application’s executable binary (embedded with debug infor-
mation) is executed and profiled by HPCToolkits. Then, a
performance database is produced, which contains an applica-
tion’s source code information and performance metrics that
are calculated with the performance counter values. Finally,
the tool Perfexpert is used to analyze the data in performance
database in order to identify the code segments whose im-
2
portance is great than a threshold (e.g., 5%). A hotspot’s
importance is computed in Eq. (1) as follows,
Importance =
t×n
ts +n× tp > ∆%, (1)
where t is the time taken by the code segment, n is the number
of threads, and Ts and Tp are the time taken by the sequential
and parallel sections of the total benchmark, respectively. In
this manner, the hotspots of an applicaion can be identified.
For each hotspot, we conduct detailed analysis using met-
rics, which are the attributes of the hotspot’s pattern. This
fine-grained analysis enables us to gain more insights, com-
pared to a coarse-grained one. We then plot the access patterns
and analyze the corresponding codes. Finally and most in-
terestingly, with ReRe method, we hierarchically mine the
essential patterns, based on which we remove redundancy and
obtain the base patterns.
2.2. Benchmark and Platform
We use benchmarks from eight suites in which each benchmark
suite is designed to exercise computational and memory access
patterns that much closely match a different and broad set of
important applications. Specifically,
• BigDataBench [8] is for large footprint workloads, mod-
elling typical big data application domains: search engine,
social networks, e-commerce, multimedia analytics, and
bioinformatics;
• MLPack [9] is a collection of artificial and real-world ma-
chine learning workloads;
• PARSEC [7] is a benchmark suite composed of mul-
tithreaded programs. The suite focuses on emerg-
ing workloads and was designed to be representative
of next-generation shared-memory programs for chip-
multiprocessors;
• SPECint [5] [6] comprises single thread integer benchmarks
from SPEC CPU2006, stressing a system’s processor, mem-
ory subsystem and compiler;
• SPECfp [5] [6] is similar to SPECint and is also a subset of
SPEC CPU2006, but focuses on float point processing;
• HPCC [10] includes LINPACK and RandomAccess and
tests a number of independent attributes of the performance
of high-performance computer (HPC) systems;
• HPCG [11] is intended as a complement to the LINPACK
benchmark, currently used to rank the Top500 computing
systems;
• Graph500 [12] is to augment the LINPACK with data-
intensive applications. Because graph algorithms are a core
part of most analytics workloads, Graph500 offers a fo-
rum for the community and provides a rallying point for
data-intensive supercomputing problems.
We conduct our study on a typical commodity server with
two Intel Xeon E5-2630 v4 and 64GB of DRAM in each blade.
Each Intel E5-2630 processor includes ten aggressive out-of-
order processor cores with a deep (three-level) on-chip cache
hierarchy. Table 1 summarizes the architectural parameters of
the experimental system.
Table 1: The experimental system parameters
Processor
14nm Intel Xeon E5-
2630 v4 operating at
2.2 GHz
Chip number 2 LGA sockets
Cores per chip 10 OoO cores
Threads per core 2
L1 dcache and icache latency 4 cycles
L2 latency 12 cycles
L3 latency 40 cycles
Memory latency 150±50 cycles
FP latency 2 cycles
FP slow latency 18 cycles
TLB latency 45 cycles
Memory
4 DDR3 channels,
16GB per channel
We completely run the 62 benchmarks from eight different
suites. Using the performance counters with the criterion
in Eq. (1), we have identified 106 hotspots, each having a
small number of lines of code. On average, the importance
of all the 106 hotspots is 23%. In other words, although each
hotspot only includes a small number of lines of code, it takes
more than 20 percent of the total execution time. We also use
PIN [19] to collect the traces of each hotspot. The hotspot
locations in the programs and the traces of the hotspots are
formed into a HOtspot Trace Suite called HOTS, which will
be released for the research community.
3. Characterizing the Hotspots with Metrics
In this section, we use metrics to characterize the hotspots.
Specifically, for each hotspot, we evaluate the effectiveness
of the following micro-architecture components: pipelines,
on-chip cache hierarchy, prefetchers and memory controllers.
Modern CMP comprises lots of processing cores in a single
chip, and each core can issue several instructions in a cycle and
execute instructions out of order (OoO). We propose pipeline
stall degree to quantify the pipeline efficiency. CMP has a deep
(three-level) cache hierarchy, where each core is equipped with
a split L1 instruction and data cache, and a unified L2 cache,
and LLC is shared among all the on-chip cores. We propose
an evaluation method using Reuse aware Locality (RaL) [13],
L2 and beyond active degree and latency non hidden degree
to quantify the cache hierarchy effectiveness. To mitigate
the latency gap between L2 and LLC, the L2 controller of
each core is built with prefetchers that can issue prefetches.
We propose prefetch/request ratio to quantify the prefetcher
utilization. We use L3 APC (Accesses Per memory active
3
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Figure 2: The IPC of the hotspots in diverse benchmark suites
Cycle) [20] to denote off-core access performance. All the
used metrics can be measured by performance counters on
commodity processors.
3.1. Pipeline Efficiency
• Most hotspots have low pipeline efficiency in terms of IPC.
• The hotspots in emerging workloads including Graph500,
HPCC, and HPCG have the lowest IPC.
• A static instruction of a Bigdata application runs many
times to process different data elements.
• Most hotspots in HPCC, HPCG, SPEC CPU2006 and PAR-
SEC have very large pipeline stall degrees.
We use three metrics to quantify the pipeline efficiency.
Apart from IPC, the number of instructions executed per clock
cycle, we define a new metric, pipeline stall degree. When the
number of instructions and clock frequency are fixed, IPC in-
dicates the final performance. To provide more details behind
the IPC value, we define pipeline stall degree as the ratio of
pipeline stall cycles to the total execution cycles, and define
latency non-hidden degree as the ratio of pipeline stall cycles
to the total memory active cycles.
Fig. 2 shows the IPC of the hotspots. On average, the
hotspots in Graph500, HPCC, and HPCG have low IPC, i.e.,
low efficiency. Therefore, we can infer that there exists a
mismatch between the demands of these emerging applications
and the micro-architecture of predominant modern processors.
For instance, the data structures of Graph500 incur many
indirect memory accesses which are challenging for current
mainstream architectures.
Fig. 3 shows the workload size of the hotsopts in terms of
the number of dynamic instructions (IC), while Fig. 4 presents
the hotspots’ pipeline stall degrees,
We find that the number of static instructions of all the
hotspots is in the same order of magnitude, which is defined by
only a few lines of code. The hotspots in BigDataBench have
executed the largest number of dynamic instructions. However,
the pipeline stall degree of BigDataBench is the smallest. We
can infer that a static instruction of BigDataBench has been
run many times to process different data elements.
Pipeline stall directly results in the low efficiency of a com-
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
109
1010
1011
1012
1013
Figure 3: The dynamic instruction count (IC) of the hotspots
in diverse benchmark suites
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0%
20%
40%
60%
80%
100%
Figure 4: The pipeline stall degree of the hotspots in diverse
benchmark suites
puting system. On average, the hotspots in HPCC, HPCG,
SPEC CPU2006 and PARSEC have very large pipeline stall
degrees.
3.2. On-chip Cache Effectiveness
• The hotspots in emerging applications, HPCC, HPCG and
Graph500 have low on-chip cache effectiveness in terms
of RaL [13], since a cache line has been reused few times
before being evicted from on-chip caches.
• The differences among the on-chip cache effectiveness of
different hotspots are quite significant; some can be more
than 255 thousand times more effective than others.
Locality of data accesses is the fundamental principle that
drives the hierarchical memory system design. Given the
significance of the locality principle, previous works have
attempted to quantify the locality for deep understanding of
reference patterns and guiding compiler and architecture de-
sign to exploit program locality. For temporal locality, the
histogram of reuse distances [21] or LRU (least recently used)
stack distances [22] can be computed based on a sequential
address trace. For spatial locality, however, there is a lack of
consensus for such a quantitative measure and several ad-hoc
metrics [23, 24] are proposed based on intuitive notions. It is
not easy to represent locality with only a single score. More-
4
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
100
101
102
103
104
105
Figure 5: The Reuse aware Locality (RaL) of the hotspots in
diverse benchmark suites
over, with the consideration of the ability of measurement in
real platforms, we found that reuse distance like metrics cannot
be measured directly via performance counters in commercial
processors.
It is known that the memory stall time stems from the large
access time ratio among the memory levels (i.e., the latency
effect [25]) and the limitation of the pin bandwidth (i.e. the
bandwidth effect [26] [27]). Prior work uses MR (Miss Rate)
or MPKI (Misses Per Kilo Instructions) to characterize lo-
cality [28] [29], which makes sense theoretically. However,
commercial processors inculde multiple cache levels, so it is
not convenient to use the metrics of different levels simultane-
ously and using only one or two of them would be misleading.
We use Reuse-aware Locality (RaL) [13], which refers to
the average amount of L1 cache hits that can be met by one
off-chip data movement. The larger the value of RaL, the
more reuses of the data, and the lower the requirement for
off-chip bandwidth, since each off-chip data movement car-
ries one cache line, the size of which is fixed in predominant
commercial processors, i.e., 64B. With only one score value,
RaL quantifies the effectiveness of the on-chip memory hier-
archy. On-chip cache hierarchy takes most of the chip area
and transistors. However, not all the types of applications can
obtain the same large benefit as that in Bigdata applications.
In Fig. 5, we can see that a significant portion of the Bigdata
benchmarks have high RaL values. For these hotspots, once
the data has been loaded into the on-chip cache hierarchy,
they can be reused many times until being moved out of the
processor chip. The hotspots in emerging applications, HPCC,
HPCG and Graph500 have low RaLs.
A hotspot from RandomAccess in HPCC suite has the least
RaL of 1.98, while a hotspot from wordcount in the Big-
DataBench has the largest RaL of 506,081. The difference is
quite significant, with wordcount being 255 thousand times
more effective than RandomAccess. On the other hand, we
found that 20 out of 34 hotspots of SPECint and SPECfp
have small RaLs (< 100). The small RaL implies that even
for single-thread programs on-chip caches have not been effi-
ciently utilized, i.e., many cache lines are dead blocks during
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0%
20%
40%
60%
80%
100%
Figure 6: The L2 and beyond active degree of the hotspots in
diverse benchmark suites
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0%
20%
40%
60%
80%
100%
Figure 7: The latency non-hidden degree of the hotspots in
diverse benchmark suites
most of their lifetime. This fact calls for a change in the
trajectory of processors.
Besides the RaL, we define two additional metrics,
L2 and beyond active degree and latency non-hidden degree.
L2 and beyond active degree is the ratio of L1 pending cycles
to total memory (including L1) active cycles, thus it shows
how busy the memory system is. Latency non-hidden degree
is the ratio of pipeline stall cycles to total memory (including
L1) active cycles, thus it quantifies to what degree the memory
access latency has not been hidden.
As shown in Fig. 6, on average the hotspots in HPCC and
HPCG have large L2 and beyond active degrees, which im-
plies that the memory hierarchy is active during most of the
execution time, and thus causes large energy consumption. In
comparison, from Fig. 2, it is seen that the IPC of HPCC and
HPCG are very low. Therefore, we can infer that the busy
activities in the memory hierarchy are very inefficient, which
implies a significant mismatch between the patterns of HPCC
and HPCG with the micro-architecture in modern commodity
processors.
As shown in Fig. 7, on average the hotspots in HPCC and
HPCG have large latency non-hidden degree. The diversity of
latency non-hidden degrees of hotspots in PARSEC, SPECint
and SPECfp are high. The mc f in SPECint, milc in SPECfp,
5
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8: The L2 prefetch/request ratio of the hotspots in di-
verse benchmark suites
and canneal in PARSEC have hotspots that are corresponding
to the high latency non-hidden degrees.
3.3. Prefetch/Request Ratio
• The RandomAccess in HPCC has the smallest prefetch per-
centage, while the HPCG has the highest.
• Prefetchers are effective for most hotspots except those in
RandomAccess from HPCC.
• Among the multithread benchmark suites except HPCC,
Graph500’s prefetcher has the lowest effectiveness.
In Intel Xeon processors [16], automatic hardware prefetch
can bring cache lines into the unified last-level cache based
on prior data misses. We find that there are two limitations
of the Intel prefetcher. First, it attempts to prefetch only two
cache lines ahead of the prefetch stream. Moreover, it requires
some regularity in the data access patterns, i.e., a data access
pattern has a constant and short stride. If the access stride
is not constant, the automatic hardware prefetcher can partly
mask memory latency if the strides of two successive cache
misses are less than the trigger threshold distance.
The bar chart in Fig. 8 illustrates the ratio of L2 prefetches
to L2 requests. The prefetches are triggered by L2 prefetchers
according to the patterns of demand requests. On average,
for most suites except for HPCC, prefetches occupy a large
portion of requests. As shown in Fig. 8, for each of the eight
suites except HPCC, there exists at least one hotspot that
has a high L2 prefetch/request ratio. The RandomAccess in
HPCC has the smallest prefetch percentage (2.6% on average),
while the HPCG has the highest prefetch percentage (88% on
average). According to the prefetch percentage, the hotspots
are grouped into two categories, prefetch-friendly and prefetch
non-friendly. We found that the prefetch-friendly hotspots
have similar patterns that satisfy the trigger condition of Intel
prefetch hardware. We visualize the patterns in Section 4.
3.4. Off-core Access Throughput
• On average, single-thread applications have higher L3 APC
than multithread applications.
• HPCC has the lowest L3 APC.
BigDat
aBenchGraph5
00 HPCC HPCG MLPack PARSEC SPECf
p SPECin
t
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Figure 9: The L3 Accesses Per memory active Cycle (APC) of
the hotspots in diverse benchmark suites
• Among the multithread benchmark suites, MLPack has the
highest L3 APC.
As RaL is an important leverage for memory performance
optimization, once its before-optimization value is already
high, to remove the hotspot, we should turn to other leverages
such as increasing concurrency. However, once the RaL’s
before-optimization value is small, one can focus on improving
RaL before turning to other leverages.
To measure the off-core access performance, we have at
least four choices: MLP, AMAT, Bandwidth, and APC. MLP
refers to the memory level parallelism. AMAT is the average
off-core access time. Bandwidth here means the off-core
bandwidth consumption.
APC [20] refers to the number of accesses completed per
memory active cycle, and is a measurement that considers both
AMAT and MLP [30]. Thus, APC model is comprehensive
and reflects the quality of service (QoS) of memory system.
Specifically, no matter how long cache hit latency is or how
high cache hit rate is, the concern is the final service quality,
i.e., how many accesses are finished in each memory active
cycle. Due to these merits, we use APC in our study.
Fig. 9 shows the APC value of L3 cache, which reflects how
quickly the off-core accesses can be completed by L3. The
hotspots in PASRSEC, Graph500, HPCC and HPCG have low
L3 APC, which implies low off-core access throughput.
By Little’s law, L3 APC equals the value of MLP divided
by memory access latency. Given the same access intensity
of each thread, the more concurrent threads, the higher the
bandwidth utilization. We measure memory access latency
using Intel MLC (memory latency checker) [31]. As shown in
Fig. 10, the memory access latency increases with the band-
width utilization. The hotspots in SPECint and SPECfp have
high L3 APC, implying that single-thread applications have
shorter contention delay than multithread applications. Thus,
it calls for enhancing NoC and memory controllers of CMP
for boosting the L3 APC of multithread applications.
6
0 10000 20000 30000 40000 50000 60000
Bandwidth (MB/sec)
100
150
200
250
300
La
te
nc
y 
(n
s)
Figure 10: The memory access latency increases with band-
width
4. Gene-Patterns and Periodic Table
• (P1,...,P6) is a minimal complete set of base patterns, which
is taken as Gene-Patterns.
• The Periodic Table of patterns can be built based on RaL
and L3 APC, and is similar to the energy level diagram in
physics, where the memory performance optimization is a
transition between two energy levels.
• For the average pipeline stall degree, HPCC > HPCG >
Graph500 > SPECfp > MLPack > SPECint > PARSEC >
BigDataBench as evidenced by the order of the indifference
curves in the Periodic Table.
• High L3 APC and high RaL rarely occur simultaneously for
the hotspots of the benchmarks from diverse domains.
• The location in Periodic Table quantifies the matching de-
gree between a pattern and an architecture.
4.1. Visualizing the Patterns and Finding the Gene-
Patterns
The flow chart in Fig. 11 illustrates how we conduct the recur-
sive reduce process. We run the benchmarks and identify their
hotspots. Each hotspot has a few pattern figures. We have
collected 2420 pattern figures, which constitute the library of
patterns. Although the size of the library is very large, we find
that there exist significant similarities between the patterns
of hotspots. Based on the similarities, the base patterns are
extracted from the pattern library by clustering.
Following the workflow in Fig. 11, we find six different
base patterns, which account for only 0.25% (6/2420) of the
pattern library. We plot the base patterns in Fig. 12, where the
horizontal axis specifies the sequence of the accesses, and the
vertical axis indicates the accessed logical address range.
Pattern P1 and P3 are straight lines and aperiodic, the slope
of which indicates how quickly the touched address space is
extended. The larger the slope (k), the larger the working set.
Specifically, when the slope (k) is zero, the data can be reused
for each access by following the first access. On the other
hand, when the slope (k) is infinite, there is no opportunity for
This is a hotspot
Plot the pattern figure 
(MAP) 
         
Clustering
T > Δ% ?
Base patterns
Benchmarks in diverse 
domain
Y
New application
This is a hotspot
Plot the pattern figure 
(MAP) 
T > Δ% ?
Base patterns
Y
Training Inference
Figure 11: The two phases of pattern analysis for any applica-
tion: training and inference.
data reuse.
Pattern P2, P5 and P6 are straight lines and periodic. Com-
pared to pattern P1 and P3, pattern P2 and P6 have more op-
portunities for data reuse due to their periods. Compared to
P2 and P6, pattern P5 has a variable period. The variability of
the period impacts the reuse opportunities.
The straight line pattern figures are friendly for prefetchers.
Pattern P4 has no straight line but randomly distributed points,
which would result in the case that the prefetchers cannot work
well.
Providing the minimal complete set (MCS) of patterns is of
great significance for architecture and software design. When
pattern space is huge, there could exist multiple MCSs. We
can prove that P1, P2, ..., P6 is a MCS.
Theorem-1: (P1, P2, ..., P6) is a minimal complete set of
data access patterns.
Simpli f ied Proof : We can prove (omit here) that any subset
of (P1, P2, ..., P6) is not a complete set. Then we only need
to prove that (P1, P2, ..., P6) is a partition of pattern space S.
First, we have that any two different patterns are exclusive,
i.e.,
Pi∩Pj = /0 (1 ≤ i, j ≤ 6, i 6= j) (2)
Second, the union of P1 and P3, P1 ∪ P3, represents all the
patterns whose pattern figures are straight lines and aperiodic,
while the union of P2 and P6, P2 ∪ P6, represents all the
patterns whose figures are straight lines and have fixed periods.
P5 represents the patterns, the figures of which are straight
lines with low slope and have variable periods. For the patterns
the figures of which are straight lines with high slope and have
variable periods, they can be taken as a special case of Random
7
P1: k [0, 64], No period P3: k (64, ∞), No periodP2: k [0, 64], Fixed period
P4: Random access P5:  k [0, 64],Variable period P6: k   Fixed period
 
 (64, ∞),
Figure 12: A minimal complete set of base patterns for the huge pattern space. Each base pattern figure is extracted from the
pattern library.
Access pattern, P4. For patterns whose pattern figures are
not straight lines, they can be classified into Random Access
pattern, P4. Therefore, we obtain that
⋃6
k=1 Pk = S.
By combining the above two aspects, we conclude that (P1,
P2, ..., P6) is a partition of pattern space S.
4.2. Periodic Table of Memory Access Patterns
Depending on the matching degree between the microarchitec-
ture and the Gene-Patterns, the memory access performance
of hotspots is different.
In our experiments, among many metrics, the combination
of RaL and L3 APC is found to be most effective. Fig. 13
shows both the RaL and L3 APC of each hotspot. The upper-
right corner is blank, implying that high L3 APC and high RaL
rarely occur simultaneously for the hotspots of benchmarks
from diverse domains. The hotspots in HPCC and HPCG
are close to the lower-left corner, which implies why they
have large latency non-hidden degree and pipeline stall degree.
Fig. 13 is the Periodic Table of Memory Access Patterns (PT-
MAP), which can explain the results of latency non-hidden
degree in Fig. 7 and pipeline stall degree in Fig. 4.
For memory access performance, the on-chip effectiveness
(RaL) can substitute off-core memory access concurrency (L3
APC) and vice versa. An indifference curve is a period, which
depicts that the RaL is substitutable for L3 APC. Fig. 13
shows eight indifference curves. Each curve corresponds to a
benchmark suite, since the average value of RaL and L3 APC
of the benchmark suite (marked as a cross) is on the curve.
Following the indifference curves in Fig. 13, we find that
the eight benchmark suites can be ordered according to the
location of their indifference curves from the bottom left to
the upper right: HPCC, HPCG, Graph500, SPECfp, MLPack,
SPECint, PARSEC, and BigDataBench. This order is in agree-
ment with the average pipeline stall degree from large to small
shown in Fig. 4. The agreement demonstrates that the loca-
tion in the PT-MAP quantifies the matching degree between a
pattern and an architecture.
The PT-MAP is similar to the energy level diagram in
physics, where a memory performance optimization essen-
tially is a transition between two energy levels. In Fig. 13, as
the dash lines move from the lower left quarter to the upper
right quarter, it transits to increasingly higher energy levels
where the memory bound effect is lower. Of practical signif-
icance, in PT-MAP, both L3 APC and RaL can be measured
in commodity processors. Such a simple, quantitative, and
effective periodic table is timely and useful as applications are
increasingly data intensive and diverse.
To enable an energy level transition, we can conduct Gene-
Pattern aware accelerator design and software development.
5. Gene-Pattern aware Accelerators
In mathematics, the linear space has a series of basic vectors
called “bases”, and any vector in a linear space can be repre-
sented by a linear combination of the bases. If we think of
applications as “vectors” in a linear space, then all we need to
do is find the corresponding “bases”. Once we find the “bases”
8
HPCC EL
HPCG EL
SPECfp EL
SPECInt EL
MLPack EL
BigDataBench EL
PARSEC EL
Graph500 EL
An EL Transition
Figure 13: The periodic table of the hotspots in diverse benchmark suites. The dash lines are the Indifference Curves, which are
analogous to energy levels (EL) in physics. Each cross marks the geometrical center of a benchmark suite. The EL transition is
due to an optimization for K-Means which will be introduced in Section 6.
of applications, we need only focus on the bases, the types of
which are very limited. In that sense, the computing system
design process can benefit from “linear algebra”.
The base of the pattern space V is (P1,P2,...,P6). For any
pattern P in V , P = α1×P1 +α2×P2 + ...+αn×Pn, here
“+” is “∪”. Table 2 shows the pattern distribution of different
benchmark suites. When a hotspot has a mixed pattern, we
use weights in the statistical analysis. Table 2 motivates us to
build Gene-Pattern aware accelerators.
Both pattern P1 and P3 can use prefetching, since their
pattern figures are straight lines. However, they also have
significant differences. For pattern P1, the slope of the straight
line is small, so the spatial locality is high. Therefore, for
pattern P1, open page policy of DRAM can be used, which
brings more benefits than closed page policy; the cache line
size can be coarse-grained for effective prefetching.
On the other hand, for pattern P3, the slope of the straight
line is large, so the spatial locality is low and the working set
size increases quickly. As a result, for pattern P3, close page
policy of DRAM brings more benefits than open page policy;
the fine-grained cache line size rather than coarse-grained will
reduce bandwidth consumption and mitigate contention, thus
improving performance and reducing power consumption.
Pattern P4 is characterized by Random Access and has low
spatial locality, so it is preferable to use fine fetching granu-
larity and close-page policy. Pattern P4 has a limited degree
of temporal locality. The same data is read and then is written
immediately, and will rarely be reused in future. Therefore,
for Pattern P4, we only need to cache the corresponding data
in L1 rather than L2 and L3. In fact, if we allocate space
for the data of pattern P4 in L2 and L3, the working set in-
creases exponentially and thus consumes the limited cache
space quickly, which would flush the data of peer programs
and severely impact their performance.
For Pattern P6, the slope of the straight line is very large but
the period is fixed. We can use fine-grained prefetch and fetch.
Because the slope is very large, the working set increases
sharply, so caching the data in L1 and L2 is of no use for
performance. The data can be allocated in LLC, especially in
die-stacked DRAM cache that is several GBs. Similar to P3,
P6 prefers the close-page policy.
Both pattern P2 and P5 have low slope, but their periods are
different, one is fixed, and the other is variable. They both can
use caching and coarse-grained fetch. However, their prefetch
policies are different.
Following the above hints, we can build hardware accelera-
tors for each of the Gene-Patterns. As the “genes” of applica-
tions have already been found, highly customized accelerators
for each “gene” can be designed. When an application is ex-
ecuting, hardware can detect the “genes” of the application.
9
Then, the accelerators of the used genes are enabled and the
accelerators of the unused genes are disabled. In this manner,
the computer architecture adapts to the software diversity auto-
matically to harvest both performance and energy-efficiency.
Table 2: The pattern distribution of diverse benchmark suites
Suites P1 P2 P3 P4 P5 P6
HPCC 0 0.25 0 0.75 0 0
HPCG 0 0.25 0.75 0 0 0
Graph500 1 0 0 0 0 0
SPECfp 0.47 0.42 0.02 0 0.09 0
MLPack 0 0.43 0 0 0 0.57
SPECint 0.04 0.56 0 0.03 0.11 0.26
PARSEC 0.17 0.64 0 0.18 0 0
BigDataBench 0.14 0.81 0.02 0.03 0 0
6. Gene-Pattern aware Programming
• Dynamic data structures can change the data reuse period
and the working set.
• Abuse of pointers is harmful for memory access perfor-
mance, similar to the “go to” statement.
• The compound data structures usually have low memory
access locality.
• Accessing array in column rather than row order would
lower the access locality.
• Random access pattern is responsible for the inefficiency of
many hotspots.
As with hardware design, the development of software in-
cluding compilers and applications can also be inspired by
the Gene-Patterns, since the Gene-Patterns have deepened our
understanding about the impact of data structure and algorithm
on memory access patterns.
Table 3 shows the abstract code example of the base pat-
terns. Data structures indicate how the data is represented.
There are three fundamental types of static data structures:
array, record, and set. They constitute the building blocks
out of which more complex structures are formed. For the
dynamic structures, not only the values but also the structures
of variables are changed during the computation. Dynamic
structures include linked list, tree, and graph, where pointers
play an important role in accessing the data elements.
Dynamic data structures can change the data reuse period
and the working set of the memory accesses. For instance,
a hotspot in mc f from SPECint has pattern P5. The hotspot
traverses a linked list and inserts new elements into the list
in each iteration of the loop, so the linked list is gradually
increasing, which results in pattern P5.
The pointers are flexible for programming but hurt data
access locality. We find another hotspot of mc f is traversing
a heap, and the pattern is a combination of P4 and P5. Heap
is a special case of a complete binary tree. When traversing
a branch of the complete binary tree, the stride between two
continuous accesses increases exponentially, which results
in pattern P4. The addresses having been accessed tend to
increase over time, which indicates that there is an upward
trend of the amount of the elements in the heap. That trend
results in the Pattern P5.
We found that the compound data structures usually have
low memory access locality. For instance, when an array’s
index is an element of another array, the memory access would
have low locality. HPCG has pattern P3, where the correspond-
ing code is Loop A[B[j]]. The A[B[j]] is an indirect access of
array A, since the index of A is the element of array B.
Table 3: The abstract code examples of the base patterns
Patterns Example Code Note
P1
Loop i
{ Operate A[i]} stride of A[i] > c
P2
Loop i {
Loop j {
Operate A[j] } }
stride of A[j] < d
P3
Loop i
{ Operate A[B[i]] } stride of B[i] > d
P4
Loop i
{ Operate A[random()]} A is large
P5
For i = 1 to n {
For j = 1 to i {
Operate A[j] }}
stride of A[j] > c
P6
Loop i {
Loop j {
Operate A[B[j]] } }
stride of B[j] > d
*c is the size of cache line (64B)
*d is the trigger threshold distance of prefetch hardware
We found that accessing array in column rather than row or-
der would lower the access locality. As for the data structures
whose implementations are based on array, the stride between
two continuous accesses significantly influences to the final
pattern. Different strides indicate different values of the slope,
k. Accessing arrays in column order usually occurs for the
hotspots in Graph500, which results in pattern P6. Since the
stride is large, k is also large.
When random numbers are used as the index of array, pat-
tern P4 occurs, which has the lowest locality among all the
patterns. We found that not only HPCC has random access
pattern, but also that canneal from PARSEC has random ac-
cess pattern. In canneal, a continuous memory region instead
of many small memory regions is allocated. Thus, although
the memory access pattern is random access (pattern P4), it
still concentrates within a range of about 20MB.
Although the focus of our study is not on optimizing a
special application, we conducted software optimizations for
many benchmarks following the above analysis of Gene-
Patterns. For instance, we found that the importance of a
10
hotspot in K-means from BigDataBench is 71.37%. After our
optimization with prefetching and splitting, the location of
the hotspot in PT-MAP moves to the upper right corner. The
energy level transition is shown in Fig. 13. The total running
time of the benchmark is reduced from 523.99s to 336.98s.
That is, with a small number of code modifications, the total
application performance has been improved by 55%.
7. Related Work
Ganesh et al. [32] proposed Conservation Cores (c-cores),
which are specialized processors that focus on reducing energy
instead of increasing performance. The focus on energy makes
c-cores an excellent match for many applications that would
be poor candidates for hardware acceleration (e.g., irregular
integer codes). In contrast, our study is not only for reducing
energy, but also for improving performance, because even the
codes with irregular patterns can be accelerated.
In the work of PuDianNao [4], one could observe that the
variables in k-NN distance calculations are naturally clustered
into three classes according to the average reuse distances
(similar behaviors can be observed from K-Means). Therefore,
PuDianNao uses three separate on-chip buffers in its accelera-
tor, where each buffer stores the variables having similar reuse
distances. Our results corroborate these findings, and more
importantly, reveal the reasons behind the observations from
the Gene-Pattern perspective.
Simulation is an alternative way to conduct the characteriza-
tion [33]. In contrast, we run the applications in real machines
and use performance counters to obtain accurate metrics. It
is difficult to use software simulation methods to run each
benchmark completely, especially those with large footprints.
The speed of cycle accurate simulators such as GEM5 is pretty
low [14], which is a typical slowdown of the real execution
time in the orders of 105 to 106. For instance, we have mea-
sured the speed of GEM5, which is 50 to 500 KIPS (Kilo
Instructions Per Second), much lower than the peak speed
of Intel Xeon processors ( 8 GIPS). Even simulating a rela-
tively small program that takes one minute to execute requires
approximately one month to a year to simulate [34]. Our pro-
posed HOTS will benefit architects to conduct simulation with
representative and concise inputs.
As opposed to the previous research which tends to charac-
terize only a special benchmark suite [7] [8], our study broadly
analyzes diverse applications from eight different suites, and
compares them in depth. Our focus is not on a special bench-
mark suite, but the base patterns of all the different suites,
which are the essential components of applications. To the
best of our knowledge, our study is the first work that covers
such a wide range of benchmark suites and extracts the base
patterns of applications.
Some existing research measures the average value of met-
rics on real machines to characterize the average behavior of
applications [7] [8]. For example, CloudSuite [35] performs
a 3-minute measurement after a warmup phase. Although the
average is a single value that is easy to present, the diversity of
the patterns in an application is hidden by the average. In our
work, we identify the application hotspots, which have a small
number of source lines but consume significant running time.
Running a whole application completely but characterizing
only in the granularity of hotspot, renders our work different
from most other studies.
Unlike Cloudsuite [35] and BigDataBench [8], we pro-
pose an evalution method using Reuse aware Locality
(RaL) [13], pipeline stall degree, L2 and beyond active de-
gree, prefetch/request ratio and latency non-hidden degree.
These metrics present new perspectives for characterizing the
utilization of the micro-architectures.
The work in BigDataBench [8] finds that L3 caches of a
typical state-of-practice processor (Intel Xeon E5645) are effi-
cient for big data workloads. Our results not only corroborate
this finding, but also reveal other things with respect to the met-
ric, RaL. The on-chip cache hierarchy is highly effective for
most hotspots in BigDataBench, MLPack, PARSEC, SPECint,
and SPECfp. However, for some emerging workloads, like
Graph500, HPCC, HPCG, the on-chip cache effectiveness is
very low.
The RaL metric [13] we used is similar to the Roofline
model [36]. In Roofline model, operational (arithmetic) in-
tensity is the number of floating-point operations per byte of
memory accesses. Roofline demonstrates that, if an applica-
tion is bounded by the peak computing speed, it is compute-
bound, while if it is bounded by the product of peak mem-
ory bandwidth and operational intensity, it is memory-bound.
Compared to operational intensity, the RaL metric is the num-
ber of L1 cache hits per off-chip memory access, which is
more focused on memory system performance.
The substitution effect between cache size and memory
bandwidth has been discussed in the REF work [37]. Our PT-
MAP is similar to the Cobb-Douglas Indifference Curves of
REF. In contrast, PT-MAP is closer to the essence of memory
access patterns. We replace cache size with RaL, and prefer L3
APC to memory bandwidth. PT-MAP is simple and effective,
and the energy level concept has been derived. The order of
the eight benchmark suites in terms of the pipeline stall degree
matches well with the order of their energy level curves.
8. Conclusions
As computers are widely applied in increasingly more areas,
fixed computer architecture contradicts the increasing diver-
sity of applications, resulting in the utilization wall and the
memory wall. In this study, we proposed a Recursive Reduce
methodology to obtain the Gene-Patterns of diverse applica-
tions. This methodology can be conducted in commercial pro-
cessors based on performance counters, thus it is much faster
than the methods based on simulators. Our study showed
that there exist significant differences among the applications
in different domains, but there also exist many similarities
from a pattern perspective, which makes the amount of base
11
patterns very limited. We have identified a set of six base
patterns. We found that inefficiency results from the mismatch
between some of the base patterns and the micro-architecture
of modern processors. We build a periodic table, where the in-
difference curves are analogous to the energy levels in physics.
The location in the table quantifies the matching degree be-
tween a pattern and an architecture, and memory performance
optimization is essentially an energy level transition.
As we have already found the “genes” of applications (when
thinking of applications as “lives”), we could take these genes
as building blocks for the design of both architecture and soft-
ware. It is not necessary to develop customized architecture
for each application. The findings in our study will facilitate
accelerator design, and hybrid or heterogeneous system design.
These Gene-patterns will straddle the divide between general-
purpose and special purpose, providing a methodology for the
matching between application diversity and architecture fixity.
References
[1] Mitchell Waldrop. More than moore. Nature, 530(1):144–147, 2016.
[2] Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton,
and Deborah K. Gracio. The changing paradigm of data-intensive
computing. Computer, 42(1):26–34, 2009.
[3] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia
Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15,
2011.
[4] Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan
Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen.
Pudiannao: A polyvalent machine learning accelerator. In ASPLOS,
pages 369–381. ACM, 2015.
[5] Cloyce D Spradling. Spec cpu2006 benchmark tools. ACM SIGARCH
Computer Architecture News, 35(1):130–134, 2007.
[6] John L. Henning. Spec cpu2006 benchmark descriptions. Acm Sigarch
Computer Architecture News, 34(4):1–17, 2006.
[7] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li.
The parsec benchmark suite: Characterization and architectural im-
plications. In International Conference on Parallel Architectures and
Compilation Techniques, pages 72–81, 2008.
[8] Lei Wang, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona
Li, Bizhu Qiu, Jianfeng Zhan, Chunjie Luo, and Yuqing Zhu. Big-
databench: A big data benchmark suite from internet services. pages
488–499, 2014.
[9] Ryan R Curtin, James R Cline, N. P Slagle, William B March, Parikshit
Ram, Nishant A Mehta, and Alexander G Gray. Mlpack: A scalable
c++ machine learning library. Journal of Machine Learning Research,
14(1):801–805, 2012.
[10] Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner,
Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. The hpc
challenge (hpcc) benchmark suite. In ACM/IEEE Sc2006 Conference
on High PERFORMANCE NETWORKING and Computing, November
11-17, 2006, Tampa, Fl, Usa, page 213, 2006.
[11] Michael A Heroux, Jack Dongarra, and Piotr Luszczek. Hpcg bench-
mark technical specification. 2013.
[12] Graph500. http://www.graph500.org/. Accessed July 4, 2017.
[13] Yuhang Liu and Xian He Sun. Cal: Extending data locality to consider
concurrency for performance optimization. IEEE Transactions on Big
Data, 5(2):273–288, 2017.
[14] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Rein-
hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower,
Tushar Krishna, Somayeh Sardashti, et al. The gem5 simulator. ACM
SIGARCH Computer Architecture News, 39(2):1–7, 2011.
[15] Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Terp-
stra Dan. Using papi for hardware performance monitoring on linux
systems. 2001.
[16] Intel Corporation. Intel R© 64 and IA-32 Architectures Optimization
Reference Manual. Number 248966-033. June 2016.
[17] L Adhianto, S Banerjee, M Fagan, M Krentel, G Marin, J Mellor Crum-
mey, and N. R Tallent. Hpctoolkit: tools for performance analysis of
optimized parallel programs. Concurrency & Computation Practice &
Experience, 22(6):685–701, 2010.
[18] Martin Burtscher, Byoung Do Kim, Jeff Diamond, John Mccalpin,
Lars Koesterke, and James Browne. Perfexpert: An easy-to-use perfor-
mance diagnosis tool for hpc applications. In High PERFORMANCE
Computing, Networking, Storage and Analysis, pages 1–11, 2010.
[19] Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi
Devor, Kim Hazelwood, Aamer Jaleel, Chi Keung Luk, Gail Lyons,
and Harish Patil. Analyzing parallel programs with pin. Computer,
43(3):34–41, 2010.
[20] Dawei Wang and Xianhe Sun. Apc: A novel memory metric and
measurement methodology for modern memory systems. Computers,
IEEE Transactions on, 63(7):1626–1639, 2014.
[21] Yutao Zhong, Xipeng Shen, and Chen Ding. Program locality analysis
using reuse distance. ACM Transactions on Programming Languages
and Systems (TOPLAS), 31(6):20, 2009.
[22] Michael Jason Cade and Apan Qasem. Balancing locality and par-
allelism on shared-cache mulit-core systems. In IEEE International
Conference on High PERFORMANCE Computing and Communica-
tions, pages 188–195, 2009.
[23] Null Yingmin Li, B. Lee, D. Brooks, Null Zhigang Hu, and K. Skadron.
Cmp design space exploration subject to physical constraints. In
The Twelfth International Symposium on High-Performance Computer
Architecture, pages 17–28, 2006.
[24] Yunlian Jiang, Eddy Z Zhang, Kai Tian, and Xipeng Shen. Is reuse
distance applicable to data locality analysis on chip multiprocessors?
In Compiler Construction, pages 264–282. Springer, 2010.
[25] Wm A Wulf and Sally A McKee. Hitting the memory wall: impli-
cations of the obvious. ACM SIGARCH computer architecture news,
23(1):20–24, 1995.
[26] Alain Kägi, James R Goodman, and Doug Burger. Memory bandwidth
limitations of future microprocessors. In 23rd Annual International
Symposium on Computer Architecture, pages 78–78. IEEE, 1996.
[27] Brian M Rogers, Anil Krishna, Gordon B Bell, Ken Vu, Xiaowei
Jiang, and Yan Solihin. Scaling the bandwidth wall: challenges in and
avenues for cmp scaling. In ACM SIGARCH Computer Architecture
News, volume 37, pages 371–382. ACM, 2009.
[28] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mah-
mut Kandemir, and Thomas Moscibroda. Reducing memory inter-
ference in multicore systems via application-aware memory channel
partitioning. In Ieee/acm International Symposium on Microarchitec-
ture, pages 374–385, 2011.
[29] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely,
and Joel Emer. Adaptive insertion policies for high performance
caching. In International Symposium on Computer Architecture, pages
381–391, 2007.
[30] Yuan Chou, Brian Fahs, and Santosh Abraham. Microarchitecture opti-
mizations for exploiting memory-level parallelism. In ACM SIGARCH
Computer Architecture News, volume 32, page 76. IEEE Computer
Society, 2004.
[31] Intel company. Memory latency checker. https://software.intel.
com/en-us/articles/intelr-memory-latency-checker. Ac-
cessed July 4, 2017.
[32] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Gar-
cia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and
Michael Bedford Taylor. Conservation cores:reducing the energy of
mature computations. ACM SIGARCH Computer Architecture News,
38(1):205–218, 2010.
[33] Nick Barrow-Williams, Christian Fensch, and Simon Moore. A com-
munication characterisation of splash-2 and parsec. In IEEE Interna-
tional Symposium on Workload Characterization, pages 86–97, 2009.
[34] Ipek E, McKee S.A, Supinski B.R.and Schulz M, and Caruana R.
Efficiently exploring architectural design spaces via predictive mod-
eling. In ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), pages
195–206, 2006.
[35] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,
Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel
Popescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the clouds:
a study of emerging scale-out workloads on modern hardware. In
Seventeenth International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 37–48, 2012.
[36] Samuel Williams, Andrew Waterman, and David Patterson. Roofline:
an insightful visual performance model for multicore architectures.
Communications of the Acm, 52(4):65–76, 2009.
12
[37] Seyed Majid Zahedi and Benjamin C. Lee. Ref: resource elasticity
fairness with sharing incentives for multiprocessors. In International
Conference on Architectural Support for Programming Languages and
Operating Systems, pages 145–160, 2014.
13
