University of Texas at El Paso

DigitalCommons@UTEP
Open Access Theses & Dissertations

2014-01-01

Platform-Independent Data Locality Analysis To
Predict Cache Performance On Abstract Hardware
Platforms
Sonish Shrestha
University of Texas at El Paso, sonish.shrestha@gmail.com

Follow this and additional works at: https://digitalcommons.utep.edu/open_etd
Part of the Computer Sciences Commons
Recommended Citation
Shrestha, Sonish, "Platform-Independent Data Locality Analysis To Predict Cache Performance On Abstract Hardware Platforms"
(2014). Open Access Theses & Dissertations. 1733.
https://digitalcommons.utep.edu/open_etd/1733

This is brought to you for free and open access by DigitalCommons@UTEP. It has been accepted for inclusion in Open Access Theses & Dissertations
by an authorized administrator of DigitalCommons@UTEP. For more information, please contact lweber@utep.edu.

PLATFORM-INDEPENDENT DATA LOCALITY ANALYSIS TO PREDICT
CACHE PERFORMANCE ON ABSTRACT HARDWARE PLATFORMS

SONISH SHRESTHA
Department of Computer Science

APPROVED:

Shirley V Moore, Ph.D., Chair

Patricia J. Teller, Ph.D.

Michael McGarry, Ph.D.

Bess Sirmon-Taylor, Ph.D.
Interim Dean of the Graduate School

Copyright ©

by
Sonish Shrestha
2014

To my
FAMILY and FRIENDS
with love

PLATFORM-INDEPENDENT DATA LOCALITY ANALYSIS TO PREDICT
CACHE PERFORMANCE ON ABSTRACT HARDWARE PLATFORMS
by

SONISH SHRESTHA

THESIS

Presented to the Faculty of the Graduate School of
The University of Texas at El Paso
in Partial Fulfillment
of the Requirements
for the Degree of

MASTER OF SCIENCE

Department of Computer Science
THE UNIVERSITY OF TEXAS AT EL PASO
May 2014

Acknowledgements

My thesis "PLATFORM-INDEPENDENT DATA LOCALITY ANALYSIS TO PREDICT
CACHE PERFORMANCE ON ABSTRACT HARDWARE PLATFORMS" would have been
incomplete without the supervision and utmost support of my respected mentor, Dr. Shirley V.
Moore. Therefore, I would like to express my sincere gratitude to Dr. Shirley Moore for her
expertise suggestions, rigorous support and availability in spite of her busy schedules.
I would like to thank Dr. Patricia J. Teller and Dr. Michael McGarry for being a part of
my thesis as being my committee members.
I would like to thank Ananta Tiwari who is member of CoDAASH team who helped me
understand the "Reuse distance analysis tool" from SDSC. Similarly, I would like to thank
Gabriel Marin, Ashay Rane and Leornardo Fialho, who helped

me understand their tool

MACPO and PerfExpert,and helped me use their tool throughout the completion of my project. I
would also like to thank my friend Vairavan Mani who helped me to setup the simulators.
I would like to thank United States Air Force Office of Scientific Research under AFOSR
award number FA9550-12-1-0476 for providing fund for doing this project.
I would like to extend my special appreciation to UTEP, for providing me the opportunity
to attend the Master's program of Computer Science. I will always be indebted to UTEP for
providing me such a great platform to build my career.
Last but not the least, I would like to thank all my friends and families for their
continuous support, understanding and encouragement

v

Abstract
This research is part of a co-design project that has the goal of designing hardware
systems to match application requirements and efficiently mapping applications to hardware.
This thesis is focused on optimizing the platform cache memory hierarchy configuration. To
determine application requirements, we characterize the application using platformindependent locality metrics. Next we use locality data and an analytical model to predict
cache an application performance of sequential versions of application codes for various
cache configurations. After using an analytical model to select a candidate set of cache
memory hierarchy configurations, we used architectural simulation to test the selection for
the targeted systems.

vi

Table of Contents
Acknowledgements .............................................................................................................................. v
Abstract ................................................................................................................................................ vi
Table of Contents ................................................................................................................................vii
List of Tables ....................................................................................................................................... ix
List of Figures ....................................................................................................................................... x
Chapter 1: Introduction ........................................................................................................................ 1
1.1

Context................................................................................................................................ 1

1.2

Problem Description .......................................................................................................... 1

1.3

Methodology ...................................................................................................................... 1

Chapter 2: Background ......................................................................................................................... 5
2.1

Overview ............................................................................................................................ 5

2.2

Memory Hierarchy ............................................................................................................. 5

2.3

Locality of References ....................................................................................................... 6
2.3.1 Temporal Locality .................................................................................................... 6
2.3.2 Spatial Locality ......................................................................................................... 7

2.4

Cache Mapping .................................................................................................................. 7
2.4.1 Direct Mapped Cache ............................................................................................... 7
2.4.2 Fully Associative Cache ........................................................................................... 8
2.4.3 Set Associative Cache .............................................................................................. 8

2.5

Effect of Cache Performance on Application Performance ............................................ 9

2.6

Hardware Prefetching ...................................................................................................... 10

Chapter 3: Reuse Distance Analysis .................................................................................................. 11
3.1

PerfExpert Version 4.1.1 ................................................................................................. 12

3.2

Reuse Distance Analysis tool from PMaC (Performance Modeling and
Characterization) labs (Pre-released version)– PEBIL, PIN ......................................... 13

3.3

MIAMI (Pre-released version)– Machine Independent Application Models for
Performance Insight ......................................................................................................... 18

vii

3.4

MACPO (Memory Access Characterization for Performance Optimization) (Prereleased version) .............................................................................................................. 23

Chapter 4: Analytical Modeling[18].................................................................................................. 25
Chapter 5: Setup and Analysis of Architectural Simulators to Verify the Predicted Cache
Performance ............................................................................................................................... 27
5.1

ESESC (Enhanced Super ESCalar simulator)................................................................ 27

5.2

MacSim ............................................................................................................................. 28

5.3

SST with gem5 ................................................................................................................. 31

5.4

GPUWattch + GPGPU-Sim ............................................................................................ 35

Chapter 6: Results ............................................................................................................................... 37
6.1

Experimental Results and Analysis ................................................................................ 37

6.2

Roofline Model for LULESH Version 2 ........................................................................ 49

Chapter 7: Conclusion ........................................................................................................................ 51
References ........................................................................................................................................... 52
Curriculum Vitae ................................................................................................................................ 54

viii

List of Tables

Table 3.2.1: Cache description file from SDSC. .............................................................................. 14
Table 5.2.1: MacSim architecture-specific files. ..............................................................................29
Table 6.1.1: LULESH v2.0 Reuse distance statistic from SDSC pin version. ............................... 37
Table 6.1.2: LULESH v2.0 Reuse distance statistic from MIAMI. ................................................38
Table 6.1.3: CoMD Reuse distance statistic from MIAMI. ............................................................. 39
Table 6.1.4: CoMD Reuse distance statistics from SDSC Pin version. ..........................................40
Table 6.1.5: Probalistic model results for LULESH v2.0 for number of sets =64. ........................ 41
Table 6.1.6: Probalistic model results for LULESH v2.0 for number of sets =512. ...................... 42
Table 6.1.7: Probalistic model results for LULESH v2.0 for number of sets =8192. .................... 43
Table 6.1.8: Probalistic model results for CoMD for number of sets =64. ..................................... 44
Table 6.1.9: Probalistic model results for CoMD for number of sets =512.................................... 44
Table 6.1.10: Probalistic model results for CoMD for number of sets =8192. .............................. 45
Table 6.2.1: Memory bandwidth with different numbers of threads for Intel E5-2680................. 49

ix

List of Figures

Figure 1.3.1: Flowchart of the methodology....................................................................................... 4
Figure 2.2.1: Memory Hierarchy. ........................................................................................................ 6
Figure 2.4.1: Single cache memory block. .......................................................................................... 7
Figure 2.4.2: Organization of cache memory. .................................................................................... 7
Figure 2.4.1.1: Memory address for direct mapped cache ................................................................. 7
Figure 2.4.2.1: Memory address for fully associative cache ............................................................. 8
Figure 2.4.3.1: Memory address for set associative cache ................................................................. 8
Figure 2.4.3.2: Organization of cache for set associative cache ........................................................ 9
Figure 3.1: Reuse distance Illustration .............................................................................................. 11
Figure 3.1.1: Sample experimental output of PerfExpert................................................................. 12
Figure 3.1.2: Sample recommendation output of PerfExpert. ......................................................... 12
Figure 3.3.1: MIAMI Architecture. ................................................................................................... 20
Figure 3.3.2: Sample MIAMI tool output. ........................................................................................ 21
Figure 3.3.3: Sample memreuse output from MIAMI. .................................................................... 22
Figure 3.4.1: Sample output of MACPO........................................................................................... 24
Figure 5.1.1: Sample output of ESESC. ............................................................................................ 28
Figure 5.2.1: Memory statistics for the Sphinx benchmark with MacSim. .................................... 31
Figure 5.3.1: SST configuration. ....................................................................................................... 32
Figure 5.3.2: M5 configuration. ......................................................................................................... 33
Figure 5.3.3: Power XML. ................................................................................................................. 34
Figure 5.3.4: gem5 with power modeling output ............................................................................. 35
Figure 6.1.1: Validation graph between MIAMI and SDSC for LULESH v2. .............................. 39
x

Figure 6.1.2: Validation graph between MIAMI and SDSC for CoMD. ........................................ 41
Figure 6.1.3: Missrate vs Associativity for LULESH v2 for number of sets =64. ......................... 42
Figure 6.1.4: Missrate vs Associativity for LULESH v2 for number of sets =512........................ 42
Figure 6.1.5: Missrate vs Associativity for LULESH v2 for number of sets =8192...................... 43
Figure 6.1.6: Missrate vs Associativity for CoMD for number of sets =64. .................................. 44
Figure 6.1.7: Missrate vs Associativity for CoMD for number of sets =512. ................................ 45
Figure 6.1.8: Missrate vs Associativity for CoMD for number of sets =8192. .............................. 46
Figure 6.1.9: ESESC result with the original sandy bridge configuration except for L3
configuration. ...................................................................................................................................... 47
Figure 6.1.8: ESESC result with our predicted configuration for LULESH version 2. ................. 48
Figure 6.2.1: Maximum bandwidth graph from stream benchmark for Intel E5-2680. ................. 49
Figure 6.2.2: Roofline Model for Intel Xeon E5-2680 marked with the LULESHv2 benchmark.
.............................................................................................................................................................. 50

xi

Chapter 1
Introduction
This thesis describes an effort to predict the optimal cache configurations for different
applications. This prediction is carried out using several tools. Different applications need
different hardware characteristics to perform at their best. It is our goal to give the guidance
needed to satisfy these requirements, especially for cache configuration.

1.1

Context

This thesis has the goal of designing hardware to match the requirements of
computational chemistry and physics algorithms important to the materials science problems
of interest. The overall project is called CoDAASH (Co-design Approach for Advances in
Software and Hardware) [1] and our specific contribution to this project is to identify the
optimal cache memory configuration for a given algorithm like physics and chemistry codes.

1.2

Problem Description

In today’s world, it is the biggest challenge to make things faster.

There are

applications that take much longer than the optimal time to compute their results and this
might be due to sub-optimal coding or to a mismatch of the algorithm and the computer
hardware. So the problem is to design the hardware in which an application could run in it
optimally.

1.3

Methodology

Our first step in solving this problem is to evaluate tools for obtaining platformindependent memory locality metrics and to validate the results. We have evaluated the
PMaC locality measurement tool from SDSC(San Diego Supercomputer Center), the
MACPO (Memory Access Characterization for Performance Optimization)[2] data access
1

analysis tool from TACC (Texas Advanced Computing Center) and MIAMI(Machine
Independent Application Models for performance Insight) from the University of Tennessee
at Knoxville [3]. We also use the PAPI v 4.1.1(Performance Application Programming
Interface) v 5.3.0.0[4] hardware event counter library to sanity check results returned by
these tools.
The PMaC locality measurement tool instruments the application source code using
PEBIL ( Static Binary Instrumentation for x86/Linux) [5] in order to measure reuse distances
and strides for data accesses. It computes reuse distances per basic block as well as per data
structure. The tool also approximates the reuse distance measurement by only measuring
reuses within a window size and window size can be varied, which is a user-settable
parameter. Also using an application source code, the MACPO data access analysis tool
reports reuse distance per non-scalar variable and the strides with which these data structures
are accessed. Annotating the applications source code with PAPI reports counts of different
events related to cache memory e.g., cache hits, cache misses, etc [6]. MIAMI works on x8664 application binaries rather than code to get the reuse distance metric. It works on top of
Pin to determine the relative importance of different parts of the program and to capture the
application’s dynamic data reuse. [3]
First to validate the results reported by each tool, we wrote simple matrix and blocked
matrix multiplication benchmark codes for which we know the expected reuse distances and
strides. In addition to collecting the locality data, we collected cache miss data using PAPI.
Although cache misses are a platform-dependent metric, the PMaC tool reports cold misses
and we can also play with the window size of this tool to simulate a fully-associative cache.
We adjusted the results to compensate for the different ways in which the tools work – for
example, the PMaC and MIAMI tool report results in terms of memory words, while
MACPO reports results in terms of cache lines.
Our second step was to use locality data to predict the cache performance of
sequential versions of the LULESH v2 [7] and CoMD v1.1 [8] codes for various cache
2

configurations. The first application code we worked with was the LULESH benchmark,
which serves as a proxy code for full shock physics applications such as CTH and ALEGRA.
Once we collected the reuse distance data, we used a straightforward analytical model to
predict cache misses for a fully-associative cache of a given size, and we used a probabilistic
model to predict cache misses for a set-associative cache.
Our applications needed to run in parallel mode to scale to realistic problem sizes.
Predicting cache behavior for a thread-parallel program running on a multicore system
complicated, thus our predictions are only approximate. To accurately evaluate and select
optimal cache memory and also memory bandwidth configurations without building the
actual hardware, we used architectural simulators. Hence, we evaluated architectural
simulators to select the one(s) most suitable for use in our project. The simulators we
considered include: MacSim, SST, gem5, GPGPU simulator and ESESC.
Other participants in the CoDAASH project are designing the abstract hardware
platforms that are to be co-designed along with the application software to be optimized for
these platforms. The initial abstract platform to be evaluated is a multicore cluster with
attached GPUs. For our part of the project, we focus on selecting the cache memory
configuration of the multicore system that best matches the data access patterns of the
application. We first use an analytical model based on the platform-independent locality
metrics described above to select a candidate set of cache configurations. We then use
architectural simulation to refine our selection.
We report the results of our evaluations of locality measurement tools and
architectural simulators. We also describe results from our characterization of the LULESH
benchmark and predictions of its cache memory performance for various cache memory
configurations. The novel contribution of our work is using platform-independent data access
analysis for systematic hardware-software co-design and co-optimization for a given class of
applications, rather than just for performance optimization for existing architectures.

3

Start
Collected reuse distance results produced by tools and
validate the results if that’s correct.
Implemented probabilistic model and feed reuse distance result into the
probabilistic model which gives total number of cache misses for given
number of sets and associativities.
Validated the results produced by probabilistic model to
the results produced by PAPI and ESESC
Feed the configuration into ESESC and if we are
successful in gaining performance with that
configuration
Analyzed other possible hardware tuning i.e. Memory
Bandwidth
End
Figure 1.3.1: Flowchart of the methodology
The remainder of this thesis is organized as follows; In Chapter 2 we discuss some general
ideas of reuse distance and the various tools that we used to get the reuse distance and other
useful metrics. In Chapter 3, we show how we built the analytical model and explain the
underlying theory. Chapter 4 describes various simulators and how we tune these simulators,
and presents some sample outputs from these simulators. In Chapter 5, we present our results
from different tools and the analytical model. We explain and analyze the results and
analyzed the results in a meaningful way. Lastly in chapter 6 we sum up all the things as a
conclusion.

4

Chapter 2
Background
2.1

Overview
This thesis uses the concepts of cache memory, different levels of caches, spatial and

temporal locality, cache configuration and cache associativities. We describe all of them in
details in this chapter.
Cache memory is fast and small memory on or near the processor that keeps copies of
data from frequently used main memory locations. To understand the how the cache memory
works, we explain it with a popular library example [23].
A librarian has a backpack that can hold up to 10 books (i.e. 10 book cache). Now when
the day starts the backpack is empty so when a student asks for one book named let’s say “The
Hunger Games”, it is obvious that he will not find it in his backpack (This is called a compulsory
cache miss since it is the first time the book has been requested). So the librarian has to go to the
storeroom to collect that book. Later that student returns that book and the librarian puts it in his
backpack. Now when another student asks for the same book, the librarian doesn’t have to go to
the storeroom to collect it because he has it in his backpack (This is called a cache hit). So the
process is much faster when librarian finds it in his backpack. However, if ten or more other
books are requested between the two requests for “The Hunger Games”, then “The Hunger
Games” will no longer be in the backpack (assuming the librarian keeps the ten most recently
requested books in his backpack), and the librarian will have to collect it from the storeroom
again (This is called a capacity cache miss). This example illustrates the concept of how cache
memory works. Now let’s discuss about how this memory is organized in our next section.

2.2

Memory Hierarchy
The memory hierarchy in a modern computer is organized in terms of space, speed and

cost. The memory hierarchy is organized as shown below.

5

L1
C
a
c
h
e

CPU
Resister
s

Size: 1000 bytes
Speed: 300 ps

64KB
1ns

L2
C
a
c
h
e
256K
3-10 ns

L3
C
a
c
h
e
2-4 MB
10-20 ns

Memory

4-16 GB
50-100 ns

I/O
bus

Disk
Storage

4-16 TB
5-10 ms

Figure 2.2.1: Memory Hierarchy [25]
Here the memory is arranged from left to right in terms of speed and size. Here we can
see the memory closest to the processor is L1 cache and to retrieve the data from L1 cache takes
less than 2 ns. So the fastest memory is L1 cache and the speed decreases as we go from left to
right as shown above in Figure 2.2. Similarly L1 has the lowest memory capacity, and capacity
increases as we go from left to right as shown above in Figure 2.2. With inclusive caches, the
higher (larger, slower) level caches maintain copies of data from the lower levels. Memory
hierarchies exploit locality by caching data likely to be used again. Now we will discuss about
the locality of references.

2.3

Locality of References
There are two types of locality of references:

2.3.1



Temporal Locality



Spatial Locality

Temporal Locality
Temporal locality is the tendency to access locations that have been recently referenced.

This means a resource that is referenced at one point in time will be referenced again sometime
in the near future. The sources of temporal locality are code within a loop, same instructions
fetched repeatedly and so on. Temporal locality is sensitive to cache size so with good temporal
locality if the cache size is increased the miss rate usually decreases.

6

2.3.2

Spatial Locality
Spatial locality is the tendency to reference locations around recently referenced

locations. This means that there is higher probability of referencing a resource if a resource
around it was just referenced. The sources of spatial locality are data arrays accessed in a regular
pattern, local variables in stack, data allocated in chunks and so on. Spatial locality is sensitive to
cache line size and to prefetching effectiveness.
2.4

Cache Mapping
Based on the mapping of cache there are three ways to do it and these are explained

below.
To explain let us explain the organization of cache. The cache is divided into some number of
blocks where in each block there is tag field, block data and valid bit which is as shown below.
Tag

Block data

Valid

Figure 2.4.1 Single cache memory block
So there will be many blocks in cache as shown above.
Tag

Block data

Valid

Tag

Block data

Valid

.

.

.

.

.

.

.

.

.

.

.

.

Tag

Block data

Valid

Figure 2.4.2 Organization of cache memory
2.4.1

Direct Mapped Cache
This is simple kind of cache. In direct mapped cache, let’s assume we will have the

address field of d bits in which this address field is divided into three fields: tag, index and offset
bits.

Tag

Index

Offset

(d-b-c) bits

c bits

b bits

Figure 2.4.1.1 Memory address for direct mapped cache
7

Here index bit will determine which block to address in cache, offset bit tells us the address of
that block in main memory and tag bits is to compare with the tag field that is present in cache
blocks to validate if that is the correct data we are accessing. In direct mapping, one memory
block maps to one cache block location, so main drawback of direct mapping is if there are
cache blocks that are never mapped to, then those blocks will be wasted.
2.4.2

Fully Associative Cache
In set associative cache, the memory address field is divided into only two fields that are

tag and offset field and a memory block can map to any cache block.
Tag

Offset
d bits
Figure 2.4.2.1 Memory address for fully associative cache

This means it will look up all the tag fields in cache and compare with all the tag in the
cache. If it matches and the valid bit is 1 then it returns true which means this is the data the
processor is looking for. So in fully associative cache advantage is that any memory block can
map to any cache block so there is the better hit rate, whereas the disadvantage is as there is no
index field and it has to look up for all the tags so it will be more complex than the direct
mapping.
2.4.3

Set Associative Cache
The set associative cache is the mixture of both the direct mapped cache and fully

associative cache. Here in set associative cache instead of the index bit as in direct mapped
cache, there is the set index bit which will look only for that particular set.
Tag

Set
(d-b-c) bits

Offset
s bits

b bits

Figure 2.4.3.1 Memory address for set associative cache
And the cache memory is organized as follows. It is similar as described above but the only
difference is they are divided by sets this time.

8

Set 0

Set 1

Set s-1

{

Tag

Block data

Valid

.

.

.

Tag

Block data

Valid

{

Tag

Block data

Valid

.

.

.

Tag

Block data

.

.

.

.

.

.

.

.

.

Tag

Block data

Valid

.

.

.

Tag

Block data

Valid

{

Figure 2.4.3.2 Organization of cache for set associative cache
So as shown in figure the s bit will determine where to look at the data in cache as it is
divided by sets, and with this filter it will be easy to compare the tag in that particular set only.
So here a given memory block maps to a single set but can go in any row in that set.
In a 2-way associative cache, there will be 2 blocks of data in each set, and in a 4-way
associative cache, there will be 4 blocks of data in each set and so on.
2.5

Effect of Cache Performance on Application Performance
The approapriate cache size and associativity for a particular application help to improve

its performance. Increasing the cache size often helps but increasing the size beyond the point
where there is significance is pointless and waste of money. When there is larger cache memory
size, there are fewer conflict and capacity misses, but when the cache memory is smaller, there is
higher conflict and capacity miss rate. Choosing the number of blocks in each sets also
determines the application performance and that is the associativity that is discussed above.
When there is higher associativity then as discussed above the tag field needs to be compared to
more field so will be more complex as fully associative and increases the cache’s latency. But
when there is less associativity there is more chance of having conflict misses so appropriate size
9

and associativity is needed for an application to work at its best. This is what our work is
concentrated on.
2.6

Hardware Prefetching[26]
Now let’s briefly talk about the technique for memory latency hiding that is prefetching

technique. There are two types of prefetching one is software prefetching and other is hardware
prefetching. Prefetching is the concept of fetching the data before its needed. The technique is if
we fetch the data before hand then the processor does not stall and wait for that data. Here we
mainly talk about the hardware supported prefetching.
A hardware prefetcher predicts which memory location will be needed in the future by
continuous monitoring of which memory addresses are being accessed by the processor. A
prefetcher works best if accesses have constant-stride patterns such as accessing an array in a
loop but does not work well with random accesses. Using regular strides as much as we can will
allow the prefetcher to work well. There is the defect with prefetching too, if it fetches the wrong
data that is not needed by the processor then it will reduce the performance because it will waste
the memory bandwidth in fetching useless data and occupy cache storage with useless data.
We have not including analysis of prefetching in our analytical cache model. Since
prefetching helps hide memory latency and prefetched items may not count as cache misses, our
model may be somewhat inaccurate and overestimate cache misses. Simulation of prefetching is
not yet available in the processor and memory system simulator we used, so again the simulation
results may overestimate the effect of cache misses.

10

Chapter 3
Reuse Distance Analysis
Cache memory is an important component in the system for achieving good performance
for most application. The capacity of this memory is very small and speed is fast compared to
other memory. Its speed is fast because it is closest to the processor and as the speed decreases as
the distance from the processor increases and it stores most frequent used data so that processor
doesnot have to fetch it from other slow memory again and again. For many scientific
applications, performance on any systems is often determined by cache memory performance,
and this performance is determined by the data locality in programs. Analyzing these localities,
our attempt is to predict the best possible cache configuration for the particular program.
Reuse distance is the concept we use to analyze the data locality of program. It doesnot
depend upon the architecture. Data reuse is a main determinant of cache performance because all
cache reuse comes from reuse of the same or adjacent data, regardless of the organization of
cache [10]. “Reuse distance separates program-specific factors from machine-specific factors
[10]”. It identifies the distance from the reference to a cache to see if the same data is being used
again in the program. If the reuse distance is smaller than the cache size, the program is more
likely to have good cache performance. In case of a fully-associative cache if the reuse distance
is smaller than cache size, then the access is a cache hit; otherwise it is a capacity miss. We have
used different tools to calculate this metric.

Figure 3.1: Reuse distance Illustration[10]
The figure 3.1 illustrates the concept of the reuse distance. Here ‘a’ is being accessed
again after ‘b’ and ‘c’ so here the reuse distance for ‘a’ is 2. Similarly, ‘a’ is again accessed

11

immediately so here the reuse distance is 0 in this case and so on. Here the reuse distance is
considered is in data or data block.
Before we introduce to the tools that we used, we explain another tool called Perfexpert
that we used it to check if the code is well optimized and to suggest possible performance
optimizations.

3.1

PerfExpert Version 4.1.1[11]
Perfexpert is a performance diagnosis tool for HPC applications. It analyzes a program’s

performance and reports performance statistics along withs recommendation to improve the
program’s performance. Recommendations can be changing the code using some optimization
strategies or changing compiler options.
The simple output that is generated by PerfExpert is shown below.

Figure 3.1.1: Sample experimental
output of PerfExpert

Figure 3.1.2: Sample recommendation
output of PerfExpert

Figure 3.1.1 is the sample performance output from PerfExpert for the benchmark
LULESH. This output provides the total running time of the program and reports on the
performance of the routine that takes the maximum percentage of the total running time. It
measures the Local Cycle-Per-Instruction (LCPI) performance for the different categories of
12

events e.g., data accesses, instruction accesses, data and instruction TLB, branch instructions and
floating-point instructions. Within each category each are individual events like Level 1 data
cache (L1d) hits, Level 2 data cache (L2d) hits, for data accesses there are: as shown in Figure
3.1.1. The LCPI for data access counts the cycles associated with the execution of the reported
function arising from accesses to memory for program variables, Instruction accesses counts the
LCPI arising from memory accesses for function’s instructions , Data TLB provides an
approximate measure of the penalty arising from strides in accesses or regularity of accesses,
Instruction TLB reflects cost of fetching instructions due to irregular accesses, Branch
instructions counts cost of if statement, loop conditions etc and Floating point instructions counts
LCPI from executing floating-point instructions [9].
Figure 3.1.2 shows the sample recommendation output from PerfExpert for the LULESH
benchmark. As shown in the Figure 3.1.2 there are three recommendations for the code segment
located at line numbers 633 and 705. One is to componentize important loops by factoring them
into their own subroutines; this optimization may allow the compiler to optimize the loop
independently. The second one is to move the loop invariant memory accesses out of loop and
the last one is to unroll the loop. We followed these suggested recommendation and got some
performance improvement for LULESH version 1. But in LULESH version 2 the
recommendation can’t be applied because the code is not as the tool suggested.
This tool gave1 us the performance evaluation of the application and recommended
optimization before we proceed to study the data access pattern of the code and its major data
structures. The following sections describe the reuse distance analysis tools that are used to get
the reuse distance of the applications.
3.2
Reuse Distance Analysis tool from PMaC (Performance Modeling and
Characterization) labs (Pre-released version)– PEBIL, PIN
PEBIL Version 0.1.3228 (PMaC’s Efficient Binary Instrumentation Toolkit for
x86/Linux)[21], a binary instrumentation toolkit from PMaC Labs, instruments on
13

ELF(Executable and Linkable Format) binaries on Linux for x86/x86_64 processors [12]. This
toolkit provides a C++ API that can be used to convert code and data into a binary file[12] and
provides tools for basic block counting and cache simulation for a set of memory hierarchies
[12]. This binary instrumentation toolkit has the package called ReuseDistance which is used to
calculate the temporal locality and spatial locality for an address stream. It has two separate
classes for the reuse distance and spatial locality but both have similar interface. Addresses are
passed through a struct named ReuseEntry that contains id and address field. Address is the
memory address that is being examined whereas id an identifier is associated with that address
like line number, thread id or other index of the structure that generated that address [13].
Distances are tracked in bins whose boundaries are powers of 2. It gives each bin an id and at the
last of the analysis it sums up the total reuse distances in those bins as well as total number of
accesses.
Reuse distance can be constructed to keep a window size of a finite or unlimited number
of addresses in the history of addresses when trying to find when some address was last used
[13]. This tool simulates the results based on the cache description provided by the user. The
parameters the user needs to set are as follows:
METASIM_SAMPLE_ON
METASIM_REUSE_WINDOW
METASIM_SPATIAL_WINDOW

The cache description file which describe the memory and associativity arrangement on
different level of cache is shown below:

Table 3.2.1: Cache Description file from SDSC
# [sysid] [lvl_count] L1[size assoc line repl] L2[size assoc line
repl] L3[size assoc line repl]
## previous years
# 3 2
64KB
2
64
lru
1024KB 16
64
lru
# 3 Theoretical AMD Opteron

14

# 4
# 4
#21
#21
#22
12
#23
#23
#44
12
#54
#54

2 256KB
8
128
lru
6144KB 12
Theoretical IT2 6MB L3 Eagle
2
64KB
8
64
lru
2048KB 8
Theoretical Intel Dempsey
3
32KB
4
128
lru
960KB 10
256 lru
#22 Theoretical babbage
2
32KB
8
64
lru
2048KB 16
Theoretical woodcrest
3
32KB
4
128
lru
1920KB 10
256 lru
#44 Theoretical bassi
2
32KB
64
32
lru
2048KB
8
Theoretical BGL/P and real BGL is same)

128

lru

64

lru

128

lru

64

lru

128

lru

128

lru

4096KB
8 128
P6 for TI09
4096KB
8 128
P6 256B L3 line

lru

16384KB

lru

16384KB

18432KB

36864KB

#78 3
16 128
#75 3
16 256

64KB
lru
64KB
lru

8
128
lru
#78 Theoretical IBM
8
128
lru
#75 Theoretical IBM

#64
32
#72
32

3
64
3
64

64KB
lru
64KB
lru

2
64
lru
512KB 16
64
lru
512KB
#64 Theoretical AMD QUAD
2
64 lru_vc
512KB 16
64 lru_vc
512KB
#72 AMD K10 with victim caches/non-inclusion policy

#73
48
#81
48

3
64
3
64

64KB
lru
64KB
lru

2
64
lru
512KB 16
64
lru
1536KB
#73 AMD Shanghai
2
64 lru_vc
512KB 16
64 lru_vc
1536KB
#73 AMD Shanghai as L1 with victim cache (see 73)

#74
48
#82
48

3
64
3
64

64KB
lru
64KB
lru

2
64
lru
512KB 16
64
lru
1024KB
#74 AMD Istanbul
2
64 lru_vc
512KB 16
64 lru_vc
1024KB
#74 AMD Istanbul as L1 with victim cache (see 74)

#77
16

3
64

32KB
lru

8
64
lru
256KB
#77 Intel Nehalem-EP

#79 3
64KB
64 lru
#AMD x86

2

64

lru

80 3
32KB
8
64
lru
64 lru
#Sandy Bridge
## popular caches
#96
3
64KB
2
64
lru
16
64 lru # 96 AMD Shanghai L3
#97
3
64KB
2
64
lru
32
64 lru # 97 AMD Shanghai L3
#98
3
64KB
2
64 lru_vc
16
64 lru # 98 AMD Shanghai L3

8

64

lru

2048KB

512KB

8

64

lru

2048KB

16

256KB

8

64

lru

20480KB

20

512KB 16
64
lru
16-way
512KB 16
64
lru
32-way
512KB 16
64 lru_vc
16-way victimized

15

1536KB
1536KB
1536KB

#99
32

3
64

64KB
2
64 lru_vc
512KB 16
64 lru_vc
lru # 97 AMD Shanghai L3 32-way victimized

1536KB

#100
16
#101
32
#102
16
#103
32

3
64
3
64
3
64
3
64

64KB
lru #
64KB
lru #
64KB
lru #
64KB
lru #

1024KB

2
100
2
101
2
102
2
103

64
lru
AMD Istanbul
64
lru
AMD Istanbul
64 lru_vc
AMD Istanbul
64 lru_vc
AMD Istanbul

512KB 16
64
lru
L3 16-way
512KB 16
64
lru
L3 32-way
512KB 16
64 lru_vc
L3 16-way victimized
512KB 16
64 lru_vc
L3 32-way victimized

1024KB
1024KB
1024KB

#104 3
32KB
8
64
lru
256KB
8
64
lru
3072KB
16
64 lru # 104 Intel Nehalem-EP L3 3MB
#105 3
32KB
8
64
lru
256KB
8
64
lru
2304KB
16
64 lru # 105 Intel Nehalem-EP L3 2.25MB
## ti11 new AMD
#110 3
16KB
4
64 lru
1024KB 16
64 lru_vc
512KB
64
64 lru
#AMD Interlagos small-L3 (C)
#111 3
16KB
4
64 lru
1024KB 16
64 lru_vc
384KB
64
64 lru
#AMD Interlagos small-L3 +HTProbe
#112 3
16KB
4
64 lru
1024KB 16
64 lru_vc
1024KB
64
64 lru
#AMD Interlagos
#113 3
16KB
4
64 lru
1024KB 16
64 lru_vc
768KB
64
64 lru
#AMD Interlagos +HTProbe (AI)
#114 3
64KB
2
64 lru_vc
512KB 16
64 lru_vc
850KB
48
64 lru
#AMD Magny Cours +HTprobe (AH)
## ti11 new intel
#130 3
32KB
8
64
lru
256KB
8
64
lru
2560KB
20
64 lru
#Intel Xeon E5 (G,H,L,M,N,O,P,X,Y,Z,AB,AC,AE,AF,AG)
#131 3
32KB
8
64
lru
256KB
8
64
lru
2048KB
64
64 lru
#Intel Sandy/Ivy Bridge (D,E)
## ti11 new ibm
#140 3
32KB
8
128
lru
8 128 lru
#IBM Power7 (I,J)

256KB

8

128

lru

4096KB

For this tool we are using the Sandy Bridge architecture’s cache hierarchy. The output
from the tool produces two files, one with the reuse distance and other with the spatial locality of
the application and they look as follows. Here “Hits” means the number of accesses with reuse
distances that take in that bin. Here bin is the range of the two reuse distances.
#ReuseDistance file
IMAGE 0x2dae528b0f2c27f3

THREAD

0

16

BB: 3940842953244672
I: 2 3 : 4 Hits: 960
I: 3 5 : 8 Hits: 900
I: 17 65537 : 131072 Hits: 61
BB: 3940696931368960
I: 0 1 : 1 Hits: 500
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:

0 Hits: 6702530317
1 Hits: 7019334787
2 Hits: 7741403300
3 Hits: 8975026160
4 Hits: 7641695636
5 Hits: 1265607093
6 Hits: 2908347082
7 Hits: 183125798
8 Hits: 69423556
9 Hits: 23448786
10 Hits: 11565712
11 Hits: 47324599
12 Hits: 114082105
13 Hits: 232462426

Total hits: 44104601369

#Spatial locality file
IMAGE 0x2dae528b0f2c27f3
THREAD
0
SPATIALSTATS
32
64
512 280 44104948736
SPATIALID 3940679738916864 945000000 94500500
0
0
121499500
4
4
94500000
8
8
404999500
12
12
13500000
16
16
13500500
24
24
53999500
SPATIALID
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:
Bin:

3940855835131904 3
16
16
3

0

0 Range: 0 Count: 12565604134
4 Range: 2 Count: 1623422060
8 Range: 6 Count: 22503105650
12 Range: 10 Count: 54001344
16 Range: 14 Count: 1392148665
20 Range: 18 Count: 13513639
24 Range: 22 Count: 792619829
28 Range: 26 Count: 596

17

2226329915

SDSC developed a new Pin-based version of this tool that uses Pin version 2.11 [22] as
the binary instrumentation tool similar to MIAMI which we discuss in section 3.3.The Pin
version gives different output than the PEBIL. We compared outputs with that from the tool
called MIAMI, and the results from MIAMI agree with the SDSC Pin version than with the
PEBIL version of the tool, PEBIL version was not giving far more different results than Pin
version. The output of the SDSC pin version for LULESH version 2 looks as follows:
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN
**BIN

1 at 64B ** => 7409208716
2 at 128B ** => 7675627040
3 at 256B ** => 8172286605
4 at 512B ** => 9471326769
5 at 1KB ** => 7545808755
6 at 2KB ** => 1365993881
7 at 4KB ** => 2418166641
8 at 8KB ** => 719824048
9 at 16KB ** => 65223575
10 at 32KB ** => 23730264
11 at 64KB ** => 11658364
12 at 128KB ** => 47377595
13 at 256KB ** => 114257399
14 at 512KB ** => 232994468
15 at 1MB ** => 488646027
16 at 2MB ** => 284920941
17 at 4MB ** => 63706384
18 at 8MB ** => 48459201
19 at 16MB ** => 147316841
20 at 32MB ** => 137202055

total memeory accesses: 46444080641
hash table size: 345072
scale tree size: 46503
total buffer processing times: 11074
total time: 7693.383990 sec
analysis time: 3534.538631 sec

3.3
MIAMI (Pre-released version) – Machine Independent Application Models for
performance Insight
MIAMI is a collection of tools that report different performance metrics. Tools that are
included in MIAMI are: memreuse, cfgstatic, streamsim, miami, miamicfg and cachesim. The
functionality of these tools is described below.
18

cfgstatic: A CFG(Control Flow Graph) profiler that Discovers and reports the CFG of all
executed routines.
memreuse: Discovering and reports data reuse patterns in an application and collects and
reports a memory reuse distance histogram for each reuse pattern. Optionally, it collects the
working set footprint of each program scope.
streamsim: Discovers streaming data behavior in applications and measures and reports
the number of concurrent streams at any given point during execution.
miami: Consumes dynamic analysis profiles produced by the previous tools and
combines dynamic analysis and static analysis to infer additional insights about an application's
performance.(Post-processing tool)
miamicfg (CFG profiler): Discovers the control flow dynamically as the program
executes. It can produce more accurate CFGs than cfgstatic, but it has other drawbacks which
makes cfgstatic more desirable.
cachesim (LRU cache simulator): given user defined line size, associativity and cache
size report cache performance statistics. This is an old tool, not really connected with the rest of
the MIAMI toolkit.
MIAMI works on x86-64 application binaries, not with source code; hence, it has some
advantages over tools that analyze source code. One advantage is it is language independent so it
doesnot matter which programming language is used to code. Next, it has good code coverage
meaning it “avoids running into blind spots when applications are linked against third party or
system libraries available only in binary formats”[3]. Additionally it can capture optimization
effects meaning “tools can analyze and model the performance of optimized code without
perturbing the optimization process”[3].
MIAMI doesnot depend upon machine characteristics but works more like a simulator to
as certain application characteristics. Giving the measurement and doing the performance
19

analysis on a particular machine limits the exploration of the tool, but this tool collects
“measurements of architecture-neutral application characteristics in a scalable way”[3]. It uses an
architecture-neutral model that predicts the program execution behavior on different platforms.
The

architecture

of

this

tool

is

shown

in

Figure

3.3.1.

Figure 3.3.1: MIAMI Architecture[3]

Reference to MIAMI’s architecture detached on Figure 3.3.1 diagram going from bottom
to top, we can see that XED(X86 Encoder Decoder)[23] is used to decode machine instructions
from the native binary and create an intermediate representation (IR) of the program's
instructions, micro-ops and data registers. The tool uses PIN for static analysis as well as
dynamic analysis. In static analysis part it is important for the application performance that tool
has to first understand the features of the application and statically characterize the memory
access patterns. But the “performance is independent of its execution characteristics, such as the
control flow and the loop nesting structure inside routines, the instruction mix and the instruction

20

schedule dependencies in loops”[3]. In the dynamic analysis part it also uses PIN and “this tool
is built on top of PIN, to determine the relative importance of different parts of the program and
to capture the application's dynamic data reuse” [3]. After getting all the information from the
static and dynamic analysis, MIAMI combines these to construct an annotated dependence graph
of the application, which is helpful to understand the interplay between a program's static
structure and its execution behavior. This information is fed into the target architecture using a
hierarchical module instruction scheduler and “it estimates the execution time at loop level,
exposing performance limits due to instruction schedule dependencies, contention on machine
resources, memory latency or memory bandwidth” [3]. All these metric are placed in the XML
performance database format and CSV files that can be viewed using hpcviewer in the interface.
The MIAMI tool has pre-configured machines state for Sandy Bridge and AMD. Below is the
sample output from the hpcviewer for the LULESH benchmark.

Figure 3.3.2: Sample MIAMI tool output

21

The above sample is for the LULESH benchmark for the Sandy Bridge x86-64
architecture. This sample is only a portion of the output from the tool. As you can see in the
Figure 3.3.2, it displays different metrics for different routines separately. Here the routine called
CalcHourglassControlForElems is being highlighted and it shows the metrics L1D misses, L2D
misses, L3D misses etc. This means in this routine the L1D misses is as shown 2.76e04 which is
less than 0.1% of the total misses in L1D.
Similarly below in Figure 3.3.3is another sample output from this tool showing the reuse
metric: different XML files produces these results. In this sample we can see that the routine
CalcHourglassControlForElems has 14.6% of reuse, which happened in Level 1 Data
Cache(L1D), and 15.8% and 0.4% that of the total reuse occurred in the Level 2 Data
Cache(L2D) and Level 3 Data Cache (L3D) respectively. This indicates that the reuse distance
for some variables is long enough that they cannot fit in the L1D so the reuse happened in the
L2D. The same concept is true for the misses, if the same data is being accessed after long time
that it throw out that data from the L1D cache size then it’s the L1D misses.

Figure 3.3.3: Sample memreuse output from MIAMI

22

3.4
MACPO ( Memory Access Characterization for Performance Optimization) (Prereleased version)
This tool is designed to performance tune C, C++ or Fortran applications and to analyze
an application’s memory usage pattern. This tool generates metrics like reuse distances, strides,
cache conflicts and cache latencies for the data structures used in the functions (subroutines) that
are performance bottlenecks. MACPO gives information about the entire function and the
important variables that are used in the function. The unique property of MACPO is that it gives
architecture-independent metrics. We used MACPO on Stampede and the process of running
MACPO on Stampede is as follows:


You can compile the application so that MACPO focuses only on a

function or loop. In this way MACPO will produce metrics only for the specified
function or loop only, To do this we used the following commands.


$macpo.sh --macpo:instrument = function_name or loop_name –c
lulesh.cc



$macpo.sh --macpo:instrument = function_name or loop_name –o lulesh
lulesh.o



You have to run the program once to generate the macpo.out file.



After that to get the analysis report we do $macpo-analyze macpo.out

23



Figure 3.4.1 Sample output of MACPO

24

Chapter 4
Analytical Modeling [18]
The probabilistic model provides the number of cache misses, given the number of sets,
set associativity, reuse distances and frequency of the reuse distances. It takes input as the data
that is produced by the tools that we used to calculate the reuse distances and their frequencies
discussed above in chapter 3.
With this model, we can easily predict the cache misses for the fully-associative cache, if
a cache uses LRU (Least Recently Used) replacement policy. If the cache size is equal to n
blocks, then when the program tries to access a block with a reuse distance greater than n, then it
is certainly a miss. If the cache has more than n blocks, the current access is a hit because the
accessed block was not evicted yet. This is same as an n-way set-associative cache.
For a set-associative cache, this model predicts the cache misses well. The model is based
on the assumption that the blocks are uniformly distributed in memory, which means the
probability of two blocks mapping to the same set is 1/s where s here is number of set and
independent of where other blocks map[18]. So now if we consider this assumption then the
probability of mapping i blocks out of n distinct blocks to a given set is given by
Pmapping(s,n,i) =
0
where

if i > n ,

represents the probability that i blocks map onto a specific set. When one

block maps to a specific set s then it becomes 1/s only but here there are i blocks to map thus
,
represents the probability that all other blocks except i blocks i.e. n-i
blocks map to other s-1 sets,
and

represents any combination of i blocks out of the total number of n blocks can

map to given set or in other words there are

ways that i blocks that can be mapped to a given

set s.
25

Now, we can write the probability of a hit as
Phit(s,k,m) =
Where m is the reuse distance, s is number of sets and k is associativity.
As we have the probability of hit , now we can write probability of miss as
Pmiss(s,k,n)=1-Phit(s,k,m)
Here the cache size in cache blocks is s*k.
For a fully-associative cache s=1 and as we know s*k represents the cache size. Thus, the cache
size is equal to k. Accordingly, if the reuse distance is greater than k-1 then the accesses its
clearly a miss and this formula is valid as for n > k-1,

= 0 for any i.But if m <=

k-1 , the probability of a hit becomes one.
Now we can write the number of misses as follows:
Number misses(Hist,s,k) =
where Dbin is the reuse distance with upper value of bin because we wanted to full utilize the
reuse data and Fbin is the frequency of the bin.

26

Chapter 5
Setup and Analysis of Architectural Simulators to Verify the Predicted
Cache Performance

This chapter presents the analysis of different architectural simulators in terms of their
usefulness in evaluating different hardware configurations. The purpose of this is to create virtual
environment in which to evaluate an application to determine the configuration that is best suited
for application.
Currently our research targets is to evaluate a cache for various applications without
building the hardware configuration, however that the simulators under study are not limited to
the study of cache configurations only.
Reuse distance is the metric that is used to predict suitable cache configurations for a
particular application. The tools that we are using to determine the reuse distance are the PMaC
locality measurement tool from SDSC and the MACPO data access analysis tool from TACC.
These tools provides data that can be used to select an optimal cache configuration for an
application. To verify this selection, we feed the cache configuration to the simulators.
The simulators used for this purpose were ESESC(Enhanced Super ESCalar simulator),
MacSim, SST with gem5 and GPUWattch+GPGPU-Sim.
5.1

ESESC (Enhanced Super ESCalar simulator)
ESESC is a fast multiprocessor simulator with detailed power, thermal and performance

models for modern out-of-order multicores [20].

Here we made use of only the performance

statistics provided by this simulator. ESESC provides statistics for particular code sectors only, it
doesnot provide statistics for the whole program. It takes ARM binaries as an input and uses
QEMU(Quick EMUlator) for emulation. The memory hierarchy parameters can be set by the
27

user and an on-chip memory controller allows the user to control the memory bandwidth and
memory latencies.
The output of ESESC is shown in Figure 5.1.1.

Figure 5.1.1: Sample Output of ESESC
The output in the figure indicates that the simulation time was 66.229 seconds and the
execution time was 54.197 ms. It gives us the total number of cycles executed during that time as
well as the average memory latency, number of memory accesses and the miss rates and hit rates
for the different levels of the memory hierarchy.
5.2

MacSim
MacSim is an architectural simulator that was developed at Georgia Institute of

Technology. It simulates x86 and NVIDIA PTX instructions and it is a trace-driven, cycle-level
simulator. It models detailed micro-architectural behaviors, including those of the pipeline,
multi-threading, and memory sub systems. MacSim is capable of simulating a variety of

28

architectures, e.g., Intel’s Sandy Bridge and NVIDIA’s Fermi GPU. It can simulate
homogeneous ISA multicore processors as well as heterogeneous ISA multicore processors [14].
As our research is focused on using locality data to predict the cache performance of
sequential and threaded versions of applications for various cache configuration, we below
discuss how this simulator (MacSim) can be used to verify our predicted cache configurations.
Also, we discuss how this simulator to set up and used to calculate the statistics of the cache
parameters.
Initially, to run MacSim, it requires two additional files. One is the file named params.in
,which

defines the values of the architectural parameters and the other is the file called

trace_file_list specifies the number of trace files to run and location of the trace files.
Different versions of the params.in(later renamed to this from the specific version used in
table 5.2.1) are provided in the repository of MacSim. Each version, for a different architecture,
has a unique file format. The filenames and architectures of the different configuration files are
shown below.
Table 5.2.1: MacSim architecture-specific files
Filename

Architecture

params_8800gt

GeForce 8800 GT (G80)

params_gtx280

GeForce GTX 280 (GT200)

params_gtx465

NVIDIA GeForce GTX 465 (Fermi)

params_x86

Intel’s Sandy Bridge (CPU part only)

params_hetero_4c_4g

Intel’s Sandy Bridge (CPU + GPU)

29

These files can be modified and configured in different ways. Since we are interested in
collecting memory traces and quantifying the cache performance, we configured the sizes,
associativity and latency of the various levels of the cache memory hierarchy of the target
architecture. This allows us to simulate the application’s behavior with different cache
configurations.
Another file, trace_file_list, gives the simulator the information about the number of
traces that are to be simulated and the path to the trace configuration files in the trace directory of
the application [11]. The trace files included in trace_file_list are generated by the trace
generator (found in the MacSim repository) and the files generated are for the given application
in the specified architecture. A trace configuration file includes: the number of threads in the
application, type of application, thread id and starting point of the thread in terms of the
instruction count of the main thread [14].
Once the files are configured, the MacSim is executed. If MacSim runs successfully, it
generates files named as params.out and several *.stat.out files for different hardware
component. The file params.out containsthe values of the parameters that is used by the
simulation and the several .stat.out files give the statistics for memory, network and so on . A
sample output regarding the memory is shown below in figure 5.2.1. The output includes
information regarding cache metrics, e.g., cache hits, cache misses, write back, etc for all levels
of the cache hierarchy, from which we can validate our predicted cache parameters gives the
minimum miss rates.

30

Figure 5.2.1: Memory Statistics for the Sphinx benchmark with MacSim
5.3

SST with gem5
The Structural Simulation Toolkit (SST) is an open modular framework that is meant to

facilitate the design and optimization of HPC architectures and applications. It consists of a
parallel simulation core with a number of network, memory, and processor models. The
simulator core provides simulation configuration and startup along with the parallel model of
computation and a common interface to the technology models [15].
gem5 is a simulator that is integrated with the SST framework. It simulates the
processors, caches, busses and TCP/IP network components. The main objective of integrating
gem5 with SST is for high parallel efficiency. To achieve the gem5’s Python-based initialization
was replaced with an XML-based configuration to better support SST’s two-phase initialization
process. Also, gem5 was encapsulated as an SST component and an internal event queue was
modified so that it is driven by SST’s [15].
31

As stated earlier, our research focuses on using locality data to predict the cache
performance of

an application executed on non-existent hardware. We use gem5 to get

information about the processors, caches and busses which is the our goal and reason to use this
simulator. To achieve this goal initially SST framework needs to be set up correctly and to set up
the simulation environment we have to configure the XML file. For the Gem5 it needs two level
of configuration, one is SST configuration and other one is M5 configuration. M5 is the modular
platform for computer system architecture research, encompassing system-level architecture as
well as processor micro architecture [16].
For the SST configuration, only one M5 SST component per rank is allowed and
configures the simulation variables like stop time etc, whereas for the M5 configuration we can
modify the M5 sub components like processors, caches, busses. The sample configuration file is
as shown below:

Figure 5.3.1: SST configuration

32

Figure 5.3.2: M5 configuration
With these configuration files we can set up the configuration as per our requirement like
setting up the data cache size, its associativity, its latency, bus configuration etc.
We have to simulate the gem5 with power modeling in order to get the application
behavior output with cache parameters. To simulate the power model it requires taking account
of the cache accesses. This is because as the cache size increases the power also increases.
During this process it calculates the total data cache read accesses/misses, total data write
accesses/misses and instruction cache read accesses/misses. However these parameters are useful
for us, and to get those values we have to configure the power XML file where we can configure
the L1/L2 cache size, line size, associativity, latency etc. The sample of power XML file is
shown below in figure 5.4.3.

33

Figure 5.3.3: Power XML
After all the successful setup and configuration, the SST (with gem5 with power
modeling) now can give the output regarding the application behavior in configured setup of
hardware parameters. The output that appears gives information regarding data cache read
accesses/misses, total data write accesses/misses and instruction cache read accesses/misses etc
and the sample output is shown below in figure 5.4.4. We can verify our predicted cache
parameters with these results.

34

Figure 5.3.4: gem5 with power modeling output
5.5

GPUWattch + GPGPU-Sim
GPGPU-Sim simulates the functional model for PTX/SASS+CUDA/OpenCL, timing

model for the computer part of GPU and power model named GPUWattch [17]. In our research,
GPUWattch will be helpful to cite some idea about the cache behavior with the configurable
cache parameters.
GPUWattch is integrated with GPGPU-Sim to simulate the power model of the
architecture. But we are interested in tracing the cache parameters with the particular application
35

in configurable architecture rather than the power model. GPGPU has various caches: instruction
cache, L1/L2 data cache, texture cache, and constant cache [16]. To develop the power model,
this simulator takes an account of the performance counters like instruction cache hits/misses,
data cache read/write hits/misses, texture cache hits/misses, constant cache hits/misses, L2 data
read/write hits/misses etc. which help us to trace the cache parameters. This simulator is not yet
use in our research, but this simulator will be very helpful in our future research work.
After analyzing and evaluating all the simulators, we chose ESESC simulator to
implement because first thing it was easy to install and was more inclined to the memory
performance rather than other performance like power, networks etc. The configuration setting
can be easily managed and more of all, this simulator have very fast response forum by which we
can quickly troubleshoot the problem we face with the simulator. Another advantage of this
ESESC is we can play with the memory bandwidth and hardware pre-fetching but unfortunately
they are working on these parameters to work it correctly.

36

Chapter 6
Results
The different tools discussed in Chapter 2 are use to collect reuse distances. We discussed
how these tools perform and what outputs they produce which helped us in getting the results
that we needed. Our first step was focused on getting the reuse distances for the applications.
MIAMI, Reuse distance tool from SDSC and MACPO gave reuse distance metrics that we
needed. Among those MIAMI and tool from SDSC gave us the reuse distance count per basic
blocks, but MACPO gave us the reuse distance per instruction.
First we determined if the results of these tools agree with one another and if they are
correct. To do this we used a simple benchmark that copies the same small array elements to a
bigger array with all the tools and analyzed their outputs. Initially SDSC had not released the Pin
version of this tool but when it did MIAMI and tool from SDSC agreed with each other and our
validation process was successful. Here the reuse distance is in terms of cache lines and cache
line size is 64B for the results below. The result for the benchmarks LULESH v2 and CoMD are
shown below in tables 6.1.1-6.1.4:
6.1 Experimental Results and Analysis
Table 6.1.1: LULESH v2.0 Reuse distance statistic from SDSC pin version
**BIN 1 at 64B ** => 14270170499
**BIN 2 at 128B ** => 13760894854
**BIN 3 at 256B ** => 15882177107
**BIN 4 at 512B ** => 16987158718
**BIN 5 at 1KB ** => 13318831362
**BIN 6 at 2KB ** => 2562379129
**BIN 7 at 4KB ** => 4500681556
**BIN 8 at 8KB ** => 1111761162
**BIN 9 at 16KB ** => 106697662
**BIN 10 at 32KB ** => 43910894
**BIN 11 at 64KB ** => 21590644
**BIN 12 at 128KB ** => 87346003
**BIN 13 at 256KB ** => 211279502
**BIN 14 at 512KB ** => 433650672
**BIN 15 at 1MB ** => 900247491
**BIN 16 at 2MB ** => 516121369
37

**BIN 17 at 4MB ** => 117689841
**BIN 18 at 8MB ** => 90003498
**BIN 19 at 16MB ** => 271601262
**BIN 20 at 32MB ** => 253963925

Table 6.1.2: LULESH v2.0 Reuse distance statistics from MIAMI
Lower Bin

Upper Bin

Count

0
1
2
3
5
9
17
33
65
129
257
511
1025
2049
4097
8193
16385
32769
65537
131073
262145

0
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288

38

13254523859
13254511072
9992733852
13931279180
10303330054
13087162149
2741152632
4069413935
743118706
100075856
43894505
21728132
87920491
210483852
438337980
921674248
491029597
118078692
87806220
273172469
253514418

4E+10

3.5E+10

Frequency

3E+10

2.5E+10
2E+10

MIAMI

1.5E+10

SDSC

1E+10
5E+09

524288

262144

131072

65536

32768

16384

8192

4096

2048

512

1024

256

128

64

32

16

4--8

2

1

0

Reuse Distance

Figure 6.1.1 : Validation Graph between MIAMI and SDSC for LULESH v2.0
Similarly for CoMD we have following results from both the tools.

Lower Bin

Table 6.1.3: CoMD Reuse distance statistics from MIAMI
Upper Bin
0
0
1
1
2
2
3
4
5
8
9
16
17
32
33
64
65
128
129
256
257
512
511
1024
1025
2048
2049
4096
4097
8192
8193
16384
16385
32768
32769
65536
65537
131072
131073
262144
262145
524288

39

Count
3513640117
4645020153
741947302
2770736516
653466712
1743899742
561849833
92554662
17387223
21053632
11445854
49988
209118
19205479
846591
12001479
7888547
8544922
7006725
11181934
1310

Table 6.1.4:CoMD Reuse distance statistic from SDSC pin version
**BIN 1 at 64B ** => 3509617353
**BIN 2 at 128B ** => 3775952035
**BIN 3 at 256B ** => 621524550
**BIN 4 at 512B ** => 4226968135
**BIN 5 at 1KB ** => 1710695540
**BIN 6 at 2KB ** => 621170833
**BIN 7 at 4KB ** => 92444085
**BIN 8 at 8KB ** => 19422338
**BIN 9 at 16KB ** => 20338300
**BIN 10 at 32KB ** => 12087798
**BIN 11 at 64KB ** => 52135
**BIN 12 at 128KB ** => 170908
**BIN 13 at 256KB ** => 19240549
**BIN 14 at 512KB ** => 841011
**BIN 15 at 1MB ** => 12006288
**BIN 16 at 2MB ** => 7861014
**BIN 17 at 4MB ** => 8543060
**BIN 18 at 8MB ** => 6876553
**BIN 19 at 16MB ** => 11335530
**BIN 20 at 32MB ** => 203

40

6E+09
5E+09

Frequency

4E+09
3E+09
MIAMI
2E+09

SDSC

1E+09

0

Reuse Distance

Figure 6.1.2: Validation Graph between MIAMI and SDSC for CoMD

After the validation of the results from the tools we are confident that these tools are
giving the correct results. The next step was to feed the reuse distance data into the the
probabilistic model that we implemented.
Since reuse distance is an architecture-independent metric, we tested these results with
numerous configurations of cache memory. The probabilistic model gives the total number of
misses in the given configuration. We produced the miss rates for different configurations of
different cache levels for LULESH v2 and CoMD, which are shown below:
LULESH v2.0


L1 Cache
o Total number of memory accesses = 84424941899
Table 6.1.5: Probabilitic model results for LULESH v2.0 for Number of sets = 64
Associativity Number of misses
Missrate(%)
2
5220943904
6.184124959
4
3186556302
3.774425224
8
2933322885
3.474474271
16
2893907978
3.427787941
32
2840178265
3.364145952

41

Miss Rate

8
6
4

2
0
2

4

8

16

32

Associativity

Figure 6.1.3: Missrate vs Associativity for LULESH v2.0 for number of
sets = 64
As is shown in Figure 6.1.3 and Table 6.1.5setting the associativity of the
L1cache to 8 with 64 sets because after the associativity 8 the miss rate settles at
~3.4 and taking the miss rate for L2 with respect to that.

L2 Cache
Table 6.1.6: Probabilitic model results for LULESH v2.0 for Number of sets =512
Associativity Number of misses
Missrate(%)
2
2967659115
~100
4
2839224665
96.79
8
2698929196
92.00
16
2380181414
81.14
32
1706289490
58.17

120
100
Miss Rate



80
60
40
20
0
2

4

8

16

32

Associativity

Figure 6.1.4: Miss rate vs associativity for LULESH v2.0 for number of sets = 512

42

o Here the behavior of the graph is decreasing and it more or less agreed with the
results from ESESC simulator.
o Here we set the associativity of L2 to 16 with 512 sets because even if we get the
low miss rate in associativity 32, its going to affect the L3 miss rate and that’s
because we will take the L3 miss rate with respect to L2 misses.
L3 Cache
o Here we are taking the miss rate in terms of the miss in L2. As we decided to fix
L2 with associativity at 16 so we have the miss rate of L3 in terms of misses with
associativity 16 in L2.
Table 6.1.7: Probabilitic model results for LULESH v2.0 number of sets =8192
Associativity
2
4
8
16
32
64

Number of misses
Missrate
1863042570
78.27313326
1146124897
48.15283786
704327111
29.59132051
574304213
24.12858993
396547885
16.66040591
130972513
5.50262733

100

80
Miss Rate



60
40
20
0
2

4

8

16

32

64

Associativity

Figure 6.1.5: Missrate vs Associativity for LULESH v2.0 for number of sets = 8192
o The graph here is nearly a straight line falling downwards. This is because the
probability of hit with early reuse distance is nearly 100% with this model and
this miss rate is due to the reuse distance data that is later part of this data.

43

There were no data which have the probability of hit 0% at this point. This is
why the data decreased nearly linearly in L3.
o Here we take the associativity with the minimum miss rate that is 64 with
8192 sets.
CoMD version 1.1


L1 Cache
o Number of sets=64, Total number of memory accesses = 14839936529

Table 6.1.8: Probabilitic model results for CoMD for Number of sets =64
Associativity Number of misses
Missrate(%)
2
234863432
1.582644451
4
95010621
0.640236033
8
74272729
0.500492228
16
67000061
0.451484822

Miss Rate

2
1.5
1
0.5
0
2

4

8

16

Associativity

Figure 6.1.6: Missrate vs Associativity for CoMD for number of sets = 64

o Setting the associativity of L1 to 8 with 64 sets because after the
associativity 8 the miss rate settles at ~0.5 and taking the miss rate for L2
with respect to that.


L2 Cache
Table 6.1.9: Probabilitic model results for CoMD for Number of sets =512
Associativity Number of misses
Missrate(%)
2
74814921
100
4
66245524
89.1922579
8
57981370
78.06549023
44

16
32

47224188
40906070

63.58213659
55.07549076

120

Miss Rate

100
80
60
40
20
0
2

4

8
Associativity

16

32

Figure 6.1.7: Missrate vs Associativity for CoMD for number of sets =
512
Here the behavior of the graph is decreasing .Here setting the associativity of L2 to 8 with 512
sets because even if we get the low miss rate in associativity 16 and 32 its going to affect the L3
miss rate and that’s because we will take the L3 miss rate with respect to L2 misses.



L3 Cache
Table 6.1.10 Probabilitic model results for CoMD for Number of sets =8192
Associativity Number of misses
Missrate(%)
2
43008932
74.17715725
4
32604366
56.2324864
8
23209538
40.02930252
16
14981312
25.83814767
32
5855854
10.09954404

45

Miss Rate

80
70
60
50
40
30
20
10
0
2

4

8

16

32

Associativity

Figure 6.1.8: Missrate vs Associativity for CoMD for number of sets =
8192

o The graph here is nearly a straight line falling downwards. This is because the
probability of hit with early reuse distance is nearly 100% with this model and
this miss rate is due to the reuse distance data that is later part of this data.
There were no data which have the probability of hit 0% at this point. This is
why the data decreased nearly linearly in L3.
o Here we take the associativity with the minimum miss rate that is 32 with
8192 sets.
Here L3 cache configuration not same as the original configuration of Sandy bridge
because ESESC simulator is not allowing us to configure associativity 20 and cache size of
20MB which is the configuration for Sandy Bridge. This is because ESESC takes only the
configuration that is the powers of 2 and cache can only be a power of 2. We fed our
configuration for LULESH version 2 in ESESC simulator and we got some improvement in
number of cycles spent by 19%. The comparison result is shown as below.

46

Figure 6.1.9: ESESC result with the original Sandy Bridge configuration except for L3
configuration

47

Figure 6.1.10: ESESC result with our predicted configuration for LULESH version 2

Here we can see there is the drastic change in number of cycles spent and we can see the miss
rate also got decreased with our configuration. So we can say that our configuration better than
the existing configuration for LULESH version 2.
For CoMD the existing configuration is same as the configuration that we predicted.
We also validated our results using PAPI. Our probabilistic model agreed with PAPI in
terms of the L1 miss rate but it didnot agree in terms of L2 miss rate. We assume that the
difference may be due to hardware pre-fetching. But the probabilistic model did agree with the
ESESC in terms of L2 miss rate. More analysis of these results is required.

48

6.2

Roofline Modeling for LULESH v2.0 [19]

As the probabilistic model suggests the cache configuration for LULESH v2 that will minimize
the L1, L2 and L3 miss rate but that configuration helped gain the performance and minimize the
miss rate for the LULESH v2 benchmark but our guess is the performance of LULESH v2 can
still be improved by increasing the memory bandwidth. Thus, we used roofline model for the
Intel Xeon E5-2680 which is in Stampede to understand how to improve performance by
increasing the memory bandwidth. The peak GFLOPS for Intel Xeon E5-2680 is 172.8 and the
theoretical maximum bandwidth is 51.2 GBytes/s (~12.8GWords/s). However, execution of the
stream benchmark on the Intel Xeon E5-2680 provided different results.
Table 6.2.1: Memory bandwidth with different
numbers of threads for Intel E5-2680
70000
60000

50000
Bandwidth

Number of Threads Triad(MB/s)
1
14737.5
2
30422.0
3
39558.0
4
52736.4
5
47857.4
6
59957.9
7
54336.2
8
61696.1
9
55880.6
10
61855.3
11
56444.6
12
61458.8
13
56820.6
14
61100.6
15
57007.2
16
60651.5

40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of Threads

Figure 6.2.1: Maximum bandwidth graph from stream
benchmark for Intel E5-2680

As shown, the maximum rate achieved is around 60 GB/s. Thus we consider the maximum
bandwidth of Intel E5-2680 to be 60 GB/s. With this rate the machine balance point is 11.52
FLOPS/word. But LULESH version 2 falls way below this in the roofline model, i.e. it results in
0.875 FLOPS/word. This metric was calculated with the help of PAPI. First, we calculated total
49

number of floating point operations (F=73878301766) for this application and then calculated the
total number of memory accesses(M=84424941899). Then we divided F by M to get 0.875
FLOPS/word. With this result we concluded that by increasing the memory bandwidth of the
system i.e. increasing the slope of the memory bandwidth roofline, as shown in figure 6.2.2, we
can improve the performance of the LULESH benchmark.
We tried to feed this memory bandwidth to the ESESC simulator but the simulator has a bug that
causes it to disregard this setting, we reported that bug to the developer of that simulator.
Hopefully they will fix it soon.

512
256
172.8
128
64
32
60 GBytes/sec
~
15Gwords/sec

16

GFLOPS
/Sec

8
4
2
1

1/16

1/8

1/4

1/2

1

2

0.875

4

8

16

32

64

128

11.52

FLOPS/words
Figure 6.2.2: Roofline Model for Intel Xeon E5-2680 annotated with the LULESH v2
benchmark
50

Chapter 7
Conclusion
Our goal was to predict the optimal cache hierarchy configuration that provides best
performance for a given application. To achieve this goal we developed the methodology to do
so. In our methodology first step is to collect the reuse distances of an application produced by
the PMaC locality measurement tool from SDSC and MIAMI that are described in Chapter 3.
Then we validate the results of these tools and determined if the results were comparable and
correct. Given an application’s reuse distance and its frequencies the probabilistic model outputs
the total number of cache misses for a given cache configuration i.e. given a number of
associativities and number of sets. The miss rate for the L1 cache is in terms of the total number
of memory accesses but for the L2 and L3 cache the miss rate is in terms of the number of
memory accesses that misses the L1 and L2 respectively. From these results we are able to
predict the optimal cache performance for that given application. This prediction is verified via
simulation. We are able to minimize some number of cycles with our configuration and are
discussed above in chapter 6.
This miss rate provided by probabilistic model allows us to predict the performance of
the application. From the result from PerfExpert got the hint that LULESH benchmark is
memory bound application i.e. it is limited by the memory bandwidth. To support this prediction
used roofline model show where this benchmark falls on the model and what can be done to
maximize its performance. The next step would have been to use simulation to verify this
suggestion but due to the bug in the simulator ESESC we are not able to verify our results.

51

References
[1] Air Force Funds Research Grant for Computational Chemistry and Physics Applications.
Web. <http://engineering.utep.edu/docs/announcement012813.pdf>.
[2] Using MACPO. Ashay Rane, Web. <http://tacc.github.io/perfexpert/user_manualch4.html>.
[3] MIAMI: Machine Independent Application Models for Performance Insight. Gabriel Marin,
Web. <http://web.eecs.utk.edu/~gmarin/miami.html>.
[4] PAPI. Web. <http://icl.cs.utk.edu/papi/>.
[5]PEBIL:
Static
Binary
Instrumentation
<http://www.sdsc.edu/PMaC/projects/pebil.html>.

for

X86/Linux.

Web.

[6] PAPI. Web. <http://icl.cs.utk.edu/papi/overview/index.html>.
[7] Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). Web.
<https://codesign.llnl.gov/lulesh.php>.
[8] CoMD. Web. <https://github.com/exmatex/CoMD>.
[9]

Using PerfExpert.
260003.3>.

Web.

<http://tacc.github.io/perfexpert/user_manualch3.html#x7-

[10] Ding, Chen, and Yutao Zhong. Reuse Distance Analysis. Tech., Feb. 2001. Web.
<http://www.cs.rochester.edu/~cding/Documents/Publications/TR741.pdf>.
[11] Burtscher, Martin, Byoung-Do Kim, Jeff Diamond, John McCalpin, Lars Koesterke, and
James Browne. Proc. of PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC
Application. N.p., Nov. 2010. Web
[12] Fast Static Binary Instrumentation for Linux/x86. Web.
<https://github.com/amoghavs/PEBIL/tree/development>.
[13]Web.<https://github.com/mlaurenzano/PEBIL/blob/master/external/ReuseDistance/READM
E>.
[14] Kim, Hyesoon, Jaekyu Lee, Nagesh B. Lakshminarayana, Jaewoong Sim, Jieun Lim, and
Tri Pho. "MacSim: A CPU-GPU Heterogeneous Simulation Framework User Guide."
[15] Rodrigues, Arun, Keren Bergman, David Bunde, Elliot Cooper-Balis, Kurt Ferreira, and K.
Scott Hemmert. "Improvements to the Structural Simulation Toolkit."
[16]Web. <https://code.google.com/p/sst-simulator/w/list>.
[17] GPUWattch Energy Model Manual. Web. <http://www.gpgpu-sim.org/gpuwattch/>.

52

[18] G. Marin and J. Mellor-Crummey. Scalable cross-architecture predictions of memory
hierarchy response for scientiﬁc applications. In Proceedings of the Symposium of the Las
Alamo s Computer Science Institute, Sante Fe, New Mexico, 2005.
[19] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful
visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65-76.
DOI=10.1145/1498765.1498785 http://doi.acm.org/10.1145/1498765.1498785
[20] Ehsan K. Ardestani and Jose Renau. 2013. ESESC: A fast multicore simulator using TimeBased Sampling. In Proceedings of the 2013 IEEE 19th International Symposium on High
Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society,
Washington,
DC,
USA,
448-459.
DOI=10.1109/HPCA.2013.6522340
http://dx.doi.org/10.1109/HPCA.2013.6522340
[21] Michael Laurenzano, Mustafa M. Tikir, Laura Carrington, Allan Snavely: PEBIL: Efficient
static binary instrumentation for Linux. ISPASS 2010: 175-183
[22] Pin - A Dynamic Binary Instrumentation Tool. Intel. Web. https://software.intel.com/enus/articles/pin-a-dynamic-binary-instrumentation-tool
[23]IntelCorporation.XED.
Web.http://software.intel.com/sites/landingpage/pintool/docs/61206/Xed/html/
[24] How Caching Works. Web. <http://computer.howstuffworks.com/cache2.htm>.
[25]
Computer
Architecture
A
Quantitative
Approach,
Fifth
<http://www.cs.gmu.edu/~menasce/cs465/slides/CAQA5e_ch2-Complete.pdf>.
[26] Hardware Prefetching.
all/prefetching.html>.

Web.

53

Edition.Web.

<http://www.futurechips.org/chip-design-for-

Curriculum Vitae
Sonish Shrestha from Nepal was born in September 13, 1987. He got his bachelor’s
degree in Electronics and Communication Engineering in 2010 from Advanced College of
Engineering and Management (Affiliated to T.U.), Nepal. He taught MCSE (Microsoft Certified
System Engineering) after he graduated from there and after that he worked as a site engineer at
Mobicon Tele-Networks, Nepal.
He started his Master’s degree in Computer Science at University of Texas at El Paso in
spring of 2012. He worked as a Teaching Assistant for course Intro Computer Science and was
responsible for conducting practical classes and grading. After working five months as a
Teaching Assistant, he worked as a Research Assistantship for the project called CoDAASH(Codesign Approach for Advances in Software and Hardware) which is funded by United States Air
Force Office of Scientific Research(AFOSR). While pursuing his master’s degree his abstract
was selected for the poster presentation in International Conference in Supercomputing 2013held
in Eugene ,OR and the poster’s title is “ Using Platform-independent Data Locality Analysis to
Predict Cache Performance on Abstract Hardware Platforms”.

Permanent address:

3500 Sun Bowl Drive Apartment Number 16
El Paso, TX, 79902

54

