SMART: A Simulation Tool for Analyzing Cache Access Behavior on SMPs by Tianchao Li & Michael Gerndt
SMART: A Simulation Tool for Analyzing Cache Access Behavior on SMPs
Tianchao Li and Michael Gerndt
Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen
Boltzmannstr. 3, D-85748 Garching bei M¨ unchen, Germany
Email:
￿gerndt, lit
￿@in.tum.de
Abstract
This paper presents SMART - a simulation tool for ana-
lyzing the cache access behavior on SMP systems. SMART
traps memory access events of multi-threaded applications,
simulates the accesses in multiple levels of caches of mul-
tiple processors and the shared memory, emulates a novel
hardware monitor that records events within given address
ranges of interest, and presents the result as event counts
or histogram in arbitrary granularity. Used independently
or together with the advanced tools developed in the EP-
Cache project, SMART can help evaluate the performance
of multi-threaded applications with different hardware con-
ﬁgurations and facilitate the application of effective code
transformations for optimization.
1 Introduction
WiththedramaticwideningofthegapbetweenCPUper-
formance and memory performance, cache access behav-
ior of applications is becoming an increasingly important
issue that can signiﬁcantly inﬂuence the efﬁciency of ap-
plications. This is especially important for shared memory
applications running on multi-processors due to cache co-
herence between processors.
Monitoring helps understand the access behavior and fa-
cilitates the application of various types of code transfor-
mations for optimization. Fine-grained monitoring target-
ing certain code element and speciﬁc data structure is espe-
cially useful for transformations based on data reorganiza-
tion such as padding.
In the German projectEP-Cache (ToolsforEfﬁcient Par-
allel Programming of Cache Architectures) [1], a novel
hardware monitor [2] has been designed to provide such
ﬁne-grained information. A complete software infrastruc-
ture has also been designed and implemented to monitor,
analyze, visualize, and optimize the cache access behavior
of OpenMP applications.
Without the hardware monitor actually implemented, a
simulation tool - SMART (Simulator for Monitoring Ad-
dress Ranged Targets) [3] - has been developed to facilitate
the design and implementation of the higher level tools and
libraries. SMART simulates the cache access in a multi-
level cache hierarchy of multi-processor shared memory
system and functionally emulates the monitoring of access
events with associated hardware monitors targeting speciﬁc
address ranges in the form of event counts or histograms of
arbitrary granularity. With binary instrumentation that fa-
cilitates on-the-ﬂy gathering of memory reference data and
programming interfaces, SMART provides quite the same
functionality and behavior as a real implementation of the
designed hardware monitor to the tool and library develop-
ers and is applicable to realistic large-scale applications.
The remainderofthispaperis organizedasfollows: Sec-
tion 2 describes the architecture of SMART, with a focus on
the functionality of each component. Section 3 discusses
the issue of speeding up cache monitoring in our simulator,
which is critical for such a tool that targets the analysis of
realistic applications. Section 4 concludes the paper with a
short summary.
2 SMART Architecture
SMART simulates symmetric multi-processor (SMP)
systems. Such systems possess multiple CPUs, share com-
mon memory and disks, and characterize with the global
physicaladdressspaceanduniformmemoryaccess(UMA),
i.e. same latency for accesses to the whole memory from
any processor. SMPs are widespread as compute servers
for high performanceparallel computing,and are becoming
more common at the desktop. An illustrative model of the
simulated architecture is shown in Figure 1.
SMART includes a memory reference trapper module
which instruments the program and generates memory ac-
cess events, a memory hierarchy simulation module which
simulates the access behaviorin a full memoryhierarchyon
multiple processors with shard memory, and a memory ref-
erence monitor module which simulates hardware monitors
associated with each processor that record accesses withinFigure 1. The target SMP architecture
Figure 2. The schematic modular structure
given address ranges of interest and present the result as
event counts or histogram. Figure 2 shows the schematic
structure and relationship among the simulator modules.
2.1 Memory Reference Trappers
Memory reference trapper is the front-end of SMART.
It detects the access events in the application binaries, and
feeds the access events to the back-end simulator.
SMART has deﬁned a uniform interface to memory ref-
erence trappers, thus provides the possibility to integrate
different memory reference trapper front-ends. Currently
supported front-ends include (1) Augmint [4], which is
based on assembly instrumentation and supports parallel
programswritten with speciﬁc macros used in the SPLASH
benchmark applications. This trapper is good for its fast in-
strumentation capability, but unfortunately lacks sufﬁcient
support for PThread; (2) MemAccess [5], a trapper based
on Valgrind [6] with runtime binary instrumentation capa-
bility and PThread support. Programs in any shared mem-
ory programming models (OpenMP, for example) based on
thisthreadlibraryandtheuseofarbitrarycompilersarethus
supported.
2.2 Memory Hierarchy Simulation Module
This module models the complete memory hierarchy in
detail. It is consisted of a multi-level data cache simula-
tor, a TLB simulator, and a shared memory simulator. For
multi-processors,the cachesimulatoralso modelscache co-
herence protocols, including full invalidation and MESI.
The memory hierarchy simulator has been made highly
conﬁgurable. Parametersofthesimulatedsystem, including
thenumberofcachelevels, cachesizes, cache-linesizesand
associativeness of each cache level, pages sizes and TLB
sizes, can all conﬁgured through a System Parameter File.
This provides high ﬂexibility and convenience to users.
2.3 Memory Reference Monitor Module
The memory reference monitor module simulates a set
of hardware monitors attached to the multi-processors fol-
lowing the design of a conceptual hardware monitor [2]. It
canbeconﬁguredintotwoworkingmodes- static modeand
dynamic mode.
The static mode allows to explicitly program the moni-
tor to count predeﬁned types of access events to speciﬁc ad-
dress regions of interest. In combination with higher-level
language speciﬁc tools, this can be applied to given data
structures or parts of arrays.
The dynamic mode enables a ﬁne-grained monitoring of
memory accesses to selected address ranges. Based on pre-
deﬁned type of event and range of addresses, it provides a
histogramof access events. The histogram’sgranularitycan
be conﬁgured as multiples of cache lines.
2.4 The ePAPI Interface
SMART deﬁnes and implementsePAPI, a PAPI [7] alike
interface to control the underlying monitor and retrieve the
monitoringresults. Inaddition,ePAPIhastheextendedsup-
portforhistogramanduser-deﬁnedmemoryareatobemon-
itored.
As ePAPI is intended to be used in individual threads of
the shared memory application, special effort has been de-
voted to make it suitable to be used in a multi-threaded en-
vironment. Like PAPI, the ePAPI interface assumes kernel
level thread.
With this interface, it is able to reﬁne performance mea-
surement to speciﬁc elements (threads, code regions) of a
sequential or parallel application, and to speciﬁed range of
memory addresses. Moreover,it supports higher level envi-
ronments to be build on top of SMART that enables auto-
matic performance analysis.
3 Speeding Up Cache Monitoring
The efﬁciency of cache monitoring with simulators is
an critical issue. Cache simulation usually introduces a
large amount of overhead for any application even with
medium size, especially for ﬁne-grained monitoring for
multi-processors.The simulation overhead comes from several sources,
including runtime instrumentation, cache simulation, and
event monitoring. The efﬁciency are largely affected by the
frequency of memory accesses and the level of details to be
simulated. In order to increase the efﬁciency of SMART,
we need to make improvementsfrom both side.
3.1 Optimization for Simulation
The efﬁciency and accuracy of the simulation is contra-
dictory. Fully detailed simulation of the hardware can be
very slow. As SMART is intended primarily for the eval-
uation of memory hierarchy, we have implemented a sim-
pliﬁed processor model rather than simulating the cycle-
accurate operations within the CPU. Other simpliﬁcations
are also made in order to increase the efﬁciency. For exam-
ple, the simulated caches does not really store the accessed
data, but only the address information, i.e. cache tags.
The monitor simulator largely follows the design of the
hardwaremonitor. However,thehardwaremonitorhasbeen
designed to take advantage of the relatively faster hardware
processing capabilities, and to cope with limited hardware
resources. Thus, aspectsofthedesignwhicharesuitable for
hardware are not naturally efﬁcient for the software simula-
tor. Thering-bufferutilizedbythedynamicmodeforreduc-
ing the interaction with software by buffering access events
is one such example. For the sake of efﬁciency, simpliﬁca-
tions and optimizations has been made to the monitor simu-
lator. The simulator keeps all the functions of the hardware
monitor, but is tuned for software execution.
3.2 Selective Monitoring
With ePAPI, one can turn on/off monitoring for selected
code regions and restrict measurement for restricted ad-
dress regions. This approach of selective measurement will
greatly reduce the numberof events to be recorded and pro-
cessed in the monitor, and thus improve the efﬁciency of
simulation as well.
Many hardware based cache monitoring tools actually
employs this approach, and similar approaches also include
sampling. For hardware based monitoring, this approach
along is sufﬁcient to dramatically decrease the overhead for
processing the monitoring data. However, it is not the case
for simulation based tools, like SMART, where the simu-
lation of cache accesses and the coherence constitute the
major part of the overhead.
3.3 “Lazy" Cache Simulation
Usually, every memory access of the application need to
be handled by the cache simulator, thus making it possible
to maintain a correct cache context in the cache simulator
whenever possible. For measurement that targets to a spe-
ciﬁc code region, there are chances that some speedup of
the simulation can be achieved by ”starting” the simulator
in a lazy way.
“Lazy” cache simulation means that cache simulation is
performed when the speciﬁc code region is entered, and
stopped when leaving it. Thus, by losing the cache con-
text between successive starts of monitoring, the overhead
of the simulation is greatly reduced.
This will in the general sense affect the derived moni-
toring result. However, for our address centric monitoring,
especially for the dynamic mode, the loss of cache context
does not usually introduce real problem. For a single run
of a code region (enter and leave), the loss of context will
maximally introduce only one miscount for each address
bin (one cache block) that are monitored. Such a difference
of one is statistically unimportant if the number of access
events is large, otherwise the code region itself does not de-
serve to pay attention to at all. This can be illustrated with
the following example:
do_something_first(A);
ePAPI_start();
do_something(A);
ePAPI_stop();
In the above example, the monitoring has been targeted
to function do something(), which is preceded by another
function do something ﬁrst() and wrapped within a loop.
If a speciﬁc cache block has been accessed in function
do something ﬁrst(), there is possibility (and just possibil-
ity) that it will be kept in cache, so whenit is accessed again
in do something(), a cache hit will happen. If the context is
not kept, this will be only one false cache miss. The later
accesses to the same addresswill be cache hits regardlessof
preserving or missing the cache context.
Of course, there are special cases that the loss of cache
context will introduce problem in the correctness of moni-
toring result. For example, if the above example is a part
wrapped in a long loop and the number of access to A
in do something() is small, this will result into wrong re-
sult. Currently, it is left to the user to make own judge-
ment when applying the monitor to speciﬁc code regions
and command-line option is offered to enable/disable the
“lazy” simulation.
3.4 Experimental Result
In order to examine the simulation overhead of SMART,
we have performed measurement for several real-world ap-
plications, including GeoFEM - a parallel ﬁnite element
solidearthsimulatorfromtheResearch OrganizationforIn-
formation Science and Technology of Japan, LM - the op-
erational local model used by Deutscher Wetterdienst forTable 1. Simulation Overhead with Real-world Applications
Time Overhead
direct execution 1.250s 4.421s 17.997s 1
+ active cache simulation 11m46.570s 56m00.619s 109m21.809s 563.337
+ selective monitoring static 13m57.800s 59m02.172s 134m43.208s x 1.157
dynamic 13m45.190s 56m36.063s 134m39.320s x 1.137
+ full monitoring static 14m18.630s 56m21.923s 134m35.754s x 1.151
dynamic 13m42.910s 56m32.416s 134m38.618s x 1.135
+ lazy cache simulation 1m27.620s 5m24.586s 22m53.524s 73.278
+ selective monitoring static 1m29.160s 5m24.897s 22m56.957s x 1.007
dynamic 1m28.080s 5m25.345s 23m02.172s x 1.005
+ full monitoring static 14m35.290s 58m55.983s 135m58.464s x 8.941
dynamic 14m58.120s 59m13.146s 136m14.339s x 9.049
Note: monitoring target static linear.F kernel.f90 src relaxation.f90
Line:841-843 Line: 1416-1441 Line: 624-632
numerical weather forecast for the central European area,
and Crash - car crash simulation application based on ﬁ-
niteelementmethodfromtheFraunhofer-InstituteforHigh-
Speed Dynamics EMI. For both active cache simulation
and lazy cache simulation, the monitor is conﬁgured in
static mode and/or dynamic mode to monitor single event
L1 access miss. For each application, a small nested loop
has been chosen for selective monitoring, which is repre-
sented in the table by its deﬁning ﬁle name and starting and
ending line number.
The execution time of each application, separately
for direct execution and different combinations of ac-
tive/lazy cache simulation and selective/full monitoring in
static/dynamic mode are show in Table 1. The data clearly
shows how selective monitoring and “lazy” cache simula-
tion together can greatly reduce the overhead,especially for
large programs with only selected parts of code monitored.
4C o n c l u s i o n
ThispaperpresentsSMART,aﬂexiblesimulatorforana-
lyzing cache access behavior of multi-threadedapplications
on SMPs. Based on software instrumentation,it traps mem-
oryaccesseventsofapplicationsandrecordsdetailedaccess
behavior in a multi-level memory hierarchy targeting user-
deﬁned address ranges.
This environment is capable of providing detailed infor-
mation about the runtime cache access behavior. According
to different work mode conﬁguration, the result of monitor-
ingmemoryaccessesareeithereventcountsorhistogramof
arbitrary granularity. The uniqueness of this tool includes
its support for programs in any shared memory program-
ming models (OpenMP, for example) based on PThread,
and the use of arbitrary compilers, its support for ﬁne-
grained monitoring of accesses within given address ranges
of interest and/or speciﬁc code regions and presents the re-
sult as event counts or histogram in arbitrary granularity,
and its capability of integration with higher-level language
speciﬁc tools through ePAPI interface.
SMART is useful to help programmers of numerical ap-
plicationstounderstandtheaccessbehaviorandidentifyap-
propriate code transformations for optimization. It can also
be used by researchers to evaluate computer hardware de-
signs, cache coherence schemes etc.
References
[1] T.Brandes et.al.Monitoring Cache Behavior onParallelSMP
Architectures and RelatedProgramming Tools.FutureGener-
ation Computer Systems, to appear.
[2] M. Schulz, J. Tao, J. Jeitner, W. Karl. A Proposal for a New
Hardware Cache Monitoring Architecture. Proceedings of
SIGPLAN Workshop on Memory System Performance (MSP
2002), Berlin, Germany, June 2002.
[3] SMART, http://wwwbode.cs.tum.edu/˜lit/smart/
[4] A-T. Nguyen, M. Michael, A. Sharma, J. Torrellas. The Aug-
mint Multiprocessor Simulation Toolkit for Intel x86 Archi-
tectures. Proceedings of 1996 International Conference on
Computer Design, October 1996.
[5] MemAccess skin for Valgrind,
http://kcachegrind.sourceforge.net/cgi-
bin/show.cgi/KcacheGrindValgrind
[6] N. Nethercote and J. Seward. Valgrind: A Program Su-
pervision Framework. Proceedings of the 3rd Workshop on
Runtime Veriﬁcation (RV’03), Boulder, Colorado, USA, July
2003.
[7] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci. A
Portable Programming Interface for Performance Evaluation
on Modern Processors. The International Journal of High
Performance Computing Applications, 14(3):189–204, Fall
2000.