SmartApps, an Application Centric Approach to High Performance Computing: Compiler-Assisted Software and Hardware Support for Reduction Operations by Francis Dang et al.
SmartApps, an Application Centric Approach to High Performance Computing:
Compiler-Assisted Software and Hardware Support for Reduction Operations
￿
Francis Dang
￿
, Mar´ ıa Jes´ us Garzar´ an
￿
, Milos Prvulovic
￿
, Ye Zhang
￿
, Alin Jula
￿
, Hao Yu
￿
,
Nancy Amato
￿
, Lawrence Rauchwerger
￿
, and Josep Torrellas
￿
Abstract
State-of-the-art run-time systems are a poor match to di-
verse, dynamic distributed applications because they are
designed to provide support to a wide variety of applica-
tions, without much customization to individual speciﬁc re-
quirements. Little or no guiding information ﬂows directly
from the application to the run-time system to allow the lat-
ter to fully tailor its services to the application. As a result,
the performance is disappointing. To address this prob-
lem, we propose application-centric computing, or SMART
APPLICATIONS. In the executable of smart applications,
the compiler embeds most run-time system services, and
a performance-optimizing feedback loop that monitors the
application’s performance and adaptively reconﬁgures the
application and the OS/hardware platform. At run-time,
after incorporating the code’s input and the system’s re-
sources and state, the SMARTAPP performs a global op-
timization. This optimization is instance speciﬁc and thus
much more tractable than a global generic optimization be-
tween application, OS and hardware. The resulting code
and resource customization should lead to major speedups.
In this paper, we ﬁrst describe the overall architecture of
SMARTAPPS and then present some achievements to date,
focusing on compiler-assisted software and hardware tech-
niques for parallelizing reduction operations. These illus-
trate SMARTAPPS use of adaptive algorithm selection and
moderately reconﬁgurable hardware.
1 Introduction
Many important applications are becoming large con-
sumers of computing power, data storage and communica-
￿
Research supported in part by NSF CAREER Award CCR-9734471,
NSF Grant ACI-9872126, NSF-NGS EIA-9975018 and EIA-0103742,
NSF ITR ACI-0113971, DOE ASCI ASAP Level 2 Grant B347886, and
Hewlett-Packard Equipment Grants
￿
Universidad de Zaragoza, http://www.cps.unizar.es/deps/DIIS/gaz
￿
University of Illinois at Urbana-Champaign, http://iacoma.cs.uiuc.edu
￿
Texas A&M University, http://www.cs.tamu.edu/faculty/rwerger
￿
Texas A&M University, http://www.cs.tamu.edu/faculty/amato
tion bandwidth. For example, applications such as ASCI
multiphysics simulations, real-time target acquisition sys-
tems, multimedia stream processing and geographical in-
formation systems (GIS), all put tremendous strains on the
computational, storage and communication capabilities of
the most modern machines. There are several reasons why
the performance of current distributed, heterogeneous sys-
tems is often disappointing. First, they are difﬁcult to
fully utilize because of the heterogeneity of the processing
nodes (usually with different capabilities) which are inter-
connected through a non-homogeneous network with dif-
ferent inter-node latencies and bandwidths. Secondly, the
system may change dynamically while the application is
running. For example, nodes may fail or appear, network
links may be severed, and other links may be established
with different latencies and bandwidths. Finally, in order to
obtain decent performance, the work has to be partitioned
in a balanced manner.
Current distributed systems have a fairly compartmen-
talized approach to optimization: applications, compilers,
operating systems and even hardwareconﬁgurations are de-
signed and optimized in isolation and without the knowl-
edge of input data. There is little information ﬂow across
these boundaries and no global optimization is even at-
tempted. For example, many important activities man-
aged by the operating system like paging activity, virtual-
to-physical page mapping, I/O activity or data layout in
disks are provided with little or no application customiza-
tion. Since the compiler’s analysis can discovermuch about
an application’s needs, performance could be boosted sig-
niﬁcantly if the OS provided hooks for the compiler, and
possibly the user, to customize or tailor OS activities to the
needs of a particular application. Current hardware is built
for general purpose use to lower costs and has almost no
tunable parameters that allow the compiler or the OS adjust
it to speciﬁc application characteristics.
In addition to this lack of compiler/OS/hardwarecooper-
ation, a second important problem is that compilers do not
necessarily know fully at compile time how an application
will behave at run time. The reason is that the run-time
behavior of an application may partly depend on its inputdata. Consequently, compilers may generate conservative
code that does not take advantage of characteristics of the
program’s input data. This precludes many aggressive opti-
mizations related to code parallelization, parallel algorithm
substitution (when possible), and redundancy elimination.
Moreover, we can only use expensive, generic methods for
load balancing and memory latency hiding. If, instead, the
compiler inserted code that, after reading the input data to
the program at run-time, adaptively made optimization de-
cisions, performance could be boosted signiﬁcantly. Fur-
thermore, at a higher level, the compiler may have the pos-
sibility of selecting an algorithm or a speciﬁc implementa-
tion of an algorithm from a library of functionally equiva-
lent modules. If this choice is made based on the speciﬁc
instance of an application then large-scale gains can be ob-
tained. For example, if the code calls for a sorting routine,
the compiler can specialize this call to a speciﬁc parallel
sort that matches both the input data to be sorted as well as
the architecture on which it will be executed.
Our ultimate goal is the overall minimization of exe-
cution time of dynamic applications in parallel systems.
Instead of building individual, generally optimized com-
ponents (compilers, run-time systems, operating systems,
hardware) that can work acceptably well with any applica-
tion, we will subordinate the whole optimization process
to the particular needs of a speciﬁc application. We will
drive the optimization with the requirements of an individ-
ual program and for a speciﬁc set of input data. Moreover,
the optimization will be carried out continuously to adapt to
the dynamic, time varying needs of the application. The ﬁ-
nal form of the executable of an application will take shape
only at run-time, after all input data has been analyzed. The
resulting Smart Application (SMARTAPP) will monitor its
performance and, when necessary, restructure itself and the
underlying OS and hardware to its new characteristics. Our
approach promises to drastically reduce the generally in-
tractable problem of global optimization because we opti-
mize only a particular instance of an application. While this
method may cost some additional overhead for every exe-
cution the resulting customized performance can more than
pay off for long running codes.
2 System Architecture
We now give a general overviewof our system which in-
cludes components at various levels of development. Some
features of SMARTAPPS have been implemented, others
have been studied but have not yet been prototyped while
others are still in early stages. We give this high level ar-
chitectural description that includes both accomplishments
as well as work in progress in order to put our work in per-
spective.
The adaptive run-time system, shown in Figures 1 and 2,
Evaluator
Performance
Predictor &
Optimizer
partially compiled
code with unknowns
and runtime hooks
runtime tuning
(w/o recompile
 or reconfigure)
runtime tuning
(w/o recompile
 or reconfigure)
Predictor &
Optimizer
Configurer
Predictor &
Optimizer
information for
rapid simulation
small adaption (tuning)
large adaption (failure, phase change)
(sample input, system information)
Application
continuously monitor
performance and adapt
as necessary
augmented with 
runtime techniques
Execute Application Adaptive Software
Get Runtime Information
Static Compiler
Recompile Application and/or
Reconfigure System
Compute Optimal Application
and System Configuration
Smart Application
Figure 1. Smart Application.
consists of a nested multi-level adaptive feedback loop that
monitors the application’s performance and, based on the
magnitude of deviation from expected performance, com-
pensates with various actions. Such actions may be run-
time software adaptation, re-compilation, operating system
and hardware reconﬁguration. The system shown in Fig-
ure 1 uses techniques from a TOOLBOX shown in Figure 2.
The TOOLBOX contains application and system speciﬁc
databases and algorithms for performance evaluation, pre-
diction and system reconﬁguration. The tools are supported
by architectural and performance models.
The ﬁrst stage of preparing a dynamic application for
execution occurs outside the proposed run-time system. It
is a pre-compilation in which all possible static compiler
optimizations are applied. However, for many of the more
aggressive and effective code transformations, the needed
information is not statically available. For example, if the
code solves a sparse adaptivemesh-reﬁnement problem, the
initial mesh is read from an input ﬁle only at the begin-
ning of the execution and is therefore not available for static
compilation. In this case, the compiler may use speculative
transformations which will have to be validated at run-time.
We will generate an intermediate code that will contain all
the necessary compiler-internalinformation statically avail-
able, which will be combined with execution-time informa-
tion to ﬁnish possible optimizations. This additional infor-
mation will be packaged so that the application could in fact
be executed, albeit sub-optimally, without passing through
the second run-time compilation stage (the current level of
development). Calls to generic algorithms or, when pos-
sible, parallel algorithm recognition and substitutions will
be either left in their most general form or specialized toPerformance
   Models
Architectural
   Models
Phase Transition info from this run
Statistical info from previous runs
Tool Box
Statistical info regarding reliability,
   availability, load characteristics
measure performance
using HW & OS. 
compare with predicted
values.
detect HW/SW failures
configure architecture,
I/O, and OS systems
(network, cache,
 directories)
compute "optimal"
configuration (arch,
OS, data layout in I/O
and memory, etc)
predict performance
Performance Evaluator
Configurer
Optimizer
Predictor
Models
Database
Application-specific database
System-specific database
Figure 2. ToolBox.
the extent permitted by static compiler analysis, e.g., type
analysis. For example, when a reduction operation is rec-
ognized or speciﬁcally called by the program, the compiler
will possibly decide between the ’standard’ parallel equiva-
lent or ’histogram reductions’ if enough knowledge can be
extracted from the code [20].
The second stage in an application’s life is driven by
the run-time system. It starts by reading in and/or sam-
pling the input data which are relevant to the ’unﬁnished’
optimizations. This ’relevant’ data is analyzed with fast,
approximative methods and essential characteristics are ex-
tracted. The result of this analysis will place the instance
of this application in a certain ’functioning domain’ which
represents the possible universe of forms that an application
can take at run-time. Calls to routines that perform certain
standard functions will be specialized by selecting from a
linked library the algorithms and/or their implementations
that match the ’functioning domain’ (code and data) of this
particular instantiation of the program. In addition, the run-
time system provides information about the type and re-
source availability of the system on which the application
will be executed. Performance monitoring instrumentation
is added to the code based on its intrinsic structure as well
as that of the run-time environment. Different architectural
and operating system features will dictate which parameters
are important, and which can be measured.
Then, a fast RUN-TIME COMPILER, which will be devel-
oped from an existing restructurer, will ﬁnish the compila-
tion process and generate a highly optimized and adaptable
code, the SMART APPLICATION. This executable will in-
clude code for adaptive run-time techniques that allow the
application to make on-the-ﬂy decisions about various opti-
mizations. To this end, we will use our techniques for de-
tectingandexploitinglooplevelparallelisminvariouscases
encountered in irregularapplications [15, 18, 17]. Load bal-
ancing will be achieved through feedback guided blocked
scheduling [5] which allows highly imbalanced loops to
be block scheduled by predicting a good work distribution
from previousmeasuredexecutiontimes of iteration blocks.
For certain simple algorithms, which can be automati-
cally recognized, e.g., reductions, the compiler will insert
code that can substitute the sequential version with a par-
allel equivalent that best matches the data access pattern of
the application. This adaptive parallel algorithm substitu-
tion can be implemented either through multi-version code
(library calls) as is currently done, or through recompila-
tion.
The result of static and dynamic compiler analysis of the
application will also produce a set of requirements of de-
sirable features for the operating system. These requests
are embedded in the user code and can call upon a tun-
able, modular OS to change some of its parameters (e.g.,
page mapping) and to perform some simple modiﬁcation
of the underlying architecture (e.g., type and/or number of
system components). Furthermore, the compiler will gen-
erate (statically or at run-time) a list of speciﬁcations for
the run-time environment. These application-level speci-
ﬁcations are passed to the system conﬁguration optimizer.
The PREDICTOR and OPTIMIZER tools will use the appli-
cation requirements and characteristics to compute an ‘op-
timal’ architectural conﬁguration and tune the environment
accordingly. In addition to the OS tuning we will perform
architectural modiﬁcation when feasible. This may range
from the customization of communication protocols (e.g.,
specialized cache coherence protocols) to the specialization
of processors for computing or communication. In the latter
case the SMARTAPP will distribute the workload between
’classical’ processors and processors in memory (IRAM).
In the following sections, we ﬁrst brieﬂy review some
of the implemented compiler-generated run-time optimiza-
tions of the presented SMARTAPPS architecture, and then
describe in more detail compiler-assisted software [20] and
hardware [8] techniques for parallelizing reduction opera-
tions. These illustrate SMARTAPPS use of adaptive algo-
rithm selection and moderately reconﬁgurable hardware.
3 Compiler Generated Run-Time Optimiza-
tions
Efﬁciently exploiting parallel machines in general and
heterogeneous machines in particular depends upon the de-gree to which a program has been optimized to execute on
a given architecture. We believe that all optimization tech-
niques, whether performed by compiler or programmer,are
derived from three fundamental optimization principles: (i)
maximizing parallelism while minimizing overhead and re-
dundant computation, (ii) minimizing wait-time due to load
imbalance, and (iii) minimizing wait-time due to memory
latency.
The SMART APPLICATIONmainlyconsists ofarun-time
library embedded by the compiler in the application and
which can dynamically select compiler optimizations based
on the above three principles (e.g., loop parallelization or
scheduling for load balance). Some architectural reconﬁg-
uration and operating system level tuning may also be em-
ployed to obtain fast, low overhead performance improve-
ment. We plan tointegratesuch adaptivetechniquesinto the
application by extending current static and run-time tech-
nologies and by developing completely new ones.
We have developed several techniques [16, 17, 15, 18]
that can detect and exploit loop level parallelism in various
casesencountered in irregularapplications: (i) a speculative
method to detect fully parallel loops (The LRPD Test), (ii)
an inspector/executor technique to compute wavefronts (se-
quences of mutually independent sets of iterations that can
be executed in parallel) and (iii) a technique for paralleliz-
ing while loops (do loops with an unknown number of
iterations and/or containing linked list traversals). Details
can be found in [16, 17, 18, 5].
We have recently developed a new technique that can
extract the maximum available parallelism from a partially
parallel loop and that removeslimitations of previous meth-
ods (for partially parallel loops), i.e., it can be applied to
any loop (even if no proper inspector can be extracted) and
requires less memory overhead. The main idea of the Re-
cursive LRPD test [5] is that in any block-scheduled loop
executed under the processor-wise LRPD test with copy-in,
the chunks of iterations that are less than or equal to the
source of the ﬁrst detected dependence arc are always ex-
ecuted correctly. Only the processors executing iterations
larger or equal to the earliest sink of any dependence arc
need to re-execute their portion of work. Thus only the re-
mainder of the work (of the loop) needs to be re-executed.
We have implemented the Recursive LRPD test and applied
it to the three most important loops in TRACK, a Perfect
code. As detailed in [5], we obtained very encouraging
speedups – prior to this technique, TRACK was considered
sequential.
4 Software Support for Reductions: Adap-
tive Algorithm Selection
Memory accesses in irregular programs take a variety of
patterns and are dependent on the code itself as well as on
their input data. Moreover, some codes are of a dynamic
nature, i.e., they modify their behavior during their execu-
tion because they simulate position dependent interactions
between physical entities.
A specialand veryfrequent case of loopdependence pat-
tern occurs in loops which implement reduction operations.
In particular, reductions (also known as updates) are at the
core of a very large number of algorithms and applications
– both scientiﬁc and otherwise – and there is a large body
of literature dealing with their parallelization.1
It is difﬁcult to a ﬁnd a reduction parallelization algo-
rithm (or for that matter, other optimizations) that will work
well in all cases. We have designed an adaptive scheme that
will detect the type of reference pattern through static(com-
piler) and dynamic (run-time) methods and choose the most
appropriate scheme from a library of already implemented
choices [20]. To ﬁnd the best choice we establish a taxon-
omy of different access patterns, devise simple, fast ways to
recognize them, and model the various old and newlydevel-
oped reduction methods in order to ﬁnd the best match. The
characterization of the access pattern is performed at com-
pile time whenever possible, and otherwise, at run-time,
during an inspector phase or during speculative execution.
From the point of view of optimizing the parallelization
of reductions (i.e., selectingthe best parallel reduction algo-
rithm) we recognize several characteristics of memory ref-
erences to reduction variables. CH is a histogram which
shows the number of elements referenced by a certain num-
ber of iterations, and CHD is the CH distribution. CHR is
the ratio of the total number of references (or the sum of the
CH histogram) and the space needed for allocating repli-
cated arrays across processors, and the set of CHRs which
have a high degree of contention is referred to as HCHR.
CON, the Connectivity of a loop, is a ratio between the
number of iterations of the loop and the number of distinct
memory elements referenced by the loop [10]. The Mobil-
ity (MO) per iteration of a loop is directly proportional to
the number of distinct elements that an iteration references.
The Sparsity (SP) is the ratio of referenced elements to the
dimension of the array. The DIM measure gives the ratio
between the reduction array dimension and cache size. If
the program is dynamic then changes in the access pattern
will be collected, as much as possible, in an incremental
manner. When the changes are signiﬁcant enough (a thresh-
old that is tested at run-time) then a re-characterization of
the reference pattern is needed.
Our strategy is to identify the regular components of
each irregular pattern (including uniform distribution), iso-
late and group them together in space and time, if this is not
already the case, and then apply the best reduction paral-
1A reduction variable is a variable whose value is used in one associa-
tive and commutative operation of the form
￿
￿
￿
￿
￿
￿
￿ , where
￿ is the
operator and
￿ does not occur in
￿
￿
￿ or anywhere else in the loop.APP MO DIM SP CON CHR Recom. Experimental
Scheme Result
Irreg 2 100,000 25 100 0.92 rep rep
￿ ll
￿ sel
￿ lw
- DO 100 500,000 5 20 0.71 lw lw
￿ rep
￿ ll
￿ sel
1,000,000 1.25 5 0.40 lw lw
￿ rep
￿ ll
￿ sel
2,000,000 0.25 1 0.26 sel sel
￿ lw
￿ ll
￿ rep
Nbf 1 25,600 25 200 0.25 ll sel
￿ ll
￿ rep
￿ lw
-D O5 0 128,000 6.25 50 0.25 sel sel
￿ ll
￿ rep
￿ lw
256,000 0.625 5 0.25 sel sel
￿ ll
￿ rep
￿ lw
1,280,000 0.25 2 0.25 sel sel
￿ ll
￿ rep
￿ lw
Moldyn 2 16,384 23.94 95.75 0.41 rep rep
￿ ll
￿ sel
￿ lw
- ComputeForces loop 42,592 7.75 31 0.36 rep rep
￿ ll
￿ sel
￿ lw
70,304 1.69 6.75 0.33 ll ll
￿ rep
￿ sel
￿ lw
87,808 0.375 1.5 0.29 ll ll
￿ rep
￿ sel
￿ lw
Spark98 1 30,169 0.625 5 0.18 sel sel
￿ ll
￿ rep
￿ lw
- smvpthread() loop 7,294 0.6 4.8 0.2 sel ll
￿ sel
￿ rep
￿ lw
Charmm 2 332,288 35.88 17.9 0.14 sel ll
￿ sel
￿ rep
￿ lw
-D O7 8 17.94 8.97 0.15 sel ll
￿ sel
￿ rep
￿ lw
664,576 1.12 4.48 0.13 sel ll
￿ sel
￿ rep
￿ lw
Spice 28 186,943 0.14 0.04 0.125 hash hash
￿ ll
￿ rep
- bjt100 99,190 0.20 0.06 0.125 hash hash
￿ ll
￿ rep
89,925 0.16 0.05 0.125 hash hash
￿ ll
￿ rep
33,725 0.16 0.05 0.126 hash hash
￿ ll
￿ rep
Figure 3. The data has been obtained from the execution of the applications 8 processors. INPUT: number of reduction elements;
SP: sparsity; CON: connectivity; CHR: ratio of total number of references to space needed for per processor replicated arrays; MO:
mobility.
lelization method to each component. We have used the fol-
lowing novel and previously known parallel reduction algo-
rithms: local write (lw) [10] (an ’owner compute’ method),
privateaccumulation and global update in replicated private
arrays (rep), replicated buffer with links (ll), selective pri-
vatization (sel), sparse reductions with privatization in hash
tables (hash).
Our main goal, once the type of pattern is established,
is to choose the appropriate reduction parallelization algo-
rithm, that is, the one which best matches these characteris-
tics. To make this choice we use a decision algorithm that
takes as input measured, real, code characteristics, and a li-
brary of available techniques, and selects an algorithm for
the given instance.
The table shown in Fig.3 illustrates the experimental val-
idation of our method. All memory reference parameters
were computed at run-time. The result of the decision pro-
cess is shown in the “Recommended scheme” column. The
ﬁnal column shows the actual experimental speedup ob-
tained with the various reduction schemes which are pre-
sented in decreasing order of their speedup. For example,
for Irreg, the model recommended the use of Local Write.
The experiments conﬁrm this choice: lw is listed as having
the best measured speedup of all schemes.
In the experiment for the SPICE loop, the hash table re-
duces the allocated and processed space to such an extent
that, although the setup of a hash table is large, the perfor-
mance improves dramatically. It is the only example where
hash table reductions represent the best solution because of
the very sparse nature of the references. We believe that
codes in C would be of the same nature and thus beneﬁt
from it. There are no experiments with the Local Write
method because iteration replication is very difﬁcult due to
the modiﬁcation of shared arrays inside the loop body.
5 Hardware Support for Reductions: Private
Cache-Line Reduction (PCLR)
We haveproposedanewscheme PrivateCache-LineRe-
duction (PCLR) which adds architectural support for per-
forming reduction operations in scalable shared-memory
multiprocessors.
The essence of PCLR is that each processor participat-
ing in the reduction uses non-coherent lines in its cache as
temporary private storage to accumulate its partial results
of the reduction. Moreover, if these lines are displaced
from the cache, their value is automatically accumulated
onto theshared reduction variable in memory. Finally, since
the cache lines are non-coherent, cache misses are satisﬁed
from within the local node by returning a line ﬁlled with
neutral elements. Figure 4 shows a representation of the
scheme.
With this approach, the processors are relievedof the ini-
tialization and merge-out work. Also, since the approach
is still based on computing partial results and combiningDirectory
Memory
Combine
Reduction
Shared
Line
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Element
Neutral
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
CPU
Cache
CPU Miss
Displace
Network
Figure 4. Representation of how PCLR works.
them, the reduction is performed with no critical sections.
The initialization phase is avoided by initializing the re-
duction lines on demand, as they are brought into the cache
on cache misses. Since the cache is used as private storage
to accumulate the partial results, there is no need to allocate
anyprivatearray in memory. On a cache miss to a reduction
line, the local directory controller intercepts the request and
services it by supplying a line of neutral elements.
The merging phase is avoided by combining the reduc-
tion cache lines in the background as they are displaced
from the cache during parallel loop execution. As each dis-
placed reduction line reaches the home of the shared reduc-
tion variable, the directory controller combines its contents
with the shared reduction variable in memory. Meanwhile,
the processors execute the loop without interruption.
When the parallel loop ends, some partial results may re-
main in the caches. They must be explicitly ﬂushed so that
they are correctly combined with the shared data before any
further code is executed. This ﬂush step takes much less
time than an ordinary merging phase. This is because it
has less combining to perform, as most of it has already
been performed through displacements during the loop ex-
ecution. In fact, the work is at worst proportional to the size
of the cache, rather than to the size of the shared array. It is
also more efﬁcient because the processor issues no remote
loads. Instead, it simply sends all the partial results to their
homes, where the directory controller combines the data.
With PCLR support, the parallelized reduction code is
shown in Figure 5. Note that we have added a call to a
function that conﬁgures the machine for PCLR before the
loop execution. This example is simpliﬁed by using static
scheduling and omitting the forking and joining code.
5.1 Implementation of PCLR
Anyimplementationof PCLRhastoconsiderthefollow-
ing issues: differentiation of reduction data (Sections 5.1.1
and 5.1.5), support for on-demand initialization (Sec-
tion 5.1.2) and combining (Section 5.1.3) of lines and con-
1 ConﬁgHardware(arguments);
// The range 0..Nodes is split among the processors
2 for(i=MyNodesBegin;i<MyNodesEnd;i++)
3 w[x[i]]+=expression;
4 CacheFlush();
5 barrier();
Figure 5. PCLR Parallelized reduction code.
ﬁguration of the hardware (Section 5.1.4). We discuss these
issues in this section.
In the following discussion, we assume a CC-NUMA ar-
chitecture such as the one in Figure 4. Each node in the
machine has a directory controller that snoops and poten-
tially intervenes on all requests and write-backs issued by
the local cache, even if they are directed to remote nodes.
5.1.1 Differentiating Reduction Data
While the data used in reduction operations remain in the
cache, they are read and written just like regular, non-
reduction data. However cache misses and displacements
of reduction data require special treatment. Consequently,
any implementation of PCLR has to provide a way to dis-
tinguish reduction data from regular data.
A simple way of doing so is to use special load and
store instructions for “reduction” accesses. Cache lines ac-
cessed by these special instructions are marked as contain-
ing reduction data by putting them into a special “reduc-
tion” state. In this state, a processor can read and write the
line without sending invalidations, even though other pro-
cessors may be caching the same memory line. Misses by
reduction loads and displacements of lines in the reduction
state cause special transactions that are recognized by the
local and home directories, respectively.
Note that we assume that reduction and regular data
never share a cache line. Although it would be possible
to enhance our scheme to support line sharing, alignment of
reduction data on cache line boundaries is beneﬁcial even
without PCLR. Consequently, we assume that the compiler
guarantees no line sharing.
In the following, we explain the rest of PCLR assuming
this simple approach to differentiating reduction data. In
Section 5.1.5, we propose a more advanced scheme for re-
duction data differentiation that allows using unmodiﬁed or
slightly modiﬁed processors and caches.
5.1.2 On-Demand Initialization of Reduction Lines
When a reduction load misses in the cache, a specially-
marked cache line read transaction is issued to the memory
system. The local directory controller intercepts the requestand satisﬁes it by returning a line initialized with neutral
elements for the particular reduction operation. The line is
loaded into the cache in the reduction state.
A reduction load may hit in the cache on a line that is not
in the reduction state. This may occur if the line had been
accessed prior to the reduction loop with plain accesses and
happened to linger in the cache. In this case, if the line is
in state dirty, it is written back to memory in a plain write-
back. Irrespective of its state, the line is then invalidated.
Finally, the cache issues a reduction read miss as indicated
above.
5.1.3 On-Demand Combining of Partial Results
When a line in the reduction state is displaced from the
cache, a specially-marked write-back transaction is issued
to the memory system. Once the write-back arrives at its
home,thedirectorycontroller readsthepreviouscontentsof
the line from memory, combines it with the newly-arrived
partial result, and stores the updated line back to memory.
The combining of the lines is done according to the reduc-
tion operator in the code, and is performed for every single
element in the line. Note that those elements of the dis-
placedline thatwere not accessedby the processorstill con-
tain the neutral element, so the effect of merging them with
memory content is that the memory content is unchanged.
To combine the lines, the directory controller has to be
enhanced with execution units that support the required re-
duction operators. Since a cache line contains several in-
dividual data elements, such execution units may become
a bottleneck if their performance is too low. Luckily, all
the elements of a line can be processed in parallel or in a
pipelined fashion. Consequently,it is not too difﬁcultto im-
prove the performance by pipelining these execution units
or adding more units.
These execution units should include an integer ALU
for integer operations. For ﬂoating-point operations, hav-
ing a full ﬂoating-point unit would be more general, but
would also increase the complexity of the directory con-
troller signiﬁcantly. Our experience with the applications in
Section 6.2 suggests that multiplication is rarely used as a
reduction operator. Thus, for ﬂoating-point operations, hav-
ing a ﬂoating-point adder and comparator is sufﬁcient.
Finally, it is possible that the reduction data had been
accessed prior to the reduction loop with plain accesses,
and still lingers in several caches when the reduction loop
starts. To handle this case, when the home directory con-
troller receives a write-back for the line, it always checks
the list of sharer processors for the line in the directory.
Note that misses due to the reduction accesses do not go
to the home. Thus, the home only has sharing informa-
tion about non-reduction sharers. If the line is in a (non-
reduction) dirty state in a cache, the controller recalls the
line and writes it back to shared memory before performing
any combining. The controller also sends invalidations to
all (non-reduction) sharer processors. After the ﬁrst reduc-
tion write-back of a line, the list of sharers at its home is
empty for the remainder of the reduction loop and causes
no further invalidation or recall messages.
5.1.4 Conﬁguring the Hardware
The conﬁguration of the hardware is controlled by the com-
piler which inserts the necessary code into the SMARTAPP.
Before executing a reduction loop, each processor issues
a system call to inform the directory controller in its node
about the data type and the operation of the reduction. This
is shown in line 1 of Figure 5. With this simple approach,
we can only support one type of reduction operation per
parallel section. In our example of Figure 5, the controller
must be conﬁgured to perform double-precision ﬂoating-
point addition when it receives a reduction write-back.
Any loop that performs several types of reduction oper-
ation must be distributed into multiple loops, so that each
loop performs only one type of reduction operation. Fortu-
nately, loops with multiple types of reduction operation are
rare.
Finally, the operating system knows if different, time-
shared processes want to use different types of reduction
operations. If this is the case, the operating system ﬂushes
the reduction data from the caches when a process is pre-
empted, and reprograms the directory controller when the
process is re-scheduled.
5.1.5 Advanced Differentiation of Reduction Data
In Section 5.1.1we explaineda simple mechanismtodistin-
guish reduction data from regular data and then explained
the rest of PCLR using that simple mechanism. Now we
propose a more advanced, but equivalent, mechanism that
eliminates the need to modify the processor, the caches, or
the coherence protocol.
In this scheme, instead of using special instructions,
cache states, and protocol transactions to identify reduc-
tion data, such data are identiﬁed by using Shadow Ad-
dresses [4]. The scheme works as follows. In the reduction
code, we use a Shadow Array instead of the original reduc-
tion array. For example, in Figure 5, we would use array
w redu instead of w. This shadow array is mapped to phys-
ical addresses that do not contain physical memory. How-
ever, such addresses differ from the corresponding physical
addresses of the original array in a known manner. For ex-
ample, they can have their most signiﬁcant bit ﬂipped. As
a result, when a directory controller sees an access that ad-
dresses nonexistent memory, it will know two things. First,
it will know that it is a reduction access. Second, from thephysical address, it will know what location of the original
array it refers to.
With this approach, we do not need to modify the hard-
ware of the processor, caches, or coherence protocol. The
onlyrequirementisthat themachinemustbeabletoaddress
more memory than physically installed. Then, when a di-
rectory controller sees a read miss from the local processor
to nonexistent memory, it simply returns a line of neutral
elements to the processor. Furthermore, when a directory
controller sees the write-back of a line from the local pro-
cessor to nonexistent memory,it will forward it to the home
of the corresponding element of the original array. Finally,
when a directory controller receives the write-back of a line
from a remote processor, it translates its address to the ad-
dress of the corresponding element in the original array and
combines the incoming data with the data in memory.
This approach requires modest compiler and operating
system support. The compiler modiﬁes the reduction code
to access a shadow array instead of the original array. It
also declares the shadow array and inserts a system call to
tell the operating system which array is shadow of which.
The operating system has to support the mapping of pages
for the shadow array. Speciﬁcally, on a page fault in the
shadow array, it assigns a nonexistent physical page whose
number bears the expected relation to the number assigned
to the corresponding original array page. Moreover, if the
latter does not exist yet, it is allocated at this time.
5.2 Summary
The PCLR scheme addresses many of the problems of
performing parallel reductions in scalable shared-memory
multiprocessors. PCLR has two main advantages. First, it
uses cache lines as the only private storage and initializes
them on demand. As a result, there is no need to allocate
private data structures or to perform a cache-sweeping ini-
tialization loop. Second, it performs the combining of the
partial results with their shared counterparts on demand, as
the reduction loop executes. As a result, there is no need
for a costly merging step that involves sweeping the cache
and many remote misses. All that is needed is to ﬂush the
reduction data from the caches at the end of the loop. These
two advantages are particularly important when the reduc-
tion access patterns are sparse.
Most PCLR modiﬁcations are in the directory con-
trollers, which perform special actions on read misses and
write backs. With the use of shadow addresses, the only
modiﬁcation to the processor and caches is the ability to
pin and unpin lines in the caches through the load&pin and
store&unpin instructions. It can be argued that these in-
structionscould alsobe usefulfor otherfunctionsin modern
processors.
6 Evaluation
We evaluate the PCLR scheme using simulations driven
by several applications.
6.1 Simulation Environment
We use an execution-driven simulation environment
basedonanextensiontoMINT[19]thatincludesadynamic
superscalar processor model [12]. The architecture mod-
eled is a CC-NUMA multiprocessor with up to 16 nodes.
Each node contains a fraction of the shared memory and the
directory, as well as a processor and a two-level cache hier-
archy with a write-back policy. The processor is a 4-issue
dynamic superscalar with register renaming, branch predic-
tion, and non-blocking memory operations. Table 1 lists the
main characteristics of the architecture. Contention is accu-
rately modeled in the entire system, except in the network,
where it is modeled only at the source and destination ports.
Processor Parameters Memory Parameters
4-issue dynamic, 1 GHz L1, L2 size: 32 KB, 512 KB
Int, fp, ld/st FU: 4, 2, 2 L1, L2 assoc: 2 way, 4 way
Inst. window: 64 L1, L2 size: 64 B, 64 B
Pending ld, st: 8, 16 L1, L2 latency: 2, 10 cycles
Branch penalty: 4 cycles Local memory latency: 104 cycles
Int, fp rename regs: 64, 64 2-hop memory latency: 297 cycles
Table 1. Architectural characteristics of the modeled
CC-NUMA. The latencies shown measure contention-free
round trips from the processor in processor cycles.
The system uses a directory-based cache coherence pro-
tocol along the lines of DASH [14]. Each directory con-
troller has been enhanced with a single double-precision
ﬂoating-point add unit. Both the directory controller and
the ﬂoating point-unit are clocked at 1/3 of the processor’s
frequency. The ﬂoating-point unit is fully pipelined, so it
can start a new addition every three processor cycles. Its
latency is 2 cycles (6 processor cycles). Floating-point ad-
dition is the only reduction operation that appears in our
applications (Section 6.2).
Private data are allocated locally. Pages of shared data
are allocated in the memory module of the ﬁrst processor
that accesses them. Our experiments show that this alloca-
tion policy for shared data achieves the best performance
results for both the baseline and the PCLR system.
6.2 Applications
ToevaluatethePCLRsystem, weuseaset ofFORTRAN
and C scientiﬁc codes. The applications Euler from HPF-
2 [7] and Equake from SPECfp2000 [11], and the kernels:Appl. Names of Loops %o f # of In- Iters. per Instruc. Red. Ops. Red. Array Lines Lines
Tseq vocations Invocation per Iter. per Iter. Size (KB) Flushed Displaced
dﬂux do[100,200]
Euler psmoo do20 84.7 120 59863 118 14 686.6 3261 2117
eﬂux do[100,200,300]
Equake smvp 50.0 3855 30169 550 22 707.1 742 580
Vml VecMult CAB 89.4 1 4929 135 6 40.0 168 0
Charmm dynamc do 82.8 1 82944 420 54 1947.0 1849 330
Nbf nbf do50 99.1 1 128000 1880 200 1000.0 238 1774
Average 81.2 795 61181 620 59 871.0 1251 960
Table 2. Application characteristics. In Euler, we only simulate dﬂux do100, and all the numbers except Tseq correspond to this
loop. The data in the last two columns correspond to a single loop, and are collected through simulation of a 16-processor system.
Vml from Sparse BLAS [6], Charmm from [3], and Nbf
from the GROMOS molecular dynamics benchmark [9].
All of these codes have loops with reduction operations.
Table 2 lists the loops that we simulate in each application
and their weight relative to the total sequential execution
time of the application (%Tseq). This value is obtained by
proﬁling the applications on a single-processor Sun Ultra 5
workstation. The table also shows the number of loop in-
vocations during program execution, the average number of
iterations per invocation,the averagenumber of instructions
per iteration, the average dynamic number of reduction op-
erations per iteration, and the size of the reduction array.
The last two columns will be discussed in the next section.
The loops in Table 2 are analyzed by the Polaris par-
allelizing compiler [2] or by hand to identify the reduction
statements. Then, wemodify thecodeto implementthepar-
allel reduction code for the software and PCLR algorithms.
For PCLR, reduction accesses are also marked with special
load and store instructions to trigger special PCLR opera-
tions (Section 5.1.1) in our simulator.
Next we report data, including speedups, for only the
sections of code described in Table 2. Also, since there is a
signiﬁcant variation in speedup ﬁgures across applications,
we report average results using the harmonic mean.
Impact of PCLR We evaluate two different implementa-
tions of our PCLR scheme. The ﬁrst one is an implemen-
tation where the directory controller is hardwired. The sec-
ond one utilizes a programmable directory controller, sim-
ilar to the MAGIC micro-controller in the FLASH multi-
processor [13]. A programmable controller can provide the
functionality required by PCLR without requiring hardware
changes. These two implementations of PCLR are com-
pared against a baseline system, i.e., a software-only reduc-
tion parallelization. The software-only approach accumu-
lates partial results in private arrays and merges the data out
when the loop is done.
Figure 6 compares the execution time of these three sys-
tems. The baseline software-only system is Sw. The PCLR
implementationwithahardwireddirectorycontrollerisHw,
andtheimplementationwitha ﬂexibleprogrammabledirec-
tory controller is Flex. The simulated system is a 16-node
multiprocessor. For each application, the bars are normal-
ized to Sw, and broken down into time spent in the initial-
ization phase of the Sw scheme (Init), loop body execution
(Loop), and time spent merging the partial results at the end
of the loop in Sw or ﬂushing the caches in Hw and Flex
(Merge). The numbers above each bar show the speedup
relative to the sequential execution of the code. In the se-
quential execution, all data were placed on the local mem-
ory of the single active processor.
The ﬁgure showsthat thespeedups in Flex are, on theav-
erage, only 16% lower than in Hw and 136% higher than in
Sw. Therefore, implementing PCLR using a programmable
directory controller is a good trade-off. Overall, for a 16-
node multiprocessor, the Hw PCLR scheme achieves an av-
erage speedup of 7.6, while the software-only system de-
livers an average speedup of only 2.7. If PCLR is imple-
mented with a programmable directory controller the aver-
age speedup is 6.4.
Scalability of PCLR. To evaluate the scalability of PCLR,
we have simulated a multiprocessor system with 4, 8, and
16 processors. Figure 7 shows the harmonic mean of the
speedups delivered by the different mechanisms. It can be
seen that PCLR (both Hw and Flex) scale well. However,
the Sw scheme scales poorly. The time of the merging step
inSwdoesnotdecreasewhenmoreprocessorsareavailable.
If the main loop scales well, the merging step limits the
achievable speedups according to Amdahl’s law.
7 Conclusion
So far we have made good progress on the development
of many the components of SMARTAPPS. We will further
develop these and combine them into an integrated system.
In this paper we have illustrated SMARTAPPS capabil-
ity by presenting two complementary techniques for adap-
tively optimizing an important operation for parallel pro-
grams: reductions. Our software-based approach describes0
0.2
0.4
0.6
0.8
1
S
w
H
w
F
l
e
x
S
w
H
w
F
l
e
x
S
w
H
w
F
l
e
x
S
w
H
w
F
l
e
x
S
w
H
w
F
l
e
x
E
x
e
c
u
t
i
o
n
 
T
i
m
e
Init
Merge
Loop
Euler Equake Vml Charmm Nbf
  1.3     4.0    3.5  7.3    14.0   10.6   3.1    6.1    5.0       1.9     9.9    7.7     9.1   15.6   14.2 
Figure 6. Execution time under different schemes for a 16-node multiprocessor. The
numbers above the bars are speedups relative to the sequential execution.
0
2
4
6
8
148 1 6
Number of Processors
S
p
e
e
d
u
p
Hw
Flex
Sw
Figure 7. Speedups delivered
by the different mechanisms (har-
monic mean).
how a compiler can generate multi-version code that adapts
to thecode behavior. We havealso illustratedhowhardware
support can be used to enable better speedups.
References
[1] N. Amato L. Rauchwerger and J. Torrellas. Smartapps: An
application centric approach to high performance comput-
ing. In Proc. 13th Workshop on Programming Languages
and Compilers for Parallel Computing,LNCS, 2000.
[2] W. Blume et al. Advanced Program Restructuring for High-
Performance Computers with Polaris. IEEE Computer,
29(12):78–82, Dec. 1996.
[3] B. R. Brooks et al. CHARMM: A program for macromolec-
ular energy, minimization, and dynamics calculations. J. of
Computational Chemistry, (4):187–217, 1983.
[4] J. B. Carter et al. Impulse: Building a Smarter Memory Con-
troller. In Proc. of the Fifth Int. Symp. on High Performance
Computer Architecture, pp. 70–79, January 1999.
[5] F. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test:
Speculative parallelization of partially parallel loops. In
Proc. Int. Parallel and Distributed Processing Symp., April.
2002.
[6] I. Duff, M. Marrone, G. Radiacti, and C. Vittoli. A set of
Level 3 Basic Linear Algebra Subprograms for Sparse Ma-
trices. Tech. Rept. RAL-TR-95-049, Rutherford Appleton
Laboratory, 1995.
[7] I. Duff, R. Schreiber, and P. Havlak. HPF-2 Scope of Activ-
ities and Motivating Applications. Technical Report CRPC-
TR94492, Rice Univ., Nov. 1994.
[8] M. Garzaran, A. Jula, M. Prvulovic, H. Yu, L. Rauchwerger,
and J. Torrellas, Architectural support for parallel reductions
in scalable shared-memory multiprocessors. In Proc. of Int.
Conf. on Parallel Architectures and Compilation Techniques,
Sept. 2001.
[9] W. Gunsteren and H. Berendsen. GROMOS: GROningen
MOlecular Simulation software. Tech. Rept., Lab. of Physi-
cal Chemistry, Univ. of Groningen, 1988.
[10] H. Han and C.-W. Tseng. Improving compiler and run-time
support for adaptive irregular codes. In Int. Conf. on Parallel
Architectures and Compilation Techniques, Oct. 1998.
[11] J. L. Henning. SPEC CPU2000: Measuring CPU Perfor-
mance in the New Millenium. IEEE Computer, 33(7):28–35,
July 2000.
[12] V. Krishnan and J. Torrellas. An Execution-Driven Frame-
work for Fast and Accurate Simulation of Superscalar Pro-
cessors. In Int. Conf. on Parallel Architectures and Compi-
lation Techniques, Oct. 1998.
[13] J. Kuskin et al. The Stanford FLASH Multiprocessor. In
Proc. of the 21st Annual Int. Symp. on Computer Architec-
ture, pp. 302–313, April 1994.
[14] D. Lenoski et al. The Stanford Dash Multiprocessor. IEEE
Computer, pp. 63–79, March 1992.
[15] L. Rauchwerger, N. Amato, and D. Padua. A scalable
method for run-time loop parallelization. Int. J. Paral. Prog.,
26(6):537–576, July 1995.
[16] L. Rauchwerger. Run–time parallelization: A framework
forparallelcomputation. TechnicalReportUIUCDCS-R-95-
1926, Dept. of Computer Science, Univ. of Illinois, Urbana,
Illinois, Sept. 1995.
[17] L. Rauchwerger and D. Padua. The LRPD Test: Specu-
lative Run-Time Parallelization of Loops with Privatization
and Reduction Parallelization. IEEE Trans. on Parallel and
Distributed Systems, 10(2), 1999.
[18] L. Rauchwerger and D. Padua. Parallelizing WHILE Loops
for Multiprocessor Systems. In Proc.of 9th Int. Parallel Pro-
cessing Symp., April 1995.
[19] J. Veenstra and R. Fowler. MINT: A Front End for Efﬁcient
Simulation of Shared-Memory Multiprocessors. In Proc.
2nd Int. Workshop on Modeling, Analysis, and Simulation of
Computer and Telecomm. Systems, pp. 201–207, Jan. 1994.
[20] H. Yu and L. Rauchwerger. Adaptive Reduction Paralleliza-
tion. In Proc.of the 14th ACMInt. Conf. on Supercomputing,
Santa Fe, NM, May 2000.