Optimized replacement in the configuration layers of the grid Alu processor by Jahr, Ralf et al.
High-performance and
Hardware-aware Computing
Proceedings of the Second International Workshop on New Frontiers 
in High-performance and Hardware-aware Computing (HipHaC‘11)
San Antonio, Texas, USA, February 2011





Karlsruher Institut für Technologie (KIT)
KIT Scientific Publishing
Straße am Forum 2
D-76131 Karlsruhe
www.ksp.kit.edu
KIT – Universität des Landes Baden-Württemberg und nationales
Forschungszentrum in der Helmholtz-Gemeinschaft
Diese Veröffentlichung ist im Internet unter folgender Creative Commons-Lizenz 
publiziert: http://creativecommons.org/licenses/by-nc-nd/3.0/de/
KIT Scientific Publishing 2011
Print on Demand
ISBN 978-3-86644-626-7
Optimized Replacement in the Configuration Layers
of the Grid ALU Processor
Ralf Jahr, Basher Shehan, Theo Ungerer
University of Augsburg
Institute of Computer Science
86135 Augsburg, Germany






Abstract—The Grid ALU Processor comprises a reconfigurable
two-dimensional array of ALUs. A conventional sequential in-
struction stream is mapped dynamically to this array by a
special configuration unit within the front-end of the processor
pipeline. One of the features of the Grid ALU Processor are
its configuration layers, which work like a trace cache to store
instruction sequences that have been already mapped to the ALU
array recently.
Originally, the least recently used (LRU) strategy has been
implemented to evict older configurations from the layers. As
we show in this paper the working set is frequently larger than
the available number of configuration layers in the processor
resulting in thrashing. Hence, there is quite a large gap between
the hit rate achieved by LRU and the hit rate achievable
with an optimal algorithm. We propose an approach called
qdLRU to enhance the performance of the configuration layers.
Using qdLRU closes the gap between LRU and an optimal
eviction strategy by 66% on average and achieves a maximum
performance improvement of 390% and 5.06% on average with
respect to the executed instructions per clock cycle (IPC).
Index Terms—Trace Cache, Replacement Strategy, Post-link
Optimization, Feedback-directed Optimization, Coarse-Grained
Reconfigurable Architecture
I. INTRODUCTION
Within this paper, we present an optimization for the Grid
ALU Processor (GAP), which has been introduced by Uhrig
et al. [1]. It brings together a superscalar-like processor front-
end and a coarse-grained reconfigurable architecture, i.e. a
reconfigurable array of functional units (FUs). The front-end
consisting of instruction fetch and decode unit is extended
with a new configuration unit. This unit maps the instructions
from the instruction stream dynamically and at run-time onto
the array of FUs. Mapping of instructions and execution
of instructions in the array run in parallel until there is a
reason to flush the array and restart the mapping process. The
mapping which has been built until this moment is called a
configuration.
These configurations can be buffered in so-called configu-
ration layers, which are formed by some memory cells very
close to all the FUs. The configuration layers are very similar
to trace caches. If a part of a program, i.e. a configuration, is
already stored in the configuration layers it can be executed
faster because it does not have to go through the front-end first,
so instruction cache misses cannot occur. The timing inside the
array is optimized, too. Because of this, it is a worthwhile goal
to increase the usage of the configuration layers. Analyzing
the execution of benchmarks we came to the conclusion
that for some of them our default replacement strategy LRU
works unexpectedly bad, even worse than replacing a random
configuration layer (we call this strategy RANDOM). So LRU
is in some cases not clever at all and humbles the execution
speed.
The main contributions of this paper are (1) the analysis
and comparison of the behavior of well-known replacement
algorithms when applied to the replacement in the configura-
tion layers and (2) the introduction and analysis of qdLRU.
QdLRU improves the hit rate of LRU by adding flags to the
program code based on a feedback-directed approximation of
the working sets.
After giving a short introduction of the target platform
in Section II, we discuss some basics about replacement
strategies in Section III. The extended version of LRU called
qdLRU is introduced in Section IV and evaluated in Section V.
Related work is presented in Section VI and Section VII
concludes the paper.
II. TARGET PLATFORM: THE GRID ALU PROCESSOR
The Grid ALU Processor (GAP) has been developed to
speed up the execution of conventional single-threaded instruc-
tion streams. To achieve this goal, it combines the advantages
of superscalar processor architectures, those of coarse-grained
reconfigurable systems, and asynchronous execution.
A superscalar-like processor front-end consisting of fetch-
and decode units is used together with a novel configuration
unit (see Figure 1(a)) to load instructions and map them dy-
namically onto an array of functional units (FUs) accompanied
by a branch control unit and several load/store units to handle
memory accesses (see Figure 1(b)).
The array of FUs is organized in columns and rows. Each
column is dynamically and per configuration assigned to one
architectural registers. Instructions are assigned to the columns
whose register match the instructions’ output registers. The
rows of the array are used to model dependencies between
instructions. If an instruction B is dependent of an instruction
A, it will be mapped to a row below the row of A. This way

























































































FU FU FU FU
FU FU FU FU









(b) General organization of the ALU array
Fig. 1. Architecture of the Grid ALU Processor
dependent instructions without the need of complex out-of-
order logic. A bimodal branch predictor is used to effectively
map control dependencies onto the array.
Execution starts in the first row of the array. The dataflow
is performed asynchronously inside the array of FUs and it is
synchronized with the clock of the branch control unit and the
L/S units by so-called timing tokens [1].
Whenever either a branch is miss-predicted or execution
reaches the last row of the array with configured FUs the array
is cleared and the configuration unit maps new instructions
starting from the first row of the array. In order to save con-
figurations for repeated execution all elements of the array are
equipped with some memory cells which form configuration
layers. Typically, 2, 4, 8, 16, 32, or 64 configuration layers
are available. The array is quasi three-dimensional and its size
can be written as columns x rows x layers.
With this extension it has to be checked before mapping
new instructions if the next instruction to execute is equal
to the first instruction in any of the layers. If a match is
found, the corresponding layer is set to active and execution
continues there. If no match is found, the least recently
used configuration layer is cleared and used to store the
new configuration. In all cases, the new values of registers
calculated in columns are copied into the registers at the top
of the columns.
To evaluate the architecture a cycle- and signal-accurate
simulator has been developed. It uses the Portable Instruction
Set Architecture (PISA), hence the simulator can execute the
identical program files as the SimpleScalar simulation tool
set [2] (but it is not based on it). Detailed information about the
processor are given by Uhrig et al. [1] and Shehan et al. [3].
III. TOWARDS AN IMPROVED POLICY
Several basic terms of replacement strategies with respect
to the GAP architecture are discussed in this section.
A. Measuring the Performance of a Replacement Strategy
To analyze the performance of a replacement policy, we
suggest two measures. The total hit rate htotal of the layer
subsystem, which is the number of accesses of layers which
can be found in the configuration layers ahit divided by the
total number of accesses atotal. The total hit rate htotal can
also be understood as the sum of the hit rate by re-accessing
the identical configuration subsequently hloop = aloop/atotal,
which is independent from the number of layers available,
and the hit rate contributed by the layer subsystem hlayer =










= hloop + hlayer
A replacement policy can influence only the hit rate of the
layer subsystem hlayer. For a given benchmark, hloop has the
same value for all replacement strategies.
An optimal offline replacement algorithm (named OPT in
the remainder) has been introduced by Belady [4] and it can be
used as upper bound. In other words, no (online) replacement
policy can achieve a better hit rate than this offline policy,
which chooses the element for eviction that will be reused as
the last one of all elements in the future.
Another offline algorithm has been mentioned by Temam [5]
with the goal to maximize the number of instructions which
can be accessed without cache misses. As upper bound for the
performance of a replacement policy the algorithm OPT is a
much more feasible measure because in the GAP, the penalty
caused by activating the front-end of the processor when a
new configuration must be build is much higher compared to
the time, which is saved when some additional instructions
can be found in a layer.
The second measure to evaluate a replacement policy is the
performance of the whole system, which is e.g. described by




Listing 1. Algorithm to configuration lines
i n p u t : l i s t <c o n f i g u r a t i o n > t r a c e
# d e f i n e l i n e l i s t <c o n f i g u r a t i o n >
s e t <l i n e > a l l l i n e s
map<l i n e , i n t > l i n e c o u n t e r s
l i n e c u r r e n t l i n e = {}
c o n f i g u r a t i o n l a s t c o n f i g u r a t i o n
f o r e a c h ( c o n f i g u r a t i o n i t em i n t r a c e )
i f ( i t em == l a s t c o n f i g u r a t i o n )
/ / Do n o t h i n g
e l s e i f ( i t e m /∈ c u r r e n t l i n e )
c u r r e n t l i n e += i t em
l a s t c o n f i g u r a t i o n = i t em
e l s e
a l l l i n e s += c u r r e n t l i n e
l i n e c o u n t e r s [ c u r r e n t l i n e ]++
c u r r e n t l i n e = {}
l a s t c o n f i g u r a t i o n = i t em
number of layers in the processor and the other group Clong
contains all the other configuration lines, those configuration
lines are too long to fit into the layers without evictions.
With having prepared these groups the following algorithm
is performed:
1) Select a configuration line item from Clong.
2) Select from item the configuration with the least usage
in Cshort, mark its first instruction.
3) Select all configuration lines from Clong where the
number of all configurations minus the number of all
marked configurations in the line is smaller than the
number of layers of the processor. Move them to Cshort.
4) If Clong is not empty, restart the algorithm with step 1.
By this heuristic, we select configurations in a manner
that they influence as little as possible the execution of
configuration lines that fit into the layers. If a configuration
line fits into the layers, but one of its configurations is marked,
than this can humble the hit rate of this configuration line
extremely.
In the last step, our post-link optimization tool GAPtimize
(introduced in [9]) is used to mark the first instruction of the
selected configurations with a special drop quickly flag. This
flag directs the configuration layer subsystem of GAP to drop
the configuration starting with the actual instruction quickly.
B. Executing the modified binary
When implementing qdLRU, changes are necessary both in
hardware and in software. The changes in hardware are very
simple. All which has to be done is to make sure that either a
configuration beginning with a marked instruction is inserted
in the least recently used position in the LRU access queue or
that, when looking for a layer for eviction, it is first looked
for a configuration layer starting with a marked instruction and
then replacing this layer.
If a program is executed on the GAP which has not been
optimized (and is hence without flags), then qdLRU behaves
exactly like LRU, which still offers reasonable performance.
This graceful degradation is one of the requirements of all
techniques used for the GAP.
V. EVALUATION
For the practical evaluation we rely on the cycle-accurate
simulator which has been developed for the GAP and was
extended to support qdLRU. As the hardware complexity of
GAP can vary very much because of different sizes of its
ALU array, we set it to a fixed size of 12 columns and 12
(a) Access plot for the first 1000 accesses of configuration layers for benchmark stringsearch
(b) Access plot for the first 1000 accesses of configuration layers for benchmark qsort




one. Together with OPT as upper bound the performance
of LRU, FIFO and RANDOM have been compared for our
situation in Section III-B.
The second class of algorithms are the Dynamic Insertion
Policy (DIP) proposed by Qureshi et al. [11] and the Shepherd
Cache proposed by Rajan etc al. [12]. Both share the property
that they require additional hardware effort. In our experi-
ments, we also got for our particular situation performance
numbers at most comparable to LRU for the Shepherd Cache.
The DIP is only applicable if it can select between LRU
and BIP with extreme parameters to prevent thrashing. The
suggested approach to divide the configuration layers into two
sets does not seem to be applicable due to the small number
of configuration layers. The small number of lines prevents
using strategies like ARC [13] where the lines are split into
two sections and handled in different ways.
Some other techniques have also been proposed (see
e.g. [14]) but most of them either require large changes of the
hardware and/or are not supposed to work well because the low
number of layers available in the GAP normally restricts the
eventual gain in performance caused by replacement strategies.
Trace caches as introduced by Rotenberg et al. [15] work for
superscalar processors very similar to the configuration layers
because they are used to buffer parts of a program flow, too.
To our knowledge, nobody has yet been working on thrashing
situations in this context.
VII. CONCLUSION AND FUTURE WORK
We introduced a software-supported replacement strategy
for the configuration layers of the GAP processor, which are
used like a trace cache to buffer instructions sequences ready
for execution. So far, LRU is used as replacement strategy
which offers an unsatisfying performance for several bench-
marks. Strangely enough, LRU shows for some benchmarks
even worse performance than RANDOM, a strategy evicting a
random element. The main reason for this is thrashing, which
can happen if the elements of a working set are processed
repeatedly and sequentially, i.e. there is a huge degree of
locality, and the set contains more configurations than the GAP
provides configuration layers. In this case, the hit rate achieved
with LRU collapses.
To overcome this issue, we proposed a replacement strategy
called qdLRU and a heuristic to approximate the working
sets in software. Based on working sets we select some
configurations which are evicted immediately from the con-
figuration layers. With this, we can draw the behavior of
qdLRU nearer to the optimal strategy OPT. The performance
measured by the IPC for qdLRU is on average 5.06% higher
than the performance achieved by LRU. A peak improvement
of 390% is gained for secu-rinjdael-decode caused by a peak
improvement of the hit rate of 0.5.
This approach could be adapted for all situations in which a
replacement strategy is needed for a small number of complex
elements with many thrashing-risky situations. The introduced
strategy requires only very little changes of the hardware when
LRU has already been implemented. It also supports graceful
degradation back to LRU.
As future work, we propose to work on the detection of
working sets. The rule which has been introduced is simple
and effective. Nevertheless, there are situations where this rule
cannot find a sufficient solution. Hence, to find better solutions
it should be thought about the scope of the working sets. From
our point of view, it is important that the configurations in a
working set should be executed repeatedly in the same order.
If this restriction is weakened, the scope of working sets could
be enlarged which must be handled carefully but might lead
to further improved results. Concluding, it might be possible
to find better solutions with biologically inspired algorithms,
e.g. ant algorithms or genetic algorithms. Linear programming
should also be taken into consideration.
REFERENCES
[1] S. Uhrig, B. Shehan, R. Jahr, and T. Ungerer, “The two-dimensional su-
perscalar gap processor architecture,” International Journal on Advances
in Systems and Measurements, 2010.
[2] D. Burger and T. Austin, “The simplescalar tool set, version 2.0,” ACM
SIGARCH Computer Architecture News, vol. 25, no. 3, pp. 13–25, June
1997.
[3] B. Shehan, R. Jahr, S. Uhrig, and T. Ungerer, “Reconfigurable grid alu
processor: Optimization and design space exploration,” in Proceedings of
the 13th Euromicro Conference on Digital System Design (DSD) 2010,
Lille, France, 2010.
[4] L. A. Belady, “A study of replacement algorithms for a virtual-storage
computer,” IBM Systems, vol. 5, no. 2, pp. 78–101, 1966.
[5] O. Temam, “Investigating optimal local memory performance,” SIGOPS
Oper. Syst. Rev., vol. 32, no. 5, pp. 218–227, 1998.
[6] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and T. Brown,
“MiBench: A free, commercially representative embedded benchmark
suite,” 4th IEEE International Workshop on Workload Characteristics,
pp. 3–14, December 2001.
[7] P. J. Denning, “The locality principle,” Commun. ACM, vol. 48, no. 7,
pp. 19–24, 2005.
[8] P. Denning, “Thrashing: its causes and prevention,” in AFIPS ’68 (Fall,
part I): Proceedings of the December 9-11, 1968, fall joint computer
conference, part I. New York, NY, USA: ACM, 1968, pp. 915–922.
[9] R. Jahr, B. Shehan, S. Uhrig, and T. Ungerer, “Static speculation as
post-link optimization for the grid alu processor,” in Proceedings of the
4th Workshop on Highly Parallel Processing on a Chip (HPPC 2010),
2010.
[10] R. W. Carr and J. L. Hennessy, “WSCLOCK - a simple and effective
algorithm for virtual memory management,” in SOSP ’81: Proceedings
of the eighth ACM symposium on Operating systems principles. New
York, NY, USA: ACM Press, 1981, pp. 87–95.
[11] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer,
“Adaptive insertion policies for high performance caching,” in ISCA ’07:
Proceedings of the 34th annual international symposium on Computer
architecture. New York, NY, USA: ACM, 2007, pp. 381–391.
[12] K. Rajan and G. Ramaswamy, “Emulating optimal replacement with
a shepherd cache,” in MICRO 40: Proceedings of the 40th Annual
IEEE/ACM International Symposium on Microarchitecture. Washing-
ton, DC, USA: IEEE Computer Society, 2007, pp. 445–454.
[13] N. Megiddo and D. S. Modha, “Outperforming lru with an adaptive
replacement cache algorithm,” Computer, vol. 37, no. 4, pp. 58–65,
2004.
[14] G. Keramidas, P. Petoumenos, and S. Kaxiras, “Where replacement
algorithms fail: a thorough analysis,” in CF ’10: Proceedings of the
7th ACM international conference on Computing frontiers. New York,
NY, USA: ACM, 2010, pp. 141–150.
[15] E. Rotenberg, S. Bennett, and J. E. Smith, “Trace cache: a low
latency approach to high bandwidth instruction fetching,” in MICRO 29:
Proceedings of the 29th annual ACM/IEEE international symposium on
Microarchitecture. Washington, DC, USA: IEEE Computer Society,
1996, pp. 24–35.
16
