A Register Allocation Algorithm in the Presence of Scalar Replacement
  for Fine-Grain Configurable Architectures by Baradaran, Nastaran & Diniz, Pedro C.
A Register Allocation Algorithm in the Presence of Scalar Replacement for
Fine-Grain Configurable Architectures
 
Nastaran Baradaran and Pedro C. Diniz
University of Southern California / Information Sciences Institute
Marina del Rey, California 90292, U.S.A.

nastaran, pedro  @isi.edu
Abstract
The aggressive application of scalar replacement to ar-
ray references substantially reduces the number of mem-
ory operations at the expense of a possibly very large num-
ber of registers. In this paper we describe a register alloca-
tion algorithm that assigns registers to scalar replaced ar-
ray references along the critical paths of a computation, in
many cases exploiting the opportunity for concurrent mem-
ory accesses. Experimental results, for a set of image/signal
processing code kernels, reveal that the proposed algorithm
leads to a substantial reduction of the number of execution
cycles for the corresponding hardware implementation on
a contemporary Field-Programmable-Gate-Array (FPGA)
when compared to other greedy allocation algorithms, in
some cases, using even fewer number of registers.
1. Introduction
Scalar replacement or register promotion is an effec-
tive technique for eliminating external memory accesses for
the data that is repeatedly accessed throughout a compu-
tation. This technique, geared towards array variables, en-
ables a compiler to replace the repeatedly accessed array
references by scalar references. Mapping these scalars to
hardware registers eliminates the memory operations as-
sociated with fetching/storing of the values, while making
them readily available for future use. This transformation is
particularly suited for loop-based memory-intensive com-
putations, such as those arising in common image and sig-
nal processing code kernels, where there are substantial op-
portunities for both input and output data reuse.
 This work is supported by the National Science Foundation (NSF) un-
der Grant No. 0209228. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of the NSF.
The aggressive application of scalar replacement how-
ever may require a large number of registers, limiting the
application of this technique. As such, fine-grain config-
urable architectures, such as Field-Programmable-Gate-
Arrays (FPGAs), offer an ideal context for applying the
scalar replacement to image and signal processing appli-
cations. These architectures have a large yet limited num-
ber of available registers which can be organized freely,
as well as storage structures organized as RAM blocks
with programmable bit-widths and flexible number of ac-
cess ports. A compiler can exploit scalar replaced array
references by explicitly mapping and managing the corre-
sponding scalars to a combination of registers and RAM
blocks [2].
In this paper we describe several algorithms for the al-
location of registers to scalar variables resulting from the
application of scalar replacement to array references in per-
fectly nested loops. We describe and evaluate two greedy
allocation algorithms based on cost/benefit metrics and pro-
pose a novel critical-path-aware allocation algorithm. The
proposed algorithm allocates registers to references along
cuts of the critical path of the computation, ensuring that the
eliminated memory accesses lead to a reduction of the com-
putation’s execution cycles and wall-clock execution time.
We evaluate the performance for the various algorithms
using a small set of image/signal processing code kernels.
The results reveal that the proposed algorithm is effective in
allocating registers to the scalar replaced array references in
the code, therefore reducing the number of execution cycles
of each computation. In some cases the critical-path-aware
algorithm reduces the overall execution cycles, as well as
the overall execution time, using the same or even fewer
number of registers than other greedy algorithms.
In the rest of this paper, section 2 describes background
and related work. Sections 3 and 4 formalize our register al-
location problem along with the description of the proposed
critical-path-aware algorithm. We present experimental re-
sults in section 5 and conclude in section 6.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
2. Background and Related Work
We now briefly describe the relevant features of our tar-
get configurable architecture as well as the compiler anal-
ysis concepts that support the application of scalar replace-
ment. We also survey related work in the context of map-
ping array variables to these architectures and contrast these
efforts with traditional register allocation approaches.
Configurable Architectures: Our work targets con-
figurable architectures with storage resources that can
be configured in an application-specific fashion. In ad-
dition the target architecture also has a large number
of resources which can be organized as either comput-
ing elements or discrete data registers. As an example,
the Xilinx Virtex-II [13] family of FPGAs have a lim-
ited set of RAM blocks that can be configured as single-
or dual-ported RAM memories given a fixed bit capac-
ity. The PipeRench [6] opts for a computation execu-
tion model based on pipelining the data through a fixed
set of stripes, each with finite computational and stor-
age elements. The XPP Array [14] uses coarser grained
elements connected via a programmable network and ex-
ports a low-level execution model that resembles a
data-flow. For a given configuration of each node the exe-
cution proceeds when the data inputs are available.
A significant difference between these architectures and
traditional processors is the absence of a unified address
space and underlying hardware mechanisms to enforce data
consistency across the various storage structures. Designers
must explicitly map high-level program variables to both
RAMs and registers and explicitly manage the flow of data
between them to enforce data consistency.
Data Reuse & Scalar Replacement: Data reuse analysis
for array variables in a loop nest relies on the concept of de-
pendence distance. The compiler observes the array refer-
ence index functions, in this context affine functions of the
enclosing loop index variables, and understands at which
loop iterations the same data element is reused.
for (i = 0; i <        ; i++)
for (j = 0; j <       ; j++)
for (k = 0; k <       ; k++) 
d[i][k] = a[k] * b[k][j];
e[i][j][k] = c[j] * d[i][k];

Figure 1. Example Code
In the code example in figure 1 the reference ff fi fl ffi fi  ffi ex-
hibits reuse at the  loop level, as for every iteration of the
 loop the same location is accessed for the same values
of the  and ! iterations. Scalar replacement converts ar-
ray references into scalar variables and then maps them to
registers. For ff fi fl ffi fi  ffi , one can save the ff # % ff ' accesses to
ff fi ( ffi fi ( ffi ) + + + ) ff fi ff # / 1 ffi fi ff ' / 1 ffi in scalar variables for the first
iteration of  , and then reuse these values for the subsequent
ff 4 / 1 iterations of the  loop. By doing so the implementa-
tion eliminates ff # % ff ' % : ff 4 / 1 > memory accesses at the
expense of ff # % ff ' scalar variables.
Researchers have developed several compiler data de-
pendence analysis frameworks for uncovering data reuse for
affine references in loop nests [4, 8], and have analytically
computed the number of required registers to capture reuse
across the various loop levels in a nest [11]. As for code
generation, the application of scalar replacement and sub-
sequent mapping to registers can be accomplished by pre-
peeling the iterations of the loop where input data needs to
be saved in registers, or back-peeling the iterations of the
loop where the data needs to be restored to memory. The
complete code generation scheme, either using peeling or
predication, is beyond the scope of this paper.
Storage Resource Allocation: Minimizing the impact of
the access to memory has been a long standing problem.
Gokhale et al. [5] describe an algorithm for the mapping of
array variables to external memories in FPGA-based archi-
tectures. Weinhardt and Luk [12] describe a limited com-
piler approach for using RAM blocks to cache the data in
contemporary FPGAs. In our own work we have used the
same data reuse analysis framework outlined in this pa-
per to explore the area and space trade-offs of using RAM
blocks to store scalar replaced variables [2], whereas So and
Hall [11] exclusively use registers to cache the data. There
has also been extensive work in hierarchical data mapping
in order to improve overall performance metrics such as
time or power [1, 7, 10].
The classical register allocation problem focuses on the
assignment of a finite number of registers to scalar vari-
ables only. Given the significance of this problem, and its in-
tractable worst case complexity, many researchers have de-
veloped various algorithmic strategies. For example, Briggs
et. al. [3] describe several graph coloring heuristics whereas
Kolson et. al. [9] propose a spill minimizing register alloca-
tion algorithm for embedded code generation.
Our register allocation approach differs from these ef-
forts in several aspects. First, we use scalar replacement
information to select the more profitable array references
in order to limit the number of required registers, without
limiting the reuse to innermost loop levels. Second, we ex-
ploit the data-flow information of the computation to coal-
locate registers to inputs of the same operation. Finally, and
as with other approaches for configurable architectures, our
approach and corresponding code generation must explic-
itly manage the flow of data between registers and RAMs.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
3. Problem Formulation and Definitions
The register allocation for the scalars generated by an
aggressive application of scalar replacement can be formu-
lated as a Knapsack problem. In this formulation, an object
is an array reference represented by      , whereas the size
of each object is the number of required registers1 for a full
scalar replacement of a reference and is represented by   .
Furthermore, the value of each object is the potential num-
ber of eliminated memory accesses and the size of the reg-
ister file is the knapsack size. A simple objective function is
to eliminate the most memory accesses [4].
This formulation however does not take into account the
dependences between references and the opportunities for
concurrent data accesses to RAM blocks. If references cor-
responding to distinct array variables are mapped to differ-
ent RAMs, accesses to them can proceed concurrently, only
incurring the latency of a single access. Considering this
concurrency opportunity, we formulate the register alloca-
tion problem for scalar replaced array references as finding
a register allocation that minimizes the completion time for
the computation in a loop nest.
To capture the notion of execution time, we abstract
the computation in a loop nest as a collection of data-flow
graphs (DFG). In this abstraction      and 	   represent
various array references and operations in the DFG, while
  
     captures the latency of a specific numeric opera-
tion or a memory access. We further assume the latencies
of the numeric operations to be known and the latency of
a memory access for a specific array reference to be either

or  , depending on whether the array element is mapped
to a register or to a RAM block. Given a DFG, we define
the latency of a path   as
  
    ! # % '
( )
'
*
  
  


and determine the Critical Path(s) (CP) of a DFG as the
path(s) with the highest latency. Finally, we define the exe-
cution time + - . - 1 of a DFG as the maximum latency across
its paths, i.e., + - . - 1 ! 4
 5
'
* ) 6 7 8 9 ;
  
    , or simply
+
- . - 1
!
  
 = ?  . Given these definitions, we wish to de-
termine a register allocation that minimizes the memory ac-
cess portion of + - . - 1 for the entire execution of the loop,
subject to the available number of registers A C .
In order to reduce the overall execution time, all the crit-
ical paths in a DFG should be reduced. Improving only a
subset of the CPs would just consume the resources with-
out having any effect on the overall computation time. To
address this issue we introduce the Critical Graph (CG) as
a subgraph of DFG including all of its CPs. We also de-
fine a Cut of the Critical Graph (CG) as a minimal subset of
its reference nodes, such that their removal would discon-
nect all the paths in the CG.2 Therefore, in order to improve
1 The techniques to calculate this number have been extensively ad-
dressed in [11].
2 A simple algorithm to find a cut of a graph consists of iteratively se-
the + - . - 1 , all the references in a Cut need to be stored in reg-
isters.
Figures 2(a) and (b) depict the DFG and CG, along with
the set of possible cuts, for the example code in figure 1. In
terms of full scalar replacement, the references D , ff , F , G and
 would require 
7
! I

, 
J
! L
 
,  1 ! N

,  P ! I

,
and  - !  registers respectively. Given a limit of L T regis-
ters, we can not possibly accommodate all scalar variables
for all references. If we assign the scalar variables gener-
ated by D fi fl ffi to I

registers while keeping the scalar vari-
ables of ff fi fl ffi fi Z ffi in a RAM, for each D fi fl ffi [ ff fi fl ffi fi Z ffi opera-
tion we would need to read the data for ff fi fl ffi fi Z ffi from the
RAM. The registers assigned to D fi fl ffi would not be used ef-
fectively since the execution of the operation would need to
stall until the values corresponding to ff fi fl ffi fi Z ffi would be re-
trieved from RAM. Instead we could allocate the available
I

registers to both

and ] in order to improve the data ac-
cess. Even if we could not fully assign all the scalar vari-
ables for ff fi fl ffi fi Z ffi , at least for a subset of the operations in the
^ loop, both operations would use the data in registers.
(a) Dataflow Graph
(b) Critical Graph with Cuts {{a,b}, {d}, {e}}
d[i][k]
e[i][j][k]
c[j]
b[k][j]
a[k]
op 1
op 2
op 1 op 2
a[k]
b[k][j]
d[i][k] e[i][j][k]
(c) Computation with Scalar Replaced References for 3 Algorithms
op 1
op 2
Register
RAM
op 1
op 2
op 1
op 2
PR -RA
ßd= 12 ße= 1
Tmem = 1,560 cycles
ßa= 30 ßc= 20ß b= 1
FR-RA
ßa= 30 ßc= 20ß b= 1
ßd= 1 ße= 1
Tmem = 1,800 cycles
CPA -RA
ßa= 16 ßc= 1ßb= 16
ßd= 30 ße= 1
Tmem = 1,184 cycles
Figure 2. Stages of Allocation Algorithm.
lecting a node of the graph and eliminating all its ancestors and de-
scendants until no more nodes are left in the graph. In the worst-case,
finding all the cuts of a graph is exponential.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
Inputs:    ,    Ł 
    ,   ,  
Output:   : Number of registers assigned to array    
Knapsack Solution Full and Partial Reuse 
for       do set    
if(
" $ &
 ( *
      ) then
for(  =1 to     )
     
else
// Variant 1. Full Reuse Allocation
sort array references     * 2 3 3 3 2   
$ &
5
based on the descending value of   .
for each ref     in the sorted list do
if(   7    ) then
     
     :   
end if
end for
// Variant 2. Partial Reuse Allocation
for the first    < where  <   do
 <   
5
Figure 3. Full Reuse and Partial Reuse Regis-
ter Allocation Algorithms (FR-RA & PR-RA).
4. Allocation Algorithms
We now present several greedy algorithms to tackle the
knapsack problem as formulated in section 3. In the de-
scription of the algorithm we denote @ A B D F G H I as the ben-
efit/cost metric defined as the ratio of saved memory ac-
cesses over the number of required registers for reference
D F G
H . We denote the maximum number of available regis-
ters and number of array references by J L and J N respec-
tively. Finally, O  indicates the number of registers that the
algorithm assigns to reference D F G H .
The first variant, named Full Reuse Register Allocation
(FR-RA), starts by assigning one register to each array ref-
erence to render the computation feasible. It then uses the
value of @ A B D F G H I R T
U V W
B D F G H I Y [  to greedily assign
the available registers to the data references that yield the
best benefit/cost ratio. For each reference D F G H , if possible,
the algorithm assigns [  registers corresponding to fully ex-
ploiting the data reuse for that reference. This proceeds un-
til the algorithm exhausts all the available registers leading
to an assignment of either [  or ^ to O  .
This simple greedy algorithm might leave some regis-
ters unallocated, as upon termination the remaining number
of registers might not be enough to satisfy a value of [  . In
the variant ` of the algorithm, named Partial Reuse Regis-
ter Allocation (PR-RA), we allow the assignment of the ex-
tra registers to the next reference in the sorted list. For this
reference the implementation exploits partial data reuse, as
it assigns O

  registers with ^ b O

  d [

  .
Clearly, these algorithms do not attempt to reduce the
computation time. To address this issue, we use a greedy
approach named Critical Path Aware Register Allocation
(CPA-RA) algorithm. This last variant calculates the value
Input:    ,   , f  h Data Flow Graph.
Output:   : Number of registers assigned to array    
CriticalPathAware 
for       do set    
while(   i k ) do
cg = Make CG(dfg);
cg cuts = Find Cuts(cg);
Find Req Reg(cg cuts);
best cut : Element of the cg cuts with the min Req Reg
  =
" m n p q r s u q r v w x v
 ( *
  
if(      ) then
for( y     z |  } ~   ~ ) do
  =   
  -=  
  = k
end if
if(    k && k    ) then
for( y     z |  } ~   ~ ) do
  =    (Num of references in best cut)
end if
5
Figure 4. Critical-Path-Aware Register Alloca-
tion Algorithm (CPA-RA).
of O  for each D F G H by finding the most valuable references
on the Critical Path (CP) and distributing the available reg-
isters among them, with the objective of minimizing the
memory access time along the CP. In order to determine the
critical path of a set of DFGs, the algorithm assumes a spe-
cific scheduling implied by the mapping of array variables
to RAM blocks and the limited computing resources.
CPA-RA starts by constructing the Data-Flow Graph
(DFG) for the computations in the loop body. It then ex-
tracts the Critical Graph and finds all the possible cuts of
the CG. After calculating the number of registers required to
fully accommodate each cut (  R " 

f

} z
 
~

 
[
 ), the al-
gorithm selects the cut with the min  . For the selected cut,
if possible, the algorithm assigns  registers correspond-
ing to fully exploiting the data reuse for the references of
the cut. Otherwise it divides (equally) the available regis-
ters between the references. The algorithm repeats this pro-
cess until it consumes all the available registers. The com-
plexity of the algorithm is a function of the number of crit-
ical paths and their memory accesses, and therefore is ex-
ponential in the worst-case. However in practice, and in our
experiments, the CG is generally so small that this fact is
not a concern.
We now illustrate the application of these three algo-
rithms to the example code in figure 1 using   avail-
able registers. For this code, the benefit/cost of the refer-
ences yield the values @ A B  I R ^    , @ A B ff I R   ,
@ A B  I R `    , @ A B  I R ^ 
 
, and @ A B F I R ^ . The FR-
RA algorithm assigns the available   registers to the refer-
ences in the order of       ff  F resulting in an assignment
of O R   `

 ¡

 ^  ^  ^ ¢ . The PR-RA algorithm assigns reg-
isters in the same order but since there are ^ ^ registers left,
it assigns them to the  array reference resulting in the as-
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
signment     

 


        . Finally, the CPA-RA algo-
rithm first selects cut    due to its minimum number of re-
quired registers and assigns 


registers to this reference,
thereby reducing the length of the CG by one node (mem-
ory access). In a second iteration, the algorithm picks the
cut    ff  and assigns the remainder of the registers equally
to references  and ff .
Figure 2(c) illustrates the register distribution for the var-
ious arrays of figure 1 as a result of applying the above al-
gorithms. Considering the loop bounds and the latency of
each RAM access, under a serial execution, the code result-
ing from the application of the FR-RA algorithm would ex-
hibit  
 
cycles for memory operations whereas using the
PR-RA algorithm it would exhibit only   

cycles, since
  out of the 


iterations of  have only  memory ac-
cesses. For the CPA-RA algorithm, iterations have either 
or  memory accesses, due to the full scalar replacement of
 and a partial scalar replacement of  and ff . As a result
a total of    ! cycles are devoted to memory operations. It
is important to notice that, for this example, CPA-RA sub-
stantially reduces the cycles devoted to memory operations
using the exact same register resources.
5. Experimental Results
We validated the register allocation algorithm for
a set of six signal and image processing code ker-
nels. The Finite-Impulse-Response (FIR) and Decimation
FIR filter (Dec-FIR) code kernels compute a convolu-
tion of a 

 ! -long vector of   -bit values against a   and
   -long sequence of coefficients, with and without a dec-
imation factor of  respectively. The MAT kernel performs
a   #   matrix-matrix multiplication. The IMI ker-
nel computes the interpolation of two grey-scaled  # 
images for 


intermediate image values. The PAT ker-
nel finds the various occurrences of an 

-character long
string pattern in a 

 ! length string. Finally, BIC com-
putes a Binary-Image-Correlation between a  #  tem-
plate image and successively overlapping regions of a
larger  ! #  ! image. With the exception of MAT and BIC,
which are structured as a 
 - a ! -deep nested loops respec-
tively, all kernels are structured as  -deep loop nests with
compile-time known bounds.
For each kernel, written in C, we applied scalar replace-
ment at the source C level and then converted the trans-
formed C codes to behavioral VHDL. To decouple the ex-
periment from the code generation complexity issues of
scalar replacement due to the use of loop peeling, we opted
to use the same structure of control (in terms of loops and
peeled sections) for all of the code versions. Next we con-
verted the behavioral descriptions of the codes into a struc-
tural VHDL design using Mentor Graphics’ MonetTM high-
level synthesis tool. We then used Synplify Pro 6.2 and Xil-
inx ISE 4.1i tool sets for logic synthesis and Place-and-
Route (P&R) targeting a Xilinx VirtexTM XCV 1K BG560
device. After P&R we extracted the real area and clock rate
for each design and used the number of cycles derived from
the simulation to calculate wall-clock execution time.
In these experiments we imposed a maximum limit of  !
registers each implementation uses to capture data reuse. In
practice this limit must be imposed by the compiler as part
of a global resource allocation policy, orthogonal to these
experiments. For each code kernel we derived three designs,
respectively v1, v2 and v3, reflecting the three register al-
location algorithm variants FR-RA, PR-RA and CPA-RA
described in section 4.
Table 1 depicts the results for the register allocation
and corresponding hardware designs. The third and forth
columns indicate the number of registers required by each
array reference for a full scalar replacement, and the regis-
ters allocated by the algorithms, respectively. The fifth col-
umn presents the number of execution cycles, indicating the
percentage reduction with respect to the base code version
v1. The sixth column presents the attained clock period
for the hardware design in nano-seconds, as extracted after
P&R. The seventh column presents the wall-clock time for
the execution of the computation which takes into account
the attained clock rate. The execution time data is used to
calculate the speedup of the implementations with respect
to the base version. Finally the last two columns present the
resources used by each design in terms of slices (out of a
maximum of       ) and number of RAM blocks.
In terms of the register allocation algorithms, code ver-
sions v2 use substantially more registers than the corre-
sponding versions v1, in an attempt to exploit partial data
reuse. Versions v3 use almost all the available registers as
they evenly distribute the number of registers among the op-
erations on the critical path.
As expected, using more registers leads to a reduction of
the number of RAM accesses and hence to a reduction in the
number of execution cycles. The figures in column  show
consistently positive gains with an average percentage im-
provement of + , + . and   , 
 . for versions v2 and v3 re-
spectively. In some cases, such as Dec-FIR and PAT, us-
ing more registers in v2 does not lead to a reduction in the
number of cycles as the inputs to the same operations are lo-
cated in distinct types of storage. In fact because the control
for these designs is more complex than the base version v1
there is an increase in the clock period leading to an over-
all performance degradation as revealed by column + .
The CPA-RA algorithm mitigates this problem by allo-
cating registers to references that always decrease the num-
ber of clock cycles. The results reveal that, even though
there is a noticeable clock degradation for the more com-
plex v3 designs, the reduction in the number of clock cy-
cles compensates for this clock rate degradation, improving
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
Code Required S.R. Number of Registers Total Execution Clock Execution FPGA Slices RAMs
Kernel Version Registers Distribution Total Cycle Count Period Time(   s) Count Occupancy Used
 
   
                                     !  
FIR  #    
       
                 
   !          
      !          !  
 (
   
           
        !             
    !    
      !  

 
                       
           
   !  

Dec-FIR  #    
    
     
                  !       
        
   !          !  

 (
     
                !       
           !  
        !  

 
                  
       
          
 !  
IMI  #                         
    !  
               !       !  
 (
           
          !  
              !       !  

 
              
     
    
  
           
 !  
MAT  #      
                        !                 !   
    !  
 (
                     !                 !   
    !  
 
        
          
   
               !  

PAT  #           
       
         !  
    
          !         !  

 (
     
             
    !  
   
            !       !  

 
               
                      
     !  
BIC  #        
                
      !                  !          !  
 (
               
      !                  !          !  
Table 1. Analysis and experimental results.
the results of v2 for all cases but MAT and BIC . For con-
figurable architectures where the clock rate is fixed regard-
less of the design complexity, the results would yield perfor-
mance improvements for all code variants as derived from
the reduction of the number of clock cycles.
Overall, code versions v2 exhibit an unimpressive aver-
age wall-clock time gain of - / 0 2 4 whereas the code ver-
sions v3 yield a respectable 5 2 / 5 4 gain even for an aver-
age clock rate loss of 7 / : 4 . It is worth to notice that CPA-
RA improves the performance of versions v3 over v2, for
an average 5 2 4 and 5 - / 2 4 for clock cycles and wall-clock
time respectively. This improvement is achieved in many
cases with little or no additional registers and no significant
increase in the used number of slices, making the proposed
CPA-RA a very effective register allocation algorithm for
this class of configurable computing architectures.
6. Conclusion
Emerging configurable architectures exhibit a rich set of
storage and computing resources which must be explicitly
managed by compilers for maximum efficiency. In this pa-
per we have described a register allocation algorithm for
scalar variables resulting from the aggressive application
of scalar replacement. We proposed a critical-path-aware
allocation strategy that exploits the internal registers and
RAM blocks parallel accesses. We showed that this algo-
rithm leads to substantial performance gains over common
register allocation strategies.
References
[1] F. Balasa, F. Catthoor, and H. DeMan. Dataflow-driven
Memory Allocation for Multi-dimensional Signal Process-
ing systems. In Intl. Conf. on Computer Aided Design, 1994.
[2] N. Baradaran, J. Park, and P. C. Diniz. Compiler Reuse Anal-
ysis for the Mapping of Data in FPGAs with RAM Blocks.
In IEEE Conf. on Field-Programmable Technology, 2004.
[3] P. Briggs, K. Cooper, K. Kennedy, and L. Torczon. Coloring
Heuristics for Register Allocation. In ACM Conf. on Pro-
gramming Language Design and Implementation, 1989.
[4] D. Callahan, S. Carr, and K. Kennedy. Improving Register
Allocation for Subscripted Variables. In ACM Conf. on Pro-
gramming Language Design and Implementation, 1990.
[5] M. Gokhale and J. Stone. Automatic Allocation of Arrays
to Memories in FPGA Processors with Multiple Memory
Banks. In IEEE Symp. on FPGAs for Custom Computing
Machines, IEEE Computer Society Press, pp. 63-69, 1999.
[6] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi,
R. Taylor, and R. Laufe. PipeRench: A Coprocessor for
Streaming Multimedia Acceleration. In 26th Intl. Symp. on
Comp. Architecture, ACM Press, pp. 28-39, 1999.
[7] P. Jha and N. Dutt. High-level Library Mapping for Memo-
ries. ACM Trans. on Design Automation of Electronic Sys-
tems, 5(3):566–603, January 1999.
[8] M. Kandemir and A. Choudhary. Compiler-directed Scratch
Pad Memory Hierarchy Design and Management. In Design
Automation Conference, 2002.
[9] D. Kolson, A. Nicolau, N. Dutt, and K. Kennedy. Optimal
Register Assignment to Loops for Embedded Code Genera-
tion. ACM Trans. on Design Automation of Electronic Sys-
tems, 1(2):251-279, 1996.
[10] I. Ouaiss and R. Vemuri. Hierarchical Memory Mapping
During Synthesis in FPGA-based Reconfigurable Comput-
ers. In Design Automation and Test in Europe, 2001.
[11] B. So and M. Hall. Increasing the Applicability of Scalar Re-
placement. In ACM Symp. on Compiler Construction, 2004.
[12] M.Weinhardt and W. Luk. Memory Access Optimization for
Reconfigurable Systems. IEE Proc.-Comput. Digit. Tech.,
148(3):105–112, 2001.
[13] Xilinx Inc. Virtex 2.5v FPGA Product Spec.(v2.4)), 2000.
[14] XPP Technologies, Inc. The XPP White Paper., 2002.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
