From Loop Transformation to Hardware Generation by Devos, Harald et al.
From Loop Transformation to Hardware Generation
Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campenhout and Dirk Stroobandt
Ghent University, ELIS - PARIS
Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
{hmdevos,kbeyls,mchristi,jan.vancampenhout,dstrooba}@elis.ugent.be
Abstract—Multimedia applications are examples of a class
of algorithms that are both calculation and data intensive
and have real-time requirements. As a result dedicated
hardware acceleration is often needed.
Usually the on-chip memory is not sufficient to store all
data and has to be extended with external memory. The
bandwidth to this memory often becomes a bottle neck.
Loop transformations are needed to reduce this bandwidth,
by improving the temporal and spatial data locality. They
can also unveil the parallelism present in the algorithm. The
polyhedral model offers a flexible program representation
that allows to automate this kind of transformations. The
class of applications that can be transformed with the poly-
hedral model fits very well into the class of applications that
can benefit from hardware acceleration.
This paper describes how the existing tools, that generate
software from a polyhedral program representation, have
been extended to generate a VHDL description of a hard-
ware controller. The corresponding data path is generated
semi-automatically. Combining the generation of controller
and data path creates a fast path to hardware. Our tech-
niques enable an easy exploration of the design space, by
generating a lot of implementation variants.
The techniques are demonstrated on an inverse discrete
wavelet transform resulting in several synthesizable designs,
of which one has been hand-optimized towards a FPGA im-
plementation. The results outperform those of a commercial
C-to-VHDL compiler. The generated variants run 5 to 10
times faster while consuming less resources.
Keywords— Polyhedral Model, Loop Transformation,
Design Space Exploration, Hardware Synthesis, Discrete
Wavelet Tranform.
I. INTRODUCTION
There exists a class of applications for which a software
implementation can not offer the required performance.
E.g., for real-time video decoding dedicated hardware is
needed. FPGAs (Field Programmable Gate Arrays), offer
a high computational power combined with a huge internal
bandwidth. Although the clock frequency is much lower
than on a processor many applications can be accelerated
thanks to the high parallelism available. All calculation
and memory blocks on a FPGA can work in parallel.
The design path from a high-level algorithm descrip-
tion to a low-level synthesizable hardware description, is
a long, manual and error-prone process. Iterating this pro-
cess, e.g., to examine the influence of design choices re-
quires a high effort and is often not possible. Therefore
automation techniques have to be developed.
Exploiting the calculative power of a FPGA is only pos-
sible when the data access rate can cope with the data pro-
cess rate. Modern FPGAs contain a lot of memories and
registers that are accessible in parallel and offer a huge
on-chip bandwidth. However, for many applications these
memories are not large enough to store all the data. The re-
quired storage can be offered by external memory, but the
bandwidth to this memory will be lower and the latency
higher.
This is very similar to the memory bottle neck, known
in processor architectures with a hierarchy of main mem-
ory, caches and registers. To reach a high performance the
accesses to the slower external memory have to be min-
imized. A lot of techniques have been developed to im-
prove the cache behaviour of algorithms in software [19].
They try to bring the reuses of data elements closer to-
gether (temporal locality), or group the uses of elements
that are located close to each other, (spatial locality). This
is done by loop transformations that change the execution
order of statement instances in a loop nest. Although these
techniques are mainly developed targeting software imple-
mentations they are also useful for designing hardware.
This paper shows how a polyhedral program represen-
tation can be used to automate loop transformations and
automate part of the hardware design process.
Section II explains how programs can be represented
and transformed in the polyhedral model. Hardware gener-
ation from this representation is described in Sect. III. As a
case study implementation variants of the Inverse Discrete
Wavelet Transform (IDWT) are built in Sect. IV. Some re-
lated work is listed in Sect. V. Section VI ends the paper
with conclusions and future work.
II. THE POLYHEDRAL MODEL
Compilers and refactoring tools that apply loop transfor-
mations need a way to represent loop nests and their corre-
sponding boundaries. Usually, abstract syntax trees (AST)
are used. In an AST, each loop corresponds to a node in a
249
Depth Statement Ordering Iteration Scheduling
0 1 2 Abstraction vector vector vector
| | |
0: for i = 0, N
0: for j = 0, N − i
if i = 0 then
0: T [i + j] = 1 S1(i, j) (0, 0, 0) (i, j) (0, i, 0, j, 0)
end if
1: T [i + j] = T [i + j]I[i][j] S2(i, j) (0, 0, 1) (i, j) (0, i, 0, j, 1)
1: for k = 0, N − 1
0: O[k] = T [k] + T [k + 1] S3(k) (1, 0) (k) (1, k, 0)
2: PRINT(”End of program.”) S4() (2) () (2)
(a) Program code with ordering, iteration and scheduling vector of the statements.
i N0
j
N
0
(b) Polyhedral representation of the iter-
ation domain of S2. The arrows indicate
the execution order.
Fig. 1. Program example to illustrate the polyhedral program representation
tree, and inner loops are represented as child nodes of an
outer loop node. In contrast, our framework is based on
a representation of loops in the so-called polyhedral pro-
gram model [5].
A. Program Representation
A statement is a line of a program without control, typi-
cally an assignment with operations at the right hand side.
The (lexical) depth of a loop or a statement is the number
of loops that surround it. The program top-level (depth 0)
is a sequence of loops and statements which are numbered
from 0 onwards (Fig. 1(a)). Each loop in its turn also con-
tains a sequence of statements and loops that are numbered
from 0 onwards. Each statement is uniquely identified by
the vector composed of the numbers telling the position in
each of the surrounding loops and the top-level. This vec-
tor is called the ordering vector. It has dimension DS + 1
with DS the depth of statement S.
A statement is executed for a set of values of the itera-
tion vector, the vector containing the iterators of the sur-
rounding loops (dimension DS). A single execution of a
statement is called a statement instance. The iteration do-
main is the set of values of the iteration vector for which
the statement is executed (Fig. 1(b)). The scheduling vec-
tor of a statement is the vector with the elements of the
ordering vector as odd elements and the iterators as even
elements. The dimension is 2DS + 1. The execution order
of statement instances follows the lexicographical order of
their scheduling vectors (Fig. 1(b)). The uniqueness of the
ordering vectors ensures that the statement instances are
strictly ordered.
We restrict ourselves to programs where the loop bounds
are linear expressions (affine functions) of some parame-
ters and the iterators of the surrounding loops. In this case
the iteration domains can be represented as parameterized
integer polyhedra, hence the name polyhedral model.
A parameterized integer polyhedron Pp is defined as
Pp = {x ∈ Z
n|Ax ≥ Bp + b}, p ∈ Zm
where A and B are constant integer matrices, b is a con-
stant integer vector, and p is a vector of parameters. Con-
sider the program in Fig. 1. The iteration domains for the
statements can be represented by polyhedra as follows
DS1 = {
[
i
j
]
∈ Z2|


1 0
−1 0
0 1
0 −1


[
i
j
]
≥


0
0
0
−1

 [ N ]+


0
0
0
0

} (1)
DS2 = {
[
i
j
]
∈ Z2|


1 0
−1 0
0 1
−1 −1


[
i
j
]
≥


0
−1
0
−1

 [ N ] +


0
0
0
0

} (2)
DS3 = {[k] ∈ Z
1|
[
1
−1
] [
k
]
≥
[
0
−1
] [
N
]
+
[
0
1
]
} (3)
250
SW
CLooG
defines.h
loop.cloog loop.c
Fig. 2. The Chunky Loop Generator (CLooG) generates (con-
trol) code from a polyhedral representation (*.cloog). Together
with the statement definitions (*.h) this results in executable
software.
loop controldata
control
statements
datapath + control
mem A
mem C
mem B
Fig. 3. Architecture with separate memory for each array.
or, using a more compact notation:
DS1 = {(i, j) ∈ Z
2|i = 0 ∧ 0 ≤ j ≤ N}
DS2 = {(i, j) ∈ Z
2|0 ≤ i ≤ N ∧ 0 ≤ j ≤ N − i}
DS3 = {(k) ∈ Z
1|0 ≤ k ≤ N − 1}.
If the array indices are linear expressions of the iterators
and parameters, the dependences between the statement in-
stances can also be represented as polyhedra or unions of
polyhedra, which are called dependence domains [6].
B. Describing Loop Transformations
In the polyhedral model, a program is represented as a
set of statements. For each statement, the values of the
surrounding loop iterators for which it is executed, is rep-
resented by a number of matrices, e.g., (1)-(3).
All loop transformations are performed by transform-
ing the matrices and scheduling vectors of the statements,
that are contained in the transformed loops. The statement
definitions, i.e., the operations that are executed by a state-
ment in function of the iterators, remain untouched. After
a transformation has been applied, the code is still repre-
sented by a set of statements with associated matrices. As
a result, loop transformations can easily be combined by
combining the corresponding matrix operations.
The polyhedral representation can then be translated
back into an AST to obtain an executable specification,
e.g., by CLooG (Chunky Loop Generator) [4] (Fig. 2). The
exact representation of transformations, and their practical
implementation in a tool called URUK, is presented in de-
tail in [9] and [12].
III. HARDWARE GENERATION
As explained in Sect. II, a program definition can be
split into statement definitions and a representation of the
iteration domains and ordering. This separation can be
for 0
ID1
for 1
ID2
S4S3S2S1
IDx
for i
for k
10
for j
for i
Identifier
block
Loop
counter
signals
Control
m..nS
implementation
Statement
S1..4
ID0
00
210
Program
Loop Control Entity
Fig. 4. Abstract syntax tree of the program in Fig. 1(a) and cor-
responding hardware architecture of the loop controller. A sin-
gle hardware block implements all statements S1..4. The block
for 0 implements the loops at depth 0: for i and for k. Which of
the two is executed depends on the value received from ID0.
used to create a hardware architecture composed of two
parts, a loop control entity that drives the iterators and
statement instances and a statements entity that imple-
ments the statement operations for the iterator values re-
ceived from the control entity (Fig. 3 and 4).
This section describes the (semi-)automatically gener-
ated architecture of these blocks. Until now only a sequen-
tial execution of the statement instances is supported. This
means that only the parallelism within one statement can
be exploited.
A. Loop Control
The loop controller consists of a set of communicating
automata, a so called factorized implementation [2]. Ex-
periments showed that the proposed factorized implemen-
tation consumes less area and reaches a higher clock fre-
quency than a monolithic control block.
In the abstract syntax tree (AST) on Fig. 4, the numbers
and iterators on the path between the top-node program
and a statement node correspond to the elements of the
scheduling vector of that statement in Fig. 1(a).
We propose a controller composed of automata, each
corresponding with one dimension of the scheduling vec-
tor. This results in two types of automata. A first type,
the loop counters, is responsible for the iterators, e.g., for
0 in Fig. 4 drives i and k. The other type, the identifier
blocks, corresponds to the elements of the ordering vec-
tors, e.g., ID 0 in Fig. 4, The loop counter blocks calcu-
251
late the loop bounds and stride in function of the param-
eters, the iterators of surrounding loops and the more sig-
nificant elements of the ordering vector.1 The identifier
blocks count from zero onwards to enumerate the different
statements and loops at the level below.
The tool CLooGVHDL was built on top of the CLooG-
library, to generate VHDL-code of this hardware structure
starting from a polyhedral description in a *.cloog file. The
execution time of a statement instance does not have to be
known at compile time and may vary during execution or
between variants of the statement implementations. A sim-
ple handshake between controller and statements makes
this possible without losing clock cycles.
B. Statement Control and Data Path
All statements are implemented as a single VHDL pro-
cess. This is only possible since no two statements operate
at the same time. It allows the synthesis tools to “see”
all statements during optimization and results in hardware
being shared between statements. The generation of the
statements entity is not fully automated yet. This is not a
large problem since a loop transformation only influences
the controller entity and therefore the statements entity can
be reused for several loop transformation variants.
The statement definitions are split into operations and
array accesses. The former are translated into a VHDL-
syntax and the latter are translated into memory transac-
tions, assuming one memory for each array and an access
time of two cycles (Fig. 3). An intermediate file, steps.vhd
on Fig. 5, contains the actions per cycle for each statement.
In this file it is possible to do some scheduling optimiza-
tions by hand. (Scheduling techniques such as in [3] could
automate this step.) From here the path to synthesizable
VHDL is automated (Steps2process).
IV. CASE STUDY: THE 2-D IDWT
A. The Basic Algorithm and Variants
The Discrete Wavelet Transform (DWT) and its inverse
(IDWT) have become popular for image processing and
compression, e.g., JPEG-2000. A lot of optimized imple-
mentations have been published [15]. Fig. 6 shows the
algorithm, using a 9/7 bi-orhogonal filter pair [16]. Each
level of transformation (iterator l), consists of a vertical
and horizontal filtering step. This results in a bad data lo-
cality while the input is first scanned column by column
and then row by row and this for each level. This ex-
1The generated VHDL expressions for the loop bounds are equivalent
to the expressions generated by CLooG. As a result operators as mod
and div are synthesizable only for powers of 2. Techniques as those
presented in [20] could extend this synthesizable subset.
for l = K-1, 0
S = R / 2l+1
for j = 0, (C / 2l)− 1 // Vertical filtering.
for i = 0, S − 1
B2i,j = Ai−1..i+1,j ·Ho + AS+i−2..S+i+1,j ·Go
B2i+1,j = Ai−1..i+2,j ·He + AS+i−2..S+i+2,j ·Ge
S = C / 2l+1
for i = 0, (R / 2l)− 1 // Horizontal filtering.
for j = 0, S − 1
Ai,2j = Bi,j−1..j+1 ·Ho + Bi,S+j−2..S+j+1 ·Go
Ai,2j+1 = Bi,j−1..j+2 ·He + Bi,S+j−2..S+j+2 ·Ge
Fig. 6. Simplified representation of the IDWT basic algorithm
(Row-Column-based). R and C are parameters representing
the number of rows and columns of the image that is trans-
formed over K levels. A and B are two dimensional arrays
and Go, Ge, Ho and He are vectors containing the odd and even
elements of the wavelet filters G and H with lengths 9 and 7(a
9/7 bi-orthogonal filter pair).
plains the name of this variant: Row-Column-based level-
by-level.
Several loop transformations, found by manual analy-
sis or suggested by tools like SLO [7], can improve the
locality. After interchanging the i and j iterator of the ver-
tical filtering step, the two filtering steps can be fused what
brings the production and consumption of the elements of
array B close to each other. The input is now scanned line
by line, hence the name line-based level-by-level. Tiling
[19] allows to interleave the operations of the different
transformation levels. More details on the construction of
the transformation sequences can be found in [10].
As can be seen in Fig. 6, some indices get values outside
the array (image) boundaries. Therefore the image has to
be extended by mirroring the pixels near the border. This
makes the code more complex than shown here.
B. Generating Variants without Memory Hierarchy
Some preprocessing was needed before the loop trans-
formations were possible. First the code on Fig. 6 was
converted to a dynamic single assignment form to elimi-
nate false dependences. Secondly, the outer loop was un-
rolled to remove the exponential expressions in the loop
boundaries of the i and j loops so that the program could
be represented in the polyhedral model.
To investigate the influence of the complexity due to
the mirroring at the borders two versions of the code were
made. One that does not calculate the pixels near the bor-
ders and one that does by using predicated statements. The
predicates are used to distinguish between the execution at
the center and near the borders of the image. This dis-
252
CLooGVHDL
HW
Steps2process
HW
VIM−scripts
scheduling
optimizations
statements.vhd
loop.vhd
Memory
Build
Hierarchy
steps.vhd
loop.cloog
defines.h
Fig. 5. CLooGVHDL extends the CLooG software generation process (Fig. 2) to create hardware.
tinction is done within the statement and not within the
loop conrol structure. This makes the loop transformations
and control complexity similar for both cases and will only
have an impact on the statements’ implementation.
The WRaP-IT/URUK tool set [9], [12] performs a
sequence of loop transformations specified in a script.
Several scripts were written to generate variants: fused
(line-based), fused-and-tiled, with and without borders.
For each variant a hardware controller was generated
by CLooGVHDL and combined with a statement block,
with or without manual optimizations. Table II gives an
overview of all the different implementations. They are
compared with three designs generated by Impulse C (ver-
sion 1.22) [1], a commercial tool for automatic synthesis
of stream-based applications, and a manual design [11],
that aimed at minimal area while transforming 45 CIF
frames/s.
The inclusion of the calculations near the borders adds
a lot of complexity to the design. If a processor is avail-
able in the system it might be beneficial to put the border
calculations on the processor and run only the regular non-
border operations on the FPGA.
Impulse C has the worst synthesis results. It aims
at compiling a larger class of C programs than those
representable in the polyhedral model. Therefore it is
more conservative in its approach to hardware generation,
while CLooGVHDL has a more specialized implementa-
tion strategy. A single large automaton is generated instead
of a factorization into smaller automata. In some places 32
bit data types are used where smaller word lengths suf-
fice, even if shorter data types are used in the source code.
Newer versions of this tool are likely to give better results.
The frame rate printed in the last column assumes a cal-
culation limited design. In practice the accesses to the ex-
ternal memory can slow down the design. Table I shows
the dataflow and burst usage for transforming one frame,
assuming the data locality is exploited, e.g., by a memory
hierarchy as in Fig. 7. This illustrates the effectiveness of
the loop transformations.
C. Building a Memory Hierarchy
The architecture constructed until now (Fig. 3), uses one
memory for each array and does not yet exploit the data
TABLE I
DATA FLOW (IN PIXELS) TO/FROM THE MAIN MEMORY
(LEFT SIDE OF FIG.7) FOR DIFFERENT IDWT VARIANTS.
Data flow Burst
K = 3 usage
RC 16
3
RC(1− 1/4K) 5.25RC 50%
Fused 8
3
RC(1− 1/4K) 2.625RC 100%
”+Tiled 2RC 2RC 100%
mem
mem
mem
loop control
main mem
fetch/store
control
data
control
statements
datapath + control
cl
k 
2
cl
k 
1
Fig. 7. Architecture with memory hierarchy. Fetching and stor-
ing data is done in parallel with the operations of the IDWT.
Execution speed is determined by the slowest of these two pro-
cesses. Multiple local buffers allow parallel memory accesses.
locality by using on-chip buffers. One promising variant,
fused with borders and manual optimizations, was selected
to be extended with a memory hierarchy as in Fig. 7.
The large memories were replaced by small buffers
large enough to contain the working data set. By using
2-port memories the fetching and storing can be done in
parallel with the statement execution. Since the bus con-
nected to the main memory is likely to be shared with other
cores the timing of data transfers is not deterministic. Syn-
chronization points delay operations if data is not available
in the buffers on time.
A queue of prefetch and store requests is kept in the
fetch/store control block. A new request is added each time
space becomes available in the buffers and not just before
the data is needed. Therefore, in a system that is not band-
width limited, only in the beginning time is wasted waiting
for data.
Some buffers are split into parallel accessible line
buffers to increase the on-chip bandwidth. As a result the
253
frame rate increases to more than 45 frames/s and is only
smaller than the manual design due to the lower clock fre-
quency. The number of cycles is roughly the same for both
designs.
These two designs were tested on an Altera PCI De-
velopment Board with a Stratix EP1S60F1020C6 FPGA
and DDR SDRAM memory. An Avalon fabric is used to
connect the hardware blocks to the external memory, run-
ning at 50 MHz. The generated design transforms up to 53
CIF frames/s (synthesis with Quartus II v5.1). The man-
ually made design reached only 11.3 CIF frames/s due to
the large amount of transactions that could not be done in
bursts.
A line-based software implementation reached 43.5 CIF
frames/s on an AMD Athlon XP 2500+ running at 1830
MHz. This corresponds to 42 million cycles for one CIF
frame, 50 times more than the optimized FPGA implemen-
tations.
V. RELATED WORK
Hardware generation from high-level languages is the
target of many projects. In MMAlpha [13], loop nests
are represented in a functional, dynamic single assign-
ment language. The code is mapped onto a systolic ar-
ray. The PICO project [17] also translates loop nests into
systolic arrays and runs the left-over code on a specialized
EPIC (Explicitly Parallel Instruction Computing) or VLIW
(Very Long Instruction Word) processor. PARO [14] maps
Piecewise Regular Algorithms (PRA) onto a configurable
processor array. These projects all handle a subset of the
programs we can handle.
The Compaan/Laura tool suite [18], [20] translates poly-
hedral loop nests into Kahn Process Networks (KPN),
by eliminating global memory and global control. Laura
translates the KPNs into VHDL.
The Cameron Project has created a high-level algorith-
mic language, named SA-C [8], for expressing image pro-
cessing applications. Compilation to FPGAs is done using
data flow graphs. Impulse C [1] uses a subset of C ex-
tended with IO-macros. Translation is done by construct-
ing one large finite automaton where the states relate to the
control flow graph of the program.
These projects all focus more on exploiting parallelism
than on improving bandwidth aspects. The inputs for the
algorithms range from PRAs (MMAlpha, PICO, PARO),
over polyhedral programs (Compaan, SA-C, this work), to
more general constructs (Impulse C). This results in differ-
ent trade-offs between the set of algorithms handled and
the efficiency of the resulting hardware.
VI. CONCLUSIONS AND FUTURE WORK
Hardware acceleration on a FPGA does not only require
to have enough parallelism in an application but also a
good data locality. Loop transformations can be used to
improve the data locality and are easily performed using
the polyhedral model. Automating the hardware genera-
tion from this model allows to compare implementation
variants. Although a manual implementation can be more
efficient, generated designs may outperform it due to band-
width requirements.
Future work may include the automation of the schedul-
ing operations and the construction of a memory hierarchy.
The hardware architecture will have to be adapted to allow
parallel execution of statement instances. Isolating com-
plex code, e.g., due to border extension, and running it on
a processor might improve the performance.
VII. ACKNOWLEDGEMENTS
The authors would like to thank A. Cohen, S. Girbal and
N. Vasilache for providing access to the URUK tool and
giving support. We would like to thank Altera for donat-
ing FPGA boards and tools. This research is supported by
I.W.T. grant 020174, F.W.O. grant G.0021.03 and by GOA
project 12.51B.02 of Ghent University. Harald Devos is
supported by the F.W.O. (Research Foundation - Flanders).
REFERENCES
[1] http://www.impulsec.com/.
[2] P. Ashar, S. Devadas, and A. R. Newton. Sequential logic synthe-
sis. Kluwer Academic Publishers, 1992.
[3] I. Auge´, F. Pe´trot, F. Donnet, and P. Gomez. Platform-based
design from parallel C specifications. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
24(12):1811–1826, December 2005.
[4] C. Bastoul. Efficient code generation for automatic parallelization
and optimization. In ISPDC’2 IEEE International Symposium
on Parallel and Distributed Computing, pages 23–30, Ljubljana,
october 2003.
[5] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam.
Putting polyhedral loop transformations to work. In LCPC’16
International Workshop on Languages and Compilers for Parallel
Computing, LNCS 2958, pages 209–225, College Station, october
2003.
[6] C. Bastoul and P. Feautrier. More legal transformations for local-
ity. In EURO-PAR Parallel Processing, Lecture Notes in Com-
puter Science, volume 3149, pages 272–283. Springer-Verlag
Berlin, 2004.
[7] K. Beyls and E. H. D’Hollander. Intermediately executed code
is the key to find refactorings that improve temporal data local-
ity. In CF ’06: Proceedings of the 3rd conference on Computing
Frontiers, pages 373–382, New York, NY, USA, May 2006. ACM
Press.
[8] W. Bohm, J. Hammes, B. Draper, M. Chawathe, C. Ross,
R. Rinker, and W. Najjar. Mapping a single assignment program-
ming language to reconfigurable systems. Journal of Supercom-
puting, 21(2):117–130, February 2002.
254
TABLE II
COMPARISON OF DIFFERENT IMPLEMENTATIONS OF THE IDWT. SYNTHESIS RESULTS WITHOUT MEMORIES (FIG. 3 OR THE
RIGHT SIDE (CLK 2) OF FIG. 7) WERE OBTAINED USING ALTERA QUARTUSII V4.2 FOR THE STRATIX FPGA FAMILY. THE
FRAME RATE IS NORMALIZED TO CIF FRAMES (288× 352) WITH BORDERS. (* = NUMBER OF CYCLES FOR CIF
RESOLUTION INSTEAD OF 72× 88 PIXELS)
Tool Transform Borders LE DSP blocks fmax Cycles Frames/s
(#Mul) (MHz) (72×88) (288×352)
CLooGVHDL
none
yes 3561 18(9) 46.72 214269 16.99
no 1691 18(9) 52.16 187350 20.15
fuse yes 3336 18(9) 49.24 215848 17.78
no 1712 18(9) 53.66 200138 19.41
fuse yes 3895 18(9) 39.68 215881 14.33
+ tile no 2575 18(9) 46.20 200497 16.68
none no 1495 18(9) 58.16 129821 32.43
CLooGVHDL fuse yes 4525 18(9) 40.47 161037 19.59
+ Manual opt. no 1622 18(9) 57.32 142570 29.11
fuse + tile no 2420 18(9) 49.45 142929 25.05
+ Mem hier. fuse yes 17645 18(9) 47.55 * 868917 45.72
Impulse C none
yes 37127 144(18) 24.13 697431 2.70
no 13146 80(10) 30.27 605588 3.38
fuse yes 23283 144(18) 34.70 508116 5.32
Manually
none yes 1738 10(5) 68.91 * 869530 79.25
+ Mem hier. 2184 13(8)
[9] A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and
N. Vasilache. Facilitating the search for compositions of program
transformations. In ACM Int. Conf. on Supercomputing (ICS’05),
Boston, Massachusetts., June 2005.
[10] H. Devos, K. Beyls, M. Christiaens, J. Van Campenhout, E. H.
D’Hollander, and D. Stroobandt. Finding and Applying Loop
Transformations for Generating Optimized FPGA Implementa-
tions. Transactions on High Performance Embedded Architec-
tures and Compilers, 2006. To be published.
[11] H. Devos, H. Eeckhaut, B. Schrauwen, M. Christiaens, and
D. Stroobandt. Ever considered SystemC? In Proceedings of the
15th ProRISC Workshop, pages 358–363, Veldhoven, November
2004.
[12] S. Girbal. Optimisations d’applications - Composition de trans-
formations de programme: mode`le et utils. PhD thesis, Universite´
de Paris-Sud, 2005.
[13] A. C. Guillou, P. Quinton, and T. Risset. Hardware synthesis for
systems of recurrence equations with multi-dimensional sched-
ule. International Journal of Embedded Systems, 2005. To be
published.
[14] F. Hannig, H. Dutta, and J. Teich. Mapping a Class of Depen-
dence Algorithms to Coarse-grained Reconfigurable Arrays: Ar-
chitectural Parameters and Methodology. International Journal
of Embedded Systems, 2(1/2):114–127, 2006.
[15] C.-T. Huang, P.-C. Tseng, and L.-G. Chen. Analysis and
VLSI architecture for 1-D and 2-D discrete wavelet transform.
IEEE Transactions on Signal Processing, 53(4):1575–1586, April
2005.
[16] S. G. Mallat. A theory for multiresolution signal decomposition:
the wavelet representation. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 11(7):674–693, July 1989.
[17] B. R. Rau and M. S. Schlansker. Embedded computer architecture
and automation. IEEE Computer, 34(4):75–83, April 2001.
[18] A. Turjan, B. Kienhuis, and E. Deprettere. Translating affine
nested-loop programs to process networks. In CASES ’04: Pro-
ceedings of the 2004 international conference on Compilers, ar-
chitecture, and synthesis for embedded systems, pages 220–229,
New York, NY, USA, 2004. ACM Press.
[19] M. Wolfe. High performance compilers for parallel computing.
Addison-Wesley, 1996.
[20] C. Zissulescu, B. Kienhuis, and E. Deprettere. Expression syn-
thesis in process networks generated by LAURA. In 16th IEEE
International Conference on Application-Specific Systems, Archi-
tecture Processors (ASAP 2005), pages 15–21, July 2005.
255
