Hybrid data/configuration caching for striped FPGAs by Deshpande, Deepali et al.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
1999 
Hybrid data/configuration caching for striped FPGAs 
Deepali Deshpande 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Akhilesh Tyagi 
Iowa State University, tyagi@iastate.edu 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Electrical and Computer Engineering Commons 
Recommended Citation 
Deshpande, Deepali; Somani, Arun K.; and Tyagi, Akhilesh, "Hybrid data/configuration caching for striped 
FPGAs" (1999). Electrical and Computer Engineering Conference Papers, Posters and Presentations. 137. 
https://lib.dr.iastate.edu/ece_conf/137 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
Hybrid data/configuration caching for striped FPGAs 
Abstract 
Most custom computing machine (CCM) design has centered around field-programmable gate array 
(FPGA) technology and rapid prototyping applications. FPGAs are reconfigured to map parts of the 
application. The performance of an FPGA when used as a virtual hardware engine depends on its 
reconfiguration granularity. We study the striped FPGA and propose a hybrid mechanism to process a 
large amount of data using a combination of data and configuration caching. 
Keywords 
field programmable gate arrays, cache storage, reconfigurable architectures, special purpose computers, 
performance evaluation 
Disciplines 
Electrical and Computer Engineering 
Comments 
This is a manuscript of a proceeding published as Deshpande, Deepali, Arun K. Somani, and Akhilesh 
Tyagi. "Hybrid data/configuration caching for striped FPGAs." In Seventh Annual IEEE Symposium on 
Field-Programmable Custom Computing Machines (1999): 294-295. DOI: 10.1109/FPGA.1999.803703. 
Posted with permission. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/137 
Hybrid DataConguration Caching for Striped FPGAs 
Deepali Deshpande Arun K Somani
Electrical and Computer Engineering Department




Iowa State University Ames 
EMail tyagiiastateedu
Abstract
In recent years interest in the area of custom com
puting machines CCMs has been on a steady increase
Much of the activity surrounding CCMs has centered
around FieldProgrammable Gate Array FPGA tech
nology and rapid prototyping applications For sup
porting applications FPGAs are recongured to allow
pieces of the application to be mapped on it temporally
The performance of the FPGA when used as virtual
hardware engine depends on its reconguration granu
larity We study the striped FPGA 	 and propose a
hybrid mechanisms to process a large amount of data
using a combination of data and conguration caching
We also present our analysis quantifying total execu
tion time and I
O overhead presented by the scheme to
determine its applicability domain
 Introduction
Originally introduced for prototyping digital cir
cuits the Field Programmable Gate Arrays FPGAs
are considered the new and potentially better means
of performing computation  	 A host processor
is a controlling unit that determines which problem
application or its part to be solved on the hard
ware and accordingly con
gures the FPGA to create
an application speci
c circuit for that problem	 For
many regular applications like DSP image processing
data encryptiondecryption DNA sequence matching
the performance gain obtained with CCMs is at least
an orderofmagnitude greater than that of processor
based approaches  	
Many CCMs for example SPLASH  PAM 
RACE  etc	 use commercially available FPGAs be
longing to the families like Xilinx  Xilinx 
Xilinx  Altera Flex etc	 The recon
gurable ar
rays are rarely large enough to encode entire interest
ing programs all at once	 To overcome the hardware re
striction and to make the platform application indepen
dent smaller con
gurations handling dierent pieces
of a program must be swapped in over time	 Thus
the performance gain obtainable from these machines
is partly determined by recon
guration time and recon

guration granularity of the FPGA	 To overcome these
problems various researchers have proposed building a
machine that tightly couples recon
gurable hardware
with a conventional microprocessor   	 This has
led to developing new architectures for recon
gurable
This work was funded by Carver Trust Grants Iowa State
University
hardware to suit fast run time and partial recon
gu
ration	 Some of the new FPGA architectures evolved
over last 
ve years include PipeRench Striped FPGA
 Garp  Colt 	
The recon
gurability provides hardware virtualiza
tion which is important for CCMs because they have
to execute applications with varying hardware require
ments	 The hardware virtualization involves de
ning
a complete set of con
gurations required by the appli
cation but scheduling parts of it in a sequence on the
hardware to complete the application	 The con
gura
tion size is usually large and recon
guration is costly
in terms of time as well as power requirement 	 The
hardware should not be recon
gured often just because
it is recon
gurable but the decision should be taken
based on the application requirements and the avail
able hardware resources	
PipeRench pipelinerecon
gurable FPGA  is
suitable for implementing pipelined applications	 As
shown in Figure  it consists of a fabric which is a set of
hardware stripes connected in a pipeline fashion	 The
basic unit of recon
guration is a stripe	 It is possible to
implement the application with the number of pipeline
stages greater than the number of stripes by hardware
virtualization or by recon
guring the same stripe to
perform the function of dierent pipeline stages at dif
ferent times	 Figure  shows a large and wide cache
that stores either the con
gurations required to imple
ment pipeline stages of the application or the interme
diate data produced during processing	 This architec
ture allows incremental pipeline recon
guration since
the time required to con
gure a stripe is the same as

















Figure  Striped FPGA Architecture
Such an architecture can be used with both data
and con
guration scheduling schemes 	 We 
rst re
view two scheduling and caching options for striped
FPGA data caching and con
guration caching	 Next
we propose a new scheduling scheme for striped FPGA
hybrid caching and study hybrid caching performance
and provide its comparison with that of data and con

guration caching	 In Section  we make concluding
remarks and provide suggestions for future work	
 Conguration vs Data Caching
A pipelined application can be described using the
number of pipeline stages S and the number of data
elements X to be processed by the pipeline	 When
the number of pipeline stages to be implemented S
is greater than the number of stripes k in the FPGA
fabric the application is scheduled temporally on the
FPGA as de
ned by the recon
guration pattern to
assist the FPGA fabric to store con
gurations input
data and intermediate results	
Conguration Caching As the name suggests the
con
guration caching approach for scheduling stores all
the con
gurations required by the application in the
cache	 This approach is used in PipeRench 	 We
number k stripes in the FPGA from  to k and the
pipeline stages of the application from  to S	 If the
execution starts at time step or cycle number t then at
cycle number t i i    i mod S
th con
guration
enters in the fabric	 Generally k  S	 This way once
a data element enters the fabric it passes through all
pipeline stages	 Thus when k  S one element enters
the pipeline every cycle but when k  S k   data
elements enter the pipeline every S cycles	 When k 
S the throughput reduces since one stripe is always
under recon
guration	
To illustrate this scheduling scheme when k  S
consider a simple application consisting of six pipeline
stages	 Let the number of stripes in the fabric be
three	 The application needs to process  data ele
ments x    x	 The operation to be performed on n
th
data element is ffffffxn	 Figure 


















































Figure  Execution using Con
guration Caching
Data Caching Data caching technique also con
g
ures one stripe at a time but uses the cache to store
data or intermediate results	 However after con
gur
ing the k strips each of these k con
gurations is kept
in the fabric until all X data elements pass through it	
After that the next k con
gurations are loaded	 The
intermediate results produced at the end of every k
pipeline stages are stored in the cache to be accessed
later for execution by next set of pipeline stages	 When
k  S one data element enters the fabric every exe
cution cycle	 The result is produced at the rate of one
per cycle during the last round when the last pipeline
stage is present in the fabric	
Considering the same example execution of data
caching is shown in Figure 	 Similar to con
gura
tion caching data caching fetches con
gurations and
data once from the external memory	 Cached inter




f1(x1) f1(x2) f1(x3) f1(x4) f1(x5) f1(x6)
f2(x1) f2(x2) f2(x3) f2(x4) f2(x5) f2(x6)





f4(x3) f4(x4) f4(x5) f4(x6)
f5(x1)
f5(x2) f5(x3) f5(x4) f5(x5) f5(x6)
f6(x1) f6(x2) f6(x3) f6(x4) f6(x5) f6(x6)
Figure  Execution using Data Caching
application processes X data elements then there is
one bubble in the pipeline after X cycles	
Both con
guration and data caching require wide
global con
guration bus for loading the con
gurations
in the stripes	 The results are produced whenever a
data element has been processed by all the stages	
Thus the latency is dependent on the number of
pipeline stages and also on the number of data elements
to be processed by the application	 In data con
gura
tion the latency for the 
rst result increases with the
number of data elements to be processed	
Model of Execution Time The 
rst part of the
execution time is spent initialization and then the ac
tual execution starts	 The execution involves fetch
ing con
gurations and data and feeding them into the
pipeline	 Whenever the con
guration or the data is not
available pipeline stalls	 The notation used are given
in Table 	
Table  Parameters Used
	
k Number of stripes in the FPGA fabric
M Cache size in bytes
Wd Data element size bytes
Wc Size of a stripe con
guration word bytes
nd Number of cycles required to fetch a
data element from the external memory
nc Number of cycles required to fetch a
con
guration from the external memory
Xmax Maximum number of data elements
 M
Wd
that can be stored in the cache




that can be stored in the cache
S Number of pipeline stages
in the application
N Total distinct con
gurations required by
the application to get S pipeline stages
X Number of data elements to be processed
fn Coverage function see text
We make the following assumptions for our study	
	 There are no stalls in the pipeline due to data
writes as they are supported by write buer	
	 The online prefetch controller with the FPGA ini
tiates prefetch of con
gurations and data of regu
lar applications whenever its 
nds the IO bus free	
	 The prefetch buer has capacity to store k con
g
urations and k   data elements	
	 Both prefetch buer and cache are dual ported	
	 In one execution cycle a stripe can process one
data element or load one con
guration and the
cache can read one con
guration or data element	
	 The application has one to one relationship be
tween the input and the output	 It means the
number of inputs to the application and the num
ber of outputs produced by the application are
equal	 If the application requires S stages f
f 	 	 	 fs then the operation performed on a
data element xi is fsfs   fxi   	 Exam
ples of such applications include data encryption
decryption using IDEA and DES	
	 The size of the intermediate and output data ele
ment is the same as that of the input data element	
This is not always true but true for the data en
cryption algorithms	
	 The number of pipeline stages S is greater than
the number of stripes in the fabric k	
	 The S pipeline stages required are distinct	
	 There is no bus contention and the bus is always
available to fetch con
gurations and data	
Cache Hit Ratio and the Coverage Function
For con
guration caching the cache of sizeM can store
at most Cmax con
gurations as given in Table 	 Sim
ilarly for data caching it can store at most Xmax data
elements	 At this point it is important to note the fact
that in any application some of the operations may be
repeated	 Therefore the number of distinct con
gu
rations N  can be less than or equal to the number
of actual pipeline stages S of the application	 Also
some of the con
gurations can be required more fre
quently than others	 It is obvious that out of N con

gurations the most frequently used C con
gurations
should be cached	 The coverage function fn denotes
the number of pipeline stages represented by the most
frequently used n distinct con
gurations	 By cachingC
con
gurations the cache can provide for fC pipeline
stages	 Hence the coverage provided by the cache or




caching when N  Cmax the cache hit ratio is 	 For
data caching con
guration hit ratio is zero as none of
the con
gurations is cached	 The cumulative function
fn where n   corresponds to the most frequently
used con
guration and n  N to all distinct con
gura
tions is convex in nature	 The worst case to consider is
a linear function or fn  n which is obtained when
all the con
gurations are distinct	 fn  n or N  S
is a worst case because it has the worst case memory
and IO bandwidth requirement	
Execution Cycles without IO Overhead Based
on the model described in the previous section we have
derived the following expressions for the number of exe
cution cycles without IO overhead for the con
guration
and data caching scheduling schemes in 	 Because
the IO overhead is not considered we assume that all
data elements and con
gurations are available without
any stalls	 The execution proceeds in rounds	 For con

guration caching we de
ne a round of con
guration
caching as a sweep of the application or a sweep of S
pipeline stages through the FPGA fabric for k data
elements	 The round gets over when k  th data el
ement is operated by the last pipeline stage	 The 
rst
round takes more cycles because of the pipeline latency
while all other rounds take the same S cycles except
the last one	 The number of execution cycles without
IO overhead EXc is






For data caching we de
ne one round of data caching
as a sweep of all data elementsX through k pipeline
stages con
gured in the FPGA fabric	 The round gets
over when the last data element is processed by the last
pipeline stage of the round	 The number of execution
cycles without IO overhead for data caching is






It is easy to see that the data caching approach takes
fewer number of execution cycles when the number of
data elements is greater than k   the number of el
ements processed by a pipeline stage in con
guration
caching	 However con
guration caching has higher la
tency for the 
rst output to be produced	
In reality these execution cycles will be increased
due to IO stall cycles  which may in some cases be
the dominant factor	 In our study we observed that the
number of stalls caused by IO in data caching increases
as the number of data elements and the number of cy
cles required to fetch a data element increase	 There
fore a better approach is blocked data caching	 This
can also be combined with some con
guration caching	
This hybrid caching is studied next	
 Hybrid Caching
We earlier showed that data caching performs well
under certain conditions and improves the utilization
of the fabric	 However data caching is sometimes not
practical because the number of data elements pro
cessed by the application may be very large and the
onchip memory is relatively small	 Also it is possi
ble that the data is generated and is to be processed in
realtime	 In such a case if the number of data elements
is large the application can not tolerate the latency of
data caching	 These problems can be solved by oper
ating on a block of data at a time	 With only cached
data all the con
gurations are read from the external
memory for every block of the data	 Since the con

gurations are large and are fetched from the external
memory it increases IO trac	 To keep the amount
of IO requirement under control we can cache some
of the most frequently used con
gurations in the on
chip cache	 As the coverage function of the application
is convex and hence caching even a few con
gurations
should lead to a considerable amount of saving in IO
trac as well as in the total execution time	
 Execution using Hybrid Caching
In hybrid caching the onchip cache is partitioned to
storeC con
gurations and B data elements whereB is
the number of data elements in a block	 A block of data
refers to the number of data elements processed at a
time by hybrid caching not to a cache block or a cache
line	 Allocating the cache storage for B data elements
allows fetching of B data elements from the external
memory to be overlapped with the processing of B data
elements by the application	 This is important in order
to reduce the stalls due to data as will be seen in Section
		 Similar to con
guration caching and data caching
in addition to the cache a prefetch buer is available
to store k con
gurations	 Since total onchip memory
is M bytes M  B and C are related by
M  CWc  BWd 
In hybrid caching a combined approach is adopted
to schedule data and con
gurations through the FPGA
fabric	 Similar to con
guration caching all the con
g
urations operate on a block of data consisting of B
B  X data elements and similar to data caching
a set of k con
gurations are kept in the fabric until
they 
nish processing all the data elements in a block	
To illustrate the scheduling using hybrid caching con
sider the same example as that used to explain data
and con
guration caching in Section 	 Let the block
of data consists of three elements or B  	 The hy
brid caching execution takes place as shown in Figure
	 We notice from Figures   and  that the number
of execution cycles of hybrid caching is in between that































Figure  Execution using Hybrid Caching approach
In the following sections we compute the number
of execution cycles without and with IO overhead for
hybrid caching	
 Execution Time without IO Overhead
Figure  depicts the rounds and the subrounds in
hybrid caching	 A round for hybrid caching is one data
caching execution on B data elements or one sweep of
the application for B data elements	 A round gets over
when the last data element in the block is processed by
the Sth pipeline stage	 Since there areX data elements





rounds	 Similar to data caching







rst and the last round behavior is dierent
than other rounds	 Therefore we consider these two
rounds separately from the remaining rounds	 The 
rst
round is a data caching execution on B data elements	
From Eq	  the number of cycles in the 
rst round





	 For computing the
number of execution cycles of other rounds we consider
the 
rst and the last subrounds separately	 The 
rst
subround is shown in Figure 	 When S is not an
integer multiple of k the number of con
gurations in






This reduces the latency of the 
rst subround of the





  cycles	 The number of
cycles in the 
rst subround for the remaining rounds











 S B  	
ceil(X/B) rounds 
1 2 X/BX/B - 1
1 2 S/kS/k - 1
ceil(S/k) Sub-rounds
Figure  Rounds and Subrounds of Hybrid Caching
Last data element









Part of the last subround 







First Sub-round Z = S-k floor(S/k)-2






  subrounds in a round are similar to
the rounds of data caching and they take B cycles	
Because the last subround has fewer number of con






















The last round is also similar to other rounds except
that the number of data elements in the last round can
be less than B	 The number of data elements operated





	 Hence the total num











Combining the number of execution cycles for the 
rst
and the last subround with that of the remaining
rounds total number of execution cycles of hybrid
caching without IO overhead EXh after simpli
ca
tion is given by











Comparing Eqs	  and  it is obvious that the
number of execution cycles of hybrid caching is higher









  it reduces to data caching	 Hence as long
as the number of blocks of data are limited or in other





is small the hybrid caching will
follow the performance trends of data caching	 At the
same time the tradeo between EXh and latency can
be made by selecting the value of B depending on the
application requirements	
 Total Execution Cycles
As stated in Section 	 for hybrid caching the cache
stores C con
gurations and B data elements	 Hybrid
caching executes the complete application on a block
of data at a time and while one block is processed
the next block of B elements is fetched from the ex
ternal memory	 We assume that appropriate hardware
is available to allow all the required accesses	 With
C con
gurations present in the cache the cache hit
ratio is h  fC	
S
	 We assume that the cached con

gurations are used uniformly in the subrounds or in
other words in each subround kh con
gurations
are required to be fetched from the external memory	
We assume that the 
rst kh con
gurations in ev
ery subround are not in the cache	 The total number
of stalls incurred during hybrid caching execution de
pends on the number of rounds and the number of
stalls in each round which is determined by its round
number	 In order to compute total number of stalls we
consider the following four rounds	
 First Round At the start of the 
rst round
none of the con
gurations is in the cache	 The con

gurations are fetched from the external memory as
they are required and out of N distinct con
gurations
C con
gurations that are selected by the compiler are
cached	 Since in the 
rst round the caching of con
g
urations is incremental and application dependent we
ignore the eect of con
guration caching in the 
rst
round	 Thus the 
rst round of hybrid caching is sim
ilar to that of data caching except for the dierence
in the number of data elements	 Hence the number of
stalls incurred during the 









Wrs  knc  Bnd  B  k  
Wrs  maxf knc  B  g
 Second Round The second round contributes
to the total number of stalls when the number of
rounds Rh is greater than 	 From the second round
onwards C con
gurations are always in the cache and
the cache hitratio is h  C
S
 where all the pipeline
stage con
gurations are assumed to be distinct	 We
need to consider the following factors to compute stalls
in the 
rst subround of the second round	
 Fetching of k  h uncached con
gurations re
quired in the 
rst subround of the second round	
 Fetching of the new data block of B elements re
quired in the second round	
 Prefetching of kh uncached con
gurations for
the second subround of the second round	
Figure  shows the last subround of the 
rst round
and the 
rst subround of the second round	 In contrast
to data caching the last subround of hybrid caching
requires to fetch k  h con
gurations in order to
avoid con
guration stalls at the start of the second
round	 The number of con
gurations in the last round





	 As shown in Figure  after the last
con
guration of the last subround is loaded there are
B k  cycles before kh con
gurations of the

rst subround of the second round are required	 The
number of cycles available to hide con
guration fetch
ing is Bkkh or Bkh	 Fetching
k h con
guration takes k  hnc cycles	 Hence
the number of stall cycles caused by the uncached con

guration in the 
rst subround of the second round is
given by














Figure  Counting stalls for Hybrid Caching
IfWrsc   then Bkhkhnc cycles are
available in the last subround of the 
rst round to fetch
the data of the second round	 In addition if Wrs  
then maxf B    kncg cycles are available to




  of the 
rst round	 Hence now total cycles
available to fetch data is TCa  maxf B   





  maxf B kncg	
The number of data elements that can be prefetched
during the 





From Figure  we notice that in the last k  
cycles of the 
rst subround the con
gurations of the
next subround are required	 Thus in order to avoid
stalls the 
rst round still requires to fetch B  rd
data elements and k  h con
gurations within  
kh  B  k    k  h   or B   cycles in
order to avoid stalls	 The number of stalls caused by
the insucient overlapping is TCs  maxf B 
rdnd  k  hnc  B  g	 The total number
of stalls in the 
rst subround of the second round is
given by
Wrs Wrsc  TCs





  in the
second round	 Similar to data caching in every sub
round there will be con
guration prefetching for the
next subround	 From Figure  there are B   cy
cles available for hiding the con
guration fetching	 The
number of stalls due to con






  subrounds is given by
Wrs  maxf k hnc  B  g







 Rounds  and Onwards Rounds  to Rh  
contribute to the total number of stalls in hybrid
caching when Rh  	 All these rounds behave sim
ilarly and hence we consider them together	
From the third round onwards the number of cycles






   the cycles from the last sub
round	 For these rounds the number of con
gurations
stalls in the last subround is also given by the expres
sion of Wrs	 Hence total cycles available for fetch






	 We do not need to consider the cycles
from the 
rst subround of the previous round because
if there are cycles available in the 
rst subround then
it means there were no stalls which will be true for this
round also	 The number of data elements prefetched
for every round in this category is given by rd 
minfB TCa
nd
g	 Hence for round  and onwards to
tal stalls in the 
rst subround of round  is given by
Wrs  maxf Brdnd khnc Bg	
The total number of stalls in every round from three







 Last Round Since B can be large and the
number of elements in the last round can be relatively
small we compute the number of stalls for the last
round separately	 The number of data elements pro






of stalls in the last round depends on Rh and hence to
compute the stalls in the last round and the total stalls
of hybrid caching we consider the following two cases	
Case I Rh   In this case the second round is
the last round	 The number of stalls due to the con

gurations at the start of the round is still given by
Wrsc	 The number of cycles available to prefetch the
data is still the same as that for the second round but
the number of data elements that are required to be






this factor into account the number of data elements of
the last or second round that can be prefetched during
the 
rst round is given by















maxf Bkncgg	 In this case the stalls
in the 
rst subround of the last round are Wlasts 
































Since there are just two rounds the total stalls are
Wht Whr Whlast 
and the total execution cycles from Eqs	  and  is
Th  EXh Wht 
Case II Rh   In this case the last round is similar
to the Round  category except for the dierent number
of data elements	 By modifying rd the number of
data elements of the last round that can be prefetched
is given by















g	 Following the expression for Wrs total stalls
in the 
rst subround of the last round is Wlasts 











	 The number of





  subrounds of
the last round is the same as given by Wlasts	 Hence







In this case total stalls are given by





  Whlast 
and the total number of execution cycles from Eq	 
and  is given by
Th  EXh Wht 
 Results
To evaluate the performance of hybrid caching we
consider the same parameters used to evaluate data and
con
guration caching	 There with some approxima
tions we showed in that EXd  EXc when X  k	
Also the performance dierence or EXdEXc is pro
portional to X 	 When the number of cycles required
to fetch a data element nd is greater than one as X
increases the number of stalls in the 
rst round of data
caching also increases	 Therefore when total number
of execution cycles is considered the comparison be
tween Td and Tc depends on X 	 To keep the number of
stalls in the 
rst round less X is required to be small	
However X is an application parameter which we can
not control	 However in hybrid caching the number of
data elements processed in every data caching round
equals B where we	 can select B to give overall op
timal performance	 Note that for hybrid caching we
require X  B and B  k 	 We provide the results
of hybrid caching for the applications with all distinct
con
gurations N  S and nondistinct con
gurations
N  S in the next two subsections	
 Distinct Congurations N  S
For an application with the number of pipeline
stages S   the variation in total number of ex
ecution cycles Th with the number of con
gurations
cached C for dierent values of X is shown in Fig
ure 	 From Eq	  for a given M as C increases
B is decreased	 This increases the number of rounds
of hybrid caching Rh and hence EXh	 For S  
nd   Figure a shows that Th does not vary much	
Almost for all values of C B is large enough to hide
con
guration fetching	 The data can always be fetched
without stalls because nd  	 Thus there are no stalls
due to data or con
gurations in the rounds except the

rst one	 The stalls in the 
rst round are independent
of B as nd  	 The only factor aecting Th is EXh
























 Th is min
imum when B  last is minimum where last is the
number of elements in the last round	 When last is
very small it does not provide enough cycles to over
lap the prefetching of con
gurations	 It adds stalls in
the last round of hybrid caching which increases Th as
shown by the peaks presents in Figure 	
When nd   the nature of the graph is dierent	
For this case as shown in Figure b as C increases Th
reduces	 The reason is that with nd   the number of
stalls proportional to B are added in the hybrid caching
execution	 Hence as B decreases the number of stalls
decrease	 The reduction in the number of stalls exceeds
the increase in EXh and hence Th is reduced	 There
are peaks in the curve again for the reason mentioned
in nd   case	
To compare hybrid caching with previous two
schemes consider Table  that shows the number of
total execution Kcycles rounded to one decimal digit
for the three schemes for S  	 For hybrid caching
we take the best value encountered over the values of C
we considered	 Table  when nd   shows that hybrid
caching execution cycles is slightly more than that of






 but hybrid caching still per
forms on an average 	 times better than that of con

guration caching	 From Table  when nd   hybrid
caching performs better than other two approaches	 In
this case hybrid caching performs 	 times better
than data caching and 	 times better than con
gu
ration caching	 Also note that in this case for last two
values of X data caching performance is not good but
by blocking the data and caching some of the con
gu
rations hybrid caching provides better performance	





























a nd  

























b nd  
Figure  Th vs C for S  
Table  Th Tc and Td for S   nd    
nd   nd  
X Th Tc Td X Th Tc Td
 	 	 	  	 	 	
 	 	 	  	 	 	
 	 	 	  	 	 	
Table  Th Tc and Td for S   nd   
nd   nd  
X Th Tc Td X Th Tc Td
 	 	 	  	 	 	
 	 	 	  	 	 	
 	 	 	  	 	 	
Table  Th Tc and Td for S   nd   
nd   nd  
X Th Tc Td X Th Tc Td
 	 	 	  	 	 	
 	 	 	  	 	 	
 	 	 	  	 	 	
Next we consider an application with the number of
pipeline stages S  	 The variation in Th with C
in this case is shown in Figure 	 The nature of the
curves for nd   as shown in Figure a and that
for nd   as shown in Figure b is similar to the
respective curves we obtained for S   except for
nd   case there is a knee in the graph of Th vs	
C	 At knee the value of C gives minimum Th or the
optimum operating point	 For nd   after the opti
mum value of C the value of B is very small and the
fetching of con
gurations can not be overlapped with
the execution	 Hence the stalls occur in the execution
due to con
gurations as well as data	 For nd   opti
mum C is near  because in this case data stalls are not
present and we can have B as large as possible to hide
the con
guration fetching	 For nd   optimum value
of C is Copt  	 The graphs for S   are shown
in Figure  which are also of the similar nature	 The
optimum number of cached con
gurations for nd  
is not constant and varies with X and these are 
  for three values of X considered in Figure b
respectively	



























a nd  


























b nd  
Figure  Th vs C for S  
As noted above at the optimum value of C hybrid
caching performs best	 In Tables  and  we list the
best value of Th obtained for the parameters mentioned
along with the table and for comparison Tc and Td are
also listed	 The values of Tc and Td are taken from
Tables in 	
Tables  and  shows that when nd   the perfor
mance of hybrid caching is comparable to that of data
caching while it is 	 and 	 times that of con
gura
tion caching for S   and S   respectively	 For
nd   from Tables  and  hybrid caching performs
better in all the cases	 Hybrid caching performance is
	 times that of data caching while it is 	 and 	
times that of con
guration caching for S   and
S   respectively	






























a nd  































b nd  
Figure  Th vs C for S  
	 Non
distinct Congurations N  S
In this section we consider the application with the
number of pipeline stages S equal to 	 For this ap
plication some of the pipeline stages require the same
con
guration and hence the number of distinct con

gurations required to provide all the pipeline stages
is less than the actual number of pipeline stages or
N  S	 We consider the speci
c application with the
number of distinct con
gurations N  	 The num
ber of pipeline stages represented by each distinct con

guration is given in Table 	 The variation in cache
hitratio h  fC	
S
is shown in Figure 	 As the cov
erage function is convex initially hitratio increases at
a higher rate providing good hitratio even with few
cached con
gurations	
We still assume that the cached con
gurations are
required uniformly in the subrounds and uncached
con
gurations can not be reused	 Hence the analysis in
Sections 	 and 	 is valid for this case as well	 The
Table   of Stages covered by a con
guration
C         
         

















Number of Configurations Cached, C
S=128, N=64
Figure  Hit Ratio vs C N  S



























a nd  























X=512, S=128, N=64, nd=2
N=S
N<S
b nd  
Figure  Th vs C for S   X  KB
performance of hybrid caching for this application is
evaluated for two values of X and nd and is compared
with N  S case in Figures  and 	

























a nd  

























b nd  
Figure  Th vs C for S   X  KB
For a given application the number of distinct con
g
urations N is 	 If all the distinct con
gurations are
stored in the cache the memory available to store data
is  bytes which allows block size of  elements
forWd   and  forWd  	 Notice that even the
minimum value of B is large enough to hide con
gu
ration fetching B  knc irrespective of the cache hit
ratio	 When C   N  S and N  S performance
is the same	 As C increases B is reduced in both the
cases	 In N  S case con
guration hitratio is less
and hence in each subround more con
gurations are
required to be fetched from the external memory	 This
results in con
guration stalls if B is not large enough
to hide con
guration fetching	 This factor is even more
important in the last round where the size of the block
can be less than B	 Thus some of the con
gurations
stalls that are present in N  S case are not present
in N  S	 For small C if there are con
guration stalls
in the last round then N  S performs better than
N  S	 This can be seen in Figure b	 However if
there are no stalls in the last round both N  S and
N  S essentially give the same performance as shown
in Figure b	 At large values of C B is small and ad
ditional data and con
guration stalls appear	 Again as
the number of con
guration stalls are fewer for N  S
than for N  S N  S performs better	 This can be
seen for C   in Figures b and b	 Also for
the given C the hitratio for N  S is more and hence
the peaks in N  S curves are smaller than that in
N  S curves	
From Figure a and a for nd   as observed
before the performance does not vary much and as data
stalls are absent N  S performs the same as N  S
as long as there are no con
guration stalls in the last
round	 The peaks in N  S curve are smaller because
it has to fetch less number of con
gurations and hence
less number of fetch cycles	
 Conclusions
We have reviewed con
guration and data caching
schemes and have developed and analyzed hybrid
caching for striped FPGA architecture	 We evaluated
the performance by computing the total execution cy
cles taking into account the eect of IO with the ex
ternal memory	 For the application with the num
ber of pipeline stages less than the number of stripes
in the FPGA fabric the three schemes are the same	
When the application has the number of pipeline stages
more than the number of stripes in the FPGA fab
ric a more likely case the performance is determined
by the scheduling scheme used for the recon
guration
of the fabric	 As seen from the results there exists
a value of C and hence B for which hybrid caching
gives its optimum performance	 The application with
N  S gains more performance than the application
with N  S	 The number of execution cycles of hybrid
caching without IO overhead EXh is in between that
of data caching and con
guration caching the number
of total execution cycles of hybrid caching are found
to be better than that of con
guration caching in all
the cases we considered and the comparison with data
caching shows that when nd   hybrid caching pro
vides better results	
The selection of particular scheduling scheme de
pends on whether application requires smaller latency
or smaller total execution time	 The latency in con

guration caching is minimal provided all the con
gu
rations corresponding to the application are present in
the cache	 But when the number of con
gurations ex
ceeds the cache capacity then hybrid caching will guar
antee the results	 In data caching latency directly de
pends on the number of data elements processed by the
application X 	 Hence it is not suitable for the appli
cation with low latency requirements	 Hybrid caching
can adapt to the required latency by selecting an ap
propriate value for the size of the block to be processed	
Thus hybrid caching scheme eliminates the practical
problems in data caching execution and provides 	
	 times performance improvement over data caching	
The performance of hybrid caching is 		 and some
times even  times that of con
guration caching	
With total execution time as the main parameter we
need to consider the application parameters in order to
evaluate all the schemes for the given system	 When
the number of cycles required to fetch a data element
nd   data caching can be selected irrespective of
the number of data elements	 When nd   hybrid
caching will provide better results	
Notice that the analysis in this paper is based on
the assumption that hat the cached con
gurations are
used uniformly in the subrounds which need not be
the case	 The cached con
gurations can have any ran
dom distribution within the given S pipeline stages	 we
also assumed that the size of data as it passes through
the pipeline remains the same	 It is possible that in
termediate results are of larger size than the actual
inputs	 Depending on the application this parameter
will change	 For data and hybrid caching schemes this
factor when greater than one reduces the number of
data elements cached	 It is necessary to take this fac
tor into account when deciding Xmax and B for data
and hybrid caching respectively	
The correct results can be obtained by simulating
the three schemes and by modeling dierent applica
tions based on the nature of the pipeline	 Currently
this simulator is under development	 The future work
can look into the issue of modeling the applications for
the reusability of dierent distinct con
gurations	
References
 H Schmit Incremental Reconguration for Pipelined Ap
plications	 in Proc of the IEEE Symposium on FPGAs for
Custom Computing Machines pp 
 
 Peter M Athanas and Harvey F Silverman Proces
sor reconguration through instructionset metamorpho
sis	 Computer vol  no  pp  
 Andre DeHon DPGAcoupled microprocessors Commod
ity ICs for the early st century	 in Proc of the IEEE




 Eric Lemoine and David Merceron Run time recongura
tion of FPGA for scanning genomic databases	 in Proc of
the IEEE Symposium On FPGAs for Custom Computing
Machines pp  
 Bernald Gunther George Milne and Lakshmi Narasimhan
Assessing document relevance with run time recongurable
machines	 in Proc of the IEEE Workshop on FPGAs for
Custom Computing Machines pp  
 N Shirazi P Athanas and L Abbott Implementation
of a D Fast Fourier Transform on a FPGAbased Cus
tom Computing Machine	 The th International Workshop
on Field Programmable Logic and Applications September

 Jean E Vuillemin Patrice Bertin Didier Roncin Mark
Shand Herve H Touati and Philippe Boucard Pro
grammable Active Memories Recongurable systems come
of age	 IEEE Transactions on VLSI vol 
 no  March

 Doug Smith and Dinesh Bhatia RACE Recongurable
and Adaptive Computing Environment	 Lecture Notes in
Computer Science Vol 
 pp  
 Rahul Razdan and Michael D Smith A highperformance
microarchitecture with hardware programmable functional
units	 in Proc of the th Annual International Sympo
sium on Microarchitecture pp  

 Ralph D Wittig and Paul Chow OneChip An FPGA
processor with recongurable logic	 in Proc of the IEEE
Symposium on FPGAs for Custom Computing Machines
pp  
 S Cadambi J Weener S C Goldstein H Schmit and D
E Thomas Managing PipelineRecongurable FPGAs	
in Proc ACMSIGDA Sixth International Symposium on
FPGAs pp 
 
 J Hauser and J Wawrzynek Garp A MIPS Processor
with a Recongurable Coprocessor	 in Proc of the IEEE
Symposium on FPGAs for Custom Computing Machines
pp 
 
 R Bittner P Athanas and M Musgrove Colt An Ex
periment in Wormhole Runtime Reconguration	 in SPIE
Photonics East 	 

 S Hauck Z Li E J Schwabe Conguration Compression
for the Xilinx XC FPGA	 to appear in IEEE Transac
tions on ComputerAided Design of Integrated Circuits and
Systems
 D D Deshpande A K Somani and A Tyagi Congu
ration Caching Vs Data Caching for Striped FPGAs	 to
appear in FPGA February 
