Configuration caching vs data caching for striped FPGAs by Deshpande, Deepali et al.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
1999 
Configuration caching vs data caching for striped FPGAs 
Deepali Deshpande 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Akhilish Tyagi 
Iowa State University 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Data Storage Systems Commons, and the Systems and Communications Commons 
Recommended Citation 
Deshpande, Deepali; Somani, Arun K.; and Tyagi, Akhilish, "Configuration caching vs data caching for 
striped FPGAs" (1999). Electrical and Computer Engineering Conference Papers, Posters and 
Presentations. 162. 
https://lib.dr.iastate.edu/ece_conf/162 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
Configuration caching vs data caching for striped FPGAs 
Abstract 
Striped FPGA [1], or pipeline-reconfigurable FPGA provides hardware virtualization by supporting fast run-
time reconfiguration. In this paper we show that the performance of striped FPGA depends on the 
reconfiguration pattern, the run time scheduling of configurations through the FPGA. We study two main 
configuration scheduling approaches: Configuration Caching and Data Caching. We present the 
quantitative analysis of these scheduling techniques to compute their total execution cycles taking into 
account the overhead caused by the IO with the external memory. Based on the analysis we can 
determine which scheduling technique works better for the given application and for the given hardware 
parameters. 
Disciplines 
Data Storage Systems | Systems and Communications 
Comments 
This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not 
for redistribution. The definitive version was published in Deshpande, Deepali, Arun K. Somani, and 
Akhilish Tyagi. "Configuration caching vs data caching for striped FPGAs." In Proceedings of the 1999 
ACM/SIGDA Seventh International Symposium on Field Programmable Gate Arrays, pp. 206-214. 1999. 
DOI: 10.1145/296399.296461. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/162 
Conguration Caching Vs Data Caching for Striped FPGAs
Deepali Deshpande Arun K Somani
Electrical and Computer Engineering Department




Iowa State University Ames 
EMail tyagiiastateedu
Abstract
Striped FPGA  or pipelinerecongurable FPGA pro
vides hardware virtualization by supporting fast runtime
reconguration In this paper we show that the perfor
mance of striped FPGA depends on the reconguration
pattern the run time scheduling of congurations through
the FPGA We study two main conguration scheduling
approaches Conguration Caching and Data Caching We
present the quantitative analysis of these scheduling tech
niques to compute their total execution cycles taking into
account the overhead caused by the IO with the external
memory Based on the analysis we can determine which
scheduling technique works better for the given application
and for the given hardware parameters
 Introduction
Originally introduced for prototyping digital circuits FP
GAs are now used as hardware accelerators in congurable
computing machines CCMs	 CCMs consist of congurable
hardware such as FPGAs and programmable processors
For regular and deeply pipelined applications like DSP im
age processing data encryptiondecryption the performance
gain obtained with CCMs is at least an orderofmagnitude
greater than that of processorbased approaches Many
CCMs for example SPLASH 
 PAM  etc use com
mercially available FPGAs belonging to Xilinx  family
Xilinx 
 family or Altera Flex family Along with other
factors reconguration time and reconguration granular
ity of the FPGA limits the performance gain obtainable
from these machines Xilinx  and Altera Flex fam
ily FPGAs have exclusive modes of execution and cong
uration and the basic unit of reconguration is the whole
This work was funded by Carver Trust Grants Iowa State
University
FPGA This results in a long reconguration time The
new generation FPGAs from Xilinx the XC
 family al
lows simultaneous reconguration and execution It also
supports partial reconguration However its basic unit
of reconguration a functional block is fairly small To
overcome these problems the researchers are proposing new
FPGA architectures suitable for fast runtime recongu
ration by providing support for concurrent execution and
conguration and partial reconguration Some of the new
architectures evolved over last ve years include PipeRench
Striped FPGA	  Garp  Colt 
The recongurability provides hardware virtualization which
is important for CCMs in order to execute applications
with varying hardware requirements The hardware vir
tualization involves dening a complete set of congura
tions required by the application but scheduling parts of
it in a sequence on the hardware to complete the appli
cation The conguration size is usually large and recon
guration is costly in terms of time as well as power re
quirement The hardware should not be recongured often
just because it is recongurable but the decision should be
taken based on the application requirements and the avail
able hardware resources The reconguration pattern which
denes the scheduling of congurations on the hardware
changes the data scheduling and the caching requirements
In this paper we analyze and study the performance of two
schedulingcaching approaches for striped FPGA architec
ture The next subsection gives the overview of the striped
FPGA architecture
 Overview of Striped FPGA Architecture
Striped or pipelinerecongurable FPGA  is suitable for
implementing pipelined applications As shown in Figure
 it consists of a fabric which is a set of hardware stripes
connected in a pipeline fashion The basic unit of recong
uration is a stripe In the ideal case one stripe of the FPGA
can implement one pipeline stage of the application It is
possible to implement the application with the number of
pipeline stages greater than the number of stripes by hard
ware virtualization or by reconguring the same stripe to
perform the function of dierent pipeline stages at dier
ent times Figure  shows a large cache that stores either
the congurations required to implement pipeline stages of
the application or the intermediate data produced during
processing The purpose of providing the cache is to allow
faster memory access The architecture also provides a wide
bus interface between the cache and the fabric for loading

















Figure  Striped FPGA Architecture
In Section  we dene two schedulingcaching schemes
for striped FPGA In Section  we present our analyti
cal model and based on it we derive expressions for the
execution time of the two scheduling schemes Based on
our model we study the performance of the two schedul
ing approaches and compare them in Section  Section 
provides the concluding remarks
 Conguration Scheduling
A pipelined application can be described using the number
of pipeline stages S and the number of data elements X to
be processed by the application When S is greater than
the number of stripes k in the FPGA fabric the applica
tion is scheduled parts at a time on the FPGA as dened
by the reconguration pattern In this section we catego
rize dierent scheduling approaches based on their recon
guration patterns Since these scheduling schemes have
dierent caching requirements we name them based on the
cache contents In this paper we study two approaches
Conguration Caching and Data Caching
 Conguration Caching
As the name suggests the conguration caching approach
for scheduling stores all the congurations required by the
application in the cache This approach is used in PipeRench
 We number k stripes in the FPGA from  to k  and
the pipeline stages of the application from  to S   If
the execution starts at time step or cycle number t then
at cycle number t  i i  	  i mod S	
th conguration
enters in the fabric If k  S the conguration process
stops after S cycles when all the pipeline stages are in the
fabric When k  S in the ith cycle from the start of the
application i mod k	th stripe is recongured to execute
i mod S	th pipeline stage in the application This way
once a data element enters the fabric it passes through all
pipeline stages Thus when k  S one element enters the
pipeline every cycle but when k  S k   data elements
enter the pipeline every S cycles When k  S through
put reduces since one stripe is always under reconguration
To illustrate the scheduling when k  S consider a simple
application consisting of six pipeline stages Let the num
ber of stripes in the fabric be three The application has to
process six data elements x    x The operation to be per
formed on nth data element is ffffffxn						
Figure  shows the execution of conguration caching
The conguration scheduling approach fetches data from
the external memory On the rst reference congurations
are fetched from the external memory and are cached to
provide further references from the cache Note that one
stripe is recongured every cycle This requires that the
cache be accessed and a large conguration be loaded in
the fabric every cycle The fabric is underutilized because
of the presence of one bubble in it It is possible that data
elements may enter and exit at any stripe in the fabric
This forces the global data bus to run all over the fabric
When the pipeline is folded across the fabric data is to be
passed from the last stripe to the rst stripe To do this a
global interconnection bus covering the complete fabric is
required In addition global wide conguration bus covering
all stripes is required for loading congurations in stripes
On the plus side note that conguration caching does not
need any supplementary storage for intermediate results as
they remain stored in the appropriate pipeline stage La
tency is independent of the number of data elements to be
processed by the application
 Data Caching
Data caching technique is similar to the scheduling used for
component level reconguration except for the dierence in
the atomic unit of reconguration Component level recon
guration is used for the runtime reconguration of FP
GAs such as XC series that do not support partial
reconguration and have exclusive execution and congu
ration modes It has to congure the whole component On
the contrary data caching congures only a stripe at a time
and allows the reconguration to be overlapped with the ex
ecution Data caching scheme uses the cache to store data
or intermediate results Similar to conguration caching if
the execution starts at time step or cycle number t then
at cycle number t  i   i  k	 i
th pipeline stage con
guration enters in the FPGA fabric Each of these k con
gurations is kept in the fabric until it processes all data
elements X At cycle number t  X the rst pipeline
stage numbered 	 nishes operating on all data elements
so during step t  X  	 it is recongured to execute
pipeline stage numbered k Thus at t X  k all stripes
in the FPGA fabric represent next k numbered k to k	
pipeline stages in the application The intermediate results
produced at the end of every k pipeline stages are stored
in the cache to be accessed later for execution by next set
of pipeline stages If k  S then the execution is same
as that of conguration caching When k  S one data
element enters the fabric in every execution cycle The re
sult is produced at the rate of one per cycle during the last
round when the last pipeline stage is present in the fabric
Taking the example from Section  the execution of data
caching is shown in Figure  Similar to conguration
caching data caching fetches congurations and data once
from the external memory Cached intermediate data are
circulated through the fabric If the application processes
X data elements then there is one bubble in the pipeline
after X cycles Data caching also requires wide global con
guration bus for loading the congurations in the stripes





















































f1(x1) f1(x2) f1(x3) f1(x4) f1(x5) f1(x6)
f2(x1) f2(x2) f2(x3) f2(x4) f2(x5) f2(x6)





f4(x3) f4(x4) f4(x5) f4(x6)
f5(x1)
f5(x2) f5(x3) f5(x4) f5(x5) f5(x6)
f6(x1) f6(x2) f6(x3) f6(x4) f6(x5) f6(x6)
Figure  Execution using Data Caching approach
bus as the rst conguration will always be loaded in the
rst stripe Also the global interconnect between the last
and the rst stripe may not be required as the data is cir
culated through the intermediate storage As all the re
sults are produced when the last pipeline stage is cong
ured in the fabric the latency is dependent on the number
of pipeline stages and also on the number of data elements
to be processed by the application Latency increases with
the number of data elements to be processed
 Model of Execution Time
In this section we describe the architectural features of
striped FPGA FPGA external memory bus interface and
the characteristics of the application Based on these pa
rameters we derive expressions for the total execution time
of the scheduling schemes described in Section 
 Parameters and Assumptions
Striped FPGA is a coprocessor attached to the host pro
cessor The host processor initiates the operation on the
FPGA Dierent schemes may require dierent initializa
tion sequences However each sequence involves specifying
the starting address of congurations and data elements
and the number of iterations to be performed After the
initialization is completed the actual execution starts The
execution involves fetching congurations and data and
feeding them into the pipeline Whenever the conguration
or the data is not available pipeline stalls The notations
for the parameters are described in Table 
We make the following assumptions for our study
k Number of stripes in the FPGA fabric
M Cache size in bytes
Wd Data element size bytes	
Wc Size of a stripe conguration word bytes	
nd Number of cycles required to fetch a
data element from the external memory
nc Number of cycles required to fetch a




Maximum number of data elements




Maximum number of congurations
that can be stored in the cache
S Number of pipeline stages
in the application
N Total distinct congurations required by
the application to get S pipeline stages
X Number of data elements to be processed
fn	 Coverage function see Section 	
















New (k-1) data elements required






Figure  First Two Rounds of Conguration Caching
 There are no stalls in the pipeline due to data write
The write buer is provided to store output data
 The prefetch controller is present onchip with the
FPGA Since the applications implemented on the
striped FPGA are regular it is possible to prefetch
congurations and data accurately The prefetch con
troller initiates the prefetch when the IO bus is free
 The prefetch buer has capacity to store k congura
tions and k   data elements
 The prefetch buer as well as cache are dual ported
 In one execution cycle a stripe can complete reading
data from the cache and its processing Loading con
guration in a stripe from the cache takes one clock
cycle

 The application has one to one relationship between
the input and the output It means the number of in
puts and the number of outputs produced are equal
With S pipeline stages ff   fsxn	   		 is the op
eration performed on a data element xn where fi rep
resents ith pipeline stage Examples of such applica
tions include data encryptiondecryption using IDEA
and DES
 Intermediate and output data elements have the same
as that of the input data element This is true for data
encryption algorithms mentioned above However if
the intermediate data size is larger than that of the
inputs then the worst case data element size is de
termined by the datapath width allowed by the archi
tecture This also determines the maximum number
of data elements that can be cached for data caching
scheme The analysis can be easily modied to take
into account this factor
 The number of pipeline stages S	 is greater than the
number of stripes in the fabric k	 because when k 
S the two scheduling schemes are the same
 The S pipeline stages of the application are distinct
which is the worst case as explained in the next sec
tion
 There is no bus contention The FPGA external mem
ory bus is always available for fetching congurations
and data To model the bus overhead we just need to
modify nd and nc by the bus overhead factor
 Cache Hit Ratio and the Coverage Function
For conguration caching the cache of size M can store at
most Cmax congurations as given in Table  Similarly
for data caching it can store at most Xmax data elements
At this point it is important to note the fact that in any
application some of the operations may be repeated There
fore the number of distinct congurations N  can be less
than or equal to the number of actual pipeline stages S
of the application Also some of the congurations can be
required more frequently than others It is obvious that
out of N congurations the most frequently used C con
gurations should be cached The coverage function fn	
denotes the number of pipeline stages represented by the
most frequently used n distinct congurations By caching
C congurations the cache can provide for fC	 pipeline
stages Hence the coverage provided by the cache or the
cache hit ratio is h  fC	
S
 For conguration caching
when N  Cmax the cache hit ratio is  For data caching
conguration hit ratio is zero as none of the congurations
is cached The cumulative function fn	 where n  
corresponds to the most frequently used conguration and
n  N to all distinct congurations is convex in nature
The worst case to consider is a linear function or fn	  n
which is obtained when all the congurations are distinct
fn	  n or N  S is a worst case because it has the worst
case memory and IO bandwidth requirement
 Execution Cycles without IO Overhead
In this section we compute the number of execution cycles of
the two scheduling schemes assuming that all data elements
and congurations are available without any stalls
 Conguration Caching
Refer to Figure  that shows rst two rounds of congu
ration caching In Figure  E stands for execution Ci	
stands for conguration of ith pipeline stage and E Ci	
denotes one stripe being congured to perform ith pipeline
stage of the application in parallel with the execution in
other stripes The gure shows dierent spans in terms of
number of cycles
We dene a round of conguration caching as a sweep of
the application or a sweep of S pipeline stages through the
FPGA fabric for k   data elements The round gets over
when k	th data element is operated by the last pipeline


















First Round Second Round
E+
Figure  First Two Rounds of Data Caching
pipeline latency while all other rounds as they overlap with
the previous round take S cycles except the last one We
observed in Section  that one round of the application





rounds are required The rst round takes one cycle to load
the rst pipeline stage conguration and S  k   cycles
to execute S pipeline stages on k   data elements The
number of cycles in the rst round is thus S  k  
We note from Figure  that for the subsequent rounds the
pipeline latency is reduced by k   cycles Hence each
of the subsequent rounds takes S  k  	  k    S


















Therefore the number of total execution cycles without IO
overhead EXc is given by
EXc  S  k    Rc  	S	  S 












Figure  shows rst two rounds of data caching We dene
one round of data caching as a sweep of all data elementsX	
through k pipeline stages congured in the FPGA fabric
The round gets over when the last data element is processed
by the last pipeline stage of the round Since each round






rounds Similar to conguration caching the
rst round takes more cycles than other rounds The rst
round takes one cycle to load the rst conguration and
k X   cycles to process X data elements by k pipeline
stages Thus the number of cycles of the rst round are
kX Each of the remaining rounds except the last takes
X   cycles one cycle to congure the kth stage of the
round and X cycles to get the results of that round The











	 less compared to other
rounds Therefore the number of cycles in the last round











the number of the execution cycles without IO overhead for
data caching is given by
EXd   X  k    Rd  	X  	 











For comparing Eqs  and  if we approximate X to be
an integral multiple of k   and S to be an integral mul
tiple of k then we obtain EXd  EXc when X  k  
Data caching approach has fewer execution cycles when the
number of data elements processed by a pipeline stage af
ter every conguration is greater than k   the number
of elements processed by a pipeline stage in conguration
caching It indicates that once the pipeline stage is loaded
in the fabric it should be used as long as possible before
replacing it The same observation is true for the standard
FPGA as pointed by many other authors including  
Thus even though the hardware can be congured very
rapidly it should not be congured often unless required
by the application
 Total Execution Cycles
In the previous section we computed execution cycles with
out IO overhead In this section we consider the overlapping
between the execution and the IO to determine the stalls
introduced in the execution Total number of execution cy
cles is obtained by adding stalls to the number of execution
cycles without IO overhead In the analysis we assume S to
be an integral multiple of k and X to be an integral multiple
of k  
 Conguration Caching
The cache along with the conguration prefetch buer can
store up to Cmax  k	 congurations From round  on
wards conguration caching do not need to fetch congu
rations from the external memory when S  Cmax  k	
When S exceeds Cmax  k	 uncached congurations are
required to be prefetched to hide stalls In the second case
we assume that the contents of the cache remain the same
while that of prefetch buer changes We consider two cases
separately to compute the number of total execution cycles
Case I S  Cmax  k	
S(1-h) S(1-h)
uncached Configurations




(k-1) data elements Round(i) Round(i+1)
1 1
s(1-h)-1
required to be ready before
the end of the round
Figure  Round	i
 and 	i
 of Conguration Caching Case II i  
The rst round as shown in Figure  needs to fetch S con
gurations since they are not in the cache and k	 data
elements that are used in the rst S  k cycles of the exe
cution The number of IO cycles required in the rst round
is Snc  k 	nd The actual IO latency as seen by the
application is less because the IO overlaps with S  k  
cycles of the execution Hence the number of stalls in the
rst round Wcr is given by
Wcr  Snc  k  	nd  S  k  	
All rounds from  to Rc   are similar to the round 
shown in Figure  We do not need to consider the last
round because it does not involve any data fetching The
remaining Rc  	 rounds need k  data elements one in
every cycle starting from cycle number S  k  	 of that
round We note that in the second round there are Sk
	 cycles where IO is available for prefetching k   data
elements If k  data element are fetched within S k
	k	 or S cycles data prefetching completely overlaps
the execution and there are no stalls in the third round The
nostall condition for every round from  onwards has to
satisfy the relation k  	nd  S Therefore the number
of stalls in each of the remaining rounds is given by
Wcr  maxf k  	nd  Sg
CombiningWcr andWcr total number of stall cyclesWct
is given in Eq  The total execution cycles for S  Cmax
k case Tc is obtained by adding Eqs  and  in Eq 






Tc  EXc Wct 	
Case II S  Cmax  k	
As mentioned in Section  some Cmax most frequently
used when all are not distinct	 congurations are cached




 Each of the un
cached S h	 pipeline stage congurations is required in
every round Since these congurations are not in the cache
they are fetched from the external memory The number of
stalls resulting from the uncached congurations depends
on the relative positions of the pipeline stages using them
and on their reusability We consider the worst case where
all uncached congurations are used in successive pipeline
stages and they cannot be reused These successive stages
can occur anywhere in the pipeline Without loss of gener
ality assume that these congurations are required at the
start of rounds  to Rc for pipeline stages numbered k  
to k    S  h	 The rst round for this case is the
same as the previous case Hence the number of stalls in
the rst round is given by Wcr In the second round the
uncached congurations are required from the beginning of
the round As there are no cycles available from the rst
round there are only Sh	  cycles available for over
lapping at the beginning of the second round where the un
cached congurations are loaded in the fabric The number
of conguration stalls in the second round is given by
W crc  S h	nc  	  
To compute the stalls due to data in the second round and
the number of stalls in the remaining rounds consider Fig
ure 
 that shows two successive rounds From the S cycles
marked in the gure hS cycles are shared for data and
conguration fetching while remaining S  h	  cycles
are used only for congurations Data takes k  	nd cy
cles If k   data elements can not be fetched in hS  
cycles then the stalls due to data are given by
W crd  maxf k  	nd  hS  g






Before the beginning of Sh	 interval shown in Figure 

the number of cycles available for conguration fetching is
maxf hS k 	ndg Either this interval or number
of prefetch buers determines the number of congurations
prefetched as




If   Sh	 Sh	 cycles are present for overlap
ping the fetching of the remaining congurations  There
fore the number of stall cycles due to congurations in a
round is
W crc  maxf S h	 nc  S  h	  g
There are W crd W

crc stalls in each of the rounds from















New set of configurations
reuired
Figure  A round of Data Scheduling when X  Xmax for i  
S  Cmax  k	 stated in Eq  and the number of total
execution cycles Tc is given by Eq 











 	 W crc 	





Total data memory available can store Xmaxk	 data
elements Similar to conguration caching we consider two
cases for computing total execution cycles
Case I X  Xmax  k	
Figure  shows that the rst round of data caching needs
to fetch k congurations and X data elements The rst
round requires total kncXnd IO cycles These IO cycles
are overlapped with X  k  cycles of the execution The
number of resulting stalls in the rst round is given by
Wdr  knc Xnd  X  k  	
From the second round onwards all the data elements are
accessed from the cache All rounds from  to Rd   are
similar and they need prefetching k congurations that are
loaded at the end of the round There are X   execution
cycles for overlapping knc IO cycles of fetching congu
rations The number of stalls resulting from insucient
overlapping is given by
Wdr  maxf knc  X  	g
There are no stalls in the last round because it does not need
to prefetch any congurations When X  Xmax  k	 the
total number of stalls and the total number of execution
cycles are given by Eqs  and  respectively
Wdt  Wdr Wdrd
S
k
e  	 	
Td  EXd Wdt 	
Case II X  Xmax  k	
In this case Xmax data elements are cached The number of
stalls in the rst round remains the same as in Case I and
is given byWdr From second round onwards XXmax	
data elements are to be fetched once in every round Figure
 shows roundi	 of data caching where i   The interval
Xmax is shared to prefetch data and congurations The
number of data elements that can be prefetched during this
interval is given by





The number of cycles remaining in the Xmax interval is
maxfXmax  dndg The number of conguration that






In the remaining interval X Xmax cycles are shared for
fetching congurations and data The number of congu
ration to be fetched within this interval is k  c and the
number of data elements to be fetched is X  Xmax  d
Therefore the number of stalls in a round is given by
W

dr  maxf k  c	nc 
X Xmax  d	nd  X Xmax	g
The last round does not have to fetch any congurations
hence the number of stalls in the last round is given by
W dl  maxf X Xmax  d	nd  X Xmax	g
Total stalls are obtained by adding rst and the last round
stalls with the stalls in the remaining rounds as shown in
Eq  The total number of execution cycles are obtained
as given in Eq 







e  	 	
Td  EXd Wtd 	
 Results
In this section we compare the number of execution cycles
for conguration caching and data caching with the follow
ing parameters
 The number of stripes in the FPGA fabric k  

 Conguration word size Wc  
 bytes as given in



































nd=1 Data Caching         
Configuration Caching
a



































Data Caching         
Configuration Caching
Total Data in bytes
b
Figure  Execution cycles 	a
 nd   	b
 nd  
 Cache size M  KB which can store maximum
 congurations Cmax  	
 The bus interface of 
 bits between the FPGA and
the external memory giving nc  
 The number of pipeline stages S 
  and 


 The amount of data processed by the application in
bytes KB KB and 
KB The number of data
elements X	 depends on the data element size Wd
 Wd   bytes 
 bytes with nd    cycles respec
tively values normally encountered for block encryp
tion	
Figure  shows the variation in execution time with the
amount of data processed for two values of S and nd The
maximumamount of data that can be cached is KB which
corresponds to X  
 in Figure a	 and X  
 in
Figure b	 For conguration caching with S  
 and
 all congurations t in the cache and for both the
values of nd the nostall condition from round  onwards
is met The stalls are added only in the rst round and
the number of stalls is determined by S For data caching
with KB and KB of data all the data elements can t
in the cache and for 
KB data only KB is cached and
the remaining data is fetched from the external memory
whenever required For given S and nd when X  Xmax
the nostall condition for congurations from round  on
wards is met Wdr	 The stalls are mainly caused by
the fetching of k congurations andX data elements in the
rst round When nd   there are no stalls because of
data fetching When nd   the number of stalls is mainly
determined by the number of cycles required to fetch k
congurations Since for both values of S we have S  k
the stalls for data caching are fewer than that for cong
uration caching Hence data caching performs better than
conguration caching even when X  Xmax The per
formance gain of data caching is approximately  when
S  
 and it is  when S   In Figure b	 when
nd   the number of stalls in the rst round is directly
proportional to X Since stalls are present only in the rst
round for X  Xmax data caching performs comparably
or better than conguration caching depending on S We
observe that when total data size is KB there is a sudden
increase in the number of execution cycles This is because
after this point X is greater than Xmax and stalls propor
tional to X Xmax start appearing in subsequent rounds
of data caching which increase number of total execution
cycles It shows that when nd is large and X  Xmax
it is better to execute data caching on blocks of data at
a time This is another scheduling scheme called blocked
data scheduling which performs better than conguration
caching when the size of data block is large enough to pro
vide execution cycles to overlap conguration fetching


































Data Caching         
Configuration Caching
Figure  Execution cycles for S
Figure  shows the comparison of two executions for S 

 The cache can t only  congurations and the re
maining  congurations are required to be fetched from
the external memory The large size of conguration word
along with low cache hit ratio 	 causes large number of
stalls in the conguration caching execution In this case
data caching still provides better performance as large X
is available for hiding conguration fetching and the time
required to fetch uncached data is relatively less as nd  nc
In this case data caching performance is  better than
that of conguration caching
 Conclusion
When number of data elements processed by the applica
tion X is large  k 	 as mentioned in Section  the
number of execution cycles without IO overhead is lower
for data caching than that for conguration caching With
out considering IO overhead the performance gain of data
caching over conguration caching increases as the amount
of data processed by the application and the number of
pipeline stages in the application increase This perfor
mance gain also applies to total execution cycles with stalls	
as long as IO overhead of data caching is less than that of
conguration caching The number of stalls caused by IO
in data caching increases as the number of data elements
and the number of cycles required to fetch a data element
increase When the number of stalls caused by IO in data
caching is high there exists a better approach called block
data caching as mentioned in Section  The block data
approach provides better performance than that of data or
conguration caching by selecting the block size to be small
enough to t in the cache and large enough to overlap con
guration prefetching during its execution This improves
performance but requires a higher number of IO cycles than
the other two schemes Total amount of IO can be kept in
control by keeping few most frequently used congurations
in the cache along with the block of data This is a hybrid
caching scheme Currently we are evaluating the hybrid
scheme whose performance will be determined by the cov
erage function of the application By processing a block
of data at a time hybrid caching scheme can solve some
of the practical problems in data caching scheme We also
plan to evaluate these three schemes conguration data
and hybrid caching for the real applications
References
 H Schmit Incremental Reconguration for Pipelined
Applications in Proceedings of the IEEE Symposium
on FPGAs for Custom Computing Machines pp 
 
 J D Hadley and B L Hutchings Design Method
ologies for Partially Recongured Systems in Pro
ceedings of the IEEE Workshop on FPGAs for Custom
Computing Machines pp  April 
 J G Eldredge and BL Hutchings Density Enhance
ment of a Neural Network Using FPGAs and Run
Time Reconguration in Proceedings of the IEEE
Workshop on FPGAs for Custom Computing Ma
chines pp  April 
 J E Vuillemin Patrice Bertin Didier Roncin Mark
Shand Herve H Touati and Philippe Boucard Pro
grammable Active Memories Recongurable Systems
Come of Age IEEE Transactions on VLSI Vol  No
 March 

 J Hauser and J Wawrzynek Garp A MIPS Proces
sor with a Recongurable Coprocessor in Proceedings
of the IEEE Symposium on FieldProgrammable Cus
tom Computing Machines April 

 N Shirazi P Athanas and L Abbott Implementa
tion of a D Fast Fourier Transform on a FPGAbased
Custom Computing Machine The th International
Workshop on Field Programmable Logic and Applica
tions September 
 R Bittner P Athanas and M Musgrove Colt An
Experiment in Wormhole Runtime Reconguration
presented at SPIE Photonics East  November 

 S Cadambi J Weener S C Goldstein H
Schmit and D E Thomas Managing Pipeline
Recongurable FPGAs in Proceedings ACMSIGDA
Sixth International Symposium on Field Programmable
Gate Arrays February 
