On reconfiguring cache for computing by Kim, Hue-Sung et al.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
1999 
On reconfiguring cache for computing 
Hue-Sung Kim 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Akhilesh Tyagi 
Iowa State University, tyagi@iastate.edu 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Data Storage Systems Commons 
Recommended Citation 
Kim, Hue-Sung; Somani, Arun K.; and Tyagi, Akhilesh, "On reconfiguring cache for computing" (1999). 
Electrical and Computer Engineering Conference Papers, Posters and Presentations. 163. 
https://lib.dr.iastate.edu/ece_conf/163 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
On reconfiguring cache for computing 
Abstract 
The number of transistors on chip has dramatically increased within the last decade. A considerable 
portion of a chip is dedicated to a cache memory in a modern microprocessor chip. However, some 
applications may not need all the caches for storage. In addition, some applications have embedded 
computations with a regular structure. The behavior of the applications is static, which implies that a 
specialized function unit could be beneficial for the application. This presents an opportunity to explore 
the use of a part of a cache for performing these regular computations. In this paper, we show one such 
design to convert a cache into a function unit to improve the performance of an application. A 
reconfigurable cache takes less area than the area of a cache and a function unit together and imposes 
no time overhead. In order to convert a cache memory to a function unit, we mapped multi-bit output look-
up tables (LUTs) into the cache structure. Therefore, the cache can perform computations When it is 
reconfigured as a function unit. 
Keywords 
cache storage, microprocessor chips, reconfigurable architectures, table lookup, performance evaluation, 
embedded systems 
Disciplines 
Data Storage Systems 
Comments 
This is a manuscript of a proceeding published as Kim, Hue-Sung, Arun K. Somani, and Akhilesh Tyagi. 
"On reconfiguring cache for computing." In Seventh Annual IEEE Symposium on Field-Programmable 
Custom Computing Machines, pp. 296-297. IEEE, 1999. DOI: 10.1109/FPGA.1999.803704. Posted with 
permission. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/163 
On Reconguring Cache for Computing
HueSung Kim Arun K Somani
Department of Electrical and Computer Engineering
Iowa State University Ames 
EMail fhuesung arungiastateedu
Akhilesh Tyagi
Department of Computer Science
Iowa State University Ames 
EMail tyagiiastateedu
Abstract
The number of transistors on chip has dramatically
increased within the last decade A considerable portion
of a chip is dedicated to a cache memory in a modern
microprocessor chip However some applications may
not need all the caches for storages In addition some
applications have embedded computations with a regu
lar structure The behavior of the applications is static
which implies that a specialized function unit could be
benecial for the application This presents an oppor
tunity to explore the use of a part of a cache for per
forming these regular computations In this paper we
show one such design to convert a cache into a function
unit to improve the performance of an application A
recongurable cache takes less area than the area of a
cache and a function unit and imposes no time over
head In order to convert a cache memory to a function
unit we mapped multibit output LUTs into the cache
structure Therefore the cache can perform computa
tions when it is recongured as a function unit The
experimental results show that the recongurable mod
ule improves the execution time of an application with a
large number of data elements by a large factor as high
as  The cache of course also works as a normal
cache with the access time similar to a normal cache
when the function unit is not required
 Introduction
The number of transistors on a chip has increased
dramatically in the last decade  Within the next
ve years we will have billions of transistors on a chip
In modern microprocessor more than half of the tran
sistors are used for cache memories This trend is likely
to continue However not all applications will use all
the capacities of a cache at a time This results in
low utilization of the cache memory when running the
applications which do not require the entire cache ca
pacities
Availability of a large number of transistors on a
next generation processor chip has motivated several
researchers to study the use of recongurable logic for
onchip coprocessors  	 
      Such logic
can accelerate the execution of applications and then
provide results to the host processor or store them into
a cache Thus it improves the performance of the ap
plications and reduces the bottleneck of ochip com
munications For example if the processor employs an
FPGA or a coprocessor to accelerate applications the
This work was funded by Carver Trust Grants Iowa State
University
data transfer between the processor and those copro
cessors increases the requirement for communication
bandwidth and eventually results in a bottleneck
In Garp architecture  programmable logic resides
on processorchip to accelerate computations in a con
ventional processor The computations are expected to
be used frequently in the architecture If an application
does not need the logic these functions remain idle
PipeRench  tries to recongure the hardware every
cycle to overcome the limitation of hardware resources
XC  from Xilinx has a dierent architecture
from the previous FPGA series to allow an online re
conguration An advantage of this architecture is that
a number of smaller conguration memory blocks can
be combined to obtain a larger memory However the
negrained memory cannot be synthesized eciently
in terms of area and time In particular providing
a decoder for small chunk of memory requires a large
number of decoders which is expensive
This observation motivates us to design a recong
urable module which works as a function unit as well
as a cache memory that reduces the extra logic A
general purpose processor may not perform well for
computeintensive applications such as image process
ing encryption and signal processing We therefore
propose to create specialpurpose logic for these ap
plications on the processor chip using a recongurable
memoryfunction unit module For this purpose we
partition the cache into several smaller caches Each
cache then can be converted to carry out a specialized
dedicated computeintensive function unit
Our goal is to develop a mechanism to convert a
cache into a function unit to improve the performance
of an application by implementing a recongurable
module with low area and time overhead We rst
describe the concept of a multibit output LUT used
in the proposed architecture in Section  Section 	
describes the architecture of a recongurable module
which talks about the function and cache operations
The conguration and schedule of the module are men
tioned in Section 
 Section  shows experimental re
sults on the recongurable module We conclude the
paper in Section 
 Multibit output LUTs
In most FPGA architectures a Lookup table LUT
usually has four inputs and one output to keep the
overall operation and routing ecient  However a
SRAMbased single output LUT does not t well to a
cache memory architecture because of a large amount
of overhead for the decoders in the cache with a large
block size Instead of a single output LUT we propose
to use a structure with multibit output LUTs Such
LUTs produce multiple output bits for a single combi
nation of inputs and are better suited for a cache than
the single output LUT Since a multibit output LUT
has the same inputs for all output bits it is less exible
in implementing functions A bit adder and a bit
multiplier or a 
x constant coecient multiplier all
need the same size of LUT are depicted in Figure 

































































Figure  Multioutput LUTs  a A bit adder  b A
bit or a x constant coecient multiplier
If a multibit output LUT is large enough for a com
putation no interconnection for example to propagate
a carry for an adder may be required since all possible
outputs can be stored into the large memory In addi
tion unlike a single output LUT a large LUT requires
only one big decoder or a multiplexer with multiple
inputs Thus the area for decoders reduces How
ever the overall memory requirement increases The
required memory size of one large LUT increases expo
nentially when the number of inputs increases There
fore multibit LUTs may not be areaecient in all
situations The computing time in this case may also
not reduce much due to the complex memory block and
the increased capacitance on long bit lines for reading
Instead of using one large LUT we show implemen
tations of an bit adder with a number of smaller
multibit output LUTs depicted in Figure  Figure
a depicts an bit adder consisting of two input
LUTs Each LUT has two 
bit inputs one bit
carry in and bit outputs for a 
bit addition Thus
total memory requirement is       bits
The carry is propagated to the next LUT only after
the previous 
bit addition in one LUT is completed
ie a ripple carry Since each LUT should be read se
quentially this adder takes longer time to nish an ad
dition By employing the concept of carry select adder
as depicted in Figure b this results in a faster adder
with LUTs because the reading of the LUTs does not
depend on the previous carry In this case the actual
result of each 
bit addition is selected using a carry
propagation scheme However all the LUTs are read
in parallel The total time for the modied adder is the
sum of the reading time for one LUT and the propa
gation time for two multiplexers If we have more input
bits the modied adder is much faster However this
adder is not area ecient because it still requires a
large amount of memory ie 














































































Figure  	bit adder using a two 
LUTs  b two
	LUTs c four LUTs
To make an area ecient adder a 
LUT with bit
outputs can be exploited Figure c The same carry
propagation scheme as in Figure b can be applied to
the 
LUTs to implement an bit adder but four 

LUTs are used The total time of the adder with the

LUTs might be larger than that with the LUTs
because it has twice the number of multiplexers to be
propagated However the read time for a 
LUT is
much faster than for an LUT since it has a smaller
decoder and shorter lines for reading memory We
therefore propose to employ the 
LUT with multiple
outputs to implement such a function
 Recongurable Cache Module
In design of recongurable cache modules we as
sume that multiple cache modules are attached to a
processor on a chip Each of these cache modules can
be programmed to cache a particular part of the ad
dress space 	 They can also be programmed to act
as function units We also assume that the number
of cache units is suciently large so that use of a few
cache units as function units does not adversely aect
the cache operation Alternately if the cache modules
are organized with kway set associativity removing
one cache module will convert the cache into a k
way set associative cache In either case it is possible
to pull a cache module out of operation to convert it
into a function unit
 Processor with recongurable caches
Figure 	 shows the overview of the processor with re
congurable caches RCs RC RC RC	   RCn
in Figure 	 can be converted to function units for
example to carry out functions such as FIR lter
IDCTmpeg decoding function Encryption and Gen











Figure 	 Overview of a processor with multiple recong
urable cache modules
respectively All the applications to be embedded into
caches are computeintensive We assume that each
cache module is designed to carry out one specic ded
icated function Other caches not shown in the gure
may or may not be recongurable When one cache is
used as a function unit one of the above functions
the other caches continue to operate as storage units
as usual It is also possible to recongure some cache
modules to become data input and output units for a
function unit to provide input and receive output at the
speed of function unit For example if cache  performs
FIR ltering some other cache cache  may initially
contain input data and cache 	 may be empty to receive
output data To adapt to the various sizes of data for
dierent applications we may also use recongurable
sizes of cache memory modules Each cache can have
a dierent depth and width and appropriate decoding
according to the demand of applications 
  
The conguration of memory is completed by the host
processor because the processor has the information
about cache mappings to take care of dynamic nature
of cache functionality
When a processor employs several such modules
than can perform as caches as well as function units
then the data transfer between the processor and these
modules increases the requirement for communication
bandwidth This eventually may result in another kind
of bottleneck We therefore adopt an ecient com
munication unit to satisfy the bandwidth needs The
modules can communicate with each other as well as
the main memory and the host processor using the Re
congurable Multiple Bus networks RMB  RMB
makes the communications among cache modules ef
cient by providing all the required paths within the
available resources of RMB RMB may be well suited
to the communication needs between the recongurable
cache and data caches or the host processor We will
not discuss the actual operation of RMB here and the
interested reader is referred to 
 SingleMultiple functions in a module
Although a recongurable cache module can be re
congured to realize more than one type of function
unit we propose that each cache module be designed
to recongure to one specic function unit Since each
recongurable module can be converted into only one
specic function unit interconnections for each func
tion unit in the module are xed By making the in
terconnections and the function xed in a module we
have several advantages and some disadvantages as de
scribed below
Advantages
 The xed interconnection is less complex takes
less area and allows faster communications than a
recongurable interconnection Unlike in FPGAs
it also reduces the routing overhead and commu
nication time between LUTs
 With a xed function in each cache module we
know what the conguration data is and how they
must be programmed when the module is con
verted Therefore it is possible to make the recon
guration faster by implementing a special logic
which allows multiple writes to multiple recong
urable LUTs for identical functionalities or direct
connections to LUTs
Disadvantages
 Only a few specic applications can be imple
mented during the fabrication stage We can
therefore choose only the most common functions
to be implemented Since the functions might be
applicationspecic and thus such a set cannot be
determined for general purpose computing
 The proposed recongurable module would take
less area than a general purpose recongurable
module However if we implement a number of
functions into several recongurable modules the
total area of all the modules may be higher than
a single general purpose recongurable module
In this paper each recongurable module has a ded
icated and xed interconnection for its own dedicated
function implemented using the LUTs described in Sec
tion  The interconnections between LUT rows do
not aect the operations when the module is used as
a cache memory since data is readwritten through
global bit lines described in the next section
 Organization and operation of module
Since we target the applications that are compute
intensive and have a regular structure we rst partition
them at coarselevel A function in each level can be
implemented using the multibit output LUTs as de
scribed in Section  We only add pipeline registers
to each coarselevel stage which contains a number of
LUTs to make the entire function unit ecient All
these registers are enabled by the same global clock
Therefore a number of computations can be performed
in a pipelined fashion Each basic computation in one
pipeline stage such as an addition or a multiplication
can also be pipelined This pipeline increases the num
ber of pipeline stages but improves the throughput
of the pipeline by shortening the cycle time of one
pipeline stage However it requires more pipeline reg
isters which increases the area overhead In this paper
to reduce the area overhead we add registers to each
stage but not to each LUT
Figure 
 depicts a possible cache organization of the
module The cache can be viewed as a twodimensional
matrix of LUTs Each LUT has  rows to support 

LUT function and as many bits in each row as required
to implement a particular function In the function
unit mode the output of each row of LUTs is manip
ulated to becomes input for the next row of LUTs in
a pipelined fashion In the cache mode the least sig
nicant 
 bits of the address lines are connected to the
row decoders dedicated to each LUT The rest of the
address lines are connected to a decoder for the entire
cache in the gure In the cache memory mode the
LUTs take the 
bit address as their inputs selected by
the enable signal from the big decoder Therefore no
matter what the value of the upper bits in the address
is the dedicated row decoder selects a word line in each
row of LUTs This means one word is selected in each











































































 Cache operation in the recongurable Module
Each LUT thus produces as many bits as the width
of the LUT These are local outputs of the LUTs
These outputs are available on the local bit lines of
each LUT row For a normal cache operation one of
the local outputs needs to become the global output
of the cache This selection is made based on the de
coding of the rest of the n  
 address bits decoded
by the big decoder The local outputs of the selected
row of LUTs are connected to the global bit lines The
cache output is carried on the global bit lines as shown
in Figure 
 Thus output of any row of LUTs can be
readwritten as a memory block through global lines
We propose that these global lines be implemented us
ing an additional metal layer The global bit lines are
the same as the bit lines in a normal cache
Both decodings can be done in parallel After a row
is selected by both the decoders one word is selected
through a column decoder at the end of the global bit
line like in a normal cache operation In the gure
the tag part of a cache is not shown and a direct
mapped cache is assumed for the module However
the concept of recongurable cache can be easily ex
tended to any setassociativity cache because the tag
logic is independent of the function units operations
In summary cache operations are possibly slowed by
the additional logics for a function unit We estimate
this time penalty in the next section
 Access time for cache operations
Each bit line in a normal cache is replaced by the
global line in the proposed architecture Since there are
no obstructive switches or logic on the global line path
except the connections contacts with the switches to
the LUTs the recongurable module does not have
higher delay due to the global line However the signal
may be degraded by the increased line capacitance due
to the contacts for switches and longer global lines than
a normal cache It therefore may require additional
amplication
On the other hand the cache with the recongurable
structure may have a faster address decoder than a
normal cache Since each LUT with its own row de
coder for addressing in the recongurable module is
much smaller than a large synthesized memory block
in a conventional cache the decoding time of a LUT
is much faster than the decoding time of a large cache
As mentioned earlier since two decoders can decode in
parallel possible word lines in a cache according to the
least signicant 
 bits are ready to be read or written
before the main row decoder even nishes decoding an
address The assumption here is that the main decoder
has larger number of address bits Since the two decod
ing operations are independent the delay of decoders
is the maximum of two decoding times in the recong
urable module If there are many LUTs which take the
same lower 
 bits in the module we have to consider the
increased capacitance due to the fanout of the lower
address bits If the delay of decoding is higher we may
need a special driver for the least signicant 
 bits to
reduce the delay However the drivers will not aect
the size of recongurable module much as we can put a
driver into the space saved due to the reduced size de
coder for higher order bits A conventional cache may
also employ such a hierarchical decoding mechanism
The lengths of word and bit lines in the recong
urable cache are more than those in a normal cache due
to the dedicated row decoders and the interconnections
between LUTs Therefore the delays of propagating a
signal from the big decoder through the word lines and
a data signal from memory cell through the bit lines
in the module are higher than those in a normal cache
Other delays are similar in both cases
To estimate the access time we used a cache simu
lator CACTI with m technology which was devel
oped by Steven JEWilton and Norman P Jouppi 
The access time consists of ve parts Decoder Word
line Bit line Sense Amplier and Data out driver for
tag and data parts We modied the simulator slightly
to t into our structure for a dierent decoding scheme
without the driver and longer lines only for the data
part We computed the access time of data part for
KB cache with  bits block size which is used to
implement our function example in Section  Table
 shows the access time of a normal cache and the re
congurable module As we described above it turns
out that the overall decoding time is decreased and the
delays of word and bit lines are increased Since we
reduce the decoding time signicantly the total cache
access time sum of those ve factors in the recong
urable module is less than in a normal cache
Table  Comparison of access time for an 	KB cache
Normal Recongurable Comments
cache ns Module ns









Sense   same
Amplier
Data out   same
driver
Total time  	 decreased
 Area overhead
In the new architecture additional areas are re
quired for the dedicated interconnections row decoders
for LUTs additional control units and pipeline regis
ters Other resources such as the row decoder for upper
bits and memory cells are required for both function
and cache operations In this section we compare the










































































Figure  a a multi output LUT with decoder and in
terconnection overhead b an expanded width LUT c
Structure of a normal cache  d Structure of a recong
urable cache
Figure  shows the areas for various components
of a recongurable cache compared to a normal cache
blocks In the gure the size of a LUT in the recong
urable cache is WxLy  WLWyLxxy
while the size of the original cache block is WL W
and L are the width and the depth of a memory block
to be used for a LUT respectively x is the additional
width due to the dedicated decoder while y is the addi
tional depth due to multiplexers and interconnections
for the recongurable cache The area overhead is as
much as WyLxxyWL  If x  aW and y  bL the over
head factor is a  b  ab The area overhead depends
on the fraction of the additional lengths vertically and
horizontally due to the 
to decoder and the xed
interconnection respectively
If we expand the size of LUTs with respect to the
width of memory cells W  cW in Figure b
the fraction of the additional width will decrease by
x

 acW in the expanded LUT The fraction of total
area overhead in the new size of LUT is ac  b
ab
c  By
increasing the width of the LUT the fraction of area
overhead is reduced by a  b  ab  ac  b 
ab
c  
a  c   ab 

c  where a b  and c   For
example if a    b 

 and c   the overhead
factor for a normal size LUT is a  b  ab   while




  In addition
using the expanded LUT reduces the number of row de
coders for LUTs in a given row Section 
 describes
how we determine the number and the size of LUTs
eciently in terms of area and conguration time In
a complete recongurable cache we have to add the
area for the pipeline registers to the area overhead
The area overhead of an entire recongurable cache
without the registers is narea of one LUTthe area of the original cache 
nWL	overhead for a LUT 

nWL  a  b  ab where n is
the number of LUTs implemented in the cache Fig
ure c and d show structures of a normal and
a recongurable cache respectively The total area
overhead is the sum of the above results and the
area of pipeline registers which is a  b  ab 
fraction of the area overhead due
to pipeline registers
 Conguration and Scheduling
 Conguration of a function unit
To reduce the complexity of the column decoding
in a normal cache memory data words are stored in
an interleaved fashion in a block Thus the distance
between two consecutive bits of a word is equal to the
number of words in a block However for a LUT ap
plication we need to use consecutive bits in a single
LUT This implies that we cannot store one entry of a
multibit output LUT by writing one word in a cache
For example if a 
LUT produces an n bitwide output
for a function n words are required to be written
to the LUT in the cache However since other LUTs
in the same row can also be programmed simultane
ously no more than n words are required to ll up
the contents of all LUTs in one row This places a re
striction that the width of a multibit output LUT be
an integral multiple of the number of words in a cache
block to allow an ecient reconguration of all LUTs
in a row The number of LUTs in a column  placed
vertically  for a pipeline stage may also be required to
be a power of  Since all cache structures are based
on a power of  it is more convenient to make all LUT
parameters length and width a power of  to avoid
a complicated controller and an arbitrary address gen
erator This may result in under utilization of mem
ory However the idle memory blocks for LUTs are
not likely to be a problem when the module is used as
a function unit due to availability of sucient memory
blocks in a cache
 Number and size of LUTs
The following conditions determine the number and
size of LUTs with the parameters described in Table 
The size and number of LUTs must be a power of 
After determining the most ecient number of LUTs
in a row we can reduce the area overhead by using
dedicated decoders in the LUTs
Condition total  of bitsblock   of LUTs 
 of
bits required per decoded entry in a LUT 
 Nbw NwB  x a
Condition
 LUTs required for a function in a row 
total  bitsblock







These conditions imply that we can write m 
Nbw
x
conguration bits into each LUT in a row with writing
of one word in the cache Also the actual number of
LUTs implemented in a row is equal to
Nbw
m 
Table  Parameters for the number and size of LUTs
a Number of bits required in a row of LUT
x Number of LUTs required in a LUT row
m Number of bits to be written
into a LUT by one word power of 
Nbw Number of bits per word
NwB Number of words per block
 InitialPartial reconguration
Initial conguration converts a cache into a specic
function unit by writing all the entries of LUTs in the
cache The conguration data to program a cache into
a function unit may be either available in an onchip
cache or an ochip memory Loading time for the con
guration data in the latter case will be much larger
than in the former case The conguration data may
be prefetched by the controller or the host processor to
reduce the loading time from ochip memory
Since we have a dedicated function in each mem
ory and each row of LUTs has its own local decoder
a recongurable decoder can also be constructed to
assist the reconguration for the specic function to
allow multiple writes to the LUTs one in each row
with the same contents The implementation of the re
congurable decoder which allows multiple writes into
memory bits in a column can be realized by adding
multiplexers Choosing all the parameters as a power
of  also results in an easier implementation of the re
congurable decoder
Parts of the cache unit as a function unit can be
recongured at runtime using write operations to the
cache When a partial reconguration occurs the func
tion unit must wait for the reconguration to complete
before feeding the next set of inputs Since computa
tion data input and output and reconguration data
contents of LUTs for a function unit share the global
lines for data buses we cannot perform both comput
ing and partial reconguring at the same time Since
the number of computation data is usually much larger
than the number of reconguration data the recong
uration may not occur frequently Therefore a partial
reconguration usually does not take much time For
example in a convolution application with  taps we
need to recongure part of the module  times It is
possible to perform both computations and recongu
rations simultaneously if we have separate data lines
for computation data and conguration data This is
expensive and therefore we do not recommend it
 Scheduling and controlling data ow
A cache module can also be used to implement a func
tion which has more stages than what can be realized
by the recongurable cache in one pass In this case
 we divide the function into multiple steps That is
S stages required for a function can be split into sets
S S     Sk such that each set Si can be realized
by a cache module If all Sis are similar then we can
adapt data caching as describe in  to store the par
tial results of the previous stage as input for the pro
cessing by the next conguration The similar here
means that the LUT contents may change but the in
terconnection between stages is the same This hap
pens for example in a convolution application By
changing the contents of LUTs we can convert a stage
in the cache block to carry out the operation of a dif
ferent set of pipeline stages In a data caching scheme
we place all input data in a cache and process them
using the current conguration At the end of that the
cache module is congured for the following step We
have to store the intermediate results from the sets of
stages into another cache and then reload them for the
next set of computations Therefore we need two other
cache modules to store input and intermediate data
respectively These modules are addressedmapped to
provide ecient data caching for intermediate results
The role of the two caches can be swapped during the
next step when a computation requires the interme
diate results as inputs and generates another set of
intermediate results If both an input and an inter
mediate result are required to be fed for all the com
putation we have to keep two caches as they are The
two caches must be large enough to hold input and in
termediate results respectively Moreover the recon
gurable cache must be able to accept an input and an
intermediate result as its inputs
The host processor needs to set up all the initial con
gurations which includes writing conguration data
into LUTs and conguring the controller to convert a
cache into a function unit To do this the host proces
sor passes the information about an application to the




















(Variable size of cache)
Cache2





























Figure  Onchip recongurable cache architecture
of inputs data caches and the recongurable cache
The data caches to hold the input and the intermedi
ate results are also allocated as resources by the host
processor The controller establishes the connections
between the recongurable cache and the data caches
using a bus architecture like a RMB The addresses for
input intermediate and output data are produced by
an address generator in the controller These addresses
are sequential within the respective cache units The
controller also monitors the computation and initiates
the next step when the current step is completed Fig
ure  shows some of these details
 Experimental Result
 Experimental setup
We experimented with a design of the recongurable
cache converted to perform a convolution function
The number of pipeline stages for the convolution in a
recongurable cache depends upon the size of a cache
to be converted Our simulation is based on an KB
size cache with  bits per blockbit wide words
A conventional convolution algorithm is shown in Fig
ure 
for i=1 to NumberOfIN {
Y[i] = 0;
for j=1 to NumberOfTAPs { Y[i] += X[i+j-1]*W[j]; }}
Figure  Convolution algorithm
One stage of convolution consists of a multiplier and
an adder In our example each stage is implemented
by an bit constant coecient multiplier and a 

bit adder to accumulate up to  taps in Figure a
The input data is double pipelined in one stage for
the appropriate computation  An   constant
coecient multiplier can be implemented using two 

 constant coecient multipliers and a bit adder
with appropriate connections  A 
   constant
coecient multiplier is implemented using  
LUTs
with single output from each LUT on FPGAs In our
simulation we split the 
  LUT of the conventional
constant coecient multiplier into two 
 LUTs with
multiple outputs as shown in Figure  b
The concept of a carry select adder is employed for
an addition using the LUTs described in Section 
Therefore we need a bit wide result for a bit ad
dition three bits when carryin and three bits when
carryin from a LUT The structure of LUT for the
bit addition is the same as that of the  
 constant
coecient multiplier since that also produces a bit
output An nbit adder can be implemented using dn e
such LUTs and a carry propagation scheme The out




























































































































Figure  Multioutput LUTs  a A bit adder  b A
bit or a x constant coecient multiplier
One stage of convolution can be implemented with
 LUTs To keep the number of LUT rows a power
of two we put  LUTs in each LUT row and have 

LUT rows to use  out of 
 required LUTs Ac
cording to the conditions in Section 
 we can have
m   m 
Nbw
x  and the actual number of LUTs
per row turns out to be  
NwB
m  The nal place
ment of LUTs is shown in Figure b A few LUTs
in the gure are not used for the computation In Fig
Table 	 Parameters for the RC
Parameter Description Actual
value
Tcpu  cpu cycle time 
ns
TAP Number of taps 
 
X Number of data 

 
S Number of taps implemented 
in the RC
T stage The time to complete 
ns
the computation in one stage
Rmemcpu Ratio of no of cycles of  main 
memory access and  cpu cycle n
Rcachecpu Ratio of no of cycles of  cache 	
memory access and  cpu cycle ns
a Number of bits required to 
congure a content of LUT
m Number of bits to be written 
by one word when conguring
Lcache Number of rows cache blocks 
in the cache
LLUT Number of rows cache blocks 
in a LUT
Wn Number of words per cache block 
ure b pipeline registers and decoders for LUTs are
not shown For an KB recongurable cache we have
	 rows of LUT which can be used to implement  taps
of the convolution algorithm
 Area
To measure the actual area overhead we experi
mented with a possible layout of a recongurable cache
Figure  represents one stage of convolution unit in
the recongurable cache described above The pipeline
registers are not shown in the gure According to this
layout we obtained a   and b   the pa
rameters for area overhead in Section 	 The value for
a takes into account the parameter c In addition the
area overhead of the pipeline registers is roughly  in
cluding the interconnections Therefore the total area
overhead is a  b  ab     Therefore the
total area of the recongurable module is  times of
a normal cache Since the interconnections are xed for
the convolution they can be eciently routed and do
not take much area Since the placement and routing
of the recongurable cache have been done manually
with CAD tools  we can expect the area overhead to
reduce further if an automated algorithm realizing an
optimal solution for the placement and routing is used
 Execution time
We achieved the following performance improve
ment results for our simulation experiment for the con
volution We compare the execution time of the func
tion using a recongurable cacheRC to a conventional
general purpose processorGPP using the algorithm in
Figure  A layout for one stage of the convolution
Figure  Since the recongurable cache may have to
be ushed we show the results for the two cases here
In the rst case no data in the cache needs to be writ
ten back to main memory before it is recongured as
the function unit In the second case the processor has
to ush all the data in the cache before conguring it
The extra time is denoted by the 	ush time
The total execution time of the convolution in the
recongurable cache consists of conguration and com
putation times The conguration time includes the
times for adder and constant coecient multiplier con
guration In addition in the second case the cache
ush time is also to be added in the conguration time
The actual parameter values to compute the times are
given in Table 	 The expressions for the times are
presented below
 Cong
Time for adder   Rmemcpu
a
m LLUT  
Rcachecpu
a
mLcache   S  Tcpu




LLUT TAP  Tcpu
 Cache Flush Time  RmemcpuWnLcache 
Tcpu
 Computation Time   TAPS  X  S   
T stage
In the computation time we add S instead of S for
the initial pipeline steps because we exploit the double
pipelined input data in each stage of the convolution
as shown in Figure a In addition we separate the
conguration time for adders and multipliers The rea
son for this is that only one set of data for a LUT is
necessary when reconguring the LUTs for adders be
cause the contents of all the LUTs are the same while
a dierent conguration data is necessary for multipli
ers Therefore we can store each conguration data of
a LUT for an adder into an onchip register and write
the data to all the LUTs for adders with cache access
time The time for storing and loading input and inter
mediate data can be overlapped with the computation
time Therefore data accessing time for the computa
tion is not added
The ratios of execution times between RC and GPP
are shown in Table 
 We assume that all the input







































































Figure  Ratio of execution time of RM and GPP  	a
 without memory ush 	b
with memory
ush before converting into the function unit
Table 
 Comparison of execution time between SPARC and RM usec
RM RM
No of No of SPARC wo memory ush Ratio w memory ush Ratio
Taps Data 	 MHZ
 cong compute cong compute
     
	     
 	  		  
	  	 		     
     	
	     
  	  	 
	 	    
   	  
	  	  	 
     
  	   		 	  
  	 	 	 		
	     
     
	   	  
  	  	 
	 		    
 	    
     	 	  
 	    	
	 	 	 		 	 
     
	  		  		 
     
	 	    
     
  	  	 	  	 	
     	
	 	  	  
 	 	  	 
	     
 	    
	 	    
   	  
	   	   		  
 		 		 	 		 
	 	    
 	 	  	 	
	 	    	
     
	  	  	 
  	  	 
  		    		  
     
	  		  		 
     
	  	 	 	 
RC and GPP We traced the number of cache misses
in GPP for all the cases in Table 
 From the trace
we found that regardless of the number of taps and
data elements in the computation the number of cache
misses does not vary with the execution time There
fore we neglected the eect of the cache miss penalty
in the comparison
Our result shows that the recongurable cache for
computing has a higher performance improvement over
the execution time of the GPP as the number of data
elements increases Figure  shows that the perfor
mance improvement is gained almost independent of
the number of taps without memory ush in a but
the ratio of the computation time with less number of
taps decreases with memory ush in b because the
ush time aects the ratio of the total execution time
more with the decrease in the number of taps
 Conclusion
We have presented a recongurable module which
can perform both as a function unit and a cache This
allows a processor to trade computing bandwidth for
IO bandwidth We have analyzed it for a convolution
The recongurable cache for the computation of con
volution improves the performance by a large amount
a factor of up to 	 The area overhead for this re
conguration is about  without any increase in the
cache access time Since applications which have a reg
ular structure may be implemented in a recongurable
module we are currently developing similar mappings
for other functions We are also studying if more than
one function can be combined into one single recong
urable module
References
 YPatt S Patel M Evers D Friendly and J
Stark One billion Transistors One Uniprocessor
One Chip IEEE Computer pp  Sep 
 John R Hauser and John Wawrzynek Garp A
MIPS Processor with a Recongurable Coproces
sorin Proceedings of the IEEE Symposium on FP
GAs for Custom Computing Machines Apr 
	 Andre DeHon DPGAcoupled microprocessor
Commodity ICs for the early st century In
D a Buell and K L Pocek editors Proceedings
of IEEE workshop on FPGAs Custom Computing
Machines pp 		 Napa CA Apr 


 R Razdan and M D Smith A highperformance
microarchitecture with hardwareprogrammable
functional units in Proc of the th Annual
Intl Symp on microarchitecture pp 
IEEEACM Nov 

 A Tyagi  Recongurable memory queues as com
puting units architecture in Proc of the Recong
urable Architecture workshop at th International
Parallel Processing Symposium Apr 
 S Hauck T W Fry M M Hosler J P Kao The
Chimaera Recongurable Functional Unit IEEE
Symposium on FPGAs for Custom Computing Ma
chines pp  
 Hokie RISC processor
wwweevteducoursesee
rapidhtml
 Ralph D Wittig and Paul Chow OneChip An
FPGA Processor With Recongurable Logic in
IEEE Symposium on FPGAs for Custom Comput
ing Machines FCCM 
 National Semiconductors Adaptive Systems ona
Chip ASC
wwwnationalcomappinfomilaeronapa
 Srihari Cadambi Jerey Weener Seth Copen
Goldstein Herman Schmit Donald E Thomas
Managing PipelineRecongurable FPGAs in
Proceedings ACMSIGDA Sixth International
Symposium on FPGAs Feb 
 Xilinx Inc Introducing the XC FPGA Ar
chitecture wwwxilinxcomappshtm
 J Rose R Francis D Lewis and P Chow Ar
chitecture of programmable gate arrays The ef
fect of logic block functionality on area eciency
IEEE Journal of SolidState Circuits v  pp
 Oct 
	 C Wittenbrink and A K Somani Cache Tiling
for HighPerformance Morphological Image Pro
cessing in the Proc of Computer architecture for




 Steven JE Wilton Implementing Logic in
FPGA Embedded Memory Arrays Architectural
Implications in the IEEE Custom Integrated Cir
cuits Conference May 
 Tony Ngai Jonathan Rose and
Steven JE Wilton An SRAMProgrammable
FieldCongurable Memory IEEE Custom Inte
grated Circuits Conference  pp 
 
 Steven JE Wilton Jonathan Rose and Zvonko
G Vranesic Architecture of Centralized Field
Congurable Memory in ACMSIGDA Interna
tional Symposium on FPGAs pp 	 
 H ElGindy A K Somani H Schroder H
Schmeck and A Spray RMB  A Recongurable
Multiple Bus Network in Proc of Second HPCAS
Feb  pp
 Steven JE Wilton and Norman P Jouppi
CACTI an enhanced cache access and cycle time
model IEEE Journal of Solid State Circuits v 	
May  pp
 Deepali Deshpande Conguration Scheduling
Schemes for Striped FPGAs  in Proc of FPGA
Feb 
 Mathew Wojko and Hossam ElGindy Self Con
guring Binary multiplier for LUT addressable FP
GAs in the th Australasian Conference on Par
allel and RealTime Sep 
