Configuration Bitstream Compression for Dynamically by Reconfigurable Fpgas Ju et al.
Conﬁguration Bitstream Compression for Dynamically
Reconﬁgurable FPGAs
Ju Hwa Pan Tulika Mitra Weng-Fai Wong
School of Computing
National University of Singapore
Republic of Singapore 117543
fpanjuhwa, tulika, wongwfg@comp.nus.edu.sg
Abstract
Field Programmable Gate Arrays (FPGAs) holds the pos-
sibility of dynamic reconﬁguration. The key advantages of
dynamic reconﬁguration are the ability to rapidly adapt to
dynamic changes and better utilization of the programmable
hardwareresourcesformultipleapplications. However, with
the advent of multi-million gate equivalent FPGAs, conﬁg-
uration time is increasingly becoming a concern. High re-
conﬁguration cost can potentially wipe out any gains from
dynamicreconﬁguration. Onesolutiontoalleviatethisprob-
lem is to exploit the high levels of redundancy in the conﬁgu-
ration bitstream by compression. In this paper, we propose a
novel conﬁguration compression technique that exploits re-
dundancies both within a conﬁguration’s bitstream as well
as between bitstreams of multiple conﬁgurations. By maxi-
mizing reuse, our results show that the proposed technique
performs 26.5–75.8% better than the previously proposed
techniques. To the best of our knowledge, ours is the ﬁrst
work that performs inter-conﬁguration compression.
INTRODUCTION
Field Programmable Gate Arrays (FPGAs) offer a generic
platform for hardware realization of application speciﬁc al-
gorithms. There is an increasing interest in using them as
alternative computing platforms. FPGAs are particularly
suited for accelerating compute intensive algorithms that can
take advantage of massive hardware parallelism.
Today’sFPGAsaredynamicallyreconﬁgurable. Atanytime
after an FPGA has been powered up, it is possible to suspend
its operations, load in a completely new conﬁguration, and
restart its operation using the newly loaded conﬁguration.
This has the effect of virtualizing the hardware by allowing
one to context switch between hardware implementations
realized on the same set of reconﬁgurable resources. By vir-
tualizing the hardware, a better utilization can be achieved,
especially if the computation involves some form of multi-
tasking. An application that exhibits major phase transitions
in its execution may also be able to use dynamic reconﬁg-
uration to accelerate its operations over its entire execution.
Dynamicreconﬁgurationalsoopensupthepossibilityofhav-
ing a system that can actively adapt its hardware to errors,
failures, or changing operating environments.
Computing with FPGAs by means of dynamic reconﬁgu-
ration is not without its challenges. Multi-million gate-
equivalent FPGAs are now available off the shelf. However,
suchFPGAstakeaconsiderableamountoftimetoconﬁgure.
This can potentially wipe out any advantages of dynamic re-
conﬁguration. Compression of the conﬁguration bitstream
has been suggested as a means to alleviate conﬁguration la-
tency. A major component of conﬁguration time is the time
it takes to transfer a conﬁguration bitstream to the FPGA.
Reducing the size of the conﬁguration bitstream will reduce
the conﬁguration transfer time.
All techniques of data compression seek to minimize the
amount of data redundancies present. For a given applica-
tion, a subset of operations may be found consistently across
successive conﬁgurations. This translates into conﬁguration
datathatmayberepeatedacrossmultiplebitstreams. There-
conﬁguration process of an FPGA provides opportunities to
exploit these redundancies through data reuse between suc-
cessiveconﬁgurationbitstreams. Existingworkinconﬁgura-
tion compression has demonstrated effective exploitation of
the high degree of redundancies present within a bitstream.
However, the exploitation of redundancies between succes-
sive conﬁgurations remains largely unexplored.
Inthispaper, weﬁrstproposeanewintra-bitstreamcompres-
sion technique. Our results show that this technique com-
petes favorably with the best known previous technique[5].
We then extend the technique to take advantage of inter-
bitstream redundancies as well as partial reconﬁgurability
supported by modern FPGAs such as Xilinx Virtex family.
To the best of our knowledge, ours is the ﬁrst work to do so.
CONFIGURATION ARCHITECTURE
Virtex is a family of partially reconﬁgurable SRAM-based
FPGAs from Xilinx Inc. Each Virtex device comprises of
an array of conﬁgurable logic blocks (CLBs). Also present
on each device are blocks of random access memory known
as SelectRAMs, input-output blocks (IOBs) clock resources,
programmable routing, and conﬁguration circuitry. The de-
vice is conﬁgured by loading a bitstream into conﬁguration
memory comprising of the SRAM cells, whose outputs con-
trol the FPGA logic and interconnect resources.
TheVirtexconﬁgurationmemoryisorganizedintoaseriesof
one-bit wide frames, each spanning from the top of the array
to the bottom [9]. A frame is the smallest unit of memory
that can be written to or read from the device. Frames are
aggregated to form columns, which are of several types. A
center column conﬁgures the four global clock pins, while
two IOB columns conﬁgure the IOBs present on the left and  6
 
 
 
 
 
F
D
R
I
 
Configuration Memory
Configuration 
frame to write 
F
D
R
O
 
Configuration 
frame readback 
In addition, frames can be read back by loading a frame from configuration memory onto a shift 
parallel-to-serial register, the Frame Data Output Register (FDRO), before the frame is shifted 
out onto the host process. Figure 4 shows the writes to and reads from the configuration memory 
involving the FDRI and FDRO respectively. 
 
 
 
 
 
 
 
 
 
 
Figure 4 Virtex Configuration and Readback Mechanism 
 
3  Existing Configuration Compression Techniques 
 
Reducing reconfiguration overhead has the potential for improving the performance of 
reconfigurable systems. Research has been carried out targeting FPGAs with various 
configuration architectures. Wildcard and Runlength Compression  and Configuration Cloning 
techniques have been investigated targeting FPGA architectures where the configuration 
granularity (smallest unit of configuration) is a single logic cell; dictionary-based techniques 
however, provide extensibility for configurations of coarser-granularity. 
 
3.1  Logic-Cell based Compression Techniques 
 
The Wildcard Compression technique (Hauck, Li and Schwabe, 1999), targets the Xilinx 
XC6200 series FPGA architecture, where logic cells can be addressed individually. By utilizing 
the wildcard address registers inherent in the architecture, logic cells sharing the the same 
configuration can be addressed and configured with a single write operation, achieving faster 
configuration. Despite demonstrating a compression ratio (compressed bitstream bit count over 
original bitstream word count) as low as 25%, this architecture-specific algorithm exhibits poor 
extensibility. Furthermore, in the absence of repetitive structures that gives rise to logic cells 
Figure 1. Virtex Conﬁguration and Readback
right sides of the device. CLBs and SelectRAM are rep-
resented by multiple CLB and RAM columns respectively.
Each CLB column is conﬁgured by 48 frames. The ﬁrst and
last 18 bits of each of these frames conﬁgure the IOBs lo-
cated at the top and bottom of the device, respectively. Each
18-bit group found between the ﬁrst and last 18 bits in turn
conﬁgures part of a CLB row found in that column.
The device is conﬁgured by loading in a conﬁguration bit-
stream, consisting of a series of commands and frame data.
To conﬁgure a frame, the frame address must be provided,
followed by the frame data. The conﬁguration mechanism
consists of a serial-to-parallel register, the Frame Data Input
Register (FDRI) (see Figure 1). Every frame is ﬁrst shifted
into the FDRI before it is written to the conﬁguration mem-
ory. Data is shifted into the bottom of the FDRI. Once the
FDRI is ﬁlled up with a new frame, it is written into mem-
ory. In addition, a frame can be read back by loading it from
conﬁguration memory into a parallel-to-serial register called
the Frame Data Output Register (FDRO).
RELATED WORK
Anumberofcompressiontechniqueshavebeenproposedfor
FPGA architectures. The wildcard compression scheme [6]
targets the Xilinx XC6200 series FPGA architecture which
supported this feature. By utilizing the wildcard address reg-
isters present in the architecture, logic cells sharing the same
conﬁguration can be addressed and conﬁgured with a single
write operation, speeding up conﬁguration. Despite demon-
strating a compression ratio as low as 25%, this architecture-
speciﬁc algorithm exhibits poor extensibility. Furthermore,
in the absence of repetitive structures that give rise to logic
cells sharing common conﬁgurations, the regularities upon
whichthealgorithmleveragesuponmaynotbepresent. Run-
length encoding techniques have also been proposed for this
type of architecture [7]. On the whole, however, this archi-
tecture is clearly not appropriate for large FPGAs, bringing
into question this entire class of methods.
Conﬁguration cloning [11] exploits regularity and locality in
bitstreams by copying conﬁguration data from one region of
the FPGA to several other regions. Without loading the en-
tire bitstream, the whole FPGA can be conﬁgured, reducing
conﬁguration latency. However, for this method to work, a
large amount of complex circuitry is needed.
It was found that bitstream regularities can be effectively
encoded by dictionary-based methods. Dictionary-based al-
gorithms depend less on speciﬁc features of the FPGA con-
ﬁgurationarchitecture,providingforgreaterﬂexibility. Con-
ﬁguration compression [5] based on the Lempel-Ziv-Storer-
Szymanski (LZSS) [1] compression scheme has been shown
to be effective for the Xilinx Virtex family of FPGAs. These
algorithms require that an extended version of the FDRI be
used as a sliding window, such that a frame present in the
upper half of the FDRI will act as the dictionary for the next
frame to be conﬁgured. By deriving a suitable sequence of
frameconﬁguration,inter-andintra-frameregularitiescanbe
leveraged. Thisisthebestconﬁgurationcompressionmethod
known to date. Besides LZSS, other dictionary compression
schemes such as the Lempel-Ziv-Welch (LZW) method [4],
and LZ77 [10] were also reported in the literature. However,
they tend to fair less well.
DIFFERENCE VECTOR (DV) COMPRESSION
An analysis of conﬁguration bitstreams reveals a high de-
gree of regularity among the frames conﬁguring the CLB
array. Recall that each CLB column is conﬁgured by a series
of frames. Frames conﬁguring common structures among
different CLBs may share high regularity. Between such
frames, one frame may be converted into another simply by
ﬂippingafewbits. LZSScompressionproducesgoodresults
in highly regular bitstreams [5]. However, when the lengths
ofthematchesaresmall,LZSScompressionprovestobeless
effective as the savings is offset by the encoding overhead.
An analysis of our benchmark bitstreams reveals that bit-
stream regularities may be too ﬁne-grained to be effectively
captured by the LZSS method. We shall now describe an
algorithm that taps into these regularities thereby achieving
better compression ratio than the LZSS method.
Compression
By encoding the bit differences between frames, we can
reduce redundancies in loading identical information repeat-
edly for similar frames. We then assign a suitable reference
frame to the frame to be compressed which we call the ben-
eﬁciary frame. We then construct a difference vector:
DV (F1;F2) = F1  F2
where F1, F2 are the reference and beneﬁciary frames re-
spectively and  is the bitwise exclusive OR operator.
Given that the bit-ﬂips obtained between two frames tend to
be either few in number and scattered or clustered in bands,
we observe long sequences of 0s in the difference vector,
with 1s occurring in shorter sequences. Therefore, runlength
encoding (RLE) can be used to effectively compress run-
ning sequences in the difference vectors. In our scheme, we
ﬁrst determine the runlengths of 0s and 1s before employing
Huffman[8]encodingfortherunlengths. Giventhattherun-
lengths of sequences of 1s exhibit vastly different entropies
from that of sequences of 0s (with 1s occurring in short se-
quences and 0s in very long sequences), we use separate set
of encodings for sequences of 1s and 0s.Decompression
Decompression occurs at runtime. Decoding the runlengths
reveals the bit locations where ﬂips are to be performed to
transform the reference frame into the beneﬁciary one. The
FDRI register in Xilinx Virtex devices can be effectively
used for decompression. Initially, it contains the reference
frame. A decoder circuit is added before the FDRI register
(see Figure 1). As runlength decoding proceeds, the relevant
bits in FDRI are ﬂipped. When all the code words have been
decoded, the contents of the FDRI represents the beneﬁciary
frame and it is written to the conﬁguration array.
INTRA-BITSTREAM COMPRESSION
The similarity between the reference frame and the beneﬁ-
ciary frame is the key to achieving good compression ratio.
Both LZSS and DV compression require a suitable conﬁgu-
rationsequenceofframessuchthatsimilarframesarenextto
each other thereby improving the compression ratio. We use
the algorithms presented in [5] to generate the conﬁguration
sequence for LZSS compression and modify it suitably to
adapt it to DV compression.
Inter-frame Regularity Graph (IRG)
Aninter-frame regularity graph(IRG)[5]isadirectedgraph
where each frame is represented by a node. A directed edge
u ! v between two nodes u and v represents the size of
the compressed encoding of v by treating u as the reference
frame. To discover inter-frame regularities, each frame is
used as a reference frame and every other frame is treated as
a beneﬁciary frame, i.e., IRG is a complete graph. In case
of LZSS compression, it is easy to ﬁnd the size of the com-
pressedencodingforthebeneﬁciaryframe. However,forDV
compression, the Huffman codes for runlengths are designed
for the entire bitstream instead of individual frames. That is,
it depends on the conﬁguration sequence. Therefore, we use
the number of 0 ! 1 and 1 ! 0 transitions in the difference
vector between frame u and frame v as the edge weight of
u ! v. We expect that minimizing the number of transi-
tions will minimize the length of the encoding. As an added
advantage, minimizing the number of runlengths reduces de-
compression time by reducing the number of Huffman table
lookups.
Given the IRG, we present two schemes for generating con-
ﬁguration sequence: one that does not require any readback
of frames, and one that does.
Conﬁguration WithOut Readback (CWOR)
The optimal conﬁguration sequence that does not have read-
backcorrespondstotheshortestHamiltonianpath(i.e.,apath
that visits each vertex exactly once) in the IRG. A greedy al-
gorithm is presented in [5] that starts with the minimum cost
edge and expands its both ends to cover all the vertices. This
algorithm generates close to optimal results. For LZSS com-
pression,thisschemerequiresextensionoftheFDRIregister.
However, DV compression requires minimum modiﬁcation
 
Original MST 
Transformed MST     
A 
   
  
B D G
C E  F  H
R1 
R2 
R3 
   
A 
   
  
BG   D 
CH E   F
R1 
R2 
Figure 2. Original and Transformed MST
totheexistingconﬁgurationarchitecture. Onlyadecompres-
sion unit is added before the FDRI register. Let hf1;:::;fni
be the conﬁguration sequence of frames. fi 1 acts the refer-
ence frame for fi. Before fi is decoded, its reference frame
fi 1 is present in the FDRI register as it was the last conﬁg-
ured frame. Therefore, FDRI register can be used as refer-
enceframeanditsbitsareﬂippedaccordingtofi’sencoding.
The modiﬁed FDRI register represents the beneﬁciary frame
fi and is written to the conﬁguration array. FDRI will now
be used as reference frame to decompress the frame fi+1.
Conﬁguration with Readback (CWR)
The conﬁguration sequence without readback is too restric-
tive in taking full advantage of the similarity among frames.
Therefore, LiandHauck[5]proposedanotherschemewhere
it is allowed to readback already conﬁgured frames from the
FPGA and use them as reference frames. This corresponds
to ﬁnding the directed minimum spanning tree (MST) [3] in
the IRG. An edge u ! v in the tree indicates that u should
be used as the reference frame for v. The conﬁguration se-
quence is generated by a pre-order traversal of the MST. Li
and Hauck [5] propose to readback the frame in FDRI regis-
teritself. Instead, weusetheexistingreadbackinfrastructure
to read frames from conﬁguration array into FDRO register
(see Figure 1). The contents of the FDRO register is trans-
ferred to the FDRI register and then used as the reference
frame as before.
Once we readback a frame in the FDRO register, it remains
there till another readback command is executed. Therefore,
it is possible to reuse the FDRO register. Let v be a node in
theMSTwithmorethanonechildrenc1;:::;cn andsuppose
that none of the subtrees rooted at each child node is a chain
(simplepath). Thentheconﬁgurationsequencegeneratedby
pre-order traversal requires n 1 readbacks of v to conﬁgure
c2;:::;cn. This is because frame v in FDRO is overwrit-
ten due to readbacks during the subtree traversal of a child.
However, if the subtree at child ci is a chain, then the FDRO
register will still contain frame v after the frames in subtree
of ci have been conﬁgured. ci+1 can simply use the FDRO
register thereby avoiding the readback of v. We exploit this
fact by transforming the MST such that at any node, the left
children are those with “chain” subtrees as shown in Fig-
ure 2. The original MST requires three readbacks (R1, R2,
R3) while the the transformed one (obtained by moving the
subtree rooted at G to the left) requires only two readbacks.INTER-BITSTREAM COMPRESSION
Utilizingintra-bitstreamregularities,thealgorithmspresented
so far are effective for single bitstream compression. How-
ever, they do not take into account inter-bitstream regular-
ities inherent across multiple bitstreams in reconﬁgurable
systems. Regularities between conﬁguration bitstreams may
arise due to the following reasons, and subsequently can be
exploited.
StaticKernels. Theimplementationofanapplicationbased
on a reconﬁgurable computing model may require several
runtime reconﬁgurations of its FPGA component. Each
conﬁguration may consist of multiple independent compu-
tational kernels, each implementing a different part of the
application. Conﬁgurations may share static kernels (modu-
lar circuits maintained in the same position across multiple
conﬁgurations). This occurs when the application requires
some kernels persistently, whereas it dynamically swaps in
andswapsoutnon-persistentkernelsfromtheremainingfab-
ric. Without partial reconﬁguration, the whole FPGA has to
be reconﬁgured every time any part of the fabric changes.
Instead of loading in an entire bitstream, the partial recon-
ﬁgurability of the Virtex FPGA allows us to load in a partial
bitstreamonlyforthepartofthedevicethatrequireschanges.
Chip Utilization. Due to the highly ﬂexible nature of the
FPGAinterconnect,onlyasmallproportionoftheconﬁgura-
tionbitsinthebitstreamforagivencircuitmaybeimportant.
Hence, only these bits in the conﬁguration present on-chip
may differ from those of the incoming bitstream.
In this section, we will ﬁrst explore the feasibility of using
partial reconﬁguration to speed up dynamic reconﬁguration,
before proposing algorithms to apply Difference Vector and
LZSS compression in an inter-bitstream paradigm.
Partial Reconﬁguration
With the partial reconﬁguration capability present in Virtex
FPGAs, conﬁguration time can be greatly reduced by trans-
ferring only those portions of the next bitstream that differ
from the current bitstream already in the FPGA, with no ad-
ditional hardware overhead. We have implemented a tool
targeting the Xilinx Virtex XCV1000 FPGA on the Celoxica
RC1000 board, that performs partial reconﬁguration by de-
tecting the differences between two bitstreams and loading
onlythesedifferencesontotheFPGAduringreconﬁguration.
Our experiments showed that reconﬁguration time increases
nearly proportionally with the amount of data transferred,
with negligible latency in addressing the changed portions of
the bitstream.
One major drawback of partial reconﬁguration is its reliance
on the appropriate placement of kernels. As mentioned pre-
viously, the smallest conﬁguration unit is one frame, which
spans the conﬁguration array vertically. The kernel layout
forBenchmark5inFigure7allowsforpartialreconﬁguration
of the static idct kernel. Whereas the static smalldes kernel
for Benchmark2 in Figure 7 will not offer any opportunity
for partial reconﬁguration as all the frames are modiﬁed.
The physical constraints and IO requirements for a kernel
may render a layout facilitating partial reconﬁguration dif-
ﬁcult. On the other hand, by detecting inter-bitstream reg-
ularities that lie within a section of a frame, the regularity
brought about by static circuitry spanning all the frames can
be captured by DV and LZSS compression.
Compression
DV and LZSS compression have been shown to be effective
onstandalonebitstreams. Givenareconﬁgurablecomputing
model, these techniques may be extended to exploit regular-
ities spanning across bitstreams.
The conﬁguration memory on the FPGA acts as a natural
cache for storing frames in the previous conﬁguration, which
we shall call the old frames. These frames may be very
similar to the new frames - the incoming conﬁguration to
be compressed. Just as intra-bitstream compression exploits
similarities among new frames, inter-bitstream compression
can readback old frames and use them to compress the new
frames. The complication here is that the reference frames
from the old bitstream cannot be overwritten prior to their
being readback for reuse. We have extended both intra-
bitstream CWOR and CWR techniques to approximate opti-
mal conﬁguration sequences that maximize the use of refer-
ence frames within the same bitstream and across successive
bitstreams.
Inter-frame Regularity Graph (IRG)
We deﬁne a variant of the IRG graph previously utilized
in LZSS and DV schemes for stand alone bitstreams. The
IRG in this case is a multi-digraph, i.e., a directed graph
with multiple edges between vertices as well as self-loops.
Each node u has a pair of directed edges Intra(u ! v) and
Inter(u ! v) to every other node v. The weight of the edge
Intra(u ! v) is the cost of using unew as a reference frame
for vnew. Similarly, the weight of Inter(u ! v) is the cost
of using uold as the reference frame for vnew. In addition,
each node u has a self-loop Inter(u ! u) annotated with
the cost of using uold as the reference frame for unew. We
now deﬁne some special nodes in the IRG.
Retained Nodes A node u is a retained node if the content
of unew is the same as the content of uold. These frames do
not need to be reconﬁgured.
Self-Referenced Nodes A node u is a self-referenced node
if it is not a retained node, but uold is used as the reference
frame for unew.
Conﬁguration WithOut Readback (CWOR)
The CWOR scheme for intra-bitstream compression does
not require any readback. However, for inter-bitstream com-
pression, we allow a restricted form of readback. We allow
a frame uold to be readback only for the conﬁguration of
unew. This relaxation allows us to exploit scenarios depicted
in the Benchmark2 in Figure 7 involving the static kernelInput: Inter-Bitstream IRG G
Output: Conﬁguration Sequence CS
CS :=  ;
Delete Retained Nodes from G;
Delete edges Inter(u ! v) if u 6= v;
Mark all nodes in G as unvisited;
for all nodes v 2 G do
if Inter(v ! v) < minu2G;u6=v (Intra(u ! v)) then
/* v is a Self-Referenced Node */
CS := CS [ fhvig;
Mark v as visited;
end
end
if CS =  then
/* There is no self-referenced node */
Let (u ! v) be minimum cost edge in G;
CS := fhu;vig;
Mark u and v as visited;
end
while 9 unvisited nodes in G do
Let u ! v be the shortest edge in G s.t. v is unvisited
and u is the tail node of a sequence S 2 CS ;
if Inter(v ! v) < Intra(u ! v) then
/* v is a Self-Referenced Node */
CS := CS [ fhvig;
else
Append node v to S;
end
Mark v as visited;
end
Algorithm 1: Inter-Bitstream CWOR
smalldes. Algorithm 1 is a greedy algorithm for constructing
the CWOR conﬁguration sequence.
The algorithm ﬁnds a conﬁguration sequence (CS) for the
non-retained frames. Note that the retained frames need not
be conﬁgured. The only “Inter” edges considered are of the
form Inter(v ! v). At any point of time, CS is of a set
of sequences (paths) and the algorithm tries to extend them
along the tails with shortest edges. The ﬁnal conﬁguration
sequence consists of a set of directed paths. If a sequence
contains a self-referenced node, it will be at the head of the
sequence and its conﬁguration will require a readback.
Conﬁguration With Readback (CWR)
The aim of the inter-bitstream CWR algorithm is to ﬁnd a
conﬁgurationsequencethatpairsupeachnewframewiththe
best possible reference frame. The reference frames can be
botholdandnewframes. However,weneedtomakesurethat
a old reference frame is not overwritten before it is used by
the corresponding beneﬁciary frame. The algorithm consists
of two phases. The ﬁrst phase pairs up each non-retained
new frame with a reference frame. This phase generates a
set of directed trees. The second phase computes an efﬁcient
Input: Rooted Tree T
Output: Conﬁguration Sequence
mark all nodes as unvisited;
for all nodes v 2 T in reverse level traversal order do
current := v;
if current is visited then
continue;
end
while (current has incident “Inter” edge or is self-
referenced) do
/* conﬁgure current */
mark current as visited;
if current is self-referenced then
parent := ;
compress current w.r.t. currentold;
else
parent := parent of current;
compress current w.r.t. parentold;
end
/* Traverse current’s subtree */;
PREORDER(sub-tree rooted at current);
if (parent 6=  and parent does not have unvisited
child with “Inter” edge) then
current := parent;
continue;
else
break;
end
end
end
PREORDER (T);
Algorithm 2: Inter-Bitstream CWR
traversal order of these trees to minimize the number of
readbacks.
Generation of Trees. First,wemarktheretainednodesand
remove all their incoming edges from the IRG. All the out-
going “Intra” edges of the retained nodes are also removed.
Similarly,weﬁndalltheself-referencednodesv inIRGGs.t.
Inter(v ! v) < minu2G;u6=v (Intra(u ! v)). Again, we
remove all the incoming edges to the self-referenced nodes.
FortheremainingsetofnodesV inG,wederiveasetoftrees
such that each node in V appears exactly once in the trees
with in-degree equal to 1. The goal is to minimize the total
cost of all the edges in the tree. This is achieved through a
modiﬁcation of the directed minimum spanning tree (MST)
algorithm [3]. The basic idea is to start with the set of best
pairing of nodes and then eliminate cycles, if any, one by
one.
1. For each node other than the retained and self-referenced
ones,selecttheincomingedgewiththeminimumcost. LetS
be the set of selected edges. If no retained or self-referencedBenchmark Source Device Utilization
rsa Opencores XCV100 56%
idct Xilinx XCV150 93%
dctidct Xilinx XCV300 86%
des Xlinx XCV400 76%
tripdes Xilinx XCV800 82%
jpeg OpenCores XCV1000 96%
Table 1. Benchmarks for Intra-Bitstream Compression
frames exists, then select an arbitrary node and remove its
incoming edge from S.
2. If the set S does not contain any cycle, then S is the set of
trees. Otherwise, continue.
3. Foreachcycleformed,thenodesinthiscyclearecollapsed
into a pseudo-node (k). For each node j in the cycle, modify
the cost all its incident edges except for the one that belongs
to the cycle as follows.
c(i ! k) = c(i ! j) 

c(x(j) ! j)   min
l2G
(c(x(l) ! l)

where c(i ! j) is the cost of the incident edge, c(x(j) ! j)
is the cost of the edge in the cycle incident to j. Note that
we also consider self loops (i = j) as incident edges.
4. Among the modiﬁed edges, let i ! k be the edge with
minimum modiﬁed cost and let i ! k enter the cycle at node
j. Ifi = j, thenmarkj asaself-referencednodeandremove
x(j) ! j from the set S. Otherwise, select i ! k as the
incoming edge of the pseudo-node k and replace x(j) ! j
with i ! j in the set S.
5. Delete all the incident self-loop edges, i.e., delete i ! k
if i is a node in the original cycle.
6. Go to step 2 with the contracted graph.
Figure 3 shows an illustration of tree generation algorithm.
As there is no self-referenced node, node 1 is chosen as the
root. The original set of edges generates a cycle which is
eliminated in the ﬁgure on the right hand side.
Theresultoftheﬁrstphaseisasetofdisjointtrees. Eachtree
can have at most one retained or self-referenced node. This
is because these special nodes have in-degree = 0 and all the
other nodes have in-degree = 1. Similarly, by construction,
there can be at most one node with in-degree = 0 that is not
a retained or self-referenced node (see Step 1). The number
of trees in S corresponds to the number of nodes with in-
degree = 0 in S. Each such node is made the root of the
corresponding tree.
Traversal of Trees. The second phase of the algorithm tra-
verses each rooted tree so as to minimize the number of
readbacks and to ensure that a reference frame is not over-
written before it is used. Each node v is assigned a depth
value equal to the length of the path from root to node v.
Algorithm 2 details the traversal. The basic idea of the algo-
rithmistovisitthenodeswithincoming“Inter”edgesﬁrstso
as to make sure that their reference frames are not overwrit-
ten. The algorithms traverse the nodes in reverse level-order,
i.e., its starts with the nodes of highest depth and works its
way towards the root. However, it also attempts to exploit
a frame that has just been conﬁgured (present in FDRI) or
has just been readback (present in FDRO). The ﬁrst case is
taken care of through a PREORDER traversal of the subtree
rooted at a conﬁgured node (current in Algorithm 2 which is
present in FDRI). The PREORDER traversal is similar to the
pre-ordertraversaldescribedinintra-bitstreamCWRscheme
(see Section ). However, it stops exploring a node further if
it has already been visited. The second case (a frame present
in FDRO) is exploited by the fact that for a parent node (pos-
sibly present in FDRO), we visit all the children with “Inter”
edges before conﬁguring parent (last condition in Algorithm
2). After all the “Inter” edges have been visited, the algo-
rithm performs another pre-order traversal starting from root
node to conﬁgure the unvisited nodes. Figure 4 shows an
example of a traversal of two trees. The ﬁgure on the left
represents the conﬁguration sequence of nodes before the
ﬁnal PREORDER traversal of the whole tree for unvisited
nodes is performed. The lightly shaded nodes represent the
ones that are covered by PREORDER traversal of subtrees
rooted at conﬁgured nodes. The ﬁgure on the right shows
the ﬁnal conﬁguration sequence.
Algorithm 2 generates the order in which the frames should
be conﬁgured and the reference frames to be used for their
compression. Giventhisorder,itiseasytoseethatareadback
willbenecessaryifthereferenceframeisnotpresentinFDRI
or FDRO register. Given the traversal order, a conﬁguration
bitstream is generated that includes readback commands as
well commands to use FDRI or FDRO register as reference
frames.
EXPERIMENTAL RESULTS
In this section, we present the experimental evaluation of our
compression. We perform compression using a selection of
circuits typically implemented on FPGAs in the encryption
and image processing application domain (see Table 1).
Intra-Bitstream Compression
Figure 5 shows the intra-bitstream compression ratios for the
DV and LZSS compression. The compression ratio is de-
ﬁned as the compressed conﬁguration size as a percentage of
theuncompressedconﬁgurationsize. Weincludedthesizeof
the Huffman tables in compressed conﬁguration size for DV.
Figure 5 shows that DV compression performs better than
LZSSirrespectiveofreadbacks. Notethatconﬁgurationwith
readback (CWR) performs better than conﬁguration without
readback (CWOR) for DV compression. Reading back sim-
ilar data allows full exploitation of inter-frame regularities.
However, LZSS fails to exploit ﬁne-grained bitstream reg-
ularities as effectively as DV compression. Therefore, the
slight reduction in frame encoding size is insufﬁcient to off-
set the frame addressing overheads involved in readbacks.
This led to the CWOR scheme performing better than the
CWR scheme under the LZSS compression.1
2
6
5
8
7
1
1
2 
4
3
3
 
1 
2 
6 
3 
4  5 
5 
8 
7
1 
1 
2 
1 3 
4 
6 
7 
7 
5 
3 
5 
8
Selected Edges 
S = [(1,2), (3,4), (4,5), (5,3), (1,6)]  S = [(1,2), (1,3), (3,4), (4,5), (1,6)] 
8
Figure 3. Tree Generation for Inter-Bitstream CWR
 
3 
5  1 
6
4 
2
2 
3  4 
5  1
8 
7 
Intra 
Inter 
Self-Referenced 
Node
3 
5  1 
6 
4 
 
2 
   
  1 8 
7 
Figure 4. Tree Traversal for Inter-Bitstream CWR  
 
 
 
 
 
0
10
20
30
40
50
60
70
80
90
100
C
o
m
p
r
e
s
s
i
o
n
 
R
a
t
i
o
 
(
%
)
rsa idct dct+idct des tripledes jpeg
DV CWR LZSS CWR DV CWOR LZSS CWOR
Figure 5. Intra-bitstream Compression with DV and LZSS
Despite yielding better results, some issues may arise dur-
ing actual hardware implementation of DV compression.
One concern is that on-chip memory may not be sufﬁcient
for Huffman look-up tables, although our experiments have
shown the largest of these tables to be less than 3KB.
Second, the Huffman trees to retrieve the Huffman-encoded
runlengths may be a potential bottleneck in the decompres-
sion process. Speedups can be achieved by N-Level imple-
mentation of the look-up tables, which trades off memory
requirements with speed of decompression, depending on
the number of levels used [12].
Inter-Bitstream Compression
Toinvestigatetheinter-bitstreamDVandLZSScompression,
weusebenchmarksconsistingofapairofbitstreams(O;N).
For each new bitstream N, we obtain the intra-bitstream
compressionaswellastheinter-bitstreamcompressionusing
O astheoldbitstream. Thecharacteristicsofthebenchmarks
is shown in Figure 7. For each benchmark, the left one is the
old conﬁguration and the right one is the new conﬁguration.
The device used is mentioned next to the benchmark and
device utilization is given at the bottom of each benchmark.
Benchmarks 2 to 5 consist of bitstreams demonstrating high
correlationtypicalofreconﬁgurablesystems-O andN shar-
ing static kernels. Each kernel was synthesized indepen-
dently and then hand mapped onto the assembled design to
ensure that the kernels do not overlap with each other before
being passed through place and route tools. Also different
kernel placements were experimented with to investigate the
effect of performing inter-bitstream compression with vary-
ing degrees of partial reconﬁguration enabled.
Figure 6 indicates that inter-bitstream compression consis-
tently outperforms intra-bitstream compression. The best
compression ratio achieved with any inter-bitstream com-
pression is 26.5%-75.8% better than the best compression
ratio achieved with any intra-bitstream compression for our
benchmarks. Moreover, DV maintains better performance
than LZSS over both the inter- and intra-bitstream compres-
sion models. However, note that the advantage of DV over
LZSS is reduced a little bit in inter-bitstream model com-
pared to intra-bitstream model. Static kernels offer longer
symbol matches between reference and beneﬁciary frames,
facilitating LZSS compression.0
10
20
30
40
50
60
70
80
90
100
C
o
m
p
r
e
s
s
i
o
n
 
R
a
t
i
o
 
(
%
)
Benchmark1 Benchmark2 Benchmark3 Benchmark4 Benchmark5
Intra DV Readback
Inter DV Readback
Intra DV w/o Readback
Inter DV w/o Readback
Intra LZSS Readback
Inter LZSS Readback
Intra LZSS w/o Readback
Inter LZSS w/o Readback
 
Figure 6. Compression Ratio for Inter-Bitstream DV and LZSS Schemes
 
 
 
 
 
des 
 
 
tripledes 
       
 
idct 
 
 
 
idct 
 
 
dct 
 
dct 
 
smallDes 
 
 
smallDes 
huffman  quantizer  fastDes_2  fastDes_1 
smallDes  rsa 
idct 
& 
rsa 
idct 
& 
rsa 
mult 
& 
huffman 
mult_2 
& 
  quantizer 
 
dct 
 
mult 
 
dct 
 
mult 
huffman    quantizer 
Benchmark5 (XCV600) 
Benchmark3 (XCV1000)  Benchmark4 (XCV400) 
Benchmark1 (XCV800)  Benchmark2 (XCV100) 
Non-static kernels 
Static kernels
U = 41%  U = 77%  U = 89%  U = 89% 
U = 67%  U = 70%  U = 66%  U = 84% 
U = 69%  U = 81% 
 
Figure 7. Benchmarks for Inter-Bitstream Compression.
Inter-bitstreamCWORperformsbetterthantheinter-bitstream
CWR for both DV and LZSS compression. This is because
inter-bitstream compression is enhanced with regularities of-
fered by static kernels, giving rise to the predominance of
readbacks in which each frame fnew uses fold as the ref-
erence frame. Inter-bitstream CWOR achieves this without
having to mention the address of the readback frame. Inter-
bitstream CWR, on the other hand, caters for readback of
any frame, and in doing so it has to specify the address of the
readback frame, thereby incurring extra overhead.
Theamountofinter-bitstreamregularitiespresentreliesheav-
ily on the amount of static conﬁguration data and placement
of static kernels [2]. For Benchmark1, which has no static
kernel, inter-bitstream techniques performs marginally bet-
ter than the corresponding intra-bitstream techniques by tak-
ing advantage only of the random regularities. Benchmarks
2 to 4 each maintains static kernels aligned horizontally
across the device. Signiﬁcant reduction in compression ra-
tios is achieved with inter-bitstream compression over intra-
bitstream compression. The reduction is not proportional to
the static kernel size as it is dependent on the compression
efﬁciency of the intra-bitstream compression. Using partial
reconﬁguration combined with DV or LZSS inter-bitstream
compression, the compression efﬁciency can be improved
signiﬁcantly as demonstrated for Benchmark5.
CONCLUSION
In this paper, we proposed a class of conﬁguration com-
pression techniques targeted at Xilinx Virtex devices. Our
novel compression technique for single conﬁguration bit-
streamperformsbetterthanthebestreportedtechniqueinthe
literature. We then extend the scheme to inter-conﬁguration
compression by exploiting the potential for data reuse be-
tween successive conﬁgurations. As far as we know, this is
the ﬁrst such algorithm proposed and it outperforms intra-
conﬁguration compression by a large margin.
ACKNOWLEDGMENTS
This work was partially supported by A*STAR Project 022-
106-0043.
REFERENCES
[1] J. A. Storer amd T. G. Syzmanski. Data compression via textual
substitution. Journal of the ACM, 29, 1982.
[2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. 3-D ﬂoorplanning:
Simulated annealing and greedy placement methods for
reconﬁgurable computing systems. Design Automation for
Embedded Systems, April 2000.
[3] Y. J. Chu and T. H. Liu. On the shortest arborescence of a direct
graph. Science Sinica, 14, 1965.
[4] A. Dandalis and V. Prasanna. Conﬁguration compression for
FPGA-based embedded systems. In FPGA, 2001.
[5] S. Hauck and Z. Li. Conﬁguration compression for Virtex FPGAs. In
FCCM, 2001.
[6] S. Hauck, Z. Li, and E. Schwabe. Conﬁguration compression for
Xilinx 6200 FPGA. IEEE TCAD, 18(8), 1999.
[7] S. Hauck and W. Wilson. Runlength compression techniques for
FPGA conﬁguration. In FCCM, 1999.
[8] D. A. Huffman. A method for the construction of minimum
redundancy codes. Proceedings of the Institute of Radio Engineers
40, 1952.
[9] Xilinx Inc. Virtex Series Conﬁguration Architecture User Guide,
2000.
[10] A. Khu. Xilinx FPGA Conﬁguration Data Compression and
Decompression, 2001.
[11] S. R. Park and W. Burleson. Conﬁguration cloning: Exploiting
regularity in dynamic DSP architecture. In FPGA, 1999.
[12] S. D. Warhade. Implementation approaches to Huffman decoding.
Technical report, Wipro Technologies, 2002.