FPGAs are a promising technology for developing high-performance embedded systems. The density a n d p e rformance of FPGAs have drastically improved over the past few years. Consequently, the size of the con guration bitstreams has also increased considerably. As a result, the cost-e ectiveness of FPGA-based embedded systems is signi cantly a ected by the memory required for storing various FPGA con gurations. This paper proposes a novel compression technique that reduces the memory required for storing FPGA con gurations and results in high decompression e ciency. Decompression e ciency corresponds to the decompression hardware cost as well as the decompression rate. The proposed technique is applicable to any SRAMbased FPGA device since con guration bit-streams are processed as raw data. The required decompression hardware is simple and is independent of the individual semantics of con guration bit-streams or speci c features of the on-chip con guration mechanism. Moreover, the time to con gure the device is not a ected by our compression technique. Using our technique, we demonstrate up to 41% savings in memory for con guration bit-streams of several real-world applications.
INTRODUCTION
The enormous growth of embedded applications has made embedded systems an essential component in products that emerge in almost every aspect of our life: digital TVs, game consoles, network routers, cellular base-stations, digital communication devices, printers, digital copiers, multifunctional equipment, house appliances, etc. For example, from 200 million units shipped in the year 1997, the DSP embedded This work is supported by the DARPA Adaptive Computing Systems program under contract no. DABT63-99-1-0004 monitored by F ort Huachuca and in part by the National Science Foundation under grant no. CCR-9900613. systems market has been forecast to grow to 1,200 million units in the year 2001 13] . The goal of embedded systems is to perform a set of speci c tasks to improve the functionality of larger systems. As a result, they are usually not visible to the end-user since they are embedded in larger systems. Embedded systems usually consist of a processing unit, memory to store data and programs, and an I/O interface to communicate with other components of the larger system. Their complexity depends on the complexity of the tasks they perform. The main characteristics of an embedded system are raw computational power and cost-e ectiveness. The coste ectiveness of an embedded system includes characteristics such as product lifetime, overall price, and power consumption, among others.
The unique combination of hardware-like performance with software-like exibility m a k e FPGAs a highly promising solution for embedded systems. Typical FPGA-based embedded systems have FPGA devices as their processing unit, memory to store data and FPGA con gurations, and an I/O interface to transmit and receive data. FPGA-based embedded systems can sustain high processing rates while providing a high degree of exibility required in dynamically changing environments. FPGAs can be recon gured on demand to support multiple algorithms and standards. Thus, the degree of system exibility strongly depends on the amount of con guration data that can be stored in the eld. However, the size of the con guration bit-stream has increased considerably over the past few years. For example, the size of the con guration bit-stream of the VIRTEX series FPGAs, range from 0:6 Mbits to 16 Mbits 14] . As a result, storing con guration bit-streams in an FPGA-based embedded system becomes a critical problem drastically affecting the cost-e ectiveness of the system.
In this paper, we propose a novel compression technique to reduce the memory requirements for storing con guration bit-streams in FPGA-based embedded systems. By compressing con guration bit-streams, signi cant s a vings in memory requirements can be achieved. The con guration compression occurs o -line. At runtime, decompression occurs and the decompressed data is fed to the on-chip con guration mechanism to con gure the device. The major performance requirements of the compression problem are the decompression hardware cost and the decompression rate. The above requirements distinguish our compression problem from conventional software-based applications. We a r e not aware of any prior work that addresses the con gura-tion compression problem of FPGA-based embedded systems with respect to the cost and speed requirements.
Our compression technique is applicable to any SRAMbased FPGA device since it does not depend on speci c features of the con guration mechanism. The con guration bit-streams are processed as raw data without considering individual semantics. As a result, both complete and partial con guration schemes can be supported. The required decompression hardware is simple and independent o f t h e con guration format or characteristics of the con guration mechanism. In addition, the achieved compression ratio is independent of the decompression hardware and depends only on the entropy of the con guration bit-stream. Finally, the time to con gure an FPGA depends only on the data rate of the on-chip con guration mechanism, the speed of the memory that stores the con guration data, and the size of the con guration bit-stream. The decompression process does not add any o verhead to the con guration time.
The proposed compression technique is based on the principles of dictionary-based compression algorithms. Even though statistical methods can achieve higher compression ratios, we propose a dictionary-based approach because statistical methods lead to high decompression hardware cost. The dictionary corresponds to con guration data that is stored in the memory. In our scheme, the dictionary is derived based on the well-known LZW compression algorithm 11]. However, a major deviation from LZW-based algorithms is the calculation of the compression ratio. Our compression technique proposes a novel way of constructing the dictionary to signi cantly improve the compression ratio. In addition, our technique delivers the decompressed data in order. On the contrary, i n c o n ventional LZW-based algorithms, the decompressed data is delivered in reverse order. By using a stack, the original data is reconstructed. The latter signi cantly a ects the decompression rate of a hardware implementation.
Using our technique, we demonstrated 11;41% savings in memory for con guration bit-streams of several real-world applications. The con guration bit-streams corresponded to cryptographic and digital signal processing algorithms. Our target architecture was VIRTEX series FPGAs 14] . The size of the con guration bit-streams ranged from 1:7 Mbits to 6:1 Mbits.
An overview of the con guration of SRAM-based FPGAs is given in Section 2. In Section 3, various aspects of compression techniques and the constraints imposed by e m bedded systems are presented. Our novel compression technique is described in Section 4. Experimental results are demonstrated in Section 5 and related work is described in Section 6. Finally, in Section 7, possible extensions to our work are described.
FPGA CONFIGURATION
An FPGA con guration determines the functionality o f the FPGA device. An FPGA device is con gured by loading a con guration bit-stream into its internal con guration memory. A n i n ternal controller manages the con guration memory as well as the con guration data transfer via the I/O interface. Throughout this paper, we refer to both the con guration memory and its controller as the con guration mechanism. Based on the technology of the internal con guration memory, FPGAs can be permanently con gured once or can be recon gured in the eld. For example, Anti-Fuse technology allows one-time programmability while SRAM technology allows reprogrammability.
In this paper, we focus on SRAM-based FPGAs. In SRAMbased FPGAs, the contents of the internal con guration memory are reset after power-up. As a result, the internal con guration memory cannot be used for storing con guration data permanently. Using partial con guration, only a part of the contents of the internal con guration memory is modi ed. As a result, the con guration time can be signi cantly reduced compared with the con guration time required for a complete recon guration. Moreover, partial con guration can occur at runtime without interrupting the computations that an FPGA performs. SRAM-based FPGAs require external devices to initiate and control the conguration process. Usually, the con guration data is stored in an external memory and an external controller supervises the con guration process.
The time required to con gure an FPGA depends on the size of the con guration bit-stream, the clock r a t e a n d t h e operation mode of the con guration mechanism, and the throughput of the external memory that stores the con guration bit-stream. Typical sizes of con guration bit-streams range from 0:6 Mbits to 16 Mbits 1, 2, 14] depending on the density of the device. The clock rate of the con guration mechanism determines the rate at which the con guration d a t a i s d e l i v ered to the FPGA device. The con guration data can be transferred to the con guration mechanism serially or in parallel. Parallel modes of con guration result in faster con guration time. Typical values of data rates can be as high as 480 Mbits/sec 1, 2, 14]. Thus, the external memory that stores the con guration bit-stream should be able to sustain the data rate of the con guration mechanism. Otherwise, the memory becomes a performance bottleneck and the time to con gure the device increases. The latter could be critical for applications where an FPGA is con gured on-demand based on run-time parameters.
Con guration bit-streams consist of data to be stored in the internal con guration memory as well as instructions to the con guration mechanism. The data con gures the FPGA architecture, that is, the con gurable logic blocks, the interconnection network, the I/O pins, etc. The instructions control the functionality of the con guration mechanism. Typically, instructions are used for initializing the con guration mechanism, synchronizing clock rates, and determining the memory addresses at which the data will be written. The format of a con guration bit-stream depends on the characteristics of the con guration mechanism as well as the characteristics of the FPGA architecture. As a result, the bit-stream format varies among di erent v endors or, even among di erent FPGA families of the same vendor.
COMPRESSION TECHNIQUES: APPLI-CABILITY & IMPLEMENTATION COST
Data compression has been extensively studied in the past. Numerous compression algorithms have been proposed to reduce the size of data to be stored or transmitted over a network. The e ectiveness of a compression technique is characterized by t h e a c hieved compression ratio, that is, the ratio of the size of the compressed data to the size of the original data. However, depending on the application, metrics such as processing rate, implementation cost, and adaptability m a y become critical performance issues. In this section, we will discuss compression techniques and the requirements to be met for compressing FPGA con gurations in FPGAbased embedded systems.
In general, a compression technique can be either lossless or lossy. Lossless compression techniques reconstruct the exact original data after decompression. Lossless techniques are used in applications where any loss of information after decompression is critical. On the contrary, lossy compression techniques eliminate certain information of the original data after decompression. Lossy techniques are primarily used in image, video, and audio applications. For con guration compression, the con guration bit-stream should be reconstructed without loss of any information and thus, a lossless compression technique should be used. Otherwise, the functionality of the FPGA may be altered or, even worse, the FPGA may be damaged.
Lossless compression techniques are based on statistical methods or dictionary-based schemes. For any given data, statistical methods can result in better compression ratios than any dictionary-based scheme 11]. Using statistical methods, a symbol in the original data is encoded with a number of bits proportional to the probability of its occurrence. By encoding the most frequently-occurring symbols with fewer bits than their binary representation requires, the data is compressed. The compression ratio depends on the entropy of the original data as well as the accuracy of the model that is utilized to derive the statistical information of the given data. However, the complexity of the decompression hardware can signi cantly increase the cost of such a n approach. In the context of embedded systems, dedicated decompression hardware (e.g., CAM memory) is required to align codewords of di erent l e n g t h s a s w ell as determine the output of a codeword.
In dictionary-based compression schemes, single codewords encode variable-length strings of symbols 11]. The codewords form an index to a phrase dictionary. Decompression occurs by parsing the dictionary with respect to its index. Compression is achieved if the codewords require smaller number of bits than the strings of symbols that they replace. Contrary to statistical methods, dictionary-based schemes require signi cantly simpler decompression hardware. Only memory read operations are required during decompression and high decompression rates can be achieved. The latter suggests that, in the context of FPGA-based embedded systems, a dictionary-based scheme would result in fairly low implementation cost.
In Figure 1 , a typical architecture of FPGA-based embedded systems is shown. These systems consist of an FPGA device(s), memory to store data and FPGA con gurations, a con guration controller to supervise the con guration process, and an I/O interface to send and receive data. The con gurations are compressed o -line by a general-purpose computer and the compressed data is stored in the embedded system. Besides the memory requirements for the compressed data, additional memory may be required during decompression. For example, in LZ -based algorithms 11], the dictionary can be reconstructed on the y based on the index. As a result, in software-based applications, only the index is stored or transmitted. Thus, only the index is considered in the calculation of the compression ratio. However, in the cont e x t o f e m bedded systems, the memory requirements to store the dictionary should also be considered.
At r u n time, decompression occurs and the original con- guration bit-stream is delivered to the FPGA con guration mechanism. As a result, the decompression hardware cost and the decompression rate become major requirements of the compression problem. The decompression hardware cost may a ect the cost of the system. In addition, if the decompression rate can not sustain the data rate of the conguration mechanism, the time to con gure the FPGA will increase.
OUR COMPRESSION TECHNIQUE
Our compression technique is based on the principles of dictionary-based compression algorithms. Even though statistical methods can achieve higher compression ratios 11], we propose a dictionary-based approach because dictionarybased schemes lead to simpler and faster decompression hardware. In our approach, the dictionary corresponds to conguration data and the index corresponds to the way the dictionary is read in order to reconstruct a con guration bit-stream. In Figure 2 , an overview of our con guration compression technique is shown. The input con guration bit-stream is read sequentially in the reverse order. Then, the dictionary and the index are derived based on the principles of the well-known LZW compression algorithm 11].
In general, nding a dictionary that results in optimal compression has exponential complexity 11]. By deleting nonreferenced nodes and by merging common pre x strings, a compact representation of the dictionary is achieved. Finally, a heuristic is applied that further enhances the dictionary representation and leads to savings in memory. The original con guration bit-stream can be reconstructed by parsing the dictionary with respect to the index in reverseorder. The achieved compression ratio is the ratio of the total memory requirements (i.e., dictionary and index) to the size of the bit-stream. In the following, we describe in detail our compression technique as well as the decompression method.
In 6], we h a ve demonstrated preliminary con guration compression results using a dictionary-based approach. However, the approach that is proposed in this paper is significantly di erent than the one proposed in 6]. In 6], the decompressed strings are delivered in the reverse order. In addition, the heuristic proposed in 6] only deletes all the leaf nodes from all the su x trees without considering how the index size is a ected. However, in this paper, a di erent approach is proposed. Our goal is to delete strings and individual nodes in a bottom-up approach considering the overall savings in memory, that is, both the dictionary and the index memory. As a result, by applying our technique to the con guration bit-streams in 6], the memory requirements for storing the dictionary and the index can be further improved by 6 ; 13 % (see Section 5).
Basic LZW Algorithm
The LZW algorithm is an adaptive dictionary encoder, that is, the coding technique of LZW is based on the input data already encoded. The input to the algorithm is a sequence of binary symbols. A symbol can be a single bit or a data word. Symbols are processed sequentially. By combining consecutive symbols, strings are formed. In our case, the input is the con guration bit-stream. Moreover, the bitlength of the symbol determines the way the bit-stream is processed (e.g., bit-by-bit, byte-by-byte). The main idea of LZW is to replace the longest possible string of symbols with a reference to an existing dictionary entry. As a result, the derived index consists of pointers to the dictionary. Initially, the dictionary is preloaded with entries for all the symbols of the input alphabet (Algorithm 1). For example, if the symb o l i s a b yte, the dictionary is preloaded with entries for 0 ; 255. One symbol s is read at a time.
A temporary string S is utilized during compression. If the string Ss is not found in the dictionary, the code for S is added to the index and Ssbecomes a new entry to the dictionary. The dictionary contains all the previously seen strings. There is no restriction on the size of the dictionary, s o m o r e and more phrases are generated as encoding proceeds. If the string Ss is found in the dictionary, a new symbol is read.
The procedure terminates when all the input data has been read.
In software-based applications, only the index is considered in the calculation of the compression ratio. The main advantage of LZW (and any LZ-based algorithm) is that the dictionary can be reconstructed based on the index. As a result, only the index is stored in a secondary storage media or transmitted. The dictionary is reconstructed on-line and the extra memory required is provided by the \host". However, in embedded systems, no secondary storage media is available and the extra required memory has to be considered in the calculation of the compression ratio. Also, note that the dictionary includes phrases that are not referenced by its index. This happens because, as compression proceeds, LZW keeps all the strings that are seen for the rst time. This is performed regardless of whether these strings will be referenced or not. This is not a problem in software-based applications since the size of the dictionary is not considered in the calculation of the compression ratio.
Compact Dictionary Construction
In our approach, we propose a compact memory representation for the dictionary. In general, the dictionary is a forest of su x trees (i.e., one tree for each symbol of the input alphabet). Each string in a tree is stored in the memory as a singly-linked list. The root of a tree is the head of all the lists in that tree. Every entry in the memory consists of a symbol and an address to a pre x string and every string is associated with an entry. A s t r i n g is read by t r a versing the corresponding list from the address of its associated memory entry to the head of the list. Furthermore, dictionary entries that are not referenced in the index are deleted and not stored in the memory. An example of our dictionary representation is shown in Figure 3 . For illustrative purposes, we consider letters as symbols. The root of the tree is the symbol \C". Each one of the strings \COMPUTE", \COMPUTER ", and \COMPUTA TION" is associated with a node. Since the string \COMPUT" is a common pre x string, it is only represented once in the memory. In Figure 4 , the memory organization for storing the dictionary and the index of the above example is shown. The memory requirements for the dictionary are ndictionary (datasymbol + dlog 2 ndictionarye) bits, where ndictionary is the number of memory entries of the dictionary and datasymbol is the number of bits required to represent a symbol. Similarly, the memory requirements for the index are nindex dlog 2 ndictionarye bits, where nindex is the number of memory entries of the index. From the above example, we notice that during decompression, the decompressed strings are delivered in reverse order. In fact, in software-based implementations 11], a stack is used to deliver each decompressed string in the right order. However, in the considered embedded environment, additional hardware is required to implement the stack. In addition, the size of the stack should be as large as the length of the longest string in the dictionary. Moreover, the time overhead to reverse the order of the decompressed strings would a ect the time to con gure the FPGA. In our scheme, to avoid the use of a stack, we derive the dictionary after reversing the order of the con guration bit-stream. During decompression, the con guration bit-stream is reconstructed by parsing the index in the reverse order. In this way, the decompressed strings are delivered in order and the exact original bit-stream is reconstructed. We h a ve performed several experiments to examine the impact of compressing a reverseordered con guration bit-stream instead of the original one. Our experiments suggest that the memory requirements for both the dictionary and the index are very close to each other in both cases (i.e., variation less than 1%).
Enhancement of the Dictionary Representation
After deriving the dictionary and its index, we r e d u c e t h e memory requirements of the dictionary by selectively decomposing strings in the dictionary. In the following, a pre x string corresponds to a path from any n o d e u p t o t h e t r e e root. Similarly, a su x string corresponds to a path from a leaf node up to any node. Finally, a substring corresponds to a path between any t wo arbitrary nodes.
The main idea is to replace frequently-occurring substrings by a new or an existing substring. As a result, while memory savings can be achieved for the dictionary, additional codewords are also introduced leading to index expansion. For example, consider the pre x strings \COMPUTER " and \QUALCOM" (see Figure 5) . Again, for illustrative purposes, we consider letters as symbols. Since \COM" i s a common substring, by storing it in the memory only once, the dictionary size can be reduced. However, one additional codeword is required for \COMPUTER" since it is decomposed in two substrings (i.e., \COM" a n d \ PUTE R "). In general, the problem of decomposing substrings that can result in maximum savings in memory has exponential complexity.
In the following, a 2-phase greedy heuristic is described that selectively decomposes substrings to achieve o verall memory savings. A bottom-up approach is used that prunes the su x trees starting from the leaf nodes and replaces deleted su x strings by new (or existing) pre x strings. We concentrate only on su x strings that include nodes pointed at by only one su x string. Otherwise, the su x string extends over large number of pre x strings resulting in lower possibility for potential savings in memory. Using our heuristic, 80 ; 85% of the nodes in all su x trees were examined for the bit-streams considered in our experiments (see Section 5).
In the rst phase, we delete su x strings that can lead to potential savings in memory (see Algorithm 2) . Initially, w e identify repeated su x strings that appear across all the sufx trees of the dictionary. A s m e n tioned earlier, the number of su x trees in the dictionary equals the number of symbols of the input alphabet. For each distinct su x string si, the potential savings in memory cost(si) are computed. The cost(si) depends on the potential savings in dictionary memory and the potential index expansion assuming that si is deleted from all the su x trees. Only su x strings si with non-negative cost(si) are deleted. By reducing the dictionary size, the number of bits that is required to address the dictionary (i.e., dlog 2 ndictionarye) can decrease too. As a result, the word-length of both the dictionary and index In the second phase, we selectively delete individual nodes of the su x trees in order to decrease the number of bits required to address the dictionary (see Algorithm 3). The deletion of nodes results in index expansion. However, the memory requirements due to the increase of index size can be potentially amortized by the decrease of the wordlength of both the dictionary and the index memories. The goal is to reduce the dictionary size while introducing minimum number of new codewords. Initially, nodes ni of the same distance across all the su x trees are sorted with respect to the number of codeword splits cost(ni) ( i . e . , n umber of new codewords introduced if the node will be deleted). Then, starting from the leaf nodes, we mark individual nodes according to their cost(ni). A marked node is eligible to be deleted. Nodes with smaller number of codeword splits are marked rst. We continue to mark nodes until we a c hieve a 1 b i t s a vings in addressing the dictionary. If the index expansion results in increasing the total memory requirements, the marked nodes are not deleted and the procedure is terminated. Otherwise, the marked nodes are deleted and the procedure is repeated. 
Configuration Decompression
Decompression occurs at power-up or at runtime. The original con guration bit-stream is reconstructed by parsing the dictionary with respect to the index. As shown in Figure 6 (b), the contents of the index (i.e., codewords) are read sequentially. A codeword corresponds to an address to the dictionary memory. F or each c o d e w ord, all the symbols of the associated string are read from the dictionary memory and then the next codeword is read. A comparator is used to decide if the output data of the dictionary memory corresponds to a root node, that is, all the symbols of a string have been read. Depending on the output of the comparator, a new codeword is read or the last-read pointer is used to address the dictionary memory.
In Figure 6 , both a typical scheme and our compression-based scheme for storing and reading the conguration bit-stream are shown. Typically, the con guration bit-stream is stored in memory. It is important t o d eliver the bit-stream sequentially otherwise the con guration mechanism will not be initialized correctly and the con guration process will fail. Depending on the con guration mode, data is delivered serially or in parallel. In our scheme, the only hardware overhead introduced is a comparator and a m ultiplexer. The output of the decompression process is identical to the data delivered by t h e c o n ventional scheme. Moreover, the data rate for delivering the con guration data is the same for both the schemes and depends only on the memory bandwidth. The decompression process does not add any t i m e o verhead to the con guration time.
EXPERIMENTS & COMPRESSION RE-SULTS
Our con guration compression technique was applied to con guration bit-streams of several real-world applications. The target architecture was the VIRTEX series FPGAs 14] . For mapping onto the VIRTEX devices, we used the Foundation Series v2.1i software development t o o l . E a c h application was mapped onto the smallest VIRTEX device that met the area requirements of the corresponding implementation. The size of the con guration bit-streams ranged from 1:7 Mbits to 6:1 Mbits. In Table 1 , the con guration bit-stream sizes for each implementation are shown.
The considered con guration bit-streams corresponded to implementations of cryptographic and signal processing algorithms. The cryptographic algorithms were the nal candidates of the Advanced Encryption Standard (AES):
MARS, RC6, Rijndael, Serpent, a n d Twofish. Their implementations included a key-scheduling unit, a control unit, and one round of the cryptographic core that was used iteratively. Implementation details of the AES algorithms can be found in 5]. We h a ve also implemented digital signal processing algorithms using the logic cores provided with the Foundation 2.1i software tool 14]. A 1024; and a 512; point complex FFTwere implemented that were able to perform I FFTtoo. In addition, four 256;tap FI R lters were mapped onto the same device. In the latter implementation, all lters can process data concurrently. F i n a l l y , a 1024;tap FI R lter was also implemented.
The con guration bit-streams were processed byte ; by ; byte during compression, that is, the symbol for the dictionary entries was chosen to be an 8-bit word. As a result, the decompressed data is delivered as 8-bit words and, thus, parallel modes of con guration can be supported. Note that the maximum number of bits used in parallel modes of con guration is typically 8 bits 1, 2, 14] . If the con guration mode requires less than 8 bits (e.g., serial mode), an 8;to;n bit converter can be used, where n is the number of bits required by the con guration mode. In this work, for each con guration bit-stream, we do not attempt to nd the optimal bit-length for the symbol that leads to the best compression results.
The compression results are shown in Tables 1 and 2 . The results are organized with respect to the optimization stages of our technique (see Figure 2) . The results shown for LZW correspond to the construction of the dictionary and the index using the LZW algorithm. The only di erence compared to Figure 2 is that the LZW results include the optimization of merging common pre x strings in the dictionary. Hence, the results shown for Compact correspond to the deletion of the non-referenced nodes in the dictionary.
Finally, the results shown for Heuristic correspond to the optimizations performed by our heuristic and are also the overall results of our compression technique.
In Table 1 , the achieved compression ratios are shown. The compression ratio is the ratio of the total memory requirements (i.e., memory to store dictionary and index) to the bit-stream size. In addition, in Table 1 , lower bounds on the compression ratios are shown. For our compression technique, the lower bound for each bit-stream corresponds to the entropy of the bit-stream with respect to the LZW compression algorithm. As mentioned in Section 3, the compression ratio is a ected by t h e e n tropy of the data to be compressed 11]. We h a ve calculated the lower bound by dividing the index size derived using LZW by the bit-stream size. Therefore, the lower bound corresponded to the compression ratio that can be achieved by LZW for softwarebased applications (assuming 8 ; bit symbols).
In Table 2 , the compression results are shown in terms of the memory requirements. The memory requirements for the dictionary are ndictionary (8 + dlog 2 ndictionarye) bits, where ndictionary is the number of memory entries of the dictionary. Similarly, the memory requirements for the index are nindex d log 2 ndictionarye bits, where nindex is the number ofmemory entries of the index and dlog 2 ndictionarye is the number of bits required to address the dictionary. LZW In software-based applications, only the index is considered in the calculation of the compression ratio. In addition, statistical encoding schemes are utilized for further compressing the index. As a result, in typical LZW applications, superior compression ratios (i.e., 10 ; 20 %) have been achieved by using commercially available software programs (e.g., compress gzip). However, such commercial programs are not applicable to our compression problem. As discussed earlier, in the context of embedded environments, both the dictionary and the index are considered in the calculation of the compression ratio. The size of the derived dictionaries was comparable to the size of the original bit-streams. Therefore, negative compression occurred, was in the order of hundreds of bits. No detailed information was provided regarding the algorithm used to build the dictionary. The authors mainly focused on tuning the dictionary parameters to achieve better compression results based on the speci c set of programs. However, such a n approach is unlikely to achieve the same results for FPGA con gurations where the bit-stream is a data le and not an instruction-based program. In addition, Hu man encoding was used for compressing the codewords. As a result, dedicated hardware resources were needed for decompressing the codewords.
In 9], the dictionary was built by solving a set-covering problem. The dictionary representation was based on the External Pointer Macro compression model. According to this model, a phrase is called by pointing to a memory address and reading as many consecutive memory addresses as the phrase length. A heuristic was developed to merge dictionary entries by subsuming a dictionary entry i by a nother entry j if i = j. While this heuristic results in a minimal-size dictionary (i.e., minimal number of entries), it also results in entries of maximal-length. This happens because the goal was to substitute maximal number of entries by k eeping the ones that \cover" maximal number of them. Hence, the longest phrases were kept in the dictionary. Moreover, the size of the considered programs was 0.5-10 Kbits and the achieved compression ratios (i.e. size of the compressed program as fraction of the original program) were approximately 85-95 %. Since the technique in 9] was developed for code size minimization, it is not fair to make a n y compression ratio comparisons with our results.
Work related to FPGA con guration compression has been reported in 7, 8] . In 7] , the proposed technique took advantage of the characteristics of the con guration mechanism of the Xilinx XC6200 architecture. Therefore, the technique is applicable only to that architecture. In 8], runlength compression techniques for con gurations have been described. Again, the techniques were developed speci cally for the Xilinx XC6200 architecture. Addresses were compressed using runlength encoding while data was compressed using LZ compression (sliding-window method 11]). Dedicated onchip hardware was required for both methods. A set of conguration bit-streams (2 ; 88 Kbits) were used to ne-tune the parameters of the proposed methods. A 16;bit size window w as used in the LZ implementation. While this window size led to good results for these bit-streams, it is impractical for larger con guration bit-streams. Moreover, a ne-tuned scheme for larger con guration bit-streams would lead to larger size windows. As stated in 8], larger size windows impose a fairly high hardware penalty with respect to the bu er size as well as the supporting hardware.
CONCLUSIONS
In this paper, a novel con guration compression technique was proposed. Our goal was to reduce the memory required to store con gurations in FPGA-based embedded systems and achieve high decompression e ciency. Decompression e ciency corresponds to the decompression hardware cost as well as the decompression rate. Although data compression has been extensively studied in the past, we are not aware of any prior work that addresses con guration compression for FPGA-based embedded systems with respect to the cost and speed requirements. Our compression technique is applicable to any SRAM-based FPGA device since it does not depend on speci c features of the con guration mechanism. The con guration bit-streams are processed as raw data without considering individual semantics. Hence, both complete and partial con guration schemes can be supported. The required decompression hardware is simple and does not depend on the individual semantics of con guration bit-streams or speci c features of the con guration mechanism. Moreover, the decompression process does not a ect the time to con gure the device. Using our technique, we have demonstrated 11;41 % savings in memory for various con guration bit-streams of real-world applications. Considering the lower bounds derived for the compression ratios, the achieved compression ratios were higher than the lower bounds by 1 4 :5 % o n t h e a verage.
Future work includes the enhancement of our technique by incorporating a unified-dictionary, that is, deriving a single dictionary for a set of con gurations. The latter will result in simpli ed memory organization since the word-length of the index memory will be the same across di erent algorithms. Possible solutions could be to process all con guration bitstreams as one entity or to process each con guration bitstream individually and merge their dictionaries later. For updating the embedded system with a new con guration, only the index is derived based on the unified dictionary.
In addition, we plan to develop a skeleton-based approach for our compression technique. A skeleton is the \intersec-tion" of a set of con guration bit-streams. By removing the data redundancy of the skeleton in the bit-streams, savings in memory can be achieved. The original con gurations are reconstructed based on the skeleton. G i v en a set of congurations, we plan to address the problem of deriving a skeleton to maximize the savings in memory and/or to minimize the con guration time by using partial recon guration.
Related problems are also addressed by t h e U S C M A A R CII project (http://maarcII.usc.edu). This project is developing novel mapping techniques to exploit dynamic recon guration and facilitate run-time mapping using con gurable computing devices and architectures 3, 4, 12] .
