Abstract-In this paper, we present hardware decompression accelerators for bridging the gap between high speed FPGA configuration interfaces and slow configuration memories. We discuss different compression algorithms suitable for a decompression on FPGAs as well as on CPLDs with respect to the achievable compression ratio, throughput, and hardware overhead. This leads to various decompressor implementations with one capable to decompress at high data rates of up to 400 megabytes per second while only requiring slightly more than a hundred look-up tables. Furthermore, we present a sophisticated configuration bitstream benchmark.
INTRODUCTION
The progress in silicon technology has led to an enormous growth of resources found on current FPGAs. And with larger FPGAs, the reconfiguration data has increased significantly over the last decade and has just passed the 10 megabyte barrier [1] , [2] for the high end products of the leading FPGA vendors. In order to keep the configuration times within acceptable margins, most recent FPGA families offer high speed configuration with data rates of up to several hundreds of megabytes per second. Such high data rates cannot be delivered by today's low cost non-volatile memories. For instance, typical NAND-flash devices offer read data rates of just a few tens of megabytes per second. Consequently, configuration bitstream storage for high speed configuration is costly. Note that some systems require multiple different configurations, making the bitstream storage even more expensive. For example, if a system supports in-field updates, such a system will normally provide a fallback configuration bitstream for the case of an update failure. In such environments, bitstream compression may help to reduce the memory usage and the required memory bandwidth, thus saving cost.
Especially for systems using runtime reconfiguration, the configuration speed is an important issue dictating if two or more modules are suitable to share the same FPGA resource area over time. If, for example, two different modules of a video processing system have an accumulated execution time smaller than the time between two video frames, these two modules may be executed time-multiplexed on an FPGA. But this is only feasible if also the reconfiguration can be performed within the given time budget. For typical multimedia applications, this implies reconfiguration times in the range of a few milliseconds.
We propose to use bitstream decompression for accelerating the configuration speed by enhancing the throughput that is typically constrained by slow configuration storage devices.
For instance, when bitstream compression reduces the size of a bitstream to 50%, the configuration speed may be doubled.
When bitstream compression should save cost and enhance the configuration speed, the decompression has to be performed in hardware. Bitstreams may contain fractions of data that are selectable to any value, thus allowing higher compression ratios [3] . Unfortunately, all bitstream formats of today's FPGA are confidential, thus preventing us to exploit such techniques. Consequently, our proposed compression algorithm variations work lossless. For hardware decompression, we observe the following objectives:
. Compression ratio: In order to allow memory savings and configuration throughput enhancements, we require significant compression ratios. * Throughput: In order to speed up the reconfiguration with the help of hardware decompression, the decompressor has to emit a high continuous data rate. As discussed in Section II, this is not only a question of the achievable clock frequency but also a question of possible burst throughputs on both sides of the decompression module. . Resource overhead: In order to get an overall benefit due to reconfiguration, the decompression hardware should be as small as possible. Contribution In this paper, we analyze existing compression algorithms with respect to their suitability for hardware accelerated bitstream decompression. We are the first discussing various modifications to standard algorithms in order to fulfill the objectives of compression ratio, throughput, and resource overhead at the same time and with high quality. Here, high quality means implementations that achieve compression ratios that can compete with state of the art software programs, such as gzip, implementations that allow throughputs of several hundreds of megabytes, and implementations with small resource requirements. We discuss issues for FPGA implementations as well as for CPLD designs, and some of our presented implementations perform well also for bitstreams of different FPGA families. Furthermore, we developed a benchmark including several dense and therefore hard compressible configuration bitstreams that can be accessed freely on our website [4] .
The paper is structured as follows: In Section II, we present general aspects and a motivating example why bitstream decompression algorithms must be carefully designed in order to reduce configuration time. Next, in Section III, we introduce our configuration bitstream benchmark that is used in Section IV to rate our implementations.
II. ALGORITHMS FOR BITSTREAM COMPRESSION
Bistream compression is the process of encoding FPGA configurations using fewer bits than the original unencoded bitstream. We define the compression ratio rj as _ size (compressed bitstream) size(original bitstream)
The main goal of bitstream compression is to hide a low bandwidth of a configuration memory through hardware efficient decompression. All of our presented accelerators for bitstream compression are capable to emit one word per clock cycle. Let IWFPGA be the word width of the configuration interface and lWdecol be the decompressor output data width, then a decompression accelerator has to operate with at least IWFdPGA times the clock frequency of the configuration interface in order to allow full configuration speed. In order to accelerate the configuration process, it is not only important to put focus on the best compression ratio, it is much more important to consider the configuration memory bandwidth. Let dFPGA be the configuration interface data rate, and dMEM be the peak data rate of the configuration memory, then the configuration data rate is: If we design the bitstream compression algorithm such to limit the worst case compression ratio at the cost of a reduced best case compression ratio, we can speed up the configuration time significantly. For instance, limiting the compression ratio of the 10% of the bitstream containing worst case configuration data to, lets say T1i = 2 while having a compression ratio of r1 = 0.4 for the remaining 90%, results in an average compression ratio of 10% r1j + 90% r1 56%.
If we now compute the configuration time for the adjusted algorithm, we achieve a much better configuration time of just 10 kB.71r1+ 90 kB = 1 3ms. dMEM +dFPGA
These numerical examples point out that looking only at the overall compression ratio will not necessarily enhance the configuration speed. To the best of our knowledge, this problem has neither been identified nor tackled in related publications.
In the following, we will examine configuration ratio issues together with implementation aspects for variations of well known compression algorithms, such as run length encoding, Lempel Ziv, and Huffman encoding.
A. Run Length Encoding
Run length encoding [6] The cost to encode a run length tuple is always w + 1 l bits.
These bits code in the best case lw 21l1 bits and in the worst case just lwl bits. Note for the last case that the compressed data will be larger than the original input data.
The word width lwl and the amount to code the run length should be determined with care, because when random data is compressed using run length encoding, the compressed data will be larger than the original input data. This is illustrated in Figure 1 on the basis of a probabilistic state machine. The probability that the next word in a random data stream is identical to the present one is p(l) = 21 1 The probability for longer recurrences i decreases exponentially: p(i) = 2 1 w
The compression ratio for an infinite random data stream is therefore: (2) 'iRLE w= 1 l m t i i=l Even for relatively small word widths, the probability that just two consecutive words are identical is about zero (e.g., < 1% for bytes). Therefore, TRLE converges against the worst case compression ratio TRLE Iw1+ ± for increasing lwl. The values of TRLE and TiRLE indicate the behavior when the bitstream contains fractions of high entropy data. This can happen especially for initialization data of internal RAM blocks. As the relative amount of RAM initialization data is not negligiblel, we demand that all compression algorithms must have an r1 close to 1 and an r-< 2. Consequently, the run length must not be encoded with more than the half amount of bits chosen for word width: l < 1 w. As a starting point for parameterizing the run length algorithm, we analyzed a histogram of the probability distribution of the run lengths for different word widths w of the complete benchmark corpus. The histogram in Figure 2 reveals that higher rates of consecutive sequences exist only for small word widths. For instance, for a word width of lwl = 16, the probability is 55% that the run length is 1, which means that most pairs of consecutive words differ. Furthermore, the histogram points out the potential for data compression through run length encoding. If we add up, for example, the probabilities of the first ten run lengths (1_01 p(i)), we see that for all word widths except for the width lwl = 16, the sum of the probabilities is below 50%. Thus, for these word widths, the majority of run lengths is larger than ten, pointing out a feasibility for run length compression. However, when using the described encoding, the run length 1 will lead to a relatively large portion of data or in the compressed data stream that is larger than the original one: uJ = 1t +11 p(l). For instance, if we set the word width to lwl = 4 bit, we find according to Figure 2 that 27% 'According to the data sheets [1] , [2] for the chosen FPGAs in the benchmark section, the relative amount of RAM preload data is in average 14% of the overall bitstream size.
of two consecutive words differ. If we further set the length field to 1 I 2 bit, we get or = 6 * 27% = 40.5%. Note that 4 the final compression ratio is always worse than or. In order to overcome these issues, we propose two variations. The first one works on a bit width of a single bit, while the second one uses a flag to distinguish between a recurrent sequence or differing consecutive words. 1) Toggle RLE: If the word width is set to w 1 bit, the run length value specifies the distance between two bit flips. If we further define a fixed start bit value, the complete bitstream data is compressible by encoding only run lengths while omitting the next word w. By reserving some codes for specifying run lengths without toggling at the end, it is possible to compress recurrent sequences of any length.
In a former implementation of such a decompressor [7] , we have demonstrated good compression ratios in combination with tiny hardware decompression modules. In this paper, we present an improved version that achieves better compression ratios. We examined different prefix free encodings in order to balance between the compression ratio and the hardware to decode the prefix free code words specifying the run lengths. In addition, the encodings ensure a bounded compression ratio for random data as well as for worst case examples. The toggle RLE hardware decompressor requires only a counter, a control state machine, and a decoder for the run length values. Therefore, this technique is suitable for FPGA as well as for CPLD implementations. One drawback of this technique is the sequential output of the bitstream data. Even for high speed designs running at 200 MHz, the throughput is limited to just 25 MB/s.
2) Flag RLE: In order to enhance the throughput, we examined a variation of the above presented traditional run length encoding scheme that uses a flag to distinguish between the case that two consecutive words differ and the case of a recurrent sequence of identical words (run length equal one or larger than one). The corresponding encodings are ('0', w) and ('1', w, 1). Thus, the worst case compression ratio is RLE = W1+1 As shown in Section IV, the flag RLE has an inferior compression ratio compared to the toggle RLE approach, but it has a lwl times higher throughput, because the toggle RLE decompressor emits only one bit per cycle. The flag RLE algorithm is also suitable for FPGA and CPLD implementations, but care has to be taken of the decompressor input interface. In the case of a typical 8-bit input interface, we found the most hardware efficient implementation when using an extra register for the flags in order to keep the alignment of the incoming compressed data constant. This extra register will store the flags for a block of eight code words and it is loaded once at the beginning of a new block.
B. LZSS
The LZ algorithms (named after their inventors A. Lempel and J. Ziv) are string substitution schemes. A well known string substitution schemes is LZ77 [8] . String substitution schemes replace uncompressed data by references to already compressed data. During compression, as shown in Figure 3 , 41HIM"U"10 a search buffer keeps track of the last n already compressed words and a look-ahead buffer allows accessing the next m words of the uncompressed data. During the compression phase, the search buffer is scanned for matches of all prefixes of the look-ahead buffer. The longest prefix of the look-ahead buffer found in the search buffer is then replaced by its starting offset and its length. If no prefix was found, we emit the first word (nextWord) of the look-ahead buffer.
The LZSS approach [9] , as an improvement of LZ77, uses a flag to distinguish between these two cases, similar to the methodology presented for the flag RLE in Section Il-A. The upper bound is equal to the upper bound of the flag RLE algorithm (see Section II-A.2). Thus, LZSS does not lead to a drastic downgrading of the compression ratio if fractions of the configuration bitstream contain data with a high entropy. 1) LZSS Hardware Decompression: During decompression, offset and length refer to already emitted data. Thus, the decompressor must keep track of the last n emitted words, with n being the search buffer size of the compression program. For the LZSS decompression, we enhanced an approach for an LZ77 hardware decompressor proposed in [10] (shown in Figure 4 ) with a state machine that decodes LZSS encodings. During decompression, the emitted data is fed into a shift register that represents the mentioned search buffer. If there is no match in the shift register, nextWord will be emitted. In case of a match, the offset field (o) specifies the position of the shift register containing the first word of a I word long sequence.
As CPLDs contain only a relatively small amount of memory bits, LZ decompressors are inapplicable for CPLD implementation, because of the required shift register. For the implementation of the shift register on FPGAs, the following four options exist. FPGAs [12] The first solution consumes an unfavorable amount of hardware resources. The second one is an interesting alternative when implementing hardware decompression on the FPGA itself as a fixed part of the chip. Accessing the FDRI leads to a significant hardware overhead and the access is pretty slow. However, [13] and [14] present compression techniques that are based on the FDRI, but both papers omit a discussion of a possible hardware implementation of a decompression module. The third variant uses dedicated on-chip RAM resources that allow implementing a long shift register that was used in [15] for LZSS configuration data decompression in hardware. But huge shift registers require more bits in the compressed data in order to specify the offset o, thus leading to lower compression ratios especially for short matching sequences. By exploring various search buffer sizes and encoding schemes, we found no significant benefit of using huge shift registers, because configuration data for densely packed FPGA designs is highly irregular. Therefore, dedicated RAMs are oversized.
Instead, we examined the last option to implement the shift register with dedicated shift register primitives that are available on most Xilinx FPGA families [2] . We discovered that even small search buffers, as little as 32 words, lead to good compression results, as being revealed in our experimental results (to be found in Section IV). In our approach, the shift register has been implemented by using SRLC16E primitives on Xilinx FPGAs. This primitive allows to use a 4-bit look-up table alternatively as a 16 bit shift register with random access to the individual register contents. An I deep shift register of width n can be created by cascading n 1F6] SRLC16E primitives.
For the sake of completeness, we want to mention an implementation [16] for decompressing configuration data in hardware that has some relationship with the LZ78 algorithm that is based on a dictionary rather than a shift register. However, the work reports only moderate compression ratios and omit details on the resource consumption.
2) LZSS Parameterization: Experimental results confirmed our assumption, that among the run lengths (shown in Figure  2 ), there is no uniform distribution among the prefix lengths found during LZSS compression. This observation lets us conclude that it is not suitable to code all LZSS prefix run emitted data lengths with a fixed code word size. If we use prefix free encodings with adjusted word widths such as Huffman codes for the length encoding, we will require an unlikely amount of hardware resources, because of the different word widths that must be aligned to the input data width, as demonstrated in Section II-C. Therefore, we encoded only a subset containing 21 run lengths of all 2111 possible run lengths using less bits (1'P < 1l). We explored manually how many bits are required to encode the length (1') and which subset of run lengths leads to the best compression ratios. Note that the search space to find the optimal prefix run lengths for just one encoding of 1' iS ,21, ). For our best encodings for run length prefixes, we found that at least 50% of the subset contains the first 10 smallest prefix run lengths.
Limiting the number of encoded prefix lengths will result in more code words in the compressed bitstream, because for an prefix length I that is not encoded, the biggest encoded prefix length smaller than I will be used instead. In this case, only a smaller prefix length is encoded and a new search for a prefix is executed. However, encoding more prefix lengths leads only to marginal better compression rates. In addition, longer prefix lengths will not necessarily reduce the configuration time, because they represent high compression ratios that cannot be exploited as the configuration interface has a limited input data rate. The best parameter set found in the above mentioned exploration for our LZSS derivation was one bit for the flag, l1 = 3 bit, lol = 5 bit, and lnl = 8 bit. This results in a nine bit long code word which does not match a typical 8-bit interface. To simplify the interfacing, we used an extra register for the flags that stores the flags for a block of 8 consecutive code words. Then the remaining compressed code word is 8 bit wide, thus, ideal for connecting a memory.
Recent FPGAs offer high speed configuration interfaces that are up to 32-bit wide and that operate with at least 100 MHz [1] , [2] . In order to match these interfaces, we could extend the word width to 32 bits. This would influence the compression as follows: With an increased word size, long prefix lengths become more rarely. Therefore, the subset of encoded prefix lengths has to be adapted. In addition, the worst case compression rate improves with higher word widths. However, it is also possible to enhance the throughput by increasing the clock frequency of the decompressor in combination with a serial-to-parallel converter at the output. Our LZSS decompressor consists of a few simple basis elements and the whole design can be easily pipelined. Consequently, we can easily build decompression modules that run at least at 200 MHz (e.g., when implemented on a Xilinx Virtex-II device). This speed in combination with a word width of 16 bit allows us to emit data by a decompression module at a rate of 400MB/s. Thus, we can utilize the full configuration interface bandwidth of today's FPGAs.
C. Huffman Compression
As an entropy encoding scheme, Huffman encoding [17] assigns short code words to frequently occurring words and longer code words to infrequently occurring words. Huffman encoding is based on a Huffman tree, which in turn is created upon a probability distribution of words. Creating an individual Huffman tree for each bitstream file is called dynamic Huffman encoding. Such a Huffman tree is part of the compressed data and has to be reconstructed before decompression. Such decompressors turn out to be costly in terms of hardware usage [18] .
Applying the same Huffman tree for all bitstreams is called static Huffman encoding. The static Huffman tree is based on the probability distribution of all words in our benchmark corpus and will never change. The compression rate for the overall corpus is hence still optimal, whereas we could achieve better compression rates for single bitstream files using dynamic Huffman encoding. We found that static Huffman compression performs well on our corpus as can be seen in Figure 5 . The higher the variation in the probability distribution, the better the compression ratio achievable with Huffman encoding. Let hmin and hmax denote the smallest and the largest Huffman code word widths found in our static Huffman tree and lwl the word width used for compression. Then, the compression ratio for Huffman encoding is:
As we demand that the decompression ratio is bounded in such way that fractions of high entropy data will not completely cut of the throughput, we have to restrict the Huffman code. If we want to ensure that the worst case compression ratio -/Huff will be not larger than e.g. 2, we have to limit the largest code word width to hmaxl < 2 W.
We can further check the compression ratio for random data that is the average code word length over all Huffman code words h(i):
In the case -/Huff is larger than 2 or Huff is to far away from 1, we have to reduce the depth of the Huffman tree. Reducing the Huffman tree does not essentially reduce the maximum configuration speed. As can be seen in the histogram in Figure 5 , there is one code word with a probability of more than 50%, thus it will be encoded in a classical Huffman tree with one bit. However, if we assume that the configuration memory interface can deliver 50% of the maximum configuration data rate, then we can spend up to 4 bits to encode this word before the configuration speed drops down for a Huffman tree with w = 8 bit wide output words. In other words, before this point, the maximum throughput of the configuration interface will limit the configuration speed. By applying this technique, we will reduce the average compression ratio on the one side. But on the other side, we will enhance the configuration speed massively, because longer code word sizes can now be reduced.
1) Huffman Hardware Decompression: After computing a static Huffman tree that fulfills our requirements, we can automatically generate the hardware decoder module. An example of the data path of such an implementation is shown in Figure 6 . When in the example a new word has to be decoded, the start input becomes one and according to the values in the so-called alignment register (flip-flops rO, . . ., r2) exactly one output of the decoder tree becomes active. As the words of the original bitstream data is encoded into code words of different lengths h(i), the alignment register is used to arrange the incoming data d such that the Huffman tree decoder gets its incoming code word always at exactly the same position. The outputs of the Huffman tree decoder indicates the decoded word of the alphabet {00 , 01, 10, 11 } that can be emitted in the following. Next, the alignment register is shifted right by the length h(i)0 of the encoded word. For instance, if the output for word 01 becomes active, the alignment register is shifted by two, which is selected by the multiplexer input C2.
A control state machine (not shown in Figure 6 ) keeps track about the fill level of the alignment register. If the alignment register has space for a new input value due to the shift operation after decoding a word h(i), the alignment register is loaded with the next value from the input. The bottom multiplexers shift the input word to the most right free position in the alignment register. For instance, let us assume that r3,... r5 are empty (not containing a code word) and that r0=0 and r1=1, thus, storing the encoded word for the output word 01. Then, upon start=1, the output 01 of the Huffman tree encoder will become active. In the following, ro and r1 will be consumed and the last valid fragment of the compressed bitstream stored in r2 will be shifted into ro. Then we will have five free entries in the alignment register and a new input word is stored in the registers rl... r4 which is selected by the multiplexer input f 5.
The example points out that we require more logic for the input data alignment than for the Huffman tree decoder itself. Note that we require just one 4-bit look-up table to decode one of the output words, because in this example, no word requires more than the inputs of st art and rO... ,r2 for decoding. The size of the alignment register rl depends on the input data width d and the Huffman tree depth which is equal to hmax, If in the case that the alignment register is missing one bit to decode the next word, it must be possible to load the alignment register with a new input word, thus, the total required alignment register width is rl > d + hmax -1.
The logic overhead for the multiplexers is enormous. For each occurrence of a Huffman code word width, we have to spend one input for the top multiplexers as well as one input for the bottom ones adjusting the input data. In addition, the structure contains long combinatorial paths and is hard to pipeline, because only when a decode phase has determined the next output word, we know from which position the next decode phase has to start.
However, as shown in Section IV, static Huffman encoding reaches the best compression ratios. III. BENCHMARK All compression algorithms exploit some statistical characteristics within the uncompressed data like non-uniform probability distributions of code words or some regularities found in the input data. In the case of FPGA configuration bitstreams, the exploited characteristics stem from the configuration facility itself, because FPGAs are designed as general purpose devices for implementing any kind of digital hardware. Therefore, only a subset of the logic and routing resources is used to implement a specific module and the used logic has an encoded counterpart inside the configuration bitstream.
However, if we want to build a suitable benchmark for investigating bitstream compression algorithms, we have to generate bitstreams that represent the statistical characteristics for a wide range of different FPGA designs. Most FPGA designs will have a high logic utilization, because unused resources produce significant overheads like monetary cost, power consumption, or configuration time. Therefore, we generated benchmark bitstreams having a logic utilization of at least 90% of the available look-up table resources2. Only the SoC [19] bitstreams have lower utilization ratios. These bitstreams stem from a system-on-a-chip design where parts of the FPGA resources are reserved for runtime reconfiguration.
In order to detect influences on the compression ratio that are related to the implemented algorithm of a module and not only to the logic utilization, we chose examples from different application domains: * In the crypto core section, we chose a DES module [20] and an RC5 core [21] . In these modules, we found a 2The FIR filter design for the Virtex V is slightly below the 90% border.
Here, the routing resources limit higher logic utilization. Figure 7 lists the measured compression ratios for all benchmark bitstreams and some decompressor variations. As a reference, the figure lists the compression ratios achieved with gzip (V 1.3.5 --best) and for two vendor-specific dedicated configuration memory devices [22] , [23] that both perform bitstream decompression. The results for our techniques reveal the best compression ratios for the Huffman decompressor that in some cases outperforms gzip. Unfortunately, as shown in Table II , our Huffman decompressors consume a significant amount of hardware resources and achieve only moderate clock frequencies.
With two exceptions, one of our run length decompressors ranges always better than the dedicated configuration devices that both allow a maximal output data rate of up to 40MB/s which seems to be limited by the internal flash memory. However, except the Huffman case and the CPLD implementation, all of our implementations allow high clock frequencies with up to 200 MHz, and in each cycle, a word may be emitted.
The T-RLE* compression algorithm is based on a prefix free encoding for the run length which has not been synthesized because of its ineptness (see Section II-C.1). The output word width of the toggle run length accelerators is 1 bit while all other decompressors have an 8-bit output. Only the LZSS16 module has a 16-bit output that allows, due to the high clock speed, a configuration data rate of up to 400 MB/s. As discussed in Section II, the maximum clock frequency and the achieved compression ratio will not automatically deliver the minimal reconfiguration time. Exemplary, we measured the impact of the memory bandwidth for a fixed compression ratio to the configuration time for one FFT benchmark bitstream (m=131 kB for a Xilinx Spartan-III FPGA) that is decompressed by the LZSS 8 algorithm to r = 41%.
We used an RTL simulation in order to measure the exact decompression time. Furthermore, in order to examine the impact of a FIFO between the memory and the decompressor, we performed these measurements for different FIFO sizes. While the decompressor was working at full system clock speed of 100MHz (dFPGA), we restricted the accesses to the memory to A times of the system clock by allowing a read operation only every cycle. The complete data path is 8 bit wide, thus, under optimal conditions the configuration time to configure a bitstream of size m is tconf = dFpGAV when the configuration interface is not limiting the time to dFPGA For the case that the memory can deliver data at full system clock speed (A = 1), the configuration time for the m= 131 kB bitstream is 1.13ms and the compression brings no benefit with respect to the reconfiguration time. Table III lists the configuration times for slower memories, the optimal time based on the memory data rate and the compression ratio, and the time a reconfiguration would take without compression. The results are for a benchmark bitstream that is close to the avarage found over all bitstreams and all accelerators. As can be seen, we cannot reach the optimal reconfiguration times, but still we can enhance the configu- 
