As FPGA logic density continues to increase, new techniques are needed to store initial configuration data efficiently, maintain usability, and minimize cost. In this paper, a novel compression technique is presented for Xilinx Virtex partially reconfigurable FPGAs. This technique relies on constrained hardware design and layout combined with a few simple compression techniques. This technique uses partial reconfiguration to separate a hardware design into two separate regions: a static and partial region. A bitstream containing only the static region is then compressed by removing empty frames. This bitstream will be stored in non-volatile memory and used for initialization. The remaining logic is configured through partial reconfiguration over a communication network. By applying this technique, a high level of compression was achieved (almost 90% for the V4 LX25). This compression technique requires no extra decompression circuitry and compression levels improve as device size increases.
INTRODUCTION
One of the challenges of using large FPGAs is the need to reliably store configuration data. FPGAs must be configured with the correct configuration data or they will not operate as expected. Fig. 1 shows the configuration file, or bitstream, sizes for devices from several of the Virtex families, listed in order from smallest to largest. Device size continues to grow at an exponential rate and as such more memory will be needed to store these bitstreams.
Continuing bitstream growth is especially problematic for FPGAs used in harsh radiation environments. The nonvolatile memory used to store the initial bitstream must be radiation-hardened (RAD-HARD). Since RAD-HARD memory has much lower density than conventional, non-volatile memory, many RAD-HARD memories are needed, taking up valuable board space. Further, RAD-HARD memories are much more expensive than conventional memories resulting in very expensive system costs just to store FPGA bitstreams. Because of these costs, system designers have great incentive to reduce the size of the configuration memory as much as possible.
There have been several studies investigating the effectiveness of compressing FPGA bitstreams. One study used a simple run-length encoding compression to achieve a 3× reduction in the bitstream [1] . Another study compared the use of Huffman coding, arithmetic coding, LZ coding, and the use of "don't cares" for configuration bitstream compression [2] . One study investigated the ability to compress bitstreams by exploiting the redundancy between multiple bitstreams [3] . These studies all suggest that configuration bitstreams can be compressed.
All of these techniques, however, require additional circuitry to perform the online decompression. This hardware takes additional board space, increasing the size of the system design. For harsh radiation environments, this decompression circuitry needs to be RAD-HARD and is thus expensive. In addition, though the compressed bitstream is smaller than the original, as FPGA devices continue to increase in size and designers take full advantage of their resources, the size of the compressed bitstream will continue to increase.
This paper will present a bitstream compression tech-nique called "bitstream compression using partial reconfiguration" (BCuPR) that does not rely on dedicated hardware for decompression. This technique reduces the initial bitstream size needed to initialize the FPGA. Partial reconfiguration is then used to provide the remaining configuration data after initialization. This technique was successfully implemented and tested on an Avnet Virtex-4 LX25 evaluation board.
BITSTREAM STRUCTURE
To understand the compression techniques applied in BCuPR, one needs a basic understanding of the configuration bitstream. The configuration bitstream contains all data necessary to configure the FPGA. While the current discussion focuses on Virtex-4 devices, the bitstream information in this section is applicable to other Virtex partially reconfigurable devices.
The Xilinx bitstream can be divided into individual units called packets. Each packet consists of a 32-bit header that defines which register will be written to or read from, the number of 32-bit data words that will be written to that register, as well as other information related to the packet command. Directly after the header come the data words that are associated with that packet as defined within the header. Many of the packets contain the start-up sequence and set configuration registers. The majority of the bitstream is made up of packets that write the actual configuration to the FPGA [4] .
Data within the bitstream is broken up into 32-bit words, but the minimum unit of configuration data that can be written to the FPGA is 41 words. This unit of data is called a frame. Data associated with packets that write new configuration data must come in multiples of 41. 1 There are two write sequences that are used to write data to the FPGA which, for this discussion, shall be called the sequential and multiple-frame write (MFW) sequences. Each sequence consists of many packets that perform tasks from setting configuration registers to sending the configuration data to the device.
The sequential write sequence is used when a long string of non-identical frames needs to be written. The sequence consists of issuing a write config command, writing the initial configuration address to the frame address register (FAR), and then writing to the frame data input register (FDRI) all the frame data that needs to be written. The FDRI is a pipeline input stage that is used to write configuration data directly to the FPGA [6] . A dummy frame of zeros must be included at the end of the FDRI write so that the last frame of useful data will actually be written to the FPGA.
The multiple-frame write (MFW) sequence is used when there are many frames that are identical. It consists of issu- ing a write config command, loading a frame of data via an FDRI write, issuing a multiple-frame write command, and then writing the FDRI data to multiple addresses via a series of FAR and multiple frame-write register (MFWR) writes.
There are two main types of bitstreams used for configuration: full and partial. A full bitstream contains all the necessary commands needed to initialize the FPGA. It also contains a single sequential frame write sequence that writes from the first to the last frame address within the FPGA. A normal initial configuration bitstream is a full bitstream.
A partial bitstream does not contain initialization instructions because it is used to configure the FPGA during execution and after initial configuration. A partial bitstream writes a new partial module and overwrites with zeros all logic that is now unused by the new partial module within the partial region. An uncompressed partial bitstream does this by a series of consecutive write sequences that write to the addresses specified by the partial region.
BCUPR OVERVIEW
The primary goal of this work is to configure a design on the FPGA with a minimal amount of configuration data at initialization (i.e., reducing the amount of onboard configuration memory) and without the use of extra decompression logic. The BCuPR technique achieves this goal by dividing the circuit into two parts: a static region and a partial reconfiguration region (see Fig. 2 ). The static region is a small circuit that supports the reconfiguration of the rest of the device through partial reconfiguration. The partial reconfiguration regions contain all the remaining logic of the intended user design.
The configuration process occurs in two steps. During the first step, the static region is configured from non-volatile memory. During the second step, the rest of the circuit is configured by partially reconfiguring the remaining FPGA resources. The static circuit contains communication circuits for receiving partial bitstreams from an external host and the configuration circuits necessary for configuring these partial bitstreams onto the FPGA (i.e. via self-reconfiguration).
Because the initial static circuit is very small (relative to the size of the full FPGA), it can be compressed using a technique described in Section 4. With compression, the size of the bitstream for this static design is only a fraction of a full non-compressed bitstream. This allows the use of a much smaller non-volatile memory. For space environments where RAD-HARD memories are used, this will result in significant savings in cost and board space.
Additional memory storage is needed to store the partial configuration bitstreams used to complete the configuration process. This storage can be on a remote host or somewhere on a local network. These bitstreams define the main functions of the FPGA and are loaded at run-time.
Implementing this form of bitstream compression requires careful floorplanning and the use of the Xilinx partial reconfiguration design flow. Large designs must be split into the static region and partially reconfigurable regions loaded at run time. This technique also requires an offline compression step (this compression will be described in Section 4).
BITSTREAM COMPRESSION
To obtain the maximum compression, it is important to make the static design as small as possible. By doing this, a full bitstream with many zeros can be created. Because there is a large quantity of zeros, any compression technique can be applied and should be able to get a decent rate of compression. Though this is true, a compression technique that does not require decompression is desirable. This section describes two techniques that are capable of compression without the need for decompression. The first, bitgen compression, is readily available for Xilinx FPGAs via the manufacturer tools. The second, compression via removal, is a technique derived specifically for this work. Both techniques require small, underutilized designs with lots of zeros. Though compression via removal is optimal for BCuPR, results of both have been provided for comparison.
Bitgen Compression
The manufacturer bitstream generation tools provide an option in bitgen called "-g compress". This bitgen compression uses MFW sequences to minimize the size of the bitstreams. The full and partial uncompressed bitstreams, which contain only sequential write sequences, are replaced by bitstreams that contain both MFW and sequential write sequences.
By default, bitgen compression is enabled for partial bitstreams. This is useful because clearing unneeded logic used by previous partial modules requires writing a large number of frames that contain only zeros. Enabling bitgen compression is a logical choice to minimize the size of these bitstreams. Fig. 3 . a. The old write sequence and all zero data/unused logic is removed from the bitstream b. New write sequences that write only the used logic are inserted into the gaps Bitgen compression is not enabled by default for full bitstreams. When FPGA resources are fully utilized, there will not be many similar frames and the advantages of compression will be minimal. If the FPGA's resources are underutilized, enabling bitgen compression could become worthwhile. Because the BCuPR initial bitstream does not use the majority of the FPGA's resources, bitgen compression can be applied to the initial bitstream for high levels of compression.
Compression Via Removal
While bitgen compression is useful for BCuPR, there is an even simpler compression method that can be applied specifically to initial bitstreams. Because the configuration memory is cleared sequentially any time the device is initialized [4] , any frames that will contain only zeros at startup do not need to be written to the FPGA. Since the bitstream to be stored in on-board memory is used for initial configuration, this property can be exploited by removing every zero frame from the initial bitstream. Fig. 3 shows how this is done. The frames that contain zeros within the FPGA are first identified. These frames as well as the original write command sequence that was used to perform configuration must then be removed. New sequential command sequences must be inserted in the gaps created that write only the used logic to their specific locations in memory. By doing this, a bitstream that has equal functionality to a full bitstream has been created, but does not redundantly write zero frames to the FPGA configuration logic.
To implement this technique, a bitstream compression tool was created called BitstreamManip. BitstreamManip takes a full bitstream and parses it into its more basic, modifiable components. It then identifies the data frames that contain logic. Those frames are then taken with new write sequences that write only to the necessary locations in mem- Just as bitgen compression is only useful for designs that do not use the full resources of the FPGA, compression via removal is only useful when there is unused logic within the device. If every frame contains data, there is zero percent compression. However, compression via removal is superior to the "-g compress" method for initial bitstreams. Each MFW sequence that was used for writing frames of zeros and the associated overhead now can be completely removed from the bitstream.
DESIGN OVERVIEW

Hardware Design
To take advantage of the compression techniques above, a hardware design was created (see Fig. 4 ) that separates the static communication circuitry from the complex logic in the partial region. The static design created consists of a very basic communications circuit, made up of a UART and a PicoBlaze, and configuration circuitry made up of the Internal Configuration Access Port (ICAP). The ICAP is a configuration port that allows for self-reconfiguration and is used in our design to perform reconfiguration of our partial regions. The UART receives the partial bitstream from an outside source, while the PicoBlaze is used to perform a simple error-checking protocol. The PicoBlaze is also used to feed the bitstream to the ICAP which then reconfigures the partial region. These components together make up the "base" portion of the static design. Also in the static design are the bus macros which connect the partial region with the rest of the device, as well as any other static routing going through the FPGA.
While more advanced communication protocols could have been chosen, this simple design was chosen because it minimizes the amount of logic needed. This base design was very small, taking up only 2.4% of the available slices on the Virtex-4 LX25 which was used for implementation.
Several partial modules were then created that performed various tasks. The first module created was the "empty" module. It ties the bus macros to constant values within the partial region. This takes up a mere one slice. This is how the partial region is able to be "removed" from the initial design. Several other partial modules were created including an LED counter and a MicroBlaze that controls an OLED display. But since it is not tied to the static module in any way, anything from an ALU to a software-defined radio could be implemented.
Logic Minimization
To minimize the footprint of the static module and thus the size of the static bitstream, the partial reconfiguration flow was modified (see Fig. 5 ) to constrain the location of the base module. Careful attention was given to minimize the amount of interconnect throughout the design. This is important because any frame through which routing passes is a frame that must be included in the initial bitstream even after compression.
Even though the base region is constrained, this does not stop necessary routing from adding to the size of the design. There are several signals such as the clock signals, bus macros, interconnect to the ICAP, and interconnect to I/O that still must be present and for which placement is hard to control.
RESULTS
To show the effectiveness of the empty module, two static designs were created: one containing the empty module and one containing the MicroBlaze module. Fig. 6 displays the layout of the two partially reconfigurable circuits. On the right is the "empty" bitstream that contains the static design and the bus macros. On the left is the MicroBlaze design containing the MicroBlaze soft processor core. As seen in this figure, the MicroBlaze partial module has much more logic than the empty module. Even with MicroBlaze, configuration file size can be reduced to 28.7% of the original by applying BCuPR. The base design and the MicroBlaze module together consume 1076 slices or 10% of LX25's resources. Once the MicroBlaze has been removed, a bitstream is created that is 10.3% of the original size. This bitstream, which contains the base and empty modules, takes up 262 slices or 2.4% of device resources. This is a significant reduction in configuration memory and it significantly reduces the amount of non-volatile memory needed for initial configuration.
A comparison of bitgen compression to compression via removal was also made and the full results from the tests are found in Table 1 . Results are provided for both unconstrained and constrained base modules. Even though the design only takes up 2.4% of available slices for the LX25, compression is about ninety percent. This is because data frames represent multiple slice locations; while logic may go through one slice and change only a few bits within the frame, the whole frame must still be included in the final bitstream.
Because our static design remains constant, the compressed file sizes remain almost constant with increased device size. This allows the LX200 bitstream to be compressed to only 2.4% of its original size.
CONCLUSIONS
In conclusion, this work demonstrates that it is possible to compress configuration bitstreams without the need of decompression circuitry by dividing a circuit into static and partial regions. This technique demonstrates a 90% compression of the static region's bitstream for the LX25 device. The communication and configuration circuits within the static region were successfully used to configure partial regions which then performed useful work.
There are a few design challenges faced in implementing this process. First, it is essential to create a small static circuit to be configured at initialization. This involves creating a small logic circuit as well as careful floorplanning to limit the number of frames used to configure the circuit. Second, the use of the partial reconfiguration flow increases the complexity and time of the design.
While these difficulties exist, the results suggest that this technique provides significant compression for initial bitstreams. By moving the majority of the design off-chip for initialization, the amount of non-volatile memory needed to store the initial bitstream is minimized. While this moves the configuration data somewhere else (i. e. the partial modules need to be stored somewhere), the initial bitstream has been effectively and efficiently compressed without the need for decompression circuitry.
