Abstract. Pipeline morphing is a simple but e ective technique for recon guring pipelined FPGA designs at run time. By overlapping computation and recon guration, the latency associated with emptying and re lling a pipeline can be avoided. We show how morphing can be applied to linear and mesh pipelines at both word-level and bit-level, and explain how this method can be implemented using Xilinx 6200 FPGAs. We also present an approach using morphing to map a large virtual pipeline onto a small physical pipeline, and the trade-o s involved are discussed.
Introduction
Pipeline architectures are commonly used in high-performance designs. This paper introduces morphing, a technique for enhancing the e ciency of recon gurable pipelines at run time. We shall also describe the use of morphing in the emulation of large virtual pipelines by small physical pipelines, and explain how temporary storage can be used to improve performance.
Implementing pipeline architectures using recon gurable devices is attractive for several reasons. Many FPGAs facilitate the realisation of pipelines, since they have a regular structure and an abundance of registers. Moreover, leading-edge FPGAs often provide built-in support for fast recon guration. An example is the wildcarding facility for Xilinx 6200 devices 1], which allows concurrent recon guration of a block of FPGA cells. With such facilities, it is possible to recon gure each pipeline stage rapidly at run time to implement multiple functions. For a system operating in an unpredictable environment, this possibility enables the selection of functions adaptively.
Partial recon guration 4] is a powerful method of exploiting the exibility of FPGAs such as the Xilinx 6200: one part of the FPGA can be recon gured while other parts are continuing to function. Pipelines provide a simple but e ective scheme for partial recon guration, since pipeline registers isolate one pipeline stage from another so that computation and recon guration can take place at the same time without interference. The regular structure of pipelines also simpli es the development of hardware operators which can be relocated to di erent regions of a pipeline, maximising the re-use of design e ort.
An obvious method for recon guring an n-stage pipeline involves three steps. First, one needs to complete the current computation and clear the pipeline, which takes n cycles. Then recon guration can take place. Finally one has to wait for n cycles for the result to ow through the newly-con gured pipeline. This method of recon guring a pipeline leads to a latency of 2n cycles, in addition to the time for recon guring all the pipeline stages. In highly-pipelined systems when n is large, the pipeline latency cycles and recon guration time will have a signi cant impact on performance.
Pipeline Morphing
We present a method, called pipeline morphing, for reducing the latency involved in recon guring from one pipeline to another. The basic idea is to overlap computation and recon guration: the rst few pipeline stages are being recon gured to implement new functions so that data can start owing into the newly-con gured stages of the pipeline, while the rest of the pipeline stages are completing the current computation. Instead of changing the entire pipeline at once, our method involves morphing one pipeline to another, just as one can morph two images by interpolating them. In our case, the pipeline registers isolate one pipeline stage from its neighbours, enabling computation and recon guration to take place concurrently in di erent stages. Figure 1 shows how a ve-stage pipeline F can be morphed into a pipeline G in ve steps. It should be clear from this example that during morphing, the ow of recon guration is synchronous with the ow of data, and hence the pipeline latency cycles are eliminated. If the time for recon guration is longer than the pipeline processing time, the pipeline will need to include ow control mechanisms to slow down the rate of data ow while morphing is taking place. We shall explain later how this can be achieved in Xilinx 6200 FPGAs. Whether morphing is used or not, a designer's task is to ensure that the slowing down due to run-time recon guration will not a ect system performance. For instance in video processing, there may be su cient time for recon guration between the end of one image frame and the beginning of the next frame.
Because of the elimination of latency cycles, pipeline morphing will improve the performance of systems that recon gure at run time. It is particularly suitable for devices supporting rapid recon guration, and it works best when reconguration time is comparable to the pipeline computation time. To meet this condition, the user can build single-cycle recon gurable structures in an FPGA 7] . FPGAs specially-designed for supporting rapid recon guration 10], 11] will particularly bene t from pipeline morphing. Morphing can also be applied to systems with multiple FPGAs arranged as a pipeline. Our method is not con ned to linear pipelines. It can be applied to pipelines of other shapes, such as two-dimensional meshes or tree-shaped designs. Figure 2 shows the steps of morphing from one mesh to another, given that every component in the mesh has a pipeline register at each of its two outputs. This approach can also be applied to bit-level operators. In the next section, we shall explain how a pipelined adder can be morphed into a pipelined subtractor.
Morphing Pipelines on Xilinx 6200 Devices
We illustrate the morphing recon guration technique by showing how it can be applied to recon gure a six-bit, three-stage pipelined adder to a pipelined subtractor of the same size on a Xilinx 6200 FPGA. The pipelined adder is shown in Figure 3a and the resulting pipelined subtractor is shown in Figure 3d . As explained above, if all n stages of a pipelined adder (n=3 in the above example) are recon gured into a pipelined subtractor at once, an additional 2n clock cycles would be needed in order to ush the pipeline and to re ll it. In order to avoid this delay, the recon guration is performed in three steps where each stage of the pipelined adder is recon gured followed by one clock cycle of computation. These three recon guration steps are shown in Figures 3b, 3c and 3d. The partial con guration information involved in these steps was obtained using tools described in 8].
A dual clocking scheme is used in order not to clock invalid data values into the pipeline registers during recon guration. The two operand values for the adder are stored in two six-bit registers. When using the processor interface 1] on the Xilinx 6200 FPGA, a pulse is produced whenever a register is written with a value, and this pulse can be used as a clock for the registers within the design. The input registers are set up so that a clock pulse is generated when the second operand is written into the register. An additional con guration clock is used to control the recon guration of the logic in each pipeline stage. The recon guration sequence is therefore broken down into three steps. First, a stage of the pipeline is recon gured; on completion the operands are written into the input registers; the write action then triggers the clock for the pipeline registers. This sequence is continued until all the stages are recon gured.
In the above example, it takes four cycles to recon gure the pipelined adder to a pipelined subtractor. Without morphing it takes an additional three cycles to ush the pipeline and three cycles to re ll it; hence a total of ten cycles are needed to perform the recon guration and to begin producing correct results. The morphing technique therefore improves the recon guration time by 2.5 times; Table 1 summarises these results. Clearly the higher the degree of pipelining, the larger the improvement that can be obtained using the morphing technique, since it takes more cycles to evacuate and to re ll the pipeline. Table 1 . Comparing morphing and non-morphing recon guration of a pipelined adder into a pipelined subtractor. N/A is short for \not applicable". Virtual Pipelines
An advantage of adopting a pipeline structure is the ease of mapping a large virtual pipeline onto a small physical pipeline. Our approach involves feeding back partial results to the physical pipeline which morphs between di erent sections of the virtual pipeline. The performance of such a system can often be enhanced by a temporary storage (Figure 4 ), as we shall discuss later. Let us begin with an example: the mapping of a six-stage virtual pipeline onto a three-stage physical pipeline (Figure 4) . The rst three stages of the virtual pipeline are time-multiplexed with the last three stages. Morphing is used to replace one of the two pipeline con gurations by the other.
This design operates as shown in Figure 4 . Note that the physical pipeline operates in two modes: the \ ll" mode and the \feedback" mode. In the ll mode, the rst pipeline stage is connected to the external input and data start lling up the pipeline. Once the pipeline is lled up, partial results will emerge and will be stored in the temporary storage. When all input data have been processed by the rst three stages of the virtual pipeline or when the temporary storage is full, the pipeline will operate in the feedback mode.
When an n-stage physical pipeline rst starts in the feedback mode, its rst stage will be recon gured to become the (n + 1)-th stage of the virtual pipeline (Figure 4b ). After recon guration is completed in the rst pipeline stage, it is provided with the partial results from the temporary storage. When the computation is completed, the second stage of the physical pipeline will be recon gured to become the (n + 2)-th stage of the virtual pipeline (Figure 4c) , and so on. For the example in Figure 4 , eventually the partial results will ow through the physical pipeline con gured to be the last three virtual pipeline stages (Figure 4d) .
When the virtual pipeline has been emulated once, the result will start to emerge at the external output. When new data can be accepted, the physical pipeline will operate in ll mode again and will morph back to the rst three virtual pipeline stages. Note that adequate ow control is necessary to stop the external input while the pipeline is operating in feedback mode. The next section will describe the use of the temporary storage to optimise pipeline performance.
Temporary Storage
First, note that if the physical pipeline only supports global recon guration, the temporary storage shown in Figure 4 is needed to hold partial results while the entire pipeline is being recon gured. If the pipeline can be partially recon gured at run time, then the temporary storage is not necessary as the pipeline itself can provide storage of partial results.
However, a small temporary storage will result in frequent recon guration, since the physical pipeline has to operate in \feedback" mode (see previous section) once the temporary storage is full. It will remain in the \feedback" mode until outputs are ready which will then free up space for further inputs. If the combined storage in the pipeline and the temporary storage is large enough to contain all input data, then each virtual pipeline stage will only need to be emulated once. Having su cient temporary storage is particularly important for pipelines which require a long recon guration time, since it would be desirable to minimise the frequency of recon guration for these pipelines. Let us now consider di erent ways of implementing the temporary storage. If a large amount of temporary storage is required, then external memory can be used; otherwise on-chip registers or embedded memories within the FPGA may be su cient. As explained above, pipelines supporting rapid recon guration can a ord a small temporary storage. When this happens, the feedback connections can be made entirely on-chip, possibly using global connections in the FPGA so that output data from the last stage can be fed back to the rst stage in the feedback mode. Global connections are provided in most FPGAs; such connections can themselves be pipelined to ensure high performance.
Another way of implementing a physical pipeline with little or no temporary storage is to partition the physical pipeline into half, and \fold" one half of the pipeline onto the other half by interleaving the components ( Figure 5 ). This method avoids global connections at the expense of requiring an e cient integration of non-neighbouring elements in a pipeline structure. Virtual Pipelines on Xilinx 6200 PCI System
The viability of virtual pipelines has been demonstrated by a PCI board supplied by Xilinx Development Corporation, which contains a Xilinx 6216 or a Xilinx 6264 device and four 8-bit wide memories organised into two banks 7] . Each bank of memory can be accessed from either of the two separate address busses ( Figure 6 ), and each of the four memories can be controlled individually. This memory architecture allows multiple modes of operation to be set-up by selecting multiplexers and bus switches for ow control in the desired manner. This system provides a exible platform for implementing virtual pipelines. One possibility is to use the two memory banks to provide the temporary storage ( Figure 4) for a virtual pipeline. Partial results can be stored into one memory bank, and they can be used later when the FPGA has been recon gured to Fig. 6 . Xilinx 6200 PCI system. implement a di erent region of the virtual pipeline. Another possibility is to use the on-chip registers of the Xilinx 6200 FPGA to implement the temporary storage. If registers or global connections are at a premium, the folded structure ( Figure 5 ) may prove to be an appealing alterative in implementing a physical pipeline.
Summary
This paper introduces morphing, an e ective technique for recon guring pipelines. We explain how morphing can be applied to linear and mesh pipelines at wordlevel and bit-level, and how it can be implemented in Xilinx 6200 FPGA technology. We also describe the mapping of large virtual pipelines onto small physical pipelines, and how the resulting implementations can bene t from morphing.
Current and future work includes extending the scope of morphing to cover various architectural templates; this extension will enable us to morph between pipelines with di erent number of pipeline stages, or to morph a linear pipeline into a tree-shaped architecture. Frequently there are multiple ways of morphing between designs, and it will be important to evaluate the trade-o s involved.
The use of languages such as Ruby 2] and VHDL 9] for modelling hardware morphing is also being explored; we expect such work to result in techniques and tools for automating the implementation of morphing and virtual pipelines.
