This paper is about the implementation of a MIMO V-BLAST (Vertical Bell Laboratories Layered Space-Time) square root decoder in a FPGA using dynamic partial reconfiguration. The decoder architecture is based on four CORDIC (COordinate Rotation DIgital Computer) Units. Among these CORDIC units, three are used in rotation mode and the fourth one is used in vectoring mode. The design implementation aims power saving and area efficiency allowing dynamically changing the interconnections between the fixed modules in the reconfigurable modules. This MIMO square root design method shows the configuration time improvement, area efficiency and flexibility of the decoder by using the dynamic partial reconfiguration method.
Introduction
Dynamically reconfigurable FPGAs offer new design space with a variety of benefits: flexibility and reusability at run time. The dynamic reconfiguration is closely related to partial reconfigurability of FPGA. Indeed, the partial reconfigurability allows to selectively change segments of the FPGA functionality without suspending operations of the remaining parts. There are several benefits of partial reconfiguration. It reduces the configuration time and saves memory as the partial reconfiguration files (bitstreams) are smaller than full ones.
Reconfigurable computing [3] and [6] has been proposed in a large range of signal processing applications in order to improve high performance, flexibility and adaptability. The development of wireless communication systems has indicated the need to dynamically adapt systems architectures at the hardware level, as in Software Radio system [8] .
One of the most promising technologies to enhance the wireless communications performances is Multiple-Input Multiple-Output (MIMO). MIMO is an attractive technology for future wireless systems because of their huge bandwidth capacity. It is well known that an extraordinary spectral efficiency can be achieved in MIMO system [5] . In various MIMO detection algorithms, square root decoder is an interesting tradeoff to obtain a high performance with reasonable complexity.
In our previous work [9] , we have implemented a reconfigurable architecture MIMO decoder with various number of CORDIC. It adapts to different number of antennas, different signal constellations and different throughputs for wireless communications.
We introduce in this paper the dynamic reconfiguration into the MIMO V-BLAST decoder architecture. We especially detail our design experiments to integrate the control of configuration into the processing algorithm during the MIMO decoding. Dynamic partial reconfiguration is used to change the interconnection between the processing modules.
The rest of the paper is organized as follows. The square root algorithm and block diagram are briefly described in section 2, details further in [9] . The reconfigurable architecture for square root decoder is detailed in section 3. Section 4 deals with the configuration management. Section 5 presents the design methodology. The experimental results are provided in section 6. The conclusions and a look at future research will be stated in section 7.
formations. The computational cost is reduced effectively from o(M4) to o(M3) without degradation in BER performance, where M is the number of transmit antennas. The whole algorithm is described in the following steps: [1] . It reduces the computational complexity significantly.
In the module M1, the elements of equation (1) [7] , (for example, signal-to-noise ratio, The tasks scheduling of both type of dynamic reconfiguration are illustrated in figure 3 . These simple examples show that in the first case, figure 3 .1, the reconfiguration task should be controlled by the processing to avoid data loss. In the first case, T2 should wait for the end of processing of T1 and the reconfiguration. Whereas in the second case, the dynamic reconfiguration could be performed at any time except when the reconfigurable block is under data processing. The first case occurs during the processing when processing blocks of a large function are multiplexed. This could be useful to reduce the area of the function. We focus in this paper on this first type of dynamic reconfiguration. Particularly, we tackle this issue to reconfigure pointto-point interconnections, to change the datapath between processing elements of a MIMO decoder.
In the dynamic reconfigurable architecture, shown in figure 4, the configuration manager is an embedded processor (like Xilinx MicroBlaze) and the configuration interface is Figure 5 . Datapath of first type of dynamic reconfiguration
The datapath of first type of dynamic reconfiguration is shown in figure 5 . There is data dependency between two configurations T2 and T3. The processing time depends on the reconfiguration time of two context (T2 and T3) of reconfigurable module RM. On the contrary, the processing is not interrupted by the reconfiguration for every data in the second type (figure 2.2). In this paper, the first type of dynamic reconfiguration is used in the MIMO decoder and the reconfiguration is executed as one part of the decoding process. The partial reconfiguration allows to reduce the reconfiguration time and saves the storage memory as the size of partial reconfiguration files are smaller than the full ones.
Square Root Decoder Overview
In our previous work, we have implemented an reconfigurable architecture MIMO decoder with various numbers of CORDIC [9] . It adapts to a different number of antennas and different throughputs for wireless communication.
In figure  8 ) are performed by 3 CORDIC operators. The operators are the same as in the first cycle and are re-used in the second cycle, and so on for the following cycles. After each iteration, the producted datas are used in a different way. So the interconnections between operations should change every cycle. The whole of the processing is performed by 3 CORDIC operators in 10 cycles. figure 6 ). This total parallel structure may lead to a waste of computational capabilities, since the channel data changes slower than the received symbol data. Therefore the iterative use of several CORDIC operators can optimize the resources. So we have implemented a decoder with a iterative structure that uses three parallel CORDIC operators to replace 29 CORDIC operators in the total parallel structure. All iterations of CORDIC algorithm are performed in parallel, using a 20 steps pipelined structure. The input data of the CORDIC periodically changes and static implementation of the interconnections frameworks uses a great number of multiplexers to switch from one interconnection context to the next one. They take a lot of surface of FPGA and lead to waste of power consumption. Nevertheless, these multiplexers remain in the same state during 20 steps of CORDIC operations. The only difference between every 20 steps are the interconnections. This fact lets inspire the implementation on reconfigurable hardware, as shown in figure 9.
Our approach splits the processing into a static hardware skeleton which is composed of decoding processing elements and a reconfigurable part that evolves at run-time depending on the step of processing to perform. In this calculation the processing elements, three CORDIC operators, are implemented in the fixed part and the interconnections between processing elements are implemented in the reconfigurable module (shown in figure 9 ). Thus the multiplexers are changed by certain number of reconfigurations which are determined by decoding processing. Every reconfiguration represents one state of multiplexer. It is suitable that [11] . The switch is implemented on an FPGA using partial configuration to modify routing resources during operation. The reconfigurable 3 x LUT, shown in figure 10 .2, in each logic cell of the FPGA is used to perform a multiplexer in [4] .
Both of these approaches use RTL description level. We use here a system level description to make the reconfigurable interconnection switch, showing in figure 10.3 . The interconnections of processing elements are defined as a module which is implemented in the reconfigurable part. The input and output of interconnection module is con- Figure 11 . Placement of fixed modules and reconfigurable module (interconnections and registers)
Our architecture contains two parts (see figure 11) 
Configuration Management
The first stage of the FPGA configurations is its initialization. The full bitstreams and the partial bitstreams are stored in the Host memory (see in figure 12 ). The initial configuration, a full bitstream, is downloaded by the Host controller. This FPGA configuration contains the MicroBlaze and the MIMO decoder (fixed part and initial context of the reconfigurable part), which is shown in figure 12 . Next, the reconfigurations are only partial and are performed during the processing to change the interconnections between the fixed processing elements. Even if a reset occurs, we call it a soft reset and it implies that a partial bitstream composed of the basic CORDIC units, multiplier is automatically loaded by reconfiguration controller into the FPGA. Reconfiguration during decoding processing: During the partial reconfiguration process all the modules can continue working except for the reconfigured module. Thus in our design, the reconfiguration tasks are considered as one part of the decoding process. The reconfiguration sequence is predefined by the decoding proceeding (as shown in figure 13) .
The reconfiguration tasks are inserted in the decoding processing when a change of interconnection is required. Each reconfiguration, during decoding, is requested by the fixed part of the decoding. The fixed part sends the request for reconfiguration by the signal reconf req ( figure  12) 
Design Methodology
The partial bitstreams for the IP modules are generated following the methodology developed by Xilinx. The design flow is based on Early Access Partial Reconfiguration (EAPR) [10] with the use of the new design tool, PlanAhead. It provides a hierarchical floorplanning, block-based, modular, and incremental design methodology. It allows changing only part of the design and leaving placement of the remaining intact. The PlanAhead floorplanner allows to handle LUT-based Bus Macro and placed during the floorplanning phase. All of these shorten design iterations, even while making frequent changes.
PlanAhead does not require the user to perform all the operations in ISE. The designer only needs to synthesize top level and module in ISE. All other operations (floorplanning, P&R, bitstreams generation) are directly performed in PlanAhead which launches automatically ISE tools. 
Conclusion and Future work
In this paper, we present a reconfigurable MIMO square root decoder design using dynamic partial reconfiguration, which has area efficiency, flexibility and configuration time advantage. The proposed method produces a reduction in hardware cost and allows performing partial reconfiguration, where only the interconnection of the modules are reconfigured. It is attractive for the future wireless applications, supporting different antenna sizes, different modulation and throughputs.
Currently the performance of the reconfigurable decoder is reduced due to the reconfiguration time and Host-todevice transmission time. However, these are mainly technological issues. This paper proves the management of reconfigurable decoder during data processing. The time of reconfiguration is limited by the throughput of the serial UART interface. This time could be reduced by implementing in the FPGA a mechanism of direct access (DMA transfert) to transport the partial reconfiguration bitstreams from an internal or external memory to ICAP primitive. Moreover, the performance of reconfigurable decoder will be improved by using Xilinx FPGA Virtex-4. This is the next step of our design work. The reconfiguration of Virtex-4 is indeed more efficient than the Virtex-JI (The Bitstream could be smaller due to a new structure of the Virtex-4 configuration memory based on block of configuration frames rather than whole column frames of the Virtex-JI). The reconfiguration time can be speeded-up, as the bandwidth of Virtex-4 ICAP is also 8 times greater than the ICAP bandwidth of Virtex-JI.
