Here, we discuss about Dynamic Reconfiguration (or Run Time Reconfiguration), a technique based on the reuse of the same device (an FPGA configured on the fly) by scheduling the execution of different algorithms building an application. Our project joins the efforts of ten laboratories working on methods and tools for Adequation Algorithm Architecture. The design of a hardware template with such a concept, will help the emergence of new methods for applications development and the benefits estimation of this approach. In this paper, we present our architecture, ARDOISE, dedicated to real time image processing. Then an analytical model is defined in order to compute the limits and the performances expected in the use of the dynamic reconfiguration scheme.
INTRODUCTION
Ten research French teams study and build a hardware architecture (ARDOISE) dedicated to real time image processing. This architecture uses fast or dynamic reconfiguration provided in new FPGA circuits. This new kind of devices can be totally or partially configured in less than lms and
The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: their processing speed is approximately 113 of ASIC speed. The basic idea is to chain algorithms used in image segmentation on the same hardware structure, by reconfiguring few devices, several times during the processing of a single image. This is equivalent to assigning the same hardware resource: an FPGA device to execute a sequence of algorithms according to a defined scheduling. Using this concept, the final system allows the implementation of applications on FPGA as fast as on ASIC circuits. However, it authorizes a same flexibility level, formerly, reserved to microprocessors. On the other hand, this project will help in estimating the contribution of the Dynamic Reconfiguration (DR) paradigm to the performances of hard wired systems.
In this paper, we briefly present the ARDOISE architecture, which can be reconfigured entirely or partially. Next, we discuss (section 3) the performances limits of a Dynamically Reconfigured Architecture (DRA) and give an evaluation of the silicon area gain expected, compared to a classical solution based on a set of statically configured FPGA circuits. In the five section, we compare the performances of three possible implementations of aDRA: 
THE ARDOISE ARCIDTECTURE
The ARDOISE architecture [1, 2] is based on three identical modules (figure 1). Each one encloses an FPGA, with a 45K gates count connected to two local memories to store intermediate results.
The GTIl and GTI2 modules provide an interface with the input and output systems, that allows de synchronization between the central computing module, BC, and the video acquisition system which occurs often at lower frequency than the maximum FPGA abilities. The different treatments are swapped in the central module. Partial results will be stored in the local memories with different memory models, while computing or reconfiguring.
PERFORMANCES AND LIMITS OF DRA
The main objective of Dynamic Reconfiguration is to allow a system to react during run time and choose the right algorithm at the right time offering performances close to those of the hard-wired systems. Compared to a static solution, DR does not improve the execution speed of an application. However, it reduces and optimizes the use of the silicon area in the FPGA. Bertin suggested [3] the definition of the power Pu needed by an application in the case of the static architecture as the product of the number of gates Gs required and of the computing frequency. In the same way, the power of an architecture can be defined as the product of the number of equivalent gates of the architecture and the maximum frequency Ft that can be used. The application which uses Gs gates in a static architecture, is splitted in C parts and then computed on an image with C configurations of the FPGA. We have already shown [1, 2] that there is a limit of the useful application power when using DR:
Gd represents equivalent gate count of the dynamically reconfigurable architecture and T the computing time of a block of data. For an image rate of 25 images/s, the block duration is T = 40ms .
The limit depends on the configuration speed V c (expressed in number of gates configured per second) and of the maximal computing frequency Ft provided by the technology.
Maximal power and complexity limits
For a given number of configurations, when Gd increases, there is a maximum value which can be reached by Pu:
GJ=VcT 2C
For each configuration p, fJ p data are processed in parallel (parallelism rate). The maximal complexity of an application implemented with DR is then defined:
This equation shows that there is a limit in the hardware complexity of an application that can be implemented with DR. This limit is exclusively a function of the configuration speed authorized by the technology.
Maximum gate count of a DRA
If the image size is N pixels, the sampling frequency of the pixel is:
For a given application (Fe, T, Gs), the useful power is:
We obtain:
The implementation according to the DR concept allows a better silicon area reduction with a parallelism rate of fJ p = 1. In this part, we show that the silicon gain is better when the data acquisition frequency is low compared to that of the processing frequency. The rate is defined as:
Performances on silicon reduction appears when dynamic reconfiguration is used at less than 20% of the limits.
Performances
In order to simplify the expressions and to make the figures analysis easier, we suppose that all configurations use the same data parallelism rate fJ . Respecting the real time constraints, the following relation has to be considered:
The following curves, drawn for two different technologies (Xll..INX 4000 family and ATMEL AT40K family), show the importance of the configuration speed V c and of the data parallelism rate fJ , on the complexity limit as well as on the substantial silicon area gain.
For example, the speed of the configuration V c of ATMEL AT 40K is 100 times faster than Xll..INX 4000. The complexity of the application should be 100 time greater.
These curves have been drawn with T=4Oms (block duration in image processing). 
WmCH TYPE OF FPGA FOR DRA?
The dynamic reconfiguration is a concept which is not reserved for only FPGA who offer a high configuration speed such as Xilinx XC6200 and Atmel AT40K series. These two FPGA families allow dynamic partial or complete reconfiguration during run-time. This is not currently available in technologies such as the Xilinx 4000 family, because devoid of an interface of fast configuration times. According to the curves shown in §3.3, built architecture with only one FPGA will have significant reductions in efficiency.
The performances can be degraded by the reconfiguration time of the FPGA hardware. However, this does not mean that the use of classical FPGA circuits is excluded in the design of the DRA. There has been much interest in the development of dynamic reconfiguration architecture using reconfigurable FPGA connected in various topology. These last years, a great number of systems are built by teams of research integrating mechanisms aiming to reduce the effect of the configuration times: configuration data caching, bit-stream compression techniques, masking the configuration, ...
The masking of configuration strategy was studied by H. Guermoud in his thesis [4] . In the following of this article, we compare the performances of this approach with the classical DRA (such as ARDOISE).
DR SYSTEMS WITH OR WITHOUT MASKING ?
We can make benefit of DR, even with slow configuration speed. One can imagine solutions that use two FPGA, each one with a capacity of GJ2 gates, instead of a single one with Gd gates large.
Real-time image processing requires complex: Nagao filter, Edges detector, Boundary closing, Regions labeling, ... ). The algorithm implementations [5] include a large variety of hardware models (pipeline, data flow, parallelism, ... ) and data arrangement. They usually need very high speed memories and data address generator for an effective data organization (data formatting). For example, some operators require to access to many pixels at the same time. That is why, sufficient memory size and high speed access (bandwidth) are very important for image processing. For processing an image, each step requires large frame storage.
ARDOISE architecture module includes 2Mbytes classic SRAM (for random memory access); the memory bandwidth is 400 Mbytes/s. Now, we calculate and discuss the performances and the silicon reduction of the following solutions:
1. without masking the reconfiguration time, 2. Masking the reconfiguration time: one device is reconfigured, while the other is computing, 3. Doubling the reconfiguration speed: the two devices are reconfigured at the same time, and then compute together. The three solutions use the same quantity of hardware resource: total number of gates, memory size and memory bandwidth.
To simplify expressions and figures interpretation, we assume here that the same data parallelism rate {3 is used for each algorithm.
Solution without masking
In this case, a single FPGA with a capacity of Gd equivalent gates is used. In order to execute several algorithms, a series of reconfiguration I computation are performed alternatively.
Number of configurations
T 1 C = ---= ---- N G d Fe G d -+ --+ -- Ft Vc Ft Vc·T
Complexity of the application
Because dynamic reconfiguration has no sense if C < 2, the maximum value of the equivalent gates of the dynamic architecture is:
The expression of Gs can spell also:
The silicon area reduction Gg according to the complexity of the application is:
DR with masking configuration
In this case, a series of reconfiguration / computation is applied in alternation to each of the two FPGA (each provided the equivalent of GJ2 gates). At any step, a phase of reconfiguration of the first FPGA is applied as long as the second FPGA executes a treatment (figure 3a). In the following step, after the reconfiguration time, the first FPGA starts to operate while a configuration is applied to the second one (figure 3b), and so on ...
Number of configurations
Two possible situations should be considered: 1. In the first situation where the reconfiguration time is lower than the computation, the number of configurations can be written as:
2. In the second situation where the time reconfiguration is greater than the computation time, the number of configurations can be expressed as:
In this situation the complexity of the application is: Gs = VeT. In both cases, we can deduce the number of configurations by:
In practice, it's obvious that partial masking of the configuration time is without interest. For this reason, we continue studying only total masking of the configuration time.
Complexity of the application
Dynamic reconfiguration is interesting beyond 4 configurations, the maximum value of the equivalent gates of the dynamic architecture is:
The silicon reduction Gg is:
DR with double configuration speed
Here, the two FPGA devices are reconfigured simultaneously with different configuration data [6] . After the reconfiguration time, each FPGA starts to execute its own correspondent treatment.
FPGA·1 FPGA·2

Reconfiguration
• Reconllgur8l1on This proves that the solution is technological. FPGA circuits should be endowing with mechanisms allowing reconfiguration time reduction (better reconfiguration interface, using configuration data caching, partial reconfiguration, ... ). For a given technology, the solution with parallel reconfigurations offers better performances.
The solution based on configuration masking is consequently without great interest. Due to material management in a ping pong manner of the FPGA devices, the system implementation will present some difficulties (complexity in the PCB design).
CONCLUSION
In this paper, we presented a new paradigm for designing hardware architecture using dynamically reconfigurable FPGA. We showed the importance of the FPGA reconfiguration time to increase the performances of dynamically reconfigurable systems. We studied and compared the performances with two solutions (masking the reconfiguration delay, doubling reconfiguration speed) using the same hardware resources.
This work indicates that performance increase is possible using doubling reconfiguration speed technique. However, the system hardware realization is more delicate as well as the flexibility is reduced.
The results of this study make our ARDOISE architectural choices are well justified. Indeed, the solution to increase the reconfigurable architecture performances in a significant way consists in enhancement of the FPGA technology: efficient configuration interface, low configuration time, configuration data caching, possibility of partial configuration, ... Al so, development software tools should contribute to improve reconfigurable architecture performances by including more feature such as configuration data compression, partial and complete configuration management, automatic application partitioning, ...
7.
