Abstract-Multi-FPGA systems have tremendous potential, providing a high-performance computing substrate for many different applications. These systems harness multiple FPGAs, connected in a fixed pattern, to implement complex logic structures. In order to use such a system effectively, it is a key for constructing a good performance hardware platform. The configuration scheme is an important part in hardware design. This paper aims at small scale Multi-FPGA systems composed of SRAM-based FPGAs developed by Xilinx Corporation, proposes a novel configuration technique by using Platform Flash PROM XCF32P. Using this scheme, only adopting one XCF32P and one Complex Programmable Logic Device (CPLD) we can configure four FPGAs with monolithic configuration data smaller than 8Mbit. When the number of FPGA is more than four, Design revisioning allows the user to cascade more XCF32P PROMs to realize. Since Xilinx Platform PROM and Xilinx FPGA/CPLD are used to get a single-vender solution, the design for hardware and software is simplified.
I. INTRODUCTION
There is currently tremendous interest in the development of computing platforms from multiple standard FPGAs [1, 2, 3, 4] . One reason is that the digital system is too large to be achieved with only one FPGA, another, the growth rate of the FPGA capacity is far behind that of the ASIC(Application Specific Integrated Circuit) chip scale [5, 6] . These systems harness multiple FPGAs [7] , connected in a fixed pattern, to implement complex logic structures. In order to use such a system effectively, it is a key for constructing a good performance hardware platform. The configuration method plays important role for hardware platform because of two major factors. First, the configuration chips affect layout and wiring for printed circuit board(PCB).
Second, the initialization and reconfiguration for a multi-FPGA system is usually needed after the PCB developed, especially in system debug. A good design of configuration can optimize construction of PCB, and also make the configuration and debug processes more convenient and effective.
In this paper, we focus on SRAM-based FPGAs developed by Xilinx Corporation. In SRAM-based FPGAs, the contents of the internal configuration memory are reset after power-up. As a result, the internal configuration memory cannot be used for storing configuration data permanently. SRAM-based FPGAs require external devices to initiate and control the configuration process.
For Multi-FPGA systems configuration, if the number of FPGA chip and monolithic FPGA configuration files are both very large in a system, such as the DN9000K10 System [8] developed by Dini Company, the Xilinx Company launched a special configuration solution, that is: System ACE (System Advanced Configuration Environment), in this solution, CF(Compact Flash) Card and ACE Control Chip are used to configure the multiple FPGAs automatically [9, 10] , but the system is costly. For general application system (such as the number of FPGA isn't larger than four, and the configuration files is less than 8Mbit), self-made configuration scheme is usually adopted, for example, literatures [11, 12, 13, 14] use the configuration scheme based on CPLD and general FLASH, a special FLASH drive device is needed to program configuration file to FLASH, and a group of output pins corresponding with FLASH capacity are needed to be distributed as address bus. And, designers must be clear with the first and the end address in the FLASH corresponding with configuration files of each FPGA, so that they can make sure that the counter in CPLD can start the control signal of next FPGA configuration after completing the last configuration, which is in fact very troublesome. Besides, the access speed of general FLASH is relatively slow to the FPGA and affects the system configuration speed. Literature [15] adopted the DSP + CPLD + general FLASH configuration scheme, which is based on processor, the design and debug of the circuit and program cost considerable time, and processor usually bears arduous task in addition to completing the FPGA configuration, so bus contention is appear easily.
In this paper, we propose a novel configuration scheme based on Xilinx Platform Flash PROM XCF32P to simplify the design of hardware and software.
II. XCF32P STRUCTURE CHARACTERISTICS
XCF32P is the programmable high capacity Platform Flash PROM developed by Xilinx Company, its storage capacity is 32Mbit. The structure diagram is shown as Fig.1 . The chip supports FPGA serial or parallel interface configuration, basically have the following typical characteristics [16, 17, 18] :
The embedded data decompressor compatible with Xilinx senior compression technique can decompress PROM compressed files with a highest 50% data compression ratio, and the compressed file is generated from target FPGA bit stream file. When decompression is enabled, FPGA must be in slave configuration mode and PROM first decompress the stored data then drive the clock and data to FPGA interface.
There is an optional oscillator in interior and can provide a 20MHz or 40MHz clock which is output by CLKOUT pin. Among them, the 40MHz clock is used to start the internal decompressor.
Design revisioning allows the user to create up to four unique design revisions on a single PROM or stored across multiple cascaded PROMs. Design revisioning can be used with compressed PROM files, and also when the CLKOUT feature is enabled. The 32Mbit storage capacity of monolithic XCF32P can be divided into several independent spaces, with 8Mbit as a unit, and each independent space can store an independent configuration file, which is called a storage version. There are many methods to manage storage versions. Shown as Fig. 2 , one XCF32P can be divided into only one 32Mbit storage version, two independent 16Mbit storage versions, one independent 8Mbit storage version and one independent 24Mbit storage version, two independent 8Mbit storage versions and one independent 16Mbit storage version or four independent 8Mbit storage versions, and so on. During the PROM file creation, each design revision is assigned a revision number: Revision 0 = '00', Revision 1 = '01', Revision 2 = '10', Revision 3 = '11'. After programming the Platform Flash PROM with a set of design revisions, a particular design revision can be selected using the external REV_SEL[1:0] pins or using the internal programmable design revision control bits. The EN_EXT_SEL pin determines if the external pins or internal bits are used to select the design revision. When EN_EXT_SEL is Low, design revision selection is controlled by the external revision select pins, REV_SEL [1:0] . When EN_EXT_SEL is High, design revision selection is controlled by the internal programmable revision telect control bits. During power up, the design revision selection inputs(pins or control bits) are sampled internally. After power up, when CE is asserted (Low) enabling the PROM inputs, the design revision selection inputs are sampled again after the rising edge of the CF pulse. The data from the selected design revision is then presented on the FPGA configuration interface.
Xilinx company develops the Multiple versions design function of Platform Flash PROM is to realize the dynamic reconfigure of system or for some special application of changeable configuration when start the FPGA each time. The work in this paper uses the multiple independent design versions to achieve multiple FPGAs configuration.
III. CONFIGURING FOUR VIRTEX XCV200 FPGAS
A. System components.
The system includes one Platform Flash PROM XCF32P, one CPLD XC9572 and four XCV200 FPGAs to be configured, the system structure diagram is shown in Fig. 3a , and the circuit board is shown in Fig. 3b . The configuration interface circuit is shown in Fig. 4 . The circuit is designed with the help of OrCAD software. Because the software can't identify the sign of NOT operation, low-level effective is expressed as "/" (same in the following text).
Virtex XCV200 FPGA supports the following four configuration modes [19] : master serial mode, slave serial mode, slave parallel (Slave SelectMAP)mode and boundary scan mode. In this work, high-speed slave-parallel mode is used and configuration clock CCLK is supplied by exterior. The frequency is determined by the formula followed:
In equation (1) FPGA2 and FPGA3 ) configuration completed, it will enter its start-up stage, and send out its instructions signal DONE, set the version selection signal corresponding to the next configuration program and start configuration for next FPGA. It means that configuration is completed when the forth FPGA(FPGA4) release its signal DONE. This signal is connected to /CE, XCF32P is no longer effective and configuration process ends. The configuration flow is shown in Fig. 5 .
The data configuration timing diagram is shown in Fig.  6 . When /PROGRAM is in low state, four FPGAs begin to initialize synchronously. After initialization completed, the signal DONE turns to be low. Because the signal /CE of XCF32P is connected with the signal DONE of the forth FPGA (DONE4), the chip enable signal of XCF32P is effective. Meanwhile, the signal /INIT turn to be low automatically and begin to clear configuration memory. When the low level of the signal /INIT is input to the OE/(/RESET) interface of XCF32P, the chip XCF32P begins to reset and address pointer points to the first address of memory space. After configuration memory is emptied, the signal /INIT is set to high again, and device samples mode pins to make sure that configuration data is loaded in parallel mode.
When multi-version design function is started, the internal logic of configuration PROM samples the design version selected input(pin /SEL) when power up. When /CE is set to low, the design version selected input signal is sampled again at the rising edge of /CF pulse to determine which design version to provide configuration data for the FPGA. The version selected pin should be set before sampling is triggered at least 300ns.
Start
Clear the configuration memory and set DONE to be low Clear the configuration memory again /PROGRAM is low?
/INIT is low?
Begin to configure FPGA1
Set the chip selected /CS to be low; Set the version code REV_SEL corresponding to this chip; Set the version initialization signal /CF to be low, and the low level stay for longer than 300ns.
Write data in BUSY is low?
DONE is high?
Set the chip selected /CS to be high, and enter the starting process
Repeat monolithic configuration process, and configure FPGA2, FPGA3 and FPGA4
DONE4 is high?
Set /CE to be high Therefore, the signal /INIT is regarded as the initial trigger signal of /CF, and /CF is triggered at the rising edge of /INIT. /CF is set to be low and the low level should delay more than 300ns duration. The version selected input signal /SEL is triggered and set to be "00" at the same time, namely, configuration data is output form the zero version. Trigger piece selected signal /CS1 is effective at the rising edge of /CF signal. The zero version data of XCF32P is output to the first FPGA and begin to configure FPGA1 at the affection of CCLK. When the first FPGA is configured, it releases the signal DONE, by this way, DONE1 turns to be high level. /CF signal is triggered by the rising edge of DONE1 and is reset to be low level, at the same time, /SEL signal is set to be "01". When the rising edge of /CF signal arrives, configuration data is sent out by the first version of XCF32P. By this time, /CS2 is set to be effective and the second FPGA is selected to begin receive configuration data. Besides, /CS1 is set to be ineffective and starts to configuration the second FPGA. The configuration of the third and forth FPGA is similar to above. After the forth FPGA configured, /CE of XCF32P is set to be high level by DONE4 signal released by this FPGA. That is to say, the chip enable signal of XCF32P is ineffective and the whole configuration process ends.
C.The software design of CPLD
The design of internal control circuit in CPLD is a key of the system. Providing the needed timing sequence when configuring, coordinating the configuration process, and ensuring that multi-FPGA configuration completed as the predetermined process are the main functions of this work. Design is realized by combining the hardware description language with schematic diagram. Control circuits are made up of a delay model, a counter and a shift register, as shown in Fig. 7 .
Delay module tests the rising edge of /INIT, DONE(1), DONE(2) and DONE(3) and trigger internal delay circuit to produce the negative pulses longer than 300ns which is need by /CF signal. It is difficult to detect rising edge of four signals simultaneously, so, there are four independent delay circuits in delay model to detect four trigger signals respectively and to produce four negative pulses which can get /CF signal when they are done the AND operation. The shift register is triggered by the rising edge of /CF and produces the chip selected signal /CS(4:1). The falling edge of /CF triggers counter and produce version selected signal /SEL(1:0).
The simulation results of control circuit are shown in Fig. 8 . 
IV. CONCLUSIONS
A new configuration scheme for small scale multi-FPGA systems based on XCF32P is given. In this scheme, a XCF32P and a CPLD are used to configure four Virtex XCV200 FPGAs. The design has certain universality, and can be used to configure multiple Xilinx FPGAs with monolithic configuration data smaller than 8Mbit. When starting the internal decompression in XCF32P, monolithic FPGA configuration data can reach 16Mbit. When the number of FPGA is more than four, Design revisioning allows the user to cascade more XCF32P PROMs to realize.
Due to the XCF32P is special configuration chip developed by XILINX Company, the chip access time is short, and the configuration speed is fast. Meanwhile, Xilinx Platform and Xilinx FPGA/CPLD are used to get a single-vender solution to make the design for hardware and software simplified.
