ABSTRACT
INTRODUCTION
Plafform FPGAs have emerged as a new powerful SOC (System-On-Chip) computing paradigm. They are usually customized to implement computationally expensive datapaths when coupled with a host [1] [2] . Usually, the host may be a workstation while the FPGA resources form a coprocessor that communicates with the host via an I/O interface. This mechanism yields greater parallelism at run time but at the expense of higher communication overheads and a daunting algorithm mapping process.
The SIMD mode of computation is suitable for data-intensive applications. Assuming a host-FPGA system, we configure the FPGA chip as an application-specific SIMD coprocessor controlled by the host. Based on HISA, application tasks are partitioned into three layers: host, FPGA, and nanoprocessor layers, in decreasing order of task This work was supported in part by the US Department of Energy under grant DE-FG02-03CH11 171. granularity. The host layer, at the top of the H-SIMD machine, is implemented in the workstation by a high-level language. More specifically, similar to the approach for PC clusters in [5] we suggest that an effective ISA be developed at the host layer for each application domain. Frequently used instructions in that domain should belong to this ISA. The invocation of these instructions is actually implemented as function calls at the host layer. These functions can take advantage of available IP cores so that the design time can be significantly reduced for the complex plafform FPGA system. Existing design methodologies focus on logic synthesis without considering the algorithm structure [2] [8] . Our H-SIMD machine approach takes into account algorithm synthesis at the host layer and logic synthesis at the nano-processor layer, respectively. Another major advantage of the H-SIMD machine is the employment of a memory switching scheme for data loads/stores involving the host and the FPGAs. Switching between pairs of data memory banks overlaps operand communications with computations, thus hiding communication overheads to improve performance.
The remainder of this paper is organized as follows. Section 11 presents our H-SIMD machine architecture. Section III includes a detailed design of HISA and its workload balancing scheme for DCT2. Section IV contains implementation results on a Xilinx Virtex 11 6000 FPGA and a comparative study with a 2.0GHz Pentium processor. Section V concludes our work.
MULTI-LAYERED H-SIMD MACHINE A. H-SIMD Architecture
The H-SIMD control hierarchy is composed of three layers, as shown in Fig. 1 
B. Memory Switching Schemes
The communication overhead between the host and the FPGA chip can be substantial primarily due to the non-preemptive nature of the operating system on the host. Based on tests in our laboratory, the one-time interrupt latency for a Windows-XP installed Dell Precision 650 host workstation running the PCI bus at 133MHz is no less than 1.5 ms. This penalty is intolerable in highperformance computing because, for example, the 64x64-point DCT2 takes about 350 us on a single arithmetic unit running at 100 MHz (which is within the range of current FPGA technology) for the Feig & Winograd algorithm [3] . If the host frequently intervenes in FPGA operations, the speedup benefits gained from the parallel FPGA implementation can be significantly reduced or even removed. Our data prefetching scheme involving memory switching is designed for the H-SIMD machine to delicately overlap host communications with FPGA computations as much as possible. Data flowing from the HC is directed into the high-speed SRAM banks on the FPGA board. The HC-level memory switching scheme is shown in Fig. 2 [7] . If MMX and streaming SIMD instructions are enabled on the latter, H-SIMD yields a speedup of about 6%'18%; otherwise, the speedup is about four. This shows the effectiveness of the H-SIMD architecture on data-intensive applications. 
