Recent advances in multi-million-gate platform FPGAs have made it possible to design and implement complex parallel systems on a programmable chip (PSOPCs) that also incorporate hardware floating-point units (FPUs). These options take advantage of resource reconfiguration. In contrast to the majority of the FPGA community that still employs reconfigurable logic to develop algorithm-specific circuitry, our FPGA-based mixed-mode reconfigurable computing machine can implement simultaneously a variety of parallel execution modes and is also user programmable. Our HERA (HEterogeneous Reconfigurable Mixed-mode parallelism has the potential to best match the characteristics of all subtasks in applications, thus resulting in sustained high performance. We evaluate HERA's performance by two common computation-intensive testbenches: matrix-matrix multiplication (MMM) and LU factorization of sparse Doubly-Bordered-Block-Diagonal (DBBD) matrices.
Introduction
With the recent achievement of multi-million-gate platform FPGAs to contain richer embedded feature sets, such as plenty of on-chip memory, DSP blocks and embedded microprocessor IP cores, FPGA-based reconfigurable computing is going through a revolution. While the majority of the FPGA community still employs FPGAs to design and implement algorithm-specific circuitry, FPGAs have made their way into the highperformance computing world. It has become feasible to build high performance, low power At the same time, the promise of billion-transistor chips in a few years' time presents new opportunities in computing that bring up a main issue: what kind of (micro)architectures has the better promise for scalable performance under technology scaling? Conventional microarchitectures are approaching a performance limit due to the limited ILP (Instruction Level Parallelism) in real programs [5] ; their large power dissipation is a major problem as well. Also, wire delays decrease much slower than transistor switching times for deep submicron processes. As a result, a major shift from ILP to TLP (Thread Level Parallelism) is present in both the industry and research communities. On the other hand, ASIC-based single-chip multiprocessors have gained popularity [6] [7] . However, a high volume is required to amortize the high development and NRE (nonrecurring engineering) costs of ASIC-based approaches, especially for deep sub-micron designs. Also, the ever-shortening product cycles and the high design complexity of such solutions limit their viability [7] . In contrast, FPGA technologies provide a great opportunity to system designers to combine the highperformance of ASIC designs with the programming flexibility of microprocessors [8] . In fact, recently introduced reconfigurable arrays for ASIC Systems-on-a-Chip (SoCs) reduce their design complexity, and add programmability and flexibility [9] . More importantly, by taking advantage of the dynamic reconfiguration of FPGAs we could customize the hardware at runtime to match the characteristics of the applications [10] .
From the application's point of view, the performance of given computing systems is not optimal for most subtasks due to the system's expected unsuitability; different subtasks in an application normally require different architectures for high performance. SIMD and MIMD are the two fundamental and complementary parallel modes of execution. SIMD's superior ability for data parallelism, often enhanced with low inter-PE communication and synchronization overheads, make it superior to MIMD in performing fine-grain tasks [11] [12] .
Many numerical analysis algorithms, such as large-scale matrix multiplication and LU factorization, have a very high degree of structured, fine-grain parallelism and can benefit substantially from the SIMD mode. However, due to SIMD's implicit synchronization, SIMD machines are often under-utilized for applications involving dynamic parameters and an abundance of conditional statements. On the other hand, MIMD machines consisting of independent PEs are good at conditional branching. However, the PE independence property of MIMD makes programming cumbersome. For applications prone to SIMD execution, the need to explicitly synchronize the PEs in MIMD realizations produces substantial overheads.
Furthermore, every PE in MIMD requires its own program memory. Mixed-mode heterogeneous computing [13] , where the machine's operational mode (i.e., SIMD, MIMD or M-SIMD) changes dynamically as needed by individual subtasks in an application, integrates effectively most of the SIMD and MIMD advantages while alleviating their major drawbacks.
This paper presents our mixed-mode HERA computer, which is coarse-grain, dynamically reconfigurable and supports floating-point arithmetic. It is based on Xilinx Virtex II and Virtex II pro platform FPGAs. Every PE in HERA is built around an IEEE 754 singleprecision FPU. The system can be reconfigured dynamically at runtime to support a variety of independent or cooperating computing modes, such as SIMD, MIMD and M-SIMD, to best match in the time spectrum all subtask characteristics in a single application. Also, every PE is tightly coupled with its own local data and program memories. The low communication overheads and very low cost of our approach add much needed promise to the highperformance, low-cost computing field.
HERA's target domains are large-scale, computation-intensive, matrix-based scientific and engineering applications. Many common problems appearing in areas such as power engineering, bioinformatics, structural analysis, circuit simulation, traffic simulation, fluid dynamics and chemical engineering can be formulated as the recurring solution of systems of equations. The corresponding matrix-based algorithmic solutions are often computation intensive and present major challenges to current computing systems. General-purpose processors do not perform well in these problems since they are not optimized to meet their needs. Substantial gains achieved by specialized processors for these problems justify our research efforts. MMM and the LU factorization [14] of large matrices are two examples of computationally expensive primitives in these areas, requiring time O(N 3 ), where N x N is the matrix size; they play such an important role in many applications that their efficient implementation has long been the focus of many research efforts. In this paper, we show that by employing an innovative data partitioning scheme and mixed-mode scheduling on our 64-PE HERA machine, Cannon's MMM algorithm [15] can be efficiently adapted for matrices of arbitrary shape and size. In [16] [17] , we carried out the parallel LU factorization of sparse DBBD matrices on an in-house developed MIMD configurable multiprocessor employing the Altera Nios ® IP core; several tree networks interconnected the processors. In contrast, we demonstrate here the beneficial effect of mixed-mode scheduling in HERA for improved performance. Our preliminary HERA design was reported in [18] . We show here our improved HERA architecture and study its performance for parallel MMM and the LU factorization of large sparse matrices appearing in real-world power flow analysis. Many other application areas, such as signal and image processing, could also benefit from HERA provided that: (a) it employs a generalized inter-processor connection structure which is realized by a mesh network augmented with global buses and limited shared-memory capabilities; and (b) it is user programmable via a matrix-oriented optimized instruction set.
The paper is organized as follows. Section 2 briefly discusses related work. The architecture of HERA is presented in Section 3. Section 4 contains implementation details. A brief discussion on programming systems like HERA is presented in Section 5. In Section 6, we illustrate mixed-mode scheduling for MMM and LU factorization on HERA. Performance results for MMM on a 64-PE HERA are provided and compared to those of two Dell PCs.
Parallel LU factorization of sparse DBBD matrices and relevant SIMD, MIMD and mixed mode scheduling on a 36-PE HERA are also included. These experimental results and theoretical analysis are employed to prove the strengths of HERA with mixed-mode scheduling. Conclusions follow in Section 7.
Related work
Work related to HERA includes: FPGA-based custom designed machines, single-chip multiprocessor ASIC machines and mixed-mode machines. design. Every PE in HERA is equipped with its own control unit so that not only the whole system is dynamically partitionable, like previous mixed-mode systems, but also the role of each PE can be changed dynamically at runtime by using an HERA instruction, as needed.
Our purpose here is to show that HERA is a novel architecture unifying the flexibility of reconfigurable logic, mixed-mode parallel execution and the programmability of 
PE architecture
We adopted a RISC load-store architecture for our PE to save hardware resources. and 1-bit operating mode register (OMR). Similar to some other RISC processors, the R0
GPR is fixed at zero.
The operating mode of each PE is configured dynamically by the host processor through its OMR by using the Configure instruction: "0" indicates SIMD and "1" sets the PE into MIMD. All PEs operate in SIMD when powered up. To switch a PE to MIMD from SIMD, the Sequencer first distributes the instructions to the LPM of the PE through the Column Bus and Cbus, and then sends a JumpI instruction to the PE with the starting address in the MIMD code. OMR is set to 1. To switch back to SIMD, OMR is reset to "0" and the PE then listens for the broadcasting of a global instruction. The data in the registers and memories remain intact during switching. The instructions come from GPM in SIMD and from LPM in MIMD.
The masking in the SIMD mode can use the PE's ID number and/or LMR. 
Memory configuration
The sizes of LPM and LDM are determined by the number of memory blocks in the FPGA device. They are configured as dual-ported 32-bit memories. Fig. 3 shows the connections for the two memory ports of LPM and LDM. We try to make the data memory as large as possible in order to reduce the data I/O time. The A port of LPM is connected to one Cbus, and serves as the interface to the Sequencer and host processor. This way, the host processor has access to all LPMs. It sends application programs to every PE through this port from the Main Memory if the PE is to operate in MIMD. Port B of an LPM can be accessed by the local PE to get the instructions. An interesting feature of our design is the LDM interface.
Our matrix algorithms usually involve frequent accesses of intermediate results in nearest neighbors. The NEWS network can efficiently support single-word communication.
However, it may take many cycles to transfer a large amount of data when using NEWS connections. Based on our experience with matrix algorithms on our IP-based multiprocessor machine, where PEs often use results from their east and north neighbors, we employed a shared dual-ported memory to address this problem. The A port of LDM is accessed by the local PE, and the B port is shared with the neighbors to the south and east. A PE can directly write to or read from the LDMs of its west and north neighbors via their B ports. Thus, we eliminate block data transfers between nearest neighbors. Another important use of the shared port is to pipe the data assigned to each PE into its LDM at system initialization. The A port of the LDM attached to the first PE on every row can also be accessed via the data bus.
Implementation and floating-point performance
Our first implementation was carried out on the high-performance WILDSTAR II-PCI FPGA board from Annapolis Micro Systems [29] . The board is populated with two Xilinx XC2V6000-5 Virtex II FPGA devices and 24MB of DDRII SRAM memory (12 chips). The board communicates with the host computer via a PCI bus interface. Every PE was assigned 4KB for LPM and 8KB for LDM. The interface to the PCI bus operates at 133MHz and the datapath is 64 bits wide. For the sake of comparison, similar to [23] we synthesized and implemented our design using the Xilinx ISE 5.2i toolset for an XC2VP125-7 FPGA. The divider in [23] uses a look-up table based reciprocator and a multiplier that result in precision errors. Table 2 shows that our FPU components generally result in higher frequency of operation; the overall latency and resource consumption are always smaller for our design. parallel programming models based exclusively either on shared-memory or message-passing create a clear boundary between software and hardware. The seminal advantage provided by HERA is its flexibility to customize the architecture for a given application, so such a boundary is an obstacle. Moreover, low power has become a first-order priority that has to be taken into account in the customization process. We have not seen appropriate programming environments supporting mixed-mode parallel computing and power-aware adaptations.
Nevertheless, customizable multiprocessing programming environments targeting chip multiprocessors (e.g., [33] [34] ) could be modified to work with HERA. We have done some relevant work for configurable systems in general. To support the implementation of conventional programming environments for configurable parallel systems, our design in [37] implements directly in hardware MPI (Message-Passing Interface) primitives; this creates a framework for efficient parallel code development involving data exchanges independently of the underlying hardware implementation. This process can also support the portability of MPI-based code developed for more conventional platforms (e.g., PC clusters) to configurable multiprocessors. Our design takes advantage of the effectiveness and efficiency of one-sided RMA (Remote Memory Access) communications.
We have recently focused our attention on the development of efficient resource management tools to maximize the performance of HERA for a given FPGA-application pair.
An integrated framework has been created for HERA hardware optimization, and dataparallel application mapping, scheduling and execution. Preliminary results will appear in [35] . There are five major phases in this framework: task profiling where the behavioral description of the application is analyzed to construct a task flow graph, system synthesis, task coding using the target system's instruction set, system implementation on FPGAs and dynamic, simultaneous resource-efficient task decomposition, mapping, scheduling and execution. The implementation on FPGAs follows the same procedure as any VHDL-based design methodology.
In addition, for a single HERA PE we could borrow ideas related to reconfigurable instruction set processors (RISPs) that contain reconfigurable logic in some of their functional units [39] . Code generation tools for RISPs could be used to create and evaluate reconfigurable instructions and select those that minimize reconfiguration time [38] . Several RISP HLL compilers are presented in [38] . The next section goes directly into the mapping problem for two specific applications, namely MMM and LU factorization of sparse DBBD matrices.
Mapping Applications onto HERA

Generalized Cannon's MMM algorithm
Data partitioning and mapping
Cannon's MMM algorithm [15] is for a memory efficient parallel implementation on torusconnected processor arrays, where each processor communicates directly with its immediate neighbors in the four NEWS connections directions. The original algorithm assumes that the input matrices and the partitioned matrix blocks are all square. In our implementation, however, matrices A and B for A x B can be of any shape and size (still, the number of rows in A and the number of columns in B should be the same).
Assume PEs are organized in a q x q 2-D torus. Let A and B be matrices of size N1 x N2
and N2 x N3, respectively. We assume that the on-chip memory can store 3m 2 floating-point elements. To be able to store complete blocks from the input and output matrices, the maximum size of a matrix block should be m x m. Let 1 and A(2,1) * B(1,1) consume most of the execution time. In all the steps, except
Step 1, data locality has priority in job assignments.
Performance results and analysis
We first implemented on HERA our mixed-mode MMM scheduling for square matrices of size up to 1000 x 1000. In this experiment, the PEs form an 8 x 8 mesh. Our target application is data intensive and the size of the core computation code for the PEs is less than MKL [43] library is one of the best for matrix operations; it is highly tuned to the architecture of the Intel processors. We could not find related work on the performance of MKL on the Pentium III. We show instead in Fig. 5 the MKL-based performance of a 1.5GHz Pentium 4
processor on double-precision MMM [42] . If we assume that the best-case performance for single-precision MMM is double that for double-precision MMM and we also scale appropriately the results in [42] to target the 1GHz Pentium III processor, we can find out that HERA's performance is worse by about 30% in the worst case. However, this finding should be expected given the fact that MKL is a highly optimized library for the Intel architecture. Also, if we consider the fastest and largest FPGA device in the same family (i.e., the XC2V8000-5), the FPGA's expected performance is an improvement of about 30%. We expect dramatic performance improvement in the near future by employing more advanced [40] . Of course, a major drawback of DSP processors is their much higher energy consumption as compared to FPGAs [23] .
The speedup of the parallel implementation on the 64-PE HERA over the sequential one on the 1-PE HERA is shown in Fig. 6 . The speedup for the 100 x 100 case is much lower because the ratio of computation to communication times is much lower than for the other cases. We could improve the speedups further in all cases by using a bigger local memory since the complexity of multiplication on a single PE for a pair of blocks is O(N 3 ) and that of communication (i.e., shifting) is O(N 2 ), where N x N is the block size. With increases in the problem size, the speedup and, of course, the efficiency stabilizes in a very narrow range.
We also evaluated the performance of our mixed-mode scheduling for a variety of non-square matrices. The multiplication of irregular matrices is required in the parallel LU factorization of sparse DBBD matrices (described in the next section). In this algorithm, the factored border blocks in each 3-block group, which are irregular in size, are multiplied to generate an intermediate matrix block of the same size of the last block, which is accumulated and added to the last block before its factorization begins. SIMD mappings, where all the PEs work in the SIMD mode all the time were also implemented for these matrices; the results are shown in Table 3 . These real-world matrices are used to solve power flow equations. From this table, we can see that our dynamic mixed-mode scheduling can greatly boost performance when the multiplication of irregular matrices is needed. and U are formed, the unknown vector x can be identified by forward reduction and backward substitution, respectively, using the two equations Ly = b and Ux = y.
Since LU factorization is a computation-intensive procedure, its parallel solution has been a quite active research area. Unfortunately, our early research [16] has revealed that for the large sparse matrices, such as those appearing in power engineering most of the current parallel solvers perform poorly; these solvers soon make these matrices dense because of fillins (i.e., zero elements that receive new values) during the factorization. By taking advantage of the physical characteristics in power matrices, we can reorder the power matrices into the DBBD form shown in Fig. 7 by applying a heuristics-based partitioning scheme [30] .
In the DBBD form, the Aik's represent matrix sub-blocks and all the non-zero elements in the matrix appear only inside these sub-blocks. For every fixed i, the blocks Aii, Ain and Ani are said to form a 3-block group, where i ∈ [1, n-1] and n ≤ N. Ann is known as the last block.
The Aii's will be referred to as the diagonal blocks, and Ain and Ani will be called right border block and bottom border block, respectively, where i ∈ [1, n]. The sizes of all the blocks after ordering are determined by the physical characteristics of the matrices and the ordering parameters, such as the maximum number of nodes in a block. Since all non-border, off-diagonal blocks contain only 0's, if we apply Eqs.
(1) and (2) to a DBBD matrix, we can find out that there will be no fill-ins in these blocks during factorization. Thus, the resulting matrix keeps the same DBBD form, as shown in Eq. (3). 
Mixed-mode scheduling on HERA
The parallel LU factorization of sparse DBBD matrices involves irregular computation patterns and blocks of various sizes, as the result of the physical characteristics of the original matrices. However, many parts in the algorithm could still benefit from an SIMD implementation. As a natural consequence, a combination of appropriate parallel execution modes should give better results.
For this application, our HERA machine comprises 36 PEs, which are configured in a 6 x 6 mesh. To map an application algorithm onto a mixed-mode system, the main focus is on identifying the optimal mode of parallelism for each subtask. We should also take into account the costs incurred when switching between different pairs of modes: SIMD/MIMD, SIMD/M-SIMD and MIMD/M-SIMD. The following is the general scheduling procedure to carry out the parallel LU factorization of DBBD matrices on our mixed-mode machine.
Step 1 Identify 3-block groups of comparable size and put them into different task queues.
Divide and configure the system into M-SIMD based on the task information. Assign 3-block groups from each queue to the PEs working in the same SIMD group, and perform the FAC and MAC work on these groups until the number of remaining 3-block groups is less than the number of PEs (i.e., 36).
Step 2 Assign the remaining 3-block groups so that groups of comparable size go to the same column of PEs (see Fig. 1 ) and every PE has the largest possible number of idle nearest neighbors. This is an effort to facilitate the subsequent PAC work. If necessary, reconfigure the system into a different M-SIMD layout.
Step 3 A PE is reconfigured into MIMD as soon as it finishes its work and no more 3-block group is waiting in the task queue.
Step 4 Assign each PE in MIMD to the multiplication of a pair of (row and column) factored border blocks. Since the LDM has a shared port with its east and south neighbors, every idle PE will help its neighbors after it finishes its own work; no data transfer incurs in this process.
Step 5 After the factorization of all the 3-block groups and the multiplication of factored border blocks, reconfigure all the PEs again into the SIMD mode to carry out the PAC work.
Step 6 Factor the last block in the SIMD mode. Fig. 8 shows a typical PE mode assignment in the above procedure for large DBBD matrices. When the number of tasks in one or more task queues is larger than 36, we begin with one or more single SIMD configurations, which is a special case of M-SIMD in Step 1.
Performance results and analysis
Experiments implementing the SIMD, MIMD and mixed-mode scheduling schemes were performed on the 36-PE HERA machine. Table 4 shows the characteristics of the test matrices from the Harwell-Boeing Collection in the Matrix Market [31] . The last matrix corresponds to a power network in North America. The performance results under these parallel execution modes are presented in Fig. 9 . It is obvious that mixed-mode parallelism consumes less time for all the matrices and the advantage is more significant when the 3-block groups are highly irregular in size and shape, such as for the matrices of dimension Under the SIMD mode, some PEs are idle during the factorization of the 3-block groups and the multiplication of the border blocks. The total execution time is From the equations, it is easy to see why MIMD performs better than SIMD for the matrices of dimension 1723, 5300 and 7917, and worse than SIMD for the rest of matrices. A disadvantage of this algorithm in MIMD execution is that half of each PE's LPM space is used for instructions despite the fact that the PEs run identical instructions most of the time.
This space could be used instead for data storage to reduce data transfer times. Another factor causing performance degradation as compared to mixed-mode execution is the increased number of instructions for synchronization. The processing of the last block is more efficient under SIMD than MIMD. Nevertheless, it contributes substantially to the total execution time.
The last block requires a significant amount of data communications among PEs; the implicit synchronization in SIMD reduces the communication time. Fortunately, however, HERA's communication overhead in MIMD is not significantly higher than that for SIMD. However, MIMD tends to perform better than SIMD in this algorithm for large matrices where we have a good chance that matrix blocks are more irregular and sparse. Due to insufficient work, all modes perform comparably for the 494 x 494 matrix.
Previous relevant work has primarily focused on fixed-point implementations of LU factorization. However, the implementation in [23] involves FPUs as well and DSP processor comparisons. A circular linear array architecture is implemented, specifically for LU factorization. To resolve data dependencies, [23] assumes a stream of s "stacked" dense matrices; s has to be larger than the combined latency (in cycles) of the multiplier and subtractor units. The total latency is s * n + s * n 2 for n x n matrices and the throughput is one matrix per n + n 2 cycles. Also, shift registers are inserted in the datapaths that bypass the FPUs and the control logic must be able to delay the control signals. This design is inflexible since the number of processors and their storage space are specific to the matrix size.
We have to emphasize here that [23] attempts to maximize the throughput because it is an application-specific programmable circuit (ASPC) whereas HERA is a fully-programmable system. Results in [23] were reported for s = 10, 19 and 25. The latencies as shown in [23] for non-optimized code on the TMS320C6711 DSP processor are included in Table 5 ; matrices of various sizes were considered. Note that [23] only shows the "effective latency" which is actually the inverse of the throughput for processed matrices. It can be deduced that HERA's performance is much better than that of both systems. There are 35 PEs in HERA running at 147MHz (for the Xilinx ISE tools) and each PE can complete three floating-point operations per cycle. Therefore, HERA has a peak performance of 15.44GFLOPs whereas the performance of the TMS320C6711x family is 600-1500MFLOPs (for 100-250MHz frequencies) [32] . There are often more disadvantages to DSP processors that rely heavily on the clock frequency for high performance. FPGAs can offer much more flexibility in memory hierarchy and configuration, system architecture and processor microarchitecture, data formats, interconnection networks, etc.; an FPGA-based design can be optimized based on various metrics such as energy, throughput, latency, area, design time, budget, etc.
Conclusions
We have presented the design and implementation of our HERA architecture on platform MIMD execution modes. The results also show that the system efficiency stabilizes with increases in the problem size, which proves the good scalability of our architecture for these algorithms. Many applications can benefit from the high-performance and flexibility resulting from dynamic mixed-mode scheduling; this is especially true for programs rich in conditional and non-deterministic operations. Our work shows that new-generation FPGAs have made feasible the building of highly parallel complex systems supporting floating-point arithmetic.
Of course, another major advantage is that these systems are easily accessible and portable.
The ultimate goal of HERA is to maximize the performance by tuning the architecture to match the application through dynamic resource management and reconfiguration of FPGAs.
With the anticipated speed and density improvements for FPGAs, low cost reconfigurable computers can enter mainstream parallel computing because they have the potential to narrow the growing performance gap between applications and chips. 
