I. INTRODUCTION
Many signal processing algorithms commonly found in a wide range of applications, such as wireless communications, medical imaging and multimedia applications, are abundant in array data structures. The exponential growth of information and the real-time solution of these applications constantly challenge the computing power of digital systems. Traditionally, the dominant platforms have been DSP processors, ASICs and FPGAs. FPGAs have emerged as a low-cost and low-risk alternative to ASICs due to their high flexibility and capability to customize the hardware architecture in order to match the characteristics of algorithms [1] . Most often, only fixed-point computations are supported by these platforms due to insufficient resources, even though the corresponding algorithms utilize floating-point (FP) operations. Hence, significant research has focused on how to transform floating-point into fixed-point representations and vice-versa [2] . However, recent remarkable advances in FPGAs, such as millions of user-accessible logic gates, and plenty of embedded DSP and memory blocks, provide new opportunities for FP digital signal processing. Research results show that the peak FP performance of FPGAs has outnumbered in the last few years that of modern microprocessors and is growing much faster than the latter [3] . With state-of-the-art FPGAs, it is feasible to design and implement parallel FP reconfigurable computing machines [5, 9] . Another notable trend is that fine-grain, applicationspecific schemes face many challenges whereas coarse grain architectures are more effective on modern FPGAs [4] .
HERA [5] is an MPoPC that we have designed and implemented with in-house developed IEEE-754 singleprecision FP units on the Annapolis WILDSTAR II-PCI board populated with two Xilinx Virtex II FPGAs [13] . In contrast to conventional FPGA-based designs, MPoPCs are user programmable instead of developer programmable. We have not seen major research effort in this direction. The seminal advantage of FPGAs is that we can customize the hardware architecture based on application characteristics and, hence, achieve a higher utilization of hardware resources compared to fixed hardware (such as microprocessors and ASICs). Based on this motivation, HERA can be dynamically reconfigured to support a variety of independent or cooperating computing modes, such as SIMD (Single-Instruction, Multiple-Data), Multiple-SIMD, and MIMD (Multiple-Instruction, MultipleData).
Taking advantage of HERA s mixed-mode parallelism and reconfigurability, we present an adaptive scheduling strategy for array-intensive applications. In our computing model, an array-intensive application is represented by a mixed-mode task flow graph consisting of two distinct categories of tasks: SIMD and MIMD. SIMD tasks are the focus of dynamic adaptive scheduling. In contrast to most previous loop scheduling strategies for conventional multiprocessor systems [6] [7] , which consider just one loop at a time, the data dependences and priorities associated with multiple SIMD tasks (loops) are the main concern of our adaptive scheduling approach. In order to achieve a minimum execution time for the entire application, the number and type of PEs (Processing Elements) assigned to each SIMD task vary dynamically at runtime. Another major advantage of our dynamic scheduling (compared to related work) is its much less scheduling overhead due to the very low communication cost in MPoPCs. Moreover, our HERA system supports run-time reconfiguration provided by FPGAs, which allows to dynamically change PE functionality and reconfigure the system to increase resource utilization and match task characteristics. Singular Value Decomposition (SVD) is investigated as being a representative computation-intensive example suitable for evaluating our approach. An FPGA implementation of fixed-point SVD on a systolic array appeared in [8] . To the best of our knowledge, no FP solution has ever been published for FPGAs.
The remainder of this paper is organized as follows. Section II provides an overview of HERA. We briefly introduce application characterization in Section III. The adaptive and dynamic scheduling algorithm is presented in detail in Section IV. The implementation details of the SVD algorithm and performance results are shown in Section V. We conclude the paper in Section VI.
II. HERA: A MIXED-MODE MPOPC
The general organization of our HERA machine with m x n PEs interconnected via a 2-D mesh network is shown in Fig. 1 . The key architectural features include: Pipelined PEs supporting IEEE 754 single-precision FP operations; PEs programmable and customizable from a library of hardware components; 2-D mesh layout with NEWS connections; General-purpose instruction set; Each column of PEs has a shared bus; The dual-ported data memory of each PE is also directly accessible by its immediate south and west neighbors; Every PE can be selected by the sequencer using its ID or a mask. Dynamic system configurations for a given application are employed since FP computing cores are very resource expensive. Only the required FP operations are supported by PEs. The functionality of PEs can be changed via hardware reconfiguration at runtime as needed by the tasks. Relevant details can be found in [9] .
IV. ADAPTIVE SCHEDULING STRATEGY
The objective of our adaptive scheduling technique is to minimize the execution time of the application by dynamically repartitioning and redistributing active SIMD tasks among available PEs in the system. We use a centrally controlled scheme based on our system architecture. PEs are required to record runtime statistics in specified registers and the system controller (SC) collects such information from individual PEs to make assignment decisions.
The target system is modeled as an undirected graph GT = (P, L), where a vertex P i P represents PE i and an undirected link L ij represents a bi-directional communication channel between PE i and PE j, for i, j [1, p] . Each PE i is associated with a parameter (P i ) that records its functionality. The weight w(L ij ) on each L ij denotes its minimum communication cost. The minimum communication cost is calculated based on the hops between PE i and PE j. Whenever there is contention for the route, the task with higher priority will be routed first. Also, communication jobs always have higher priority than computation jobs (i.e., PEs are always forced to route their messages even when they are busy). A priority is assigned to each task in TFG, which dynamically changes as scheduling proceeds. This is because the dynamic assignment of PEs, dynamic partitioning and migration of tasks to multiple PEs, and the communication pattern and cost in our policy result in changes in the critical path. If two tasks are assigned to the same PE, then the communication is removed. The following is our policy to assign the priority.
Step 1. Find the first critical path (CP 1 ) in S. For many critical paths, choose one randomly.
Step 2. Assign priority 1 to the entry node in CP 1 .
Step 3. Proceed along CP 1 and assign each task the priority of its predecessor plus one.
Step 4. Find the other parents of each task in CP 1 and insert them before this task; determine the priority of multiple parents by their location in subsequent critical paths.
Step 5. Find the next CP in S and repeat Steps 3-4 until all tasks are processed. In the scheduling procedure, a task is said to be READY when all its inputs are available. A QUALIFIED PE for a task is defined as a PE that supports all the operation types (S i ) in the task S i . We define the average execution time (AET) for each task
is the communication overhead caused by distributing task S i to p i
PEs as compared to scheduling on one PE. A distinct advantage of our strategy is that the SC is fully aware of the performance of each PE and system structure. In order to simplify the scheduling algorithm and reduce the runtime scheduling overhead, the following policies are applied.
At any given time, tasks do not share the same PE. Pieces of an MIMD task can only be scheduled to immediate neighbors. SIMD tasks have a higher priority than MIMD tasks (the former also can be assigned to immediate neighbors).
A. Loop Partitioning
Loop-based partitioning is the basis of our adaptive parallelization. In our current design, flow dependence is allowed inside an assigned iteration. We distinguish the following cases.
FOR loops without cross-iteration dependence
Assume that the total number of PEs available to the loop is and the total number of iterations is L. The loop space is split into groups of size / L or / L . Each PE gets such a group and the corresponding data set. These loops map naturally to the SIMD mode and no communication is required. FOR loops without both flow and cross-iteration dependences are treated the same way.
FOR loops with cross-iteration dependence [10] We assume that the distance between successive data dependent iterations is w. Fig. 3 shows a simple example with w = 2. Let the total iteration space L in the loop be a multiple of w and the loop can be divided into w partitions. The i iterations into a new loop with a constant number of operations [11] . Then the above partitioning techniques can be applied to these loops. Other loops that can be transformed into FOR loops (e.g., WHILE loops) are treated similarly.
B. PE Searching
The distance between PEs is one of the two critical parameters to the communication cost. Thus, the order in which we search for one or more candidate PEs is an important step affecting the overall mapping performance. Based on the HERA organization and interconnection network, we propose column-oriented PULSE searching as shown in Fig. 4 (a) . The motivations are:
The column buses in HERA can be used to broadcast instructions in SIMD and M-SIMD. In this search pattern, the distance between two adjacent stops is always one. One port of the data memory of each PE in HERA is shared with its immediate neighbors to the west and south. By selecting candidate PEs in the PULSE pattern, there is a large chance that we can save on communication time. For example, consider the TFG shown in Fig. 4 Assume that the numbers of PEs assigned to tasks S i and S j are p i and p j , respectively, and x = min{p i , p j }. The objective function to be minimized in this step is the communication time among tasks:
where D ij is the amount of data in bytes communicated between these two groups of PEs, is the transfer speed in bytes/second between two immediate neighbors, T ini is the overhead to initialize the transfer, H(i, j) is the number of hops between two communicating PEs and T cflict is the routing delay caused by data collisions. In order to reduce the collision and communication costs, data locality is taken into account when mapping tasks to PEs.
C. Dynamic Adaptive Scheduling Algorithm
The scheduling procedure is as follows.
V. APPLICATION STUDY
Singular Value Decomposition (SVD) has been chosen as a computation-intensive algorithm to evaluate the performance. This algorithm is abundant in nested loops and for square matrices it requires more than 20 times the number of FP operations in LU factorization. Given a matrix A R N x N  orthogonal matrix, and =diag( 1 , , r ) is a diagonal matrix with r = min(M, N) and 1 2 r . The i s are known as the singular values of A, and vectors u i and v i are the i th left singular and right singular vector, respectively. The algorithm studied here is based on Golub-Kahan [12] , which is one of the most commonly employed implementations. In our experiment, a modified sequential description of the SVD algorithm at http://www.netlib.org/ was analyzed and then divided into the tasks summarized in Table 1 .
In-house developed single-precision FP units were chosen and the Annapolis WILDSTAR II-PCI board with two XC2V6000-5 FPGAs [13] was used. The system was clocked at 125MHz. We first evaluated the effect of runtime reconfiguration (RTR). The chosen numbers of PEs for the four steps in Table 1 were 36, 42, 42 and 42, respectively. Five randomly generated dense matrices were used; the execution times are shown in Fig. 5 . The results prove that partial dynamic RTR system reconfiguration can improve the performance significantly; the speedup increases with the matrix size. Fig. 6 shows a performance comparison of our adaptive scheduling with an alternative dynamic scheduling where all the available PEs are assigned to each task when it is ready (fixed through the task lifetime). Our adaptive scheduling, enforced by RTR, performs much better than the naive scheduling strategy. It was observed during the experiments that adaptive scheduling greatly reduces the effect of data dependences and the idle times of PEs. It effectively shortens the critical path, which largely determines the execution time of the entire application. An increase in the number of PEs improves dramatically the execution of the largest SIMD tasks (three-nested loops), and the overheads of loop partitioning and scheduling become less significant for larger matrix sizes.
VI CONCLUSIONS
Our MPoPCs based on multimillion-gate FPGAs provide low-cost, low-risk and high-performance computing platforms for floating-point array-intensive applications. Our dynamic adaptive scheduling strategy takes advantage of HERA s mixed-mode parallelism. The good performance results with the SVD algorithm prove the effectiveness of our dynamic approach and suggest more benefits for increased matrix size compared to fixed PE allocations. 
