Abstract
Introduction
The computation complexity of many application algorithms increases at a higher rate than Moore's Law 121. The von Neumann architectural paradigm for micropmcessors, which struggle to follow Moore's Law, is not obviously the solution to bridging this computing gap [I] . Meanwhile, the supercomputer market has shrunk in recent years after more than 20 years' efforts and tremendous investments. The demand for higher performance makes it necessq to build applicationspecific parallel architectures.
At the same time, FPGA-based configurable * ThL work was supported in part by the U. S. Department of Energy undergrantER63384.
computing is becoming more and more appealing and has resulted in impressive achievements for many computation-intensive applications [3, 51. Recently available multimillion platform FPGAs with richer embedded feature sets, such as plenty of on-chip memory, DSP blocks and embedded hardware microprocessor IP cores, have made it feasible to build parallel systems on a programmable chip (PSOPC). [6] presents a recent example. We implemented a scalable shared-memory multiprocessor and mapped a pmllel LU factorization algorithm onto an F' F' GA [7] . With its good performance, our machine show the great potential of FPGA-based confignrable technology for implementing parallel systems. Many relevant issues must still be addressed in this area.
Moreover, as we all know from OUT experience with fixed parallel-computing architectures, the performance and efficiency of algorithms highly depend on a good match of the algorithm with the target hardware architecture. With FPGA-based platforms, we can adapt the architecture to match the communication and computation demands of the application. This new field demands better design methodologies to fully explore the hardware potential and acheve maximum system performance.
T h~s papa focuses on hardwarehftware codesign perspectives and presents the techniques that we have applied to optimize the performance of our FFGA-based configurable multiprocessor system for LU factorization.
In order to further increase system efficiency, we propose a new dynamic scheduling algorithm that was tested with IEEE electric power 57-, 118-and 300-bus matrices using up to five-processor systems embedded into the Allera SOFC (System-On-a-ProgramableChip) development board.
2.FPGA-Based Configurable Multiprocessors
The new FPGA-based computing platforms present great challenges to cwent hardware-software codesign methodologies that are often based on a given CPU-ASIC hardware model 141. Since the latter methodologies do not provide much choice to the hardware infrastmcture, most of the efforts in optimizing the performance focus on the software. The boundary between hardware and software is decided upon before the separate software and hardware design teams begin their work using different languages, often C/C++ (and/or assembly code primarily for the critical part) and HDL, respectively. In contrast, new FPGA-based configurable computing strategies provide the system designer with several dimensions to optimize the design for application-specific performance. Full control is viable over most of the resources and enormous opportunities appear during the overall design process.
We illustrate our new hardware-software codesign techniques that first optimize the algorithm and then analyze the application's communication-computation model for further optimization. Figure 1 shows our processor-based system model for the parallel BDB LU factorization algorithm [7] . The binary hee interconnection network matches well the data communication model in our algorithm
The processor-based approach that makes the PES more localized than a large across-the-chip computing strategy is more preferable to modem S O X designs As the feature sue of silicon processes enters the submicron range, the wire delay becomes even more significant compared to the logic delay, and it can even dominate the system's performance due to the reverse scaling of wires compared to transistors. Our binary tree network reduces dramatically global communications.
Bv using configurahle RISC nrocessor IP cores. we resources being used The processor datapath wdth, register file sue, API functions, penpherals, data and Figure 1 . The operation of every PE is guided by the system controller (SC) that utilizes the boot code in the boot memory of every PE and the intemipt connection between every PE and this controller. The control channel is a star connection between the SC and every PE. There is also a direct communication channel between the controller and every PE through the dual-port on-chip memory.
-

I
add another dimension of programmability and flexibility to the FPGA-based computing engines. We can tailor the processor to the specific requirements of the application and include only those features that are needed by the latter. We can identify the critical instructions in the application code that affect performance the most and implement them as hardware logic. Table 1 shows the big difference of software (SW FP) and hardware floating-point (HW FP) operations for LU factorization on a variety of matrices. The confgurahle processor cores also provide us with more flexibility to integrate them in an SOPC environment with other IPS, compared to fxed processor cores.
The PE lies in the core of any Computing machine. When using configurable processors, wc need to cany out trade-offs between the processing power and the
Figure 1. Our muitipmcessor architecture mdei
The memory herarchy design and management are always major design issues for all computer systems. There have been many advanced memory management techniques for conventional microprocessors; they are directly supported by the operating systems and compilers. Due to the current lack of relevant software support in confguable machines, the memory design becomes a dominant factor in system performance. Moreover, while new silicon technology and computer architecture research facilitate faster processors, the performance gap between processors and memories tends to become larger. In our shared memory multiprocessor, the overall speedups may be quickly diminished due to severe memory contention and large system synchronization, d we rely solely on the on-board SRAM memory as the main runtime memory. Fortunately, new generation FPGAs make available more on-chip memory with wide communication channels.
Our PGA-based multiprocessor architecture capitalizes on this advantage and forms several kinds of memories in order to maximize performance. For example, we implemented a controller to manipulate the system's operation and pre-fetch instructions from the main memory into the PES; the latter use the on-chip memory to run the application code because of its much lower access latency compared to the on-board SRAM memories. Table 2 summarizes the memory hierarchy and related characteristics. The bootup program in the Boot ROM is implemented in assembly code and has sue less than 2KB; it is used to communicate with the SC. The Shared RAM between'two neighbors speeds up the system performance by eliminating the transfer of large blocks of data between memories. The sizes of the Shared RAM and Data Memory are determined based on the sue of the largest 3-block matrix group that may appear in our algorithm [7] and the total available on-chip memory. We assign just enough space to the Boot ROM, Data Memory and Shared RAM in order to leave as much space as possible for the Program Memory. All the required interconnection between on-chip memories and/or processors is implemented bascd on a multimastering, fully connected Avalon bus. Thus, the communicatiou bandwidth is quite large and the on-chip memory access time is only one clock cycle.
The cache can become a dominant architecture feahue in the total execution time. For a given fxed processor, we focus on the efficient utilization of a cache of fixed size and configuration. For a configurable processor, we have choices in both hardware configuration and software optimization. Because it takes on the average at least 4 clock cycles to access the on-board synchronous SRAM memory, on-chip data and instruction caches are employed to reduce the memory access latency. In order to find an optimal sue for our application, we compared the performame of a single processor with different instruction and cache sizes for the LU factorization algorithm applied to a 30 x 30 matrix. Our test results show that the cache configuration can make a difference of more than 20% in performance.
Our multiprocessor targets complex matrix algorithms that require floating-point arithmetic to deal with dynamic data of wide range and some trigonometric functions, which take considerable time if implemented in sofhvare. We implemented the floating-point arithmetic and these functions in hardware, and interfaced the application code as custom instructions [7] . Such hardware customization also releases many resources to the processor for other tasks.
We implemented send and TCP connections behveen the multiprocessor and the host computer. The TCP network connection provides a flexible, quick and efficient communication channel for our FPGA-based parallel system, which can be accessed by all other hosts in the network. We may use the TCP port of the processor to send application code or reconfigure the system in its entirety, or in part, at m t i m e .
Mapping Applications to the Multiprocessor
One of the most challenging tasks in programming parallel systems is load balancing. O u target application, i.e. parallel LU factorization of sparse matrices, is one of the hardest problems for dynamic load balancing due to its large amount of data dependences and occurrences of fill-ins (that is, zero elements become non-zero elements during factorization). Since our system does not currently have operating system support, we built a dedicated system controller to take care of load balancing at runtime; in this approach, all processors must report their load information to the controller.
Let us begin with the preprocessing phase where we attempt to order the matrix into an optimal BDB matrix [7] . The best ordering is the one that keeps the 3-block groups as dense as possible while not making the last block too large. This way we can reduce the number of floating-point operations in the proccdure that follows.
305-
The best solution also depends on the detailed chamcteristics of the application matrix. Dynamic load balancing is camied out by the SC. The information tbat the host computer passes to the SC includes at least the size of the whole matrix, the number of the diagonal blocks, the size of every block and its memory addresses in the on-board memory The SC always assigns the biggest available 3-block groups in the task queue.
During the system configuration phase, the SC assigns the initial loads based on the information sent by the host computer. All the computing processors wait for the system information. Then, the SC copies the instructions and data stored in the on-board SSRAM into the local memories of all processors until they are tilled up to the preset sue. Then, every processor begins its work using its local data. If the next instruction is not in the local RAM, the processor generates a hardware interrupt to the SC and sends its status to the SC through the dual-port RAM. After receiving the interrupt, the SC checks the status of the corresponding processor, puts the interrupt in a queue, and continues processing the interrupts from this queue according to the FIFO priority. The SC keeps a load record for every processor in order to manage the d i s h e d jobs dynamically. The record enhies include: the starting time of the working matrix group, the expected end time, the sue of the working group, the possible next goup for this processor, the phase the algorithm is in (that is, factorization or multiplication phase [7] ), f~s b e d groups, etc. If one processor is idle (maybe because there is no work available or for some other reason), the SC fust checks the statu of the processors along the summation tree to fmd the nearest busy processor. It further decides whether it is worth asking the idle processor to help the working processor. The decision is based on the sue of the working group, how much work has already been done (i.e., the used time divided by the expected time), and the current phase.
Dmng factorizatioq the idle processor will multiply the border blocks follouing the factorization of the working processor. If the working processor is in the multiplication phase and more than 1/3-rd of the work has not been done yet (this number is based on the computatiodcommunication time ratio in our machine), then the SC will copy half of the remaining data to the idle processor and modify the working processor's load information. The multiplication results will be collected along the binaty tree. Otherwise, the SC w i l l look for another busy processor and w i l l apply tlus check procedure again. We tested our parallel BDB LU factorization with this dynamic scheduling policy for the IEEE electric power 5 6 , 118-and 300-bus test matrices for up to 5-processor systems in the Altera SOPC development board. The performance comparison of the static and dynamic load balancing techniques for different numbers of PES is shown m Table 3 . Generally, the larger the input matrix, the better the performance of dynamic scheduling. 
Conclusions
We have presented our hardware and software design effort for a shared-memory multiprocessor implemented within an FPGA. This computing paradigm provides tremendous opportunities along several dimensions: system, hardware, and sofhvare. Due to the lack of hardwaresoftware codesign platforms, this procedure requires that the designer be proficient in algorithms, system-level design, softwarehardware partitioning, architecture design, and softwarehardware coding. Our multiprocessor targets LU factorization Our proposed dynamic load balancing algorithm for this macbine yields good performance and Iughlights our observation ahout the need for hardware-software codesign efforts.
