Description of the Algorithm
The successive over-relaxation method is an iterative method used for finding the solution of elliptic differential equations. SOR has been devised to accelerate the convergence of Gauss-Seidel and Jacobi [14] , by introducing a new parameter, , referred to as the relaxation factor.
Given the linear system of equations:
the matrix can be written as
where , and denote the diagonal, strictly upper triangular, and strictly lower triangular part of matrix [34] .
Using the successive over relaxation technique, the solution of the PDE is obtained using: 
where ( ) represents the ℎ iterate.
The SOR rate of convergence strongly depends on the choice of the relaxation factor, [3] . Extensive work has been done on finding a good estimate of this factor in the [0, 2] interval [3, 23] .
Recent studies have shown that for the case where:
• = 1: SOR simplifies to Gauss-Seidel method [24] .
• ≤ 1 or ≥ 2 : SOR fails to converge [24] .
• ≻ 1: SOR used to speedup convergence of a slow-converging process [34] .
• ≺ 1: helps to establish convergence of diverging iterative process [23] .
Reconfigurable Computing
Today, it becomes possible to benefit from the advantages of both software and hardware with the presence of the RC paradigm [18] Actually, the first idea to fill the gap between the two computing approaches, hardware and software, goes back to the 1960s when Gerald Estrin proposed the concept of RC [23] .
The basic idea of RC is the "ability to perform certain computations in hardware to increase the performance, while retaining much of the flexibility of a software solution" [18] .
RC-systems can be either of fine-grained or of coarse-grained architecture. An FPGA is a fine-grained reconfigurable unit, while a reconfigurable array processor is a coarse-grained reconfigurable unit. In the fine-grained architecture each bit can be configured; while in the coarse-grained architecture, the operations and the interconnection of each processor can be configured. Example of a coarse-grained system is the MorphoSys which is intended for accelerating datapath applications by combining a GPP and an array of coarse-grained reconfigurable cells [2] .
The realization of the RC paradigm is made possible by the presence of programmable hardware such as large scale Complex Programmable Logic Device (CPLD) and Field Programmable Gate Array (FPGA) chips [29] . RC involves the modification of the logic within the programmable device to suite the application at hand.
Hardware compilation
There are certain procedures to be followed before implementing a design on an FPGA. First, the user should prepare his/her design by using either a schema editor or by using one of the Hardware Description Languages (HDLs) such as VHDL (Very high scale integrated circuit Hardware Description Language) and Verilog. With schema editors, the designer draws his/her design by choosing from the variety of available components (multiplexers, adders, resistors, ...) and connects them by drawing wires between them. Several companies supply schema editors where the designer can drag and drop symbols into a design, and clearly annotate each component [30] . Schematic design is considered simple and easy for relatively small designs. However, the emergence of big and complex designs has substantially decreased the popularity of schematic design while increasing the popularity of HDL design. Using an HDL, the designer has the choice of designing either the structure or the behavior of his/her design. Both VHDL and Verilog support structural and behavioral descriptions of the design at different levels of abstractions. In structural design, a detailed description of the system's components, sub-components and their interconnects are specified. The system will appear as a collection of gates and interconnects [26] . Though it has a great advantage of having an optimized design, structural presentation becomes hard, as the complexity of the system increases. In behavioral design, the system is considered as a black box with inputs and outputs only, without paying attention to its internal structure. In other words, the system is described in terms of how it behaves rather than in terms of its components and the interconnection between them. Though it requires more effort, structural representation is more advantageous than the behavioral representation in the sense that the designer can specify the information at the gate-level allowing optimal use of the chip area [27] . It is possible to have more than one structural representation for the same behavioral program.
Noting that modern chips are too complex to be designed using the schematic approach, we will choose the HDL instead of the schematic approach to describe our designs.
Whether the designer uses a schematic editor or an HDL, the design is fed to an Electronic Design Automation (EDA) tool to be translated to a netlist. The netlist can then be fitted on the FPGA using a process called place & route, usually completed by the FPGA vendors' tools. Then the user has to validate the place and route results by timing analysis, simulation and other verification methodologies. Once the validation process is complete, the binary file generated is used to (re)configure the FPGA device. More about this process is found in the coming sections.
Implementing a logic design on an FPGA is depicted in Fig. 1 :
The above process consumes a remarkable amount of time; this is due to the design that the user should provide using HDL, most probably VHDL or Verilog. The complexity of designing in HDL; which has been compared to the equivalent of assembly language; is overcome by raising the abstraction level of the design; this move is achieved by a number of companies such as Celoxica, Cadence and Synopsys. These companies are offering higher level languages with concurrency models to allow faster design cycles for FPGAs than using traditional HDLs. Examples of higher-level languages are Handel-C, SystemC, and Superlog [26] .
Handel-C Language
Handel-C is a high-level language for the implementation of algorithms on hardware. It compiles program written in a C-like syntax with additional constructs for exploiting parallelism [26] . The Handel-C compiler comes packaged with the Celoxica DK Design Suite which also includes functions and memory controller for accessing the external memory on the FPGA. A big advantage, compared to other C to FPGA tools, is that Handel-C targets hardware directly, and provides a few hardware optimizing features [8] . In contrast to other HDLs, such as VHDL, Handel-C does not support gate-level optimization. As a result, a Handel-C design uses more resources on an FPGA than a VHDL design and usually takes more time to execute. In the following subsections, we describe Handel-C features' that we have used in our design [28] .
Types and type operator
Almost all ANSI-C types are supported in Handel-C except for float and double. Yet, floating point arithmetic can still be performed using the floating-point library provided by Celoxica. Also, Handel-C supports all ANSI-C storage class specifier and type qualifiers except volatile and register which have no meaning in hardware. Handel-C offers additional types for creating hardware components such as memory, ports, buses and wires. Handel-C variables can only be initialized if they are global or if declared as static or const. Handel-C types are not limited to width since when targeting hardware, there is no need to be tied to a certain width. Variables can be of different widths, thus minimizing the hardware usage.
par statement
The notion of time in Handel-C is fundamental. Each assignment happens in exactly one clock cycle, everything else is free [28] . An essential feature in Handel-C is the "par" construct which executes instructions in parallel.
Handel-C targets
Handel-C supports two targets. The first is a simulator that allows development and testing of code without the need to use hardware, P1 in Fig 2. The second is the synthesis of a netlist for input to place and route tools which are provided by the FPGA's vendors, P2 in Fig 2. 
Fig. 2. Handel-C targets
The remaining of this section describes the phases involved in P2, as it is clear from P1 that we can test and debug our design when compiled for simulation. The flow of the second target involves the following steps:
• Compile to netlist: The input to this phase is the source code. A synthesis engine, usually provided by the FPGA vendor, translates the original behavioral design into gates and flip flops. The resultant file is called the netlist. Generally, the netlist is in the Electronic Design Interchange Format (EDIF) format. An estimate of the logic utilization can be obtained from this phase.
• Place and Route (PAR): The input to this phase is the EDIF file generated from the previous phase; i.e. after synthesis. All the gates and flip flops in the netlist are physically placed and mapped to the FPGA resources. The FPGA vendor tool should be used to place and route the design. All design information regarding timing, chip area and resources utilization are generated and controlled for optimization at this phase.
• Programming and configuring the FPGA: After synthesis and place and route, a binary file will be ready to be downloaded into the FPGA chip [30, 31] . 
Hardware Implementation of SOR
The successive over-relaxation method was designed using Handel-C, a higherlevel hardware design tool. Handel-C comes packaged with DK Design Suite from Celoxica. It allows the designer to focus more on the specification of the algorithm rather than adopting a structural approach to coding [14] . Handel-C syntax is like the ANSI-C with additional extensions for expressing parallelism [14] . One of the most important features in Handel-C which is used in our implementation is the 'par' construct that allows statements in a block to be executed in parallel and in the same clock cycle.
Our design has been tested using the Handel-C simulator; afterwards, we have targeted a Xilinx Virtex II Pro FPGA, an Altera Stratix FPGA, and Spartan3L which is embedded in an RC10 FPGA-based system from Celoxica. We have used the proprietary software provided by the devices vendors to synthesize, place and route, and analyze the design [28, 32, 33] .
In Fig. 3 and Fig. 4 , we present a parallel and a sequential version of SOR. In the first version, we used the 'par' construct whenever it was possible to execute more than one instruction in parallel and in the same clock cycle without affecting the logic of the source code. The dots in the combined flowchart/concurrent process model which is shown in Fig. 3 represent replicated instances. Fig. 4 shows a traditional way of sequentially executing instructions on a general-purpose processor. Executing instructions in parallel have shown a substantial improvement in the execution of the algorithm.
To handle floating point arithmetic operations which are essential in finding the solution to PDE using iterative methods, we used the Pipelined Floating Point Library provided by Celoxica [28] . However, an unresolved bug in the current version of the DK simulator limited the usage of the floating-point operations to four in the design. The only possible way to avoid this failure was to convert/Unpack the floating-point numbers to integers and perform integer arithmetic on the obtained unpacked numbers. Though it costs more logic to be generated, the integer operations on the unpacked floating-point numbers have a minor effect on the total number of the design's clock cycles. 
Experimental Results
As mentioned before, the main objectives of this chapter are: i) studying the feasibility of implementing SOR method in hardware and ii) realizing an accelerated version of the method. The first objective is met by targeting high-performance FPGAs: Virtex II Pro (2vp7ff672-7), Altera Stratix (ep1s10f484c5), and Spartan3L (3s1500lfg320-4) which is embedded on RC10 board from Celoxica. The second objective is met by comparing the timing results obtained, with a software version written in C++ and compiled using Microsoft Visual Studio .Net. All the test cases were carried out on a Pentium (M) processor 2.0GHz, 1.99GB of RAM. The relaxation factor was chosen to be 1.5 [22] . The obtained results are based on the following criteria:
• Speed of convergence: the time it takes the SOR method to find the solution to the PDE in hand. In another word, it is the time needed to execute the Multigrid algorithm. In hardware implementation, the speed of convergence is measured using the clock cycles of the design divided by the frequency at which the design operates at. The first parameter is found using the simulator while the second is found using the timing analysis report which is generated using the FPGA vendor's tool.
• Chip-area: this performance criterion measures the number of occupied slices on the FPGA on which the design is implemented. The number of occupied slices is generated using the FPGA vendor's place and route tool. Tables 1, 2 8x8  128  2918  16x16  136  3033  32x32  219  4807  64x64  265  5978  128x128  315  7125  256x256  610  14538  512x512  1098  23012  1024x1024  1601  31848  2048x2048 2289 53476 8x8  519  250  120  16x16  601  310  155  32x32  810  501  199  64x64  999  637  280  128x128  1274  720  347  256x256  1510  890  948  512x512  2286  1087  501  1024x1024  2901  1450  569  2048x2048 3286 1798 640 Fig 5 shows SOR execution time when targeting Virtex II Pro FPGA versus the execution time of SOR in C++. We started with a problem size of 8x8 and reached 2048x2048. Obviously, one can notice the acceleration of the method when moving from software implementation to hardware implementation. The speedup of the design, for different problem sizes, is shown in Table 4 and calculated as the ratio of Execution Time (C++) / Execution Time (Handel-C). 
Conclusion
In this chapter, we have studied the feasibility of implementing the SOR method on reconfigurable hardware. We used a hardware compiler, Handel-C, to code and implement our design which we map onto high-performance FPGAs: Virtex II Pro, Altera Stratix, and Spartan3L which is embedded in the RC10 board from Celoxica. We used the FPGAs vendor's tool to analyze the performance of our hardware implementation. For testing purposes, we designed a software version of the algorithm and compiled it using Microsoft Visual Studio .Net. The software implementation results were compared to the hardware implementation results. The synthesis results prove that SOR is suitable for FPGA implementation; the timing results prove that SOR on hardware outperforms SOR on GPP. Soon, we plan to improve a) the speedup of the algorithm by designing a pipelined version of SOR; b) the efficiency of the algorithm by moving from Handel-C to a lower-level HDL such as VHDL.
Besides, we will consider mapping the algorithm into a coarse grain reconfigurable system (e.g., MorphoSys) [34] , and benefiting from the advantages of formal modeling [35] . We can extend the benefit of SOR by implementing other versions of the algorithm such as: Modified SOR (MSOR), Symmetric SOR (SSOR) and Unsymmetrical SOR (USOR).
