The evolution of computer and Internet has brought demand for powerful and high speed data processing, but in such complex environment fewer methods can provide perfect solution. To handle above addressed issue, parallel computing is proposed as a solution to the contradiction. This paper provides solution for the addressed issues of demand for high speed data processing. This paper demonstrates an effective design for the Matrix Multiplication using Systolic Architecture on Reconfigurable Systems (RS) like Field Programmable Gate Arrays (FPGAs). Here, the systolic architecture increases the computing speed by combining the concept of parallel processing and pipelining into a single concept. Here, the RTL code is written for matrix multiplication with systolic architecture and matrix multiplication without systolic architecture in Verilog HDL, compiled and simulated by using Modelsim XE III 6.4b, Synthesized by using Xilinx ISE 9.2i and targeted to the device xc3s500e-5-ft256 and then finally the designs are compared to each other to evaluate the performance of proposed architecture. The proposed Matrix Multiplication with systolic architecture is enhances the speed of matrix multiplication by twice of conventional method.
INTRODUCTION
In computer architecture, a systolic architecture is a pipelined network arrangement of Processing Elements (PEs) called cells. It is a specialized form of parallel computing, where cells compute the data which is coming as input and store them independently. A systolic architecture is an array composed of matrix-like rows of cells. Here, the Processing Elements is similar to central processing units (CPUs) (except for the usual lack of a program counter, instruction register, control unit etc. since operation is transport-triggered, i.e., sensitive to arrival of a data object across it). Each cell shares the information with its neighbors immediately after processing. The systolic array is often rectangular where data flows across the array between neighbor Data Processing Units (DPUs), often with different data flowing in different directions. Systolic architecture is arrays of DPUs which are connected to a small number of nearest neighbor DPUs in a mesh-like topology. DPUs perform a sequence of operations on data that flows between them. In this research, DPU performs the Multiplication and Accumulation (MAC) and the systolic array concept is used for multiply the matrices to enhance its computation speed.
LITERATURE REVIEW
The various Systolic architecture represented in [2, 3, 4, 5, 7] are shown bellow.
AB1 architecture
AB1 architecture is 1-D systolic array shown in Figure 1 has size of a block used for Block matching algorithm [3] . Consecutive computation of all (2p + 1) 2 candidate blocks per displacement vector may provide N (2p + 1) 2 time instances as can be seen of the input data indexes in Fig.1 , where p represents the maximum displacement assumed and N is order of matrix. Computation of consecutive candidate blocks implies the replacement of one input data column by another. A regular data flow at the end of each candidate block line within a search area requires a continuous exchange of columns of input data, such that N-1 dummy time instances with invalid data at the output of AB1 occur. 
AS2 architecture
An alternative procedure in [2] is the decomposition of the algorithm into two subparts where the first is defined over a three-dimensional index space spawn by the indexes i, k, and m. The best matching candidate block is searched along a line of candidate blocks indexed by m within the search range. The second part of the algorithm is defined over a one-dimensional index space along the index n. previously determined minima of all search area lines are compared and the smallest denotes the displacement vector component shown in Figure. 
PROPOSED ARCHITECTURE
The Parallel Matrix Multiplication [7] has many different identifications, but all with the similar implementation. That is, they immediately multiplex a pair of matrix elements in special. Parallel Matrix Multiplication on Systolic Array (PMMSA) uses this approach. In [5] , PMMSA is characterized by processing data input in pipeline and comprised of regularly arrayed PE.
Where neighbor PEs are connected with each other by shortest line and therefore mass data has no need to be stored before processing. Decrease of distance between the PEs in an array greatly reduces the internal communication delay and improves the utility of processing units. It also removes time consumption for controlling the establishment of data stream. In, this research, the PE is replaced with Multiplication and Accumulation (MAC) to enhance the speed and reduce the complexity of Systolic Architecture.
The algorithm for the matrix multiplication of order N×N is shown bellow. 
IMPLEMENTATION SCHEME
In this paper, we aim to compute the equation (1) with a two dimensional systolic array.
Where A, B and C are the matrices with order , and respectively. Each PE of systolic array computes the multiplication of elements and accumulates to the corresponding element and then elements will be passed to neighbor PE in the systolic array. First elements in row i of matrix A are injected first into PE as pipeline with the sequence of and the input time to the element of is one time unit later than . Similarly, elements in column j of matrix B are injected first into PE as pipeline with the sequence of and the input time to the element of the sequence of is one time unit later than . The architecture of PE in this approach is shown in figure 5 which performs the Multiplication and Accumulation on data. 
Systolic Array Architecture for Matrix Multiplication
A systolic architecture is an arrangement of processors i.e. PEs in an array (AB2 Architecture in [3] ) where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. PE at each step takes input data from one or more neighbors (e.g. Left and Top), processes it and, in the next step, outputs results in the opposite direction (Right and Bottom). The Proposed two dimensional systolic Architecture is given in the Figure 6 .
Figure 6. Two-dimensional Systolic Array
The array architecture given above takes input data in parallel into first PEs in the array and processes the Multiplication and Accumulation on them and then outputs result to the next level PEs of array. Systolic arrays do not lost their speed due to their connection like any other parallelism. Where, each cell (PE) is an independent Processor (CPU) and has its own registers and Arithmetic and Logic Units (ALUs) i.e. 
RESULTS & DISCUSSION
The implementation of Matrix Multiplication is done in both methods i.e. Conventional and Systolic Architecture, as described above, on FPGA. The RTL code is written in Verilog HDL, verification of logic and simulation is done by ModelSim XE 6.4b. The simulation results have given that, the Systolic architecture implementation requires less number of clock cycles then Conventional method and is shown in Figure 8 . The simulation results in Figure 8 , exposes the parallel processing and pipelining by the systolic array architecture and also the input and output matrices , and respectively where the matrix elements are of 4 bit each. After simulation, the design is passed for synthesis onto the platform XILINX ISE 9.2i to convert RTL logic into gate level netlist and also the schematic diagram is captured. The schematic diagrams are shown in Figure 9 and Figure 10 . The Figure 9 represents the top level hierarchy of design and the Figure 10 shows 
CONCLUSION
The Systolic Array Architecture is designed for Matrix Multiplication and it is targeted to the Field Programmable Gate Array device xc3s500e-5-ft256. The parallel processing and pipelining is introduced into the proposed systolic architecture to enhance the speed and reduce the complexity of the Matrix Multiplier. The proposed design is simulated, synthesized, implemented on FPGA device xc3s500e-5-ft256 and it has given the core speed 210.2MHz.
