Strassen S algorithm is an efficient method for mulliplying large matrices. We explore various ways of mapping Strassen ' s algorithm inlo reconfgurable hardware that contains one or more customisable instruction processors. Our approach has been implemented using Nios processors with custom inslrucfions and with custom-designed coprocessors, taking advantage of the additional logic and memory blocb available on a reconfgurable platform.
Introduction
Recent advances in reconfigurable hardware technology enable one or more instruction processors to be implemented in a single device [I] . Such reconfigurable implementations support customisation for specific applications in two main ways: an instruction set that can be extended with custom instructions, and a custom-designed co-processor.
The purpose of this paper is to explore the use of such customisable processor technology in implementing Strassen's algorithm [2] , an efficient method for multiplying large matrices. Other algorithms such as K-means clustering [3] have also been targeted for reconfigurable instruction processor implementation.
The remainder of the paper is organised as follows. Section 2 introduces Strassen's algorithm for matrix multiplication. Section 3 describes the custom instruction facility for the Nios processor [SI, and its use in implementing Strassen's algorithm. Section 4 explains how a custom-designed co-processor can be developed to support matrix multiplication. Section 5 presents a framework for concurrent processing based on multiple copies of customisable instruction processors, and shows how it benefits Strassen's algorithm. Finally Section 6 summarises our approach, and discusses opportunities for further research. 
Strassen's Algorithm
Strassen's algorithm [2] and its variants are known to be among the most efficient matrix multiplication methods. It reduces the number of scalar multiplications involved in the computation of a matrix multiplication. This is achieved by replacing a large matrix multiplication with a combination of smaller matrix multiplications and matrix additions. These smaller multiplications can also be subdivided using the algorithm recursively.
Conventional matrix multiplication requires n3 scalar multiplications for an n by n matrix. Strassen's algorithm requires 4(n-1)' scalar multiplications, together with extra additions. The procedure is summarised as follows. Let A ' and B he two n by n matrices, where n is an even integer. Partition the two input matrices A and B and the result matrix C into quadrants as follows 161:
The symbol in the above equation represents matrix multiplication. We then compute the four quadrants of the result matrix as follows:
Our tests show that Strassen's algorithm does improve a pure software implementation of a matrix multiplication.
Custom Instructions
Reconfigurable systems, by their very nature, are more flexible than fixed-hnction systems. Configurable instruction processors allow many forms of customisation. For instance, source code running on the processor can be optimised, special interfaces to custom designs can be included, and features not used by the processor can be removed. However the implementation of such solutions may be time consuming and complex.
The Nios configurable processor [SI provides a simple solution to the problem of incorporating new hardware to a processor and in accessing the hardware from a software program. The Nios instruction set architecture can be extended to include custom instructions, and this has several advantages: 1)lntegrated into the instruction processor pipeline; 2)Direct register interaction;
3)Stateddata can transcend multiple instruction calls; 4)Hardware size limited only by chip size.
Figure 1 Nios custom instruction architecture
The Nios custom instruction takes two 32-bit words from registers A and B and outputs one in register A ( Figure I ). There are additional inputs that can be used to control other aspects of the custom instruction's functionality. The close integration through the CPU registers may limit data throughput to main memory, but if the required calculation is computationally complex and required at ad hoc intervals, the custom instruction provides a very good opportunity for speeding up software algorithms, as the current data may already be loaded in the registers.
To improve performance of common multiply-accumulate based algorithms, we propose the use of a multiply-add routine in a custom instruction. An internal accumulator is used to reduce memory bandwidth requirements and thus improves the overall performance. It is clear from Figure 2 that the custom instruction with the built-in accumulator performs significantly better than a software implementation. Performance increases of up to 77% are observed for large matrices. When we use Strassen's algorithm with the custom instruction, little performance increase is achieved. Strassen's algorithm only becomes consistently faster for matrices of size larger than 64.
Matrix Multiplication CO-processor
The deficiency of a general-purpose processor when processing computationally intensive scientific algorithms is frequently compensated by installing an additional specialized processing unit. In our experiment, we demonstrate the performance advantage of an additional hand-optimised co-processor in a system with a customisable processor.
To implement Strassen's algorithm, the co-processor is designed as a stand-alone matrix multiplier and can produce up to one row of the result matrix at a time. The matrix addition pan of the algorithm is done by the main processor. The matrix multiplication co-processor is designed to stream matrix data from 'memory. Its sequence of operations starts with streaming a single row from the first matrix operand into its internal buffers. It next streams the whole second matrix operand, calculating the first row of the result matrix at the same time ( Figure 3 ). Provided that data can be supplied continuously at every cycle, n(n2+ 2n + 6) cycles are required to calculate the result of two n by n operands, including co-processor setup. The matrix co-processor is efficient for a larger stream of data, as the overhead of initiating a DMA is offset by more efficient memory access, as compared to the custom instruction or software implementations. With a buffer size of 8, a performance increase of 45% has been measured. However, with larger buffer sizes, simulations show that perfoimance increases can be up to 86% (Figure 4) . When we use Strassen's algorithm with the matrix co-processor, we find that the performance is lower than using the hardware alone. This is due to two factors. First, the additions are computed by the N o s , which takes a rather long time. Second, the Nios and the matrix co-processor are competing for memory bandwidth. Strassen's algorithm does have the advantage that it allows niatrices of twice the size to be computed using the same piece of hardware.
Further performance gains can be obtained with more co-processors, provided that the memory bandwidth is sufficient to keep the processing units busy. Here we propose a memory architecture suitable to be used under a programmable platfonn to overcome the memory bandwidth bottleneck.
An example of one processing unit is used to illustrate the idea. As shown in Figure 5 , we have two additional blocks of memory installed as buffers for the co-processor. The overall mechanism is like a swinging buffer commonly found in video processing architectures. Calculation starts with the general processor evaluating partial matrix sums (equations for PI to P7), with results stored in bufferl. (AiI+A2*and B , , + B12 for PI, say.)The general processor then starts calculating the third partial sum, results stored in buffer2. At the same time, the co-processing unit could start evaluating the first set of matrix multiplication in PI by streaming from bufferl. In this way, the main processor will not need to compete with the co-processing unit for memory resources. Our tests on Strassen's algorithm show that for processors with loadistore architecture, matrix addition requires more cycles than the predicted nz cycles. In the worst case, the adder in the general processor cannot keep up with the multiplier in the co-processor. Therefore, the workloads in the two processing units are fairly balanced. 
Multi-Customisable Processors
We show that it is feasible to implement a multi-processor system on an advanced programmable logic platform. Additional logic is available for system bus arbitration circuit and on-chip memory blocks. The customisability of a processor on a programmable chip facilitates the development of a synchronization framework for distributed programming on multiple processors. The system implemented consists of two general customisable processors; however, the principles could easily be extended to cover more processors provided that there are sufficient resources on the programmable chip.
For multiple processors sharing data structures, some synchronization techniques are often required to ensure that no morc than one processor is working on the same data structure at any given time. Common techniques include locks, semaphores and monitors 141. In o w implementation, the processors share common registers CO keep track of which processor is accessing which shared data block. Software running on the processors in the system has the responsibility of checking these registers, so that it does not run into occupied shared data. This is accomplished by an atomic "test and set" custom instruction operating on the shared registers. In this way, a processor can obtain control of a "free" data structure by writing to its "lock" flag in an unintermptible way. To demonstrate the use of our hardware-assisted synchronization framework, we recode the first recursion level of Strassen's matrix multiplication algorithm as a distributed program for our dual processor platform. The aim is to balance the computation load across the processors. At some point in the program, the two processors are competing to run the next step of Strassen's algorithm. This is supported by the custom "test and set" instruction. With a limited memory bandwidth, the dual processor achieves a performance gain of almost 25% over a single processor (Figure 7) . Further performance gain can be achieved by introducing a new memory architecture that will increase the memory bandwidth of the whole system. After closely inspecting Strassen's algorithm, we derive a novel memory architecture suitable for parallel processing with dual processors. The first seven equations (PI-P7, Section 2) consist of ten matrix additions and seven matrix multiplications. We employ four memories, (A, B, A', B') using on-chip resources (memory blocks and bus arbitration logic) or off-chip memory bank connections.
The first step is to generate the matrix sums for the multiplication stage. Initially operand matrix A (Ail, All. AX, AZ2) is stored in memory block A, and operand matrix B (Bli. BIZ, BzI, BZ2) is stored in memory block B. The aim of the operations is that at any instance in the sum and multiplication stages, we have no more than one processor working on the same memory block. respectively. To perform the matrix multiplication of the partial sums on P6 and PS, one processor would access only memory block B' (P6), while the other processor will be working on blocks A and B (PS) (Figure 11 ). Hence the two processors would be working on different memory blocks. With results of PI-P6 stored in A, B and B', the final matrix multiplication in P7 can be executed on block A' only. At the same time when the multiplication stage of P7 is evaluated, Clz, C2, and CZ2 can be calculated; no two processors work on the same block of memoly at the same time. A more fine-grained approach is to split the multiplication in P7 and have two processors working on it. However, given the limited memory bandwidth, this approach is unlikely to improve performance by much.
In conclusion, with an increasing number of processors in the system and provided that there is sufficient spare hardware on the programmable platform to implement additional memory blocks and bus arbitration circuit, the performance of distributed software could benefit significantly from an on-chip multiprocessor system.
Summary
We have presented various ways of mapping StrSSen's algorithm into reC0nfigurable hardware containing one or more customisable instruction processors. Current and future work includes studying the scalability aspects of our approach, and generalising our techniques to cover other applications.
