Wireless communication and multimedia applications feature a large amount of matrix operations with different matrix size. These operations require accessing matrix in column order. This paper implements a Multi-Grained Matrix Register File (MMRF) that supports multi-grained parallel row-wise and column-wise access. We implement a 4*4 MIMO decoding with the help of MMRF to illustrate the efficient matrix operations on SIMD processors. Experimental results show that, compared with TMS320C64x+, our SIMD processor can achieve about 5.65x to 7.71x performance improvement by employing the MMRF. By customized design technology, we reduce the area and critical-path delay of MMRF by 17.9% and 39.1% respectively.
(CVR) CVR 0 ~CVR N-1 . Thus, the MMRF provides 2N vector registers logically by employing N 2 registers, which is two times of that of the traditional VRF.
The multiple R/WPs support parallel accesses to the MMRF, which can increase the parallelism degree of FUs. The BMR, which is configurable dynamically, is used to record the parallel access mode of the MMRF. The index view of the register varies as the content of BMR changes.
We can apply the MMRF to a typical SIMD processor without modification to their ISA. Generally, there is m-bit in an instruction word to encode a source or destination register index. The highest bit of the m-bit is used to distinguish the VR from the CVR. The other bits are used to indicate the indexed vector register in the VR or the CVR.
Multi-grained parallel access
To demonstrate the concept of the MMRF, we implement a 16*16 MMRF on a SIMD processor with 16 PEs. We investigate several typical applications from wireless communication and multimedia area, and find out that the following three access modes are highly efficient:
A. One Way Access In the one way access mode, the MMRF can be recognized as a traditional MRF. One VR or CVR, which is as shown in Fig.1 , can be accessed in parallel by 16 PEs. Generally, the one way access mode is mainly used to serve the 16*16 matrix. Programmers can access all the elements of one matrix row or column with the help of MMRF. 
B. Two Ways Parallel Access
The two ways parallel access mode is mainly used to serve 8*8 matrices. As shown in Fig. 2 (a), each VR or CVR consists of two parts from two different matrices. Each part can be accessed by 8 PEs. A row vector register like VR i is the combination of VR i_a and VR i_b . A column vector register like CVR j is the combination of CVR j_a and CVR j_b . Thus, two sub-VRs or two sub-CVRs can be accessed in parallel.
C. Four Ways Parallel Access The four ways parallel access mode is mainly used to serve 4*4 matrices. As shown in Fig.2 (b), each VR or CVR consists of four parts from four different matrices. Each part can be accessed by 4 PEs. A row vector register VR j is the combination of VR j_a , VR j_b , VR j_c and VR j_d . A column vector register CVR i is the combination of CVR i_a , CVR i_b , CVR i_c and CVR i_d . Thus, four sub-VRs or four sub-CVRs can be accessed in parallel. Then four matrix multiplications can be performed in parallel with the help of MMRF. Fig.3 
1728
Instruments, Measurement, Electronics and Information Engineering
We configure the MMRF into four ways access mode to perform matrix multiplications in Eq. (1) and Eq. (2). Fig.4 shows a part of the assembly code that performs the computing process. The register BaseR0 and BaseR1 is the base address of matrix H H and H respectively. The register BaseR2 is the base address of the result of H H *H. The CMUL instruction performs a complex multiplication. The VREDUC4 instruction performs the add operation among the 4 PEs. After four VREDUC4 instructions, we get the first row of the H H *H in VR12. Obviously, it shows that the MMRF can enable very convenient programming. 
Experimental Evaluation
We implement the MMRF in FT-Matrix-Sim which is a cycle-accurate simulator of FT-Matrix processor aimed at media applications. The FT-Matrix which includes 16 PEs is based on VLIW and SIMD technologies. The FT-Matrix accommodates 5-way VLIW instructions with two slots for LD/ST operations and one slot for MAC operations that support 32-bit complex multiplication.
We also implemented a traditional 16*16 VRF and a 16*16 MRF on the FT-Matrix-Sim. We compare the performance of the FT-Matrix that implements the VRF, the MRF, and the MMRF separately with that of a media processor core TMS320C64x+ (also reffered as C64x+). We select a set of typical algorithm kernels as benchmarks from wireless communication and multimedia applications.
A. Experimental results Fig.5 shows the speedup of the FT-Matrix with the VRF, the MRF, or the MMRF over the C64x+. The maximal speedup of the FT-Matrix over the C64x+ should be 8 in theory by comparing their hardware resource. In Fig.5 , the FT-Matrix with the MMRF can achieve nearly optimal speedup of 5.65x ~ 7.71x, as compared with the C64x+.
Meanwhile, the FT-Matrix with the MMRF achieves an average and maximal speedup of 2.21x and 5.87x respectively over the FT-Matrix with the VRF. And, the FT-Matrix with the MMRF achieves an average and maximal speedup of 1.6x and 2.22x respectively, as compared with that of the FT-Matrix with the MRF. executed by Encounter. The hardware cost after placement and routing is shown in Table 1 . The critical-path delay is 1.6ns. The register file always impacts the clock frequency of full chip and takes up significant on-chip area. To reduce the area and critical-path delay of MMRF, a hierarchical customized design technology is adopted on designing the RA and the index decoder of the MMRF. The whole MMRF is divided into 16 macro 2-bit sub-MMRF, and each macro 2-bit sub-MMRF, which can support row-wise and column-wise access, is composed of 2 1-bit sub-RAs. In addition, the 2 1-bit sub-RAs in a macro share a group of index decoder to reduce the area and power.
The hierarchical strategy is also employed for designing the read ports and write ports of the MMRF to obtain a compact layout and optimal wire interconnect. Each of the sub-MMRF contains 6 read ports and 3 write ports, including 2 write ports for row-wise access, 1 write port for columnwise access, 3 read ports for row-wise access, and 3 read ports for column-wise access as well. For one port, regardless of read or write port, it has a 1-bit sub-array of 16 word-lines by 16 bit. The 1-bit sub-array is partitioned into 4 4x4 small blocks. Each 4x4 block has its independently local wordline and shared global word-line. As a sequence, an extremely efficient layout and regular routing are achieved.
The layout of 2-bit sub-MMRF is shown in Fig.6 . The layout area of MMRF is 0.46mm 2 under TSMC 65nm technology, exhibiting 17.9% reduction to the DC synthesized circuits. Furthermore, the customized MMRF achieves better speed performance. The critical-path delay of customized MMRF is 1.15ns. Compared with that of the DC synthesized circuits, the critical-path delay is reduced by 39.1%.
Conclusions and Future Work
This paper introduces the MMRF, which can be well applied to existing SIMD processors without modification to their ISA, aimed at efficient matrix operations on SIMD processors. With the MMRF, we can obtain a performance speedup about 5.65x to 7.71x over TMS320C64x+ for the selected set of algorithm kernels from wireless communication and multimedia applications. By customized design technology, the area and critical-path delay of the MMRF reduced by 17.9% and 39.1%% respectively.
