Abstract
Introduction
Chip multiprocessing has received significant attention in both the academe and industry in the last few years as a result of manifested and looming difficulties of traditional approaches to further improve performance [1] . Various such architectures have shown promising results [2] [3] [4] . There are two practical computing paradigms in parallel processing, namely SIMD and MIMD. SIMD's superior ability for data parallelism, often enhanced with low inter-PE communication and synchronization overheads, make it superior to MIMD in performing fine-grain tasks. However, SIMD's implicit intra-instruction synchronization makes it difficult to accommodate application dynamics. On the other hand, MIMD machines consisting of independent PEs target coarsegrain parallelism. However, the PE independence property of MIMD makes programming cumbersome. For applications prone to SIMD execution, the need to explicitly synchronize the PEs in MIMD realizations produces substantial overheads.
Mixed-mode heterogeneous computing [5] , where the machine's parallel operation mode (i.e., SIMD, MIMD or M-SIMD) changes dynamically as needed by individual subtasks in an application, integrates effectively most of the SIMD and MIMD advantages while alleviating their major drawbacks.
HERA [6] is such a mixed-mode chip multiprocessor that we have designed and implemented on Xilinx Virtex II FPGAs [14] and can be easily retargeted to newer families. Floating-point (FP) computationintensive applications are HERA's main target domain. State-of-the-art FPGAs have shown impressive FP performance and provide new opportunities to parallel processing [7] . Moreover, FPGA-based HERA realizations provide another dimension of flexibility to customize and match the multiprocessor to the needs of a given application.
At the same time, energy has become an important design metric for all computing systems, especially when used in wireless and embedded environments where real-time constraints are combined with the requirement for long battery life. Compared to their ASIC counterparts, SRAM FPGAs are more energy hungry. Hence, performance-energy trade-offs are often desirable, even deemed necessary, for FPGA-based reconfigurable systems. However, in contrast to extensive research on energy efficiency for chip multiprocessors, very little work has been done on performance-energy trade-offs. Li et al. [10] explores power-performance optimizations targeting chip multiprocessors with Alpha-like processor cores by manipulating various interacting factors, including application granularity, voltage/frequency levels and number of processors. Ref. [11] studies energyperformance trade-offs for a shared-memory chip multiprocessor with a shared interconnection bus. In this paper we propose performance-energy trade-off techniques that exploit the flexibility of mixed-mode chip multiprocessors like HERA. Several studies have shown that communication channels in FPGA-based systems consume significant energy [8] [9] . For a given task, SIMD and MIMD tasks require different types and amounts of communication among processors. Also, the optimal modes for minimum execution time and energy are different. By carefully manipulating the presence of the SIMD and MIMD modes for given tasks, we can achieve various performance-energy objectives.
The multiplication of matrices with irregular shape and size is used as an example to illustrate our techniques. But this approach can also be applied to other applications with minor modifications. The parallel multiplication of square matrices has been studied extensively and numerous results targeting various computing platforms exist in the literature. With dramatic increases in the computing capability of current FPGAs, various highly specialized and customized FP implementations of MMM on FPGAs are emerging (e. g., [12] ). In comparison, HERA can be a semi-customized chip multiprocessor with generalpurpose user instructions for matrix-oriented applications; it is friendly to software developers lacking hardware expertise. Moreover, our focus in this study is the efficient processing of matrices of any shape (i.e., not restricted to square matrices).
Mixed-mode chip multiprocessor
The general organization of our multiprocessor for generalized MMM is shown in Fig. 1 
Generalized matrix multiplication
Consider A x B = C, where A, B, and C are matrices of size N 1 x N 2 , N 2 x N 3 , and N 1 x N 3 , respectively. N 1 , N 2 , and N 3 are different for non-square matrices. If A and B are square, Cannon's algorithm [13] works best in the SIMD mode; all the PEs are then busy all the time except during the initial alignment. If A and B are not square or cannot be partitioned in such a way that N i , i = 1, 2, 3, is a multiple of q, then the multiplication of border blocks is not efficient in the SIMD mode because the sizes and numbers of blocks are irregular. Some PEs are idle while other PEs are busy at some point because SIMD is implicitly synchronous. We solve this problem by changing the computation mode of PEs in HERA. To fully benefit from the flexibility of mixed-mode computing, an efficient partitioning scheme is required. Suppose that the LDM of the PEs can store 3m 2 floating-point numbers. To be able to store complete blocks from the input and output matrices, the maximum size of a matrix block should be m x m. Let 1 After we decompose the matrices in this way, the multiplication of A and B involves 8 major tasks, represented by (2, 2) , respectively. Apparently, TK 1 generally involves the most significant work and takes the longest time among the tasks. In MIMD, PEs work independently and asynchronously on their own instructions and data blocks. In SIMD, all PEs execute the same instructions from the GCU on different matrix blocks. Both the LDM and LPM of a PE are used for data and, hence, reduce the total number of data I/O. However, SIMD execution requires the global broadcasting of instructions and may threaten the energy budget. As discussed in the Introduction, global communication is a major contributing factor to the overall energy consumption. MIMD execution may reduce the energy consumption by working on local data. But only the LDM can be used to store matrix blocks in MIMD, which results in an increased number of data transfers. Also, SIMD causes more idle PEs with irregular matrix blocks.
Performance-energy trade-off analysis
In this section we analyze the performance and energy consumption of MMM tasks in SIMD and MIMD. This analysis provides a quantitative basis for our mixed-mode performance-energy trade-off techniques. In contrast to most of the other performance and energy modeling approaches, our equations are based on implementation measurements on the FPGA.
Performance and energy characterization
In both SIMD and MIMD, the multiplication of a pair of matrix blocks of size n 1 x n 2 and n 2 x n 3 is an indivisible job, which is mapped into one PE only. We denote this type of job as J m (n 1 , n 2 , n 3 ). Each task TK i involves many such jobs. The clock cycles to finish a J m (n 1 , n 2 , n 3 ) job on HERA is :   1  2  3  1  2  3  1  2  1 ( , , )=15* * * 6* * 4* 15 m t n n n n n n n n n + + + There are several types of communication channels in HERA. Let J c (n 1, n 2 ) be the job to transfer a matrix block of size n 1 x n 2 between the GDM and an LDM. Since the GCU can directly access every LDM and LPM, a J c (n 1, n 2 ) job takes t c (n 1, n 2 ) = (n 1 * n 2 ) clock cycles. Another type of major communication jobs is the broadcasting of HERA instructions. The MMM code has around 40 assembly language instructions. A few control instructions are added according to different computing modes. These instructions are broadcast via the column buses. The NEWS interconnect is used to transfer matrix blocks or register values between PEs. The clock cycles for instruction broadcasting and a NEWS transfer, denoted by t b and t news , respectively, are proportional to the data volume.
The energy consumption of the basic jobs is calculated as follows. We distinguish between two power states for HERA components: active and idle. Only the dynamic power is considered since the static power is much less significant for our target device; i.e., a Virtex II FPGA [14] . The power data of HERA components in the two states are obtained after implementation on the specific FPGA device. The required power for a PE, a NEWS connection, the bus system, or an LDM in the active and idle states is represented by For the sake of simplicity, we do not take data locality into account in these equations. Also, the accumulation time of the partial products is not included. Similarly, other tasks (TK i i = 2, …, 8) can be treated in SIMD, M-SIMD, MIMD, or the mixed mode. A peculiarity of these tasks is that they involve nonsquare matrix blocks. The irregularity will cause more idle PEs in SIMD than MIMD. Hence, it is beneficial to execute these tasks in the mixed mode. For example, consider a task involving one J m (2, 5, 17), six J m (10, 15, 11) , and eight J m (14, 25, 7) jobs. We can then construct one SIMD group consisting of six PEs working on the six J m (10, 15, 11) jobs and another SIMD group with eight PEs for the eight J m (14, 25, 7) jobs. An independent PE will work on the J m (2, 5, 17) job. This way we can avoid the idleness of PEs and potentially save on energy and time. We can derive similar equations for the clock cycles and energy consumption of these tasks. Let The clock cycles and system energy consumption for all the tasks can be found by:
where active x C and idle x C are the clock cycles of the system components, i.e., PEs, NEWS, bus, or memory, in the active and idle states, respectively, for all the tasks. They are collected by hardware counters in the respective components at runtime.
Performance-energy tradeoffs
From the above analysis, we can see that the SIMD and MIMD executions of a task involve different amounts of execution time and energy consumption. By varying the frequency of different modes, we can achieve various performance-energy objectives. In particular, we explore three performance-energy scenarios: 4.2.1. Optimize the performance without energy constraints. The focus is to reduce the communication time and also consider data locality when distributing matrix blocks to available PEs. This case also helps us to discover the best performance and corresponding energy consumption of the application on the specific architecture. The objective is to find a set of For an application-system pair, there is an optimal i γ for minimum execution time. Since the PEs consume different power in different states, this optimal i γ does not necessarily correspond to minimum energy consumption. Optimality involving both energy and performance depends on the task characteristics as well as the architecture. We aim to optimize across two dimensions for each task: energy and/or performance vs. i γ . Moreover, a hardware technique, clock gating, is employed to save energy at runtime. The clock signal of idle PEs will be disabled until they are assigned new jobs. Algorithm-1 is applied. 
Experimental results
The FPGA device used in our experiments is the Xilinx Virtex II XC2V6000-5 FPGA [14] which contains 33,792 slices and 144 x 512 36-bit BlockRAM blocks. The performance of the single-precision FP adder and multiplier used to construct the HERA PEs is shown in Table 1 . The HERA system runs at 125MHz. 36 PEs with 512 x 36-bit LDM and LPM were implemented for the experiments.
We first evaluated the accuracy of our performance and energy equations shown in Section 4 for the SIMD and MIMD modes. A variety of non-square matrices of different shapes were used. Cannon's algorithm was applied to TK 1 in all the matrix pairs for SIMD execution. The measured execution time and energy consumption of the tasks are listed in Table 2 . These results were compared with those calculated with our time and energy equations. The energy results were measured with the Xilinx XPower tool. The average activity rates were extracted from ModelSim files. The average difference between the actual and the measured time and energy is 2.1% and 4.5%, respectively. The difference in time mainly comes from the overheads of system administration and bus conflicts. Data locality during scheduling also adds to the dynamic effects of performance and energy. This energy error rate is acceptable for system-level estimation models. HERA components consume a continuous range of power under various activity rates while we assume only one state to represent any active behavior. The key is to obtain the accurate activity rate by extensive simulations with benchmark matrices. Another reason is that the energy measurements for the bus system tend to be less accurate than for PEs and memory blocks. However, our objective is to develop fast, yet useful models for exploring performance-energy optimizations without involving tedious and time-consuming lowlevel simulations. Table 2 also shows that different execution modes require different execution times and energy consumptions, which provides room for performance-energy trade-offs. The exploration space increases with increases in the matrix size. Finally, we evaluated our optimization techniques. Table 3 shows results for matrices of size 565 x 767 and 767 x 999. Scenario-II evaluates the impact of clock gating on the energy consumption. A reduction of 7.3% in energy consumption was observed by putting the idle PEs into sleep without major switching penalty on the execution time. A performance penalty of 5.7% was observed when reducing the energy consumption by 13%, as shown in Scenario-III. In Scenario-IV and -V, we relaxed the performance by 10.6% and 15% to reduce the energy consumption by 14.5% and 18.9%, respectively. The benefits of the approach could be better for more closely coupled algorithms presenting more data dependences among tasks, which expose more flexibility for performance-energy trade-offs.
Conclusions
Continuous advances in silicon technology and increasing difficulties in realizing superscalar processors have brought about a significant shift in microprocessor design. Chip multiprocessing has (17) recently emerged in general-purpose computing and will continue to develop further in many application scenarios, including embedded and wireless systems. While high performance is always desirable, trade-offs between performance and energy are necessary in many such systems. We have presented our performanceenergy trade-off study for an in-house designed and implemented mixed-mode reconfigurable chip multiprocessor. The flexibility of mixed-mode parallel execution provides us with a tremendous exploration space to achieve various performance-energy objectives. The experimental results prove the effectiveness of our approach. 
