Abstract-We consider memory subsystem optimizations for improving the performance of sparse scientific computation while reducing the power consumed by the CPU and memory. We first consider a sparse matrix vector multiplication kernel that is at the core of most sparse scientific codes, to evaluate the impact of prefetchers and power-saving modes of the CPU and caches. We show that performance can be improved at significantly lower power levels, leading to over a factor of five improvement in the operations/Joule metric of energy efficiency. We then indicate that these results extend to more complex codes such as a multigrid solver. We also determine a functional representation of the impacts of such optimizations and we indicate how it can be used toward further tuning. Our results thus indicate the potential for cross-layer tuning for multiobjective optimizations by considering both features of the application and the architecture.
I. INTRODUCTION
Sparse scientific algorithms and codes enable the linear scaling of the computational costs of modeling and simulation applications when the problem size is increased through refinements required to capture phenomena of interest [7] , [12] . However, the performance of such codes depends to a large extent on the memory subsystem design of the computer. Unlike dense codes [13] , which inherently have a large number of floating point operations per data access, sparse codes are typically dominated by data access operations [8] .
In this paper, we consider in detail the interactions between sparse code features, as represented by the sparse-matrix vector multiplication kernel (SMV), and memory optimizations.
We discuss how memory optimizations that we have developed earlier [11] , [15] , [16] can affect the performance of tuned and un-tuned versions of sparse matrix vector multiplication. We consider the use of such optimizations with powersaving modes of the hardware such as Dynamic Voltage and Frequency Scaling (DVFS) [5] to improve performance at significantly lower power levels. We next develop a functional representation of metrics, such as performance and power, for parameters of the application and the hardware. We then
The work was supported in part by the National Science Foundation through the grant CCF-0444345.
indicate how this functional form could be used to select optimal feature sets for multiple objectives. This is particularly important because the impacts of multiple optimizations on multiple metrics are not independent of each other. Such analysis captures interactions between all parameters, including those representing code features and hardware optimizations, to enable the determination of minimal feature sets to maximize impact.
Section II discusses our methodology, Section III contains our main results and we end with brief concluding remarks in Section IV.
II. METHODOLOGY
We use instruction-level simulation with SimpleScalar [3] and Wattch [2] to model our memory subsystem optimizations.
We use a standard sparse matrix vector kernel (SMV-U) and its tuned form (SMV-O) from Sparsity [8] , and the multigrid code MG from the NAS benchmark [1] . Both SMV kernels compute y ← A × x which requires one floating-point multiplication and addition per nonzero element in A; x, y are N −vectors and A is a sparse N × N matrix. In both cases, re-use of elements of x can be enhanced by reordering A, as shown in Figure 1 . We use such reordered forms for our test matrices with both SMV-U and SMV-O; SMV-O includes optimizations that increase floating-point operations while decreasing loads from memory.
The sparse kernels are emulated by SimpleScalar3.0 [3] and Wattch1.02d [2] with extensions to model memory subsystem enhancements. We use SimpleScalar configured to accept PISA compiled programs to model a single-core processor (such as the one in BlueGene [18] ), starting from a PowerPC440 embedded core. We use Wattch [2] to calculate the power consumption with extrapolations for .13 um technology [11] , [15] , [16] . We also developed a DDR2 type memory performance and power simulator for use with our modified 1-4244-0910-1/07/$20.00 ©2007 IEEE versions of SimpleScalar and Wattch.
Our base architecture has two floating-point units (FPUs) and two integer arithmetic-logic units (IALUs). Each FPU has a multiplication/division module and other arithmetic-logic modules. Thus, our base system can issue four floating-point instructions at each cycle. The data paths between memory and L3 cache are 64 bit wide with cache lines of 64 bytes, i.e., 8 double precision operands or 16 integer operands. We model a cache hierarchy with three levels on chip, including a 32KB data/32KB instruction level 1 cache (L1), a 2KB level 2 cache (L2), and a 4MB unified level 3 cache (L3). Wattch is configured to model only two levels of cache, but we added new functions to model our hierarchy. More details of our system can be found in [11] , [15] .
Starting with the base architecture (B) we consider the effects of (i) doubling the width of the data paths (W), (ii) Memory page policy: open (MO) or closed (default), (ii) memory prefetching at the memory controller (MP), and (iii) L2-cache prefetching (LP). Many of these optimizations have been considered in other contexts [9] , [10] , [14] , [17] , [19] - [21] . All prefetchers are stride-1 and we simulate utilizing power control modes of caches by simply varying cache sizes.
III. EMPIRICAL RESULTS AND ANALYSIS
We now evaluate the impact on performance (time), power, and energy, of memory subsystem optimizations (W,MO,MP,LP) for SMV-U, SMV-O and MG as discussed in Section II. We first indicate how we can significantly improve the energy efficiency in terms of the number of floating-pointoperations/Joule by improving performance at reduced power levels. Next, we consider how we can use our observed data to derive a functional representations which can be used for constrained multi-objective optimizations. Figure 2 indicates the benefits of code tuning when combined with a wider bus (B) for increased memory bandwidth.
Observe that SMV-O with a 2 by 1 blocking increases the floating point operations (useful work) by operating on known zeroes inserted into the matrix in order to reduce the number of loads. However, both SMV-U and SMV-O benefit from the increased memory bandwidth. Figure 3 indicates that L3 cache miss rates remain nearly unchanged as L3 size is decreased from 4MB to 256KB for a given set of optimizations at two different CPU frequencies (1GHz and 600MHz). Among the optimizations, the wider bus (W) and the L2-cache prefetcher (LP) result in the most dramatic decreases in L3-miss rates. In Figure 4 , we illustrate the impact on average load store queue latencies, i.e., memory clock cycles per instruction 1GHz or 600MHz. We thus show the combined impacts of memory subsystem optimizations with power saving modes of the caches and DVFS. Observe that the memory CPI is lower for the base B at 600MHz indicating a better balance between CPU and memory service times. Memory CPIs decrease dramatically when the optimizations are added, with greater benefits for the faster CPU at 1GHz. As indicated in Figure 5 , these reductions in memory CPI translate to faster execution (time, in seconds). Furthermore, with even just a few of the optimizations, execution is faster at 600MHz compared to the base at 1GHz.
We indicate the impact of optimizations (at 1GHz with 4MB L3, and at 600MHz with a 256KB cache) on: power in Figure 6 , energy in Figure 7 L3 to 6.7 × 10 7 with all optimizations at 600MHz, 256KB L3.
We consider in summary (see Figure 9 ), the impact on performance and energy delay product (EDP, energy × time)
for SMV-U, SMV-O (L3 cache size 256KB), and MG (with L3 cache size of 512KB, the smallest size without performance degradations) across the frequency range from 300 MHz to The functional representations can be used with a mixed integer program (for optimization), to select, for example, a set of exactly 3 optimizations that minimize energy at execution times no slower than at the base B (at 600 MHz, 4MB L3). Our analysis indicated that such an optimal configuration is given by B+W+MO+MP for SMV-O and B+MO+MP+LP for MG. Figures 10 and 11 show relative time, power and energy for these configurations, the base B, and the configuration with all optimizations. Observe that these optimal configurations perform just as well as the configuration with all features, at equal or lower power levels. Such analysis indicates the potential of numeric techniques for multiobjective optimizations.
IV. CONCLUSIONS
The results in this paper indicate the significance of memory subsystem optimizations for power-aware high performance of sparse codes. We conjecture that such codes will be impose even greater demands on the memory subsystem of emerging chip multiprocessor (CMP) architectures, especially as they scale to larger numbers of CPUs. We plan to extend our work to evaluate performance and power trade-offs of sparse computations on such CMPs, with particular attention on developing accurate functional representations for efficient exploration of the high dimensional space of multiobjective optimizations. Such functional representations will necessarily be more complex than the ones indicated here. Additionally, they need to be incorporated into a numerical optimization framework to model the effect of uncertainties in the parameters and observed metrics [4] .
