3220.8
GB/s ---23 GH 2. Gz 2. lG relatively slow. For digital optical tomography (DOT), a large 73 .6 GB/s ---set of linear equations have to be solved which currently takes in the order of hours on desktop computers. Our goal was to 147 .2 GB/s speed up the conjugate gradient solver. In this paper we present 10.5 GB/s -----------------l the results of applying multiple optimization techniques and '' 8 B / R B exploiting multi-core solutions offered by two recently introduced 21GB/s ---architectures: Intel's Woodcrest general purpose processor and NVIDIA's G80 graphical processing unit. Using these techniques for these architectures, a speedup of a factor three has been achieved. [1], sparse-matrix vector multiplication (SMVM) is an imIn addition to further optimizing single-core processors, portant kernel. Due to the sparseness of the matrix and the Intel chose to place multiple processors on a single chip. The lack of temporal and spatial locality, the SMVM kernel runs Woodcrest processor contains two cores based on Intel's Core inefficiently on general purpose processors. Fast execution of micro architecture. In Intel's Roadmap, quad-core processors the multiplication is needed, since the linear solvers require are announced and octa-core processors are the next logical multiple iterations.
step. The memory model of the Bensley architecture, including SMVM offers parallelism, which can be exploited using two Woodcrest processors, is depicted in Figure 1 . parallel computing. Currently there is a trend in the semiThe processing cores run at 2.3 GHz. In the ideal case, conductor industry to move from single-core designs in Gen-in each cycle 6 SSE instructions are executed in parallel at eral Purpose Processors (GPPs) to multi-core designs. Another each core. Each SSE instruction requires 16 bytes of data trend is using a Graphics Processing Unit (GPU) for general which results in a total data bandwidth of 220.8 GB/s to fully purpose programming [2] . Both architectures offer a lot of utilize a processing core (assuming the program fits in cache). processing power employing multiple processing units which Each core has 16 KB Level 1 data cache and 16 KB Level 1 can be used in parallel. This paper shows the results of running instruction cache. The Level 2 cache connected with two a CG algorithm on multi-core systems.
separate 256-bit wide buses also runs at 2.3 GHz, delivering We have implemented a conjugate gradient solver using 147.2 GB/s. two Intel 5140 Woodcrest GPPs and using NVIDIA's G80
As depicted in the Figure 1 , two cores share a total of 4 MB GPU. First, in Section II the necesarry information about of Level 2 cache. The Woodcrest architecture reduces memory the used architectures is presented. In Section III techniques access latency by using eight hardware prefetchers. In case for improving the performance are explained. Results are an application has limited spatial and temporal locality (such presented in Section IV. This paper is finalized with the as SMVM), the memory hierarchy does not offer sufficient conclusions in Section V.
bandwidth to constantly provide the processing cores with data, causing the cores to stall. II. ARCHITECTURES B. NVIDIA GeForce 8800 GTX
The increase in performance of single-core processors is The GeForce 8800 GTX architecture, as depicted in Figslowing down because of limited growth in clock frequen-ure 2, uses the G80 processor which has 128 stream processors cies [3] and limited architectural improvements for superscalar running at 1.35 GHz while the rest of the processor runs at designs [4] . This has caused a trend towards multi-core de-675 MHz. There is a 384-bit wide bus to the main memory. signs.
The GPU is viewed as co-processor to which data-intensive and We have experimented with multiple ways to parallelize the highly parallel tasks can be off-loaded. CUDA provides li-SMVMs by splitting up the matrix. braries with common mathematical functions. The GPU can When using even odd splitting (see Figure 3 (a)), matrix be programmed using (extended) C, without knowledge about elements which are direct neighbors are split up in two conventional GPU pipelines. Figure 3(b) . multiplications on modern superscalar processors [5] . First, The matrix is split up into an upper and lower part. A data structures that represent sparse-matrices have no temporal row always fits completely within one of the two sets. The locality (the matrix elements are used only once) and limited disadvantage is that we need the double amount of memory spatial locality (data accesses are in a stride loop). Second, the to store the system matrix since symmetry is lost. We have tendency of multiple load/store function units to miss on the measured that this method needs rzz.30% more clock cycles on same cache line, as a consequence of the data access pattern the Woodcrest processor than the method we describe next. of the loop. Finally, sparse matrix vector multiplication codes By splitting the matrix in small square submatrices (see typically perform a large number of load instructions relative Figure 3(c) ), we can retain the symmetry within the diagonal to the number of floating-point instructions. Since the memory submatrices. This method does not have numerical errors like is not fast enough to offer new data, the floating-point units the even-odd splitting. Further scaling to more-core processors are underutilized. is also possible since we can repeat the same splitting for the The sparse-matrix elements are usually reordered in such a diagonal submatrices. For these reasons we use the submatrix way that most non-zero elements will be close to the diagonal splitting method for implementing data parallelism within the of the matrix. This improves memory access during the CG algorithm on the Woodcrest architecture. multiplication of the matrix, since spatial locality is improved.
Our DOT algorithm requires multiple linear equations to be solved. Multiple threads are used to parallelize the execution using the Intel Compiler 8. Figure 1) . The system has 2 GB To obtain a reasonable execution time on the GeForce it is memory divided over two channels on a Bensley platform. important that the 300 cycle memory latency is hidden when Both Intel processors offer 64-bit precision. fetching data from memory. To hide this latency we require For the GPU measurements the NVIDIA GeForce 8800 that there are multiple arithmetic operations per data fetch. GTX board was placed in the Woodcrest PC. The GeForce Furthermore, the available streaming processors and multipro-board was connected via a 16 lanes PCI-express bus. All cessors are loaded with more threads than can actually run on timing was performed with the timing function provided by the device such that the GPU's thread processor can switch be-the CUDA environment. The GPU offers 32-bit precision. tween threads that stall on memory fetches. By hiding memory The precision influences the convergence rate, since numerical access latency we allow for a more efficient use of available errors lead to slower convergence. resources and thus lower the total execution time. This is also explained in [8] which describes the Parallel Random Access A. Intel Xeon 5140 Woodcrest Model (PRAM) model for programming architectures with a
The results of the conjugate gradient solver on the Woodlarge number of processors. The programming paradigm used crest is presented in Table I . The last column in the table by NVIDIA's CUDA C-compiler is based on this model. We shows the speedup compared to previously tested architecture used this PRAM model as well to design the custom kernel and the initially used architecture respectively. needed to implement the sparse matrix vector multiplication.
To 1.59 1.51 1.93 thread work on a row of the matrix. This streaming method is TABLE I also described in [9] .
RESULTS OF THE CONJUGATE GRADIENT SOLVER ON THE WOODCREST
Using CRF introduces memory redirections, since the locations of all elements in the matrix have to be calculated. By padding every column of the matrix with zeros such that every Although Intel's hand-optimized MKL library requires row has an equal number of elements, the number of memory slightly more iterations to achieve equal precision than the redirections is reduced. Since every row has the same amount initial compiler optimized deal.I1 implementation, a speed-up of elements, we can calculate the index of the row.
of a factor 1.20 is obtained. Analysis with VTune showed that Every streaming processor needs to perform the same oper-85% of all the required clock cycles of the CG solver are used ation every clock cycle, thus we reordered the matrix in such for sparse-matrix multiplication. a way that starting from the first row until the last row of A single single-core Woodcrest using deal.I1 is 1.06 times the matrix, the number of elements increases. Consequently faster than the Cranford using deal.I1, which is probably subsequent rows will have approximately the same number of caused by the bigger cache size and improvements of the non-zeros. We can use this to implement a branch when a zero Core micro-architecture processor. The hand-optimized MKL element, resulting from padding, is loaded from memory since clearly better exploits the processors hardware, since a speedup threads close to each other take the same branch. Generally factor of 1.20 is achieved. Enabling the second core added an the implementation of branches should be taken with great improvement of a factor 1.51. However, analysis with VTune caution on the GPU. If two thread processors take a different showed that cache-misses and synchronization caused by the branch the execution path needs to be serialized which is OpenMP library are becoming more important when adding very resource inefficient. By reordering the matrix in the way more cores/processors. described above this disadvantage is minimized.
Additional memory modules are used when using two dualcore Woodcrest processors to run multiple solvers in parallel.
IV. RESULTS
This increases the memory bandwidth to support the offered In this section we will show results of our CG solver for 21 GB/s total memory bandwidth capacity. Experiments with DOT. Our initial solver was running on an Intel Xeon Cranford multiple memory modules were done to see the effect of processor and is used as reference. The CG solver available on scaling memory bandwidth. Results are shown in On the GeForce architecture we see a similar memory wall. Memory access patterns on the GeForce architecture determine the obtained memory bandwidth and for the SMVM algorithm For the dual-socket Woodcrest architecture implementation this is the main performance limiting factor. Since the GeForce the speedup factor is not a factor 2 compared to the single-architecture has software controlled caches it is possible to socket Woodcrest architecture, but a factor 1.59. One reason relief the pressure on the memory bandwidth. Unfortunately for this is that logically separated solvers share physical this requires a change in the original CG algorithm, which has memory modules causing memory bank conflicts. not been implemented.
B. NVIDIA GeForce 8800 GTX The Cranford processor needs an average of 3.07 seconds, a single Woodcrest (containing two cores) 1.59 seconds and In Table III In Table IV , the results of the conjugate gradient solver on
