Implementing the conjugate gradient algorithm on multi-core systems by Wiggers, W.A. et al.
Implementing the conjugate gradient algorithm on
multi-core systems
W.A. Wiggers and V. Bakker and A.B.J. Kokkeler and G.J.M. Smit
University of Twente, Department of EEMCS
P.O. Box 217, 7500 AE Enschede, The Netherlands
Abstract—In linear solvers, like the conjugate gradient al-
gorithm, sparse-matrix vector multiplication is an important
kernel. Due to the sparseness of the matrices, the solver runs
relatively slow. For digital optical tomography (DOT), a large
set of linear equations have to be solved which currently takes
in the order of hours on desktop computers. Our goal was to
speed up the conjugate gradient solver. In this paper we present
the results of applying multiple optimization techniques and
exploiting multi-core solutions offered by two recently introduced
architectures: Intel’s Woodcrest general purpose processor and
NVIDIA’s G80 graphical processing unit. Using these techniques
for these architectures, a speedup of a factor three has been
achieved.
I. INTRODUCTION
In linear solvers, like the conjugate gradient (CG) algorithm
[1], sparse-matrix vector multiplication (SMVM) is an im-
portant kernel. Due to the sparseness of the matrix and the
lack of temporal and spatial locality, the SMVM kernel runs
inefficiently on general purpose processors. Fast execution of
the multiplication is needed, since the linear solvers require
multiple iterations.
SMVM offers parallelism, which can be exploited using
parallel computing. Currently there is a trend in the semi-
conductor industry to move from single-core designs in Gen-
eral Purpose Processors (GPPs) to multi-core designs. Another
trend is using a Graphics Processing Unit (GPU) for general
purpose programming [2]. Both architectures offer a lot of
processing power employing multiple processing units which
can be used in parallel. This paper shows the results of running
a CG algorithm on multi-core systems.
We have implemented a conjugate gradient solver using
two Intel 5140 Woodcrest GPPs and using NVIDIA’s G80
GPU. First, in Section II the necesarry information about
the used architectures is presented. In Section III techniques
for improving the performance are explained. Results are
presented in Section IV. This paper is finalized with the
conclusions in Section V.
II. ARCHITECTURES
The increase in performance of single-core processors is
slowing down because of limited growth in clock frequen-
cies [3] and limited architectural improvements for superscalar



























32 B 32 B 32 B 32 B
8 B 8 B
Fig. 1. Bensley architecture memory model
A. Intel Xeon 5140 Woodcrest
In addition to further optimizing single-core processors,
Intel chose to place multiple processors on a single chip. The
Woodcrest processor contains two cores based on Intel’s Core
micro architecture. In Intel’s Roadmap, quad-core processors
are announced and octa-core processors are the next logical
step. The memory model of the Bensley architecture, including
two Woodcrest processors, is depicted in Figure 1.
The processing cores run at 2.3 GHz. In the ideal case,
in each cycle 6 SSE instructions are executed in parallel at
each core. Each SSE instruction requires 16 bytes of data
which results in a total data bandwidth of 220.8 GB/s to fully
utilize a processing core (assuming the program fits in cache).
Each core has 16 KB Level 1 data cache and 16 KB Level 1
instruction cache. The Level 2 cache connected with two
separate 256-bit wide buses also runs at 2.3 GHz, delivering
147.2 GB/s.
As depicted in the Figure 1, two cores share a total of 4 MB
of Level 2 cache. The Woodcrest architecture reduces memory
access latency by using eight hardware prefetchers. In case
an application has limited spatial and temporal locality (such
as SMVM), the memory hierarchy does not offer sufficient
bandwidth to constantly provide the processing cores with
data, causing the cores to stall.
B. NVIDIA GeForce 8800 GTX
The GeForce 8800 GTX architecture, as depicted in Fig-
ure 2, uses the G80 processor which has 128 stream processors
running at 1.35 GHz while the rest of the processor runs at
675 MHz. There is a 384-bit wide bus to the main memory.
The 768 MB main memory uses Graphics Double Data Rate 3
memory running on 1.8 GHz. This leads to a peak memory
bandwidth of 86.4 GB/s.
Eight stream processing cores are grouped into one multi-
processor and share a 16 KB memory. Each stream processor
can process a 32-bit floating point number calculation every
cycle at a frequency of 1.35 GHz. This results in a total data
processing capacity of 43.2 GB/s. The 16 multiprocessors
are connected to main memory via a 48 byte bus clocked
at 1.8 GHz. The GPU offers a lot of processing power
compared to a GPP, caused by dedicating more raw transistors








































Fig. 2. GeForce architecture memory model
Traditional GPU programming differs from programming
a GPP since GPUs are initially designed for 3D rendering.
NVIDIA introduced a programming model, called CUDA
(Compute Unified Driver Architecture). Within CUDA the
GPU is viewed as co-processor to which data-intensive and
highly parallel tasks can be off-loaded. CUDA provides li-
braries with common mathematical functions. The GPU can
be programmed using (extended) C, without knowledge about
conventional GPU pipelines.
III. ALGORITHM TECHNIQUES
Different architectures call for different optimization tech-
niques to maximize throughput. For both architectures, the
chosen optimization techniques are shortly discussed.
A. Intel Xeon 5140 Woodcrest
Three main factors limit the performance of sparse-matrix
multiplications on modern superscalar processors [5]. First,
data structures that represent sparse-matrices have no temporal
locality (the matrix elements are used only once) and limited
spatial locality (data accesses are in a stride loop). Second, the
tendency of multiple load/store function units to miss on the
same cache line, as a consequence of the data access pattern
of the loop. Finally, sparse matrix vector multiplication codes
typically perform a large number of load instructions relative
to the number of floating-point instructions. Since the memory
is not fast enough to offer new data, the floating-point units
are underutilized.
The sparse-matrix elements are usually reordered in such a
way that most non-zero elements will be close to the diagonal
of the matrix. This improves memory access during the
multiplication of the matrix, since spatial locality is improved.
(a) Even odd (b) Block row (c) Submatrix
Fig. 3. Multiple ways to split up a 4× 4 matrix
As summarized in [5], multiple reordering algorithms exist.
In this paper, the Cuthill-McKee (CM) algorithm is used. The
Cuthill McKee [6] heuristic algorithm tries to minimize the
bandwidth of the matrix. The bandwidth B of a matrix A is
defined as B(A) = 1 +max(|i− j|)Ai,j 6=0.
By storing the sparse-matrix in the Compressed Row Format
(CRF) [7], only the non-zero elements are stored. CRF uses
three arrays; the first array is used to store all non-zero
elements. The second array stores all column indexes of the
non-zero elements. The last array stores the locations in the
arrays that start a row. Using the CRF leads to a smaller data
set and better spatial locality. The dataset is reduced from n2
elements to 2 × nnz + n + 1 (nnz = number of non-zeros).
Our test suite consisted of a matrix of 138, 324 × 138, 324
(n = 138, 324) elements, containing around 2.5 × 106 non-
zero elements.
By using Intel’s Math Kernel Library (MKL) for our CG
solver specific hardware features such as SSE instructions,
loop unrolling etc. are exploited. We used the OpenMP library
for implementing data parallelism for our dual-core processors.
We have experimented with multiple ways to parallelize the
SMVMs by splitting up the matrix.
When using even odd splitting (see Figure 3(a)), matrix
elements which are direct neighbors are split up in two
separate sets. The data sets are symmetric along the matrix
diagonal. When the results of two sets are combined, bigger
numerical errors occur since the results of different sets are
not kept in extra wide SSE accumulator registers. In effect,
the amount of CG iterations needed to converge increases. We
observed a doubling in the number of iterations thus making
this method practically useless.
Another way to balance the load can be seen in Figure 3(b).
The matrix is split up into an upper and lower part. A
row always fits completely within one of the two sets. The
disadvantage is that we need the double amount of memory
to store the system matrix since symmetry is lost. We have
measured that this method needs ≈30% more clock cycles on
the Woodcrest processor than the method we describe next.
By splitting the matrix in small square submatrices (see
Figure 3(c)), we can retain the symmetry within the diagonal
submatrices. This method does not have numerical errors like
the even-odd splitting. Further scaling to more-core processors
is also possible since we can repeat the same splitting for the
diagonal submatrices. For these reasons we use the submatrix
splitting method for implementing data parallelism within the
CG algorithm on the Woodcrest architecture.
Our DOT algorithm requires multiple linear equations to be
solved. Multiple threads are used to parallelize the execution
of multiple solvers over the two processors. Each CG solver
requires multiple SMVMs. The SMVM is parallelized over
the two cores of one processor using the submatrix splitting
technique.
B. NVIDIA GeForce 8800 GTX
To obtain a reasonable execution time on the GeForce it is
important that the 300 cycle memory latency is hidden when
fetching data from memory. To hide this latency we require
that there are multiple arithmetic operations per data fetch.
Furthermore, the available streaming processors and multipro-
cessors are loaded with more threads than can actually run on
the device such that the GPU’s thread processor can switch be-
tween threads that stall on memory fetches. By hiding memory
access latency we allow for a more efficient use of available
resources and thus lower the total execution time. This is also
explained in [8] which describes the Parallel Random Access
Model (PRAM) model for programming architectures with a
large number of processors. The programming paradigm used
by NVIDIA’s CUDA C-compiler is based on this model. We
used this PRAM model as well to design the custom kernel
needed to implement the sparse matrix vector multiplication.
To execute a SMVM, it is inefficient to distribute parts of a
row of the matrix over multiple processors. Each processor
can only calculate a part of the “row-vector” product and
the final combination stage will introduce additional inter-
processor communication overhead. We chose to let every
thread work on a row of the matrix. This streaming method is
also described in [9].
Using CRF introduces memory redirections, since the loca-
tions of all elements in the matrix have to be calculated. By
padding every column of the matrix with zeros such that every
row has an equal number of elements, the number of memory
redirections is reduced. Since every row has the same amount
of elements, we can calculate the index of the row.
Every streaming processor needs to perform the same oper-
ation every clock cycle, thus we reordered the matrix in such
a way that starting from the first row until the last row of
the matrix, the number of elements increases. Consequently
subsequent rows will have approximately the same number of
non-zeros. We can use this to implement a branch when a zero
element, resulting from padding, is loaded from memory since
threads close to each other take the same branch. Generally
the implementation of branches should be taken with great
caution on the GPU. If two thread processors take a different
branch the execution path needs to be serialized which is
very resource inefficient. By reordering the matrix in the way
described above this disadvantage is minimized.
IV. RESULTS
In this section we will show results of our CG solver for
DOT. Our initial solver was running on an Intel Xeon Cranford
processor and is used as reference. The CG solver available on
the Cranford system was implemented using the deal.II library
(version 5.2.0), which was compiled with all optimizations
using the Intel Compiler 8.1. For the Woodcrest architecture
we used the Intel MKL library version 9.0. All results are
measured with VTune 8.01. The Cranford PC is a single Intel
Xeon Cranford 3.2 GHz with 1 MB of level 2 cache and 1 GB
main memory. For the Woodcrest architecture we used two
Intel Xeon 5140 Woodcrest 2.3 Ghz processors with each a
shared 4 MB level 2 cache (see Figure 1). The system has 2 GB
memory divided over two channels on a Bensley platform.
Both Intel processors offer 64-bit precision.
For the GPU measurements the NVIDIA GeForce 8800
GTX board was placed in the Woodcrest PC. The GeForce
board was connected via a 16 lanes PCI-express bus. All
timing was performed with the timing function provided by
the CUDA environment. The GPU offers 32-bit precision.
The precision influences the convergence rate, since numerical
errors lead to slower convergence.
A. Intel Xeon 5140 Woodcrest
The results of the conjugate gradient solver on the Wood-
crest is presented in Table I. The last column in the table
shows the speedup compared to previously tested architecture
and the initially used architecture respectively.
Processor / Implementation #Iterations Time(s) Speedupper solve
Cranford / Deal.II 152 3.07 1.0 1.0
Single-core Woodcrest / deal.II 152 2.89 1.06 1.06
Single-core Woodcrest / MKL 154 2.40 1.20 1.28
Dual-core Woodcrest / MKL 154 1.59 1.51 1.93
TABLE I
RESULTS OF THE CONJUGATE GRADIENT SOLVER ON THE WOODCREST
Although Intel’s hand-optimized MKL library requires
slightly more iterations to achieve equal precision than the
initial compiler optimized deal.II implementation, a speed-up
of a factor 1.20 is obtained. Analysis with VTune showed that
85% of all the required clock cycles of the CG solver are used
for sparse-matrix multiplication.
A single single-core Woodcrest using deal.II is 1.06 times
faster than the Cranford using deal.II, which is probably
caused by the bigger cache size and improvements of the
Core micro-architecture processor. The hand-optimized MKL
clearly better exploits the processors hardware, since a speedup
factor of 1.20 is achieved. Enabling the second core added an
improvement of a factor 1.51. However, analysis with VTune
showed that cache-misses and synchronization caused by the
OpenMP library are becoming more important when adding
more cores/processors.
Additional memory modules are used when using two dual-
core Woodcrest processors to run multiple solvers in parallel.
This increases the memory bandwidth to support the offered
21 GB/s total memory bandwidth capacity. Experiments with
multiple memory modules were done to see the effect of
scaling memory bandwidth. Results are shown in Table II. The
speed-up is relative to the execution time of a single dual-core
processor.





MULTI-PROCESSOR SPEED-UP FOR DIFFERENT NUMBER OF MEMORY
MODULES
For the dual-socket Woodcrest architecture implementation
the speedup factor is not a factor 2 compared to the single-
socket Woodcrest architecture, but a factor 1.59. One reason
for this is that logically separated solvers share physical
memory modules causing memory bank conflicts.
B. NVIDIA GeForce 8800 GTX
In Table III the results of different SMVM implementations
on the GeForce architecture are shown. The first row shows
the execution time needed for the sparse matrix vector mul-
tiplication implementation described in the previous section.
We have also simulated an approach where every thread loads
the matrix values and indices in the 16 kB shared memory and
multiplies this with 32 different vectors. The execution time
of this simulated approach is given in the second row.
Function Time(ms)
SMVM 3.40
SMVM (multiple vectors) 1.22
TABLE III
SPARSE MATRIX VECTOR MULTIPLICATION RESULTS ON THE GEFORCE
In Table IV, the results of the conjugate gradient solver on
the GeForce architecture are shown. On average, around 4 ms
is spent per iteration of the CG algorithm. However, due to
the 32-bit precision, more iterations are required to converge
compared to a 64-bit implementation.
C. Precision
The results of solvers have been compared to a MatLab
solution. Although the GeForce is faster, the relative error of
the solution is bigger than of the 64-bit GPP implementation.
Since the MKL doesn’t support 32-bit operations, we changed
the Woodcrest implementation to allow a bigger residual (and
thus relative error). When the relative errors of the Woodcrest
and the GeForce are approximately equal, the execution time
of the two architecture are almost the same.
V. CONCLUSION
The execution time of the CG algorithm on the GeForce and
Woodcrest architectures is limited by the mismatch between
Processor #Iterations per solve Time(s)
NVIDIA GeForce 8800 GFX 177 0.71
TABLE IV
CONJUGATE GRADIENT SOLVER RESULTS ON THE GEFORCE
the bandwidth of the memory and the required memory
bandwidth for the processing (also called the memory wall).
On the Woodcrest architecture this results in stalling due
to level 2 cache misses. Together with load balancing of the
sparse matrix on the cores, these stalling problems will be the
performance limiting factors in the future.
On the GeForce architecture we see a similar memory wall.
Memory access patterns on the GeForce architecture determine
the obtained memory bandwidth and for the SMVM algorithm
this is the main performance limiting factor. Since the GeForce
architecture has software controlled caches it is possible to
relief the pressure on the memory bandwidth. Unfortunately
this requires a change in the original CG algorithm, which has
not been implemented.
The Cranford processor needs an average of 3.07 seconds,
a single Woodcrest (containing two cores) 1.59 seconds and
the GeForce 0.71 seconds per solve. After applying all op-
timization techniques, using two Woodcrest architectures and
multiple memory modules, a total speedup of a factor 3.07 on
a GPP has been achieved.
The GeForce architecture has the lowest execution time per
solve. Per iteration of the CG algorithm the GeForce is even a
factor 2.56 faster than a single dual-core Woodcrest processor.
This makes the GPU a good candidate for CG solvers.
However, the relative error of the results has to be taken
into account to make a fair comparison between the 64-bit
floating point and 32-bit floating point precision architectures.
For results with an equal approximated relative error, the
Woodcrest and GeForce architectures have approximately the
same execution time.
REFERENCES
[1] J. R. Shewchuk, “An introduction to the conjugate gradient method
without the agonizing pain,” Pittsburgh, PA, USA, Tech. Rep., 1994.
[2] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E.
Lefohn, and T. J. Purcell, “A survey of general-purpose computation on
graphics hardware,” in Eurographics 2005, State of the Art Reports, Aug.
2005, pp. 21–51.
[3] D. Patterson, K. A. Arvind, D. Chiou, J. Hoe, C. Kozyrakis, S. Lu,
M. Oskin, J. Rabaey, and J. Wawrzynek, “RAMP: Research Accelera-
tor for Multiple Processors,” Technical Record of the 18th Hot Chips
Symposium, Palo Alto, CA, August 2006.
[4] J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantita-
tive Approach. Morgan Kaufmann Publischers, 2003.
[5] S. Toledo, “Improving the memory-system performance of sparse-
matrix vector multiplication,” IBM Journal of Research and
Development, vol. 41, no. 6, pp. 711–725, 1997. [Online]. Available:
citeseer.ist.psu.edu/toledo97improving.html
[6] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric
matrices,” Proceedings of the 1969 24th national conference, pp. 157–
172, 1969.
[7] J. Dongarra, “Compressed Row Storage,” in Templates for the Solution of
Algebraic Eigenvalue Problems: A Practical Guide, Z. Bai, J. Demmel,
J. Dongarra, A. Ruhe, and H. van der Vorst, Eds. Philadelphia: SIAM,
2000.
[8] S. Chatterjee and J. Prins, “Parallel and Distributed computing PRAM Al-
gorithms,” www.cs.unc.edu/prins/Classes/203/Handouts/pram.pdf, 2005,
accessed 27-jan-2007, University of North Carolina.
[9] J. Gummaraju and M. Rosenblum, “Stream Programming on General-
Purpose Processors,” in MICRO 38: Proceedings of the 38th annual
ACM/IEEE international symposium on Microarchitecture, 2005.
