Exploiting Structure of Symmetric or Triangular Matrices on a GPU by Jung, Jin Hyuk & O'Leary, Dianne P.
Exploiting Structure of Symmetric or Triangular
Matrices on a GPU
Jin Hyuk Jung∗ Dianne P. O’Leary†
January 2008
Abstract
Matrix computations are expensive, and GPUs have the potential to
deliver results at reduced cost by exploiting parallel computation. We
focus on dense matrices of the form AD2AT , where A is an m×n matrix
(m ≤ n) and D is an n× n diagonal matrix. Many important numerical
problems require solving linear systems of equations involving matrices of
this form. These problems include normal equations approaches to solv-
ing linear least squares and weighted linear least squares problems, and
interior point algorithms for linear and nonlinear programming problems.
We develop in this work efficient GPU algorithms for forming and factor-
ing AD2AT by exploiting the triangular rastorization capabilities of the
GPU.
This report summarizes work from 2005 to 2007 and was sup-
ported in part by the US Department of Energy under Grant
DEFG0204ER25655 and by the National Science Foundation un-
der Grant CCF 05 14213.
Keywords: GPGPU, general purpose graphics processing units, symmetric
matrix, triangular matrix, rectangular packed format, matrix computation, fac-
torization, decomposition, weighted least squares.
∗Department of Computer Science, University of Maryland, College Park, MD 20742.
jjung@cs.umd.edu, salbang+csr@gmail.com
†Department of Computer Science and Institute for Advanced Computer Studies, Univer-
sity of Maryland, College Park, MD 20742. oleary@cs.umd.edu
1
1 Introduction
Given time series data, b0, . . . , bn−1 measured at times t0, . . . , tn−1, we may want
to fit a model to the data in order to reduce the effects of noise. We choose a
set of basis functions φ0(t), . . . , φm−1(t) that can describe the behavior of the





To find the coefficients x0, . . . , xm−1, we solve a weighted least squares prob-
lem involving a matrix A, with entries computed by evaluating the basis func-
tions at the measured times, and a diagonal matrix D, with entries determined
by the uncertainties in the measurements. We solve the problem by forming
the matrix AD2AT and then computing a factorization of this matrix [5]. The
cost is proportional to nm2. In data fitting problems, m may be on the order
of tens or hundreds, but n can be quite large.
Many other important computational problems also involve forming and
factoring a matrix of the form AD2AT . For example, interior point algorithms
for linear and nonlinear programming solve a sequence of weighted least squares
problems [12, 10]. In these applications, m and n can be thousands or more.
Because of the expense of solving these problems, the massive parallelism of
the GPU architecture is very attractive [8, 3, 1, 9], and we develop in this work
efficient GPU algorithms for forming and factoring AD2AT .
In section 2, we briefly review the weighted least squares problem and how
it is solved. Since the matrix AD2AT is symmetric, only its lower triangular
part is needed, and this reduces computation by 50%. We discuss in section
3 how the matrix can be assembled on a GPU using triangular rasterization.
Then we present a GPU algorithm based on a rectangular packed storage form.
For packed storage, we store a lower triangular matrix by moving the submatrix
at the bottom right, rotated by 180 degrees, to the unused upper right corner
of the array, thus reducing the required storage. Section 4 explains how factor-
ization can be performed in either full or packed storage. Results in section 5
demonstrate that by rasterizing two triangles simultaneously, we can assemble
and factor the packed matrix as fast as the non-packed matrix. Section 6 gives
our conclusions and implications for design of GPU languages.
For consistency with GPU notation, we number elements in matrices and
vectors starting with 0 instead of 1.
2
2 Weighted Least Squares
Given the data (tk, bk) (k = 0, . . . , n−1), and the model (1), we want the model





To find good coefficients, we could think of summing the squares of the differ-
ences between the values bk and the model prediction. Let A be the m × n
matrix with entries ajk = φj(tk). If we form a vector b from the values bk and






2 ≡ ∥∥∥b−AT x∥∥∥2 .
The terms in the summation should be weighted by how sure we are about
the data; we multiply the k-th term by d2k, where we assume that the error in













∥∥∥D(b−AT x)∥∥∥2 , (2)
where D is a diagonal matrix with entries dk.
To solve this problem, we can set the gradient of ‖D(b −AT x)‖2 to zero,
obtaining the linear system of equations
AD2AT x = AD2b. (3)
If A has full rank (i.e., the basis functions φj are linearly independent and the
data points tk are distinct), then this system has a unique solution.
Therefore, we can solve our weighted least squares problem (2) by forming
the matrix AD2AT and then solving the linear system of equations. The most
efficient method for solving this requires computing a lower triangular matrix L
so that LLT = AD2AT . The matrix L is called the Cholesky factor of AD2AT ,
and its computation requires approximately m3/3 operations.
We propose to form AD2AT and factor it on a GPU. This might limit us
to single precision. If we require a double precision answer x, then it might
be necessary to perform mixed precision iterative refinement, as explained in
Algorithm 1. Note that we form the initial x0sgl by forward and back substitu-
tion, solving Ly = AD2b from the first to the last equation, and then solving
LT x0sgl = y from the last to the first equation.
3






rk = AD2b−AD2AT xkdbl; // A,D and b in double precision




k = k + 1
until ‖rk‖2/‖xk+1dbl ‖ ≤ ε or k≥iteration limit
3 Assembling Symmetric Matrices
In this section we discuss how AD2AT can be formed efficiently.
3.1 Full format on a CPU
The full or non-packed format stores an m × m matrix in an m × m array
or texture as illustrated in Fig. 1. The figure also shows how we store other
matrices and vectors used for assembling AD2AT .
Denote the k-th column of A by ak. We observe that






Therefore, we can initialize the array C to 0 and, at step k, add in baTk , where
b = d2kak.
Assembling the matrix AD2AT without considering its structure wastes
computational resources, because its upper and lower triangular parts are ex-
actly the same. To avoid redundancy we may omit computing the upper tri-
angular part and copy the lower triangular part if necessary. The algorithm
consists of two main routines, column scaling and outer product addition. At
the k-th iteration, we scale ak by d2k, and store it in a temporary vector b. Then
we add the outer product of b and ak to C. Considering the redundancy, we
may write a CPU version that computes only the lower triangular matrix as in
Algorithm 2.
3.2 Full format on a GPU
We can match the two routines, column scaling and outer product addition, to
GPU kernels as described in Kernels 1 and 2 and Fig. 2. The key to saving
computational time lies in how we arrange the two nested for loops (iterating
with i and j) for the outer product addition. We design the loops so that they
iterate over the lower triangular matrix. We can implement this simple strategy

























































































d4d3d0 d2d1 d5 d6 d7
x
.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
.5b4b3b0 b2b1 b5




Figure 1: This figure illustrates how we store matrices and vectors for m = 6
and n = 8. We use C for storing the result of AD2AT , b for temporary storage,
and d for storing the diagonal of D2. The white colored entries of C store 0.
Algorithm 2 A CPU algorithm to form AD2AT
// Indices are in (row, column) order
// Input d contains the diagonal elements of the scaling matrix D2
// We store a scaled column in b
C = zeros(m,m);
for k = 0 to n-1 do
// Column scaling
for i = 0 to m-1 do
b(i) = d(k)*A(i,k);
end for
// Outer product addition
for j = 0 to m-1 do
for i = j to m-1 do





can initiate the outer product addition kernel to work on the lower triangular
matrix as illustrated in Fig. 2(b).
Kernel 1 GPU kernel for column scaling
float main( uniform samplerRECT A : TEXUNIT0 ,
uniform samplerRECT d : TEXUNIT1 ,
float3 index : TEXCOORD0 ) : COLOR {
return texRECT(A, index.yx) * texRECT(d, index.yz);
}
Kernel 2 GPU kernel for outer product addition
float main( uniform samplerRECT C : TEXUNIT0 ,
uniform samplerRECT A : TEXUNIT1 ,
uniform samplerRECT b : TEXUNIT2 ,
float2 C_index : WPOS ,
float2 A_index : TEXCOORD0 ,
float2 b_index : TEXCOORD1 ) : COLOR {
return texRECT(C, C_index.xy)
+ texRECT(A, A_index.xy) * texRECT(b, b_index.xy);
}
To generate texture coordinates for fetching the input textures A and b in
the outer product addition kernel, we should attach texture coordinates to each
vertex of the triangle. The matrix A is stored in an n × m texture and the
temporary vector b, which stores a scaled column of A, is stored in an m × 1
texture with w idth × height ordering. At the k-th iteration, we need two sets
of texture coordinates, (k + 0.5, j) for fetching A and (i, 0.5) for b, to process
a fragment at (i, j) (with (x, y) coordinates ordering). The y coordinate of
(k + 0.5, j) and the x coordinate of (i, 0.5) are synchronized with the fragment
position. So for a vertex at (x, y), we attach two sets of texture coordinates
(k +0.5, y) for A and (x, 0.5) for b. Then the rasterizer will linearly interpolate
the required coordinates for fetching the inputs.
Implementing the column scaling operation is easier than the outer product
addition. We first set the viewport size as m × 1 so that the size of a pixel
matches the size of a texel. Of course, we should change the viewport size to
m×m when we perform the outer product addition. To make the column scaling
kernel operate on the temporary vector b, we draw a rectangle with vertices at
(0, 0), (0, 1), (m, 1) and (m, 0). In processing a fragment at (i, 0.5) of b, the
kernel needs to fetch the entries at (k + 0.5, i) of A and at (k + 0.5, 0.5) of d.
The y coordinate of the texture coordinates for fetching A is synchronized with
the x coordinate of the fragment position. So for a vertex at (x, y), we attach a
set of texture coordinates (x, k+0.5, 0.5), and use the y and x coordinates of the
interpolated coordinate triad for fetching A and y and z for d. By combining
the two routines, we can write an algorithm for assembling AD2AT on a GPU
as described in Algorithm 3.
6
Algorithm 3 Assembling AD2AT in the full format on a GPU
Create two textures, CS and CT , of size m×m;
Create a texture b of size m× 1;
Bind CS as the target;
Clear the target buffer;
Bind CT as the target;
Clear the target buffer;
for k = 0 to n-1 do
// Scale the k-th column of A
Load ‘column scaling’ kernel;
Bind b as the target;
Bind A and d as the input textures;
Set viewport size as m× 1;
Draw a rectangle covering the texture b;
// Add the k-th outer product to CT
Load ‘outer product addition’ kernel;
Bind CT as the target;
Bind CS , A and b as the input textures;
Set viewport size as m×m;
Draw a triangle covering the lower triangular part of CT ;














(a) Rasterizing a rectangle for











(b) Rasterizing an isosceles right triangle
for the k-th outer product addition.
Figure 2: k-th iteration of the matrix multiplication for computing AD2AT .
Coordinates are in (x, y, z) order. Ti represents the i-th texture coordinates
attached to a vertex V . For fixed texture coordinates, we use an offset of 0.5







(a) An example of packing a 6×6 ma-





(b) An example of packing a 7×7 ma-
trix in a 4× 7 texture.
Figure 3: In the packed format, we save storage by storing the right lower
triangular submatrix in the upper corner by rotating it.
3.3 Rectangular packed format
In addition to saving computational resources, we can save storage space by
exploiting the matrix structure. Since our GPU algorithm computes only the
lower triangular matrix, the upper triangular part is not used at all. Gunnels
and Gustavson explored a rectangular packed format requiring half the storage
space [6]. In our work, we store the rotated right lower triangular submatrix in
the upper left corner as illustrated in Fig. 3. Packing a symmetric or triangular
matrix results in a w×h texture, where w = dm/2e and h = (m+mod(m+1, 2))
represent width and height, respectively.
The key to assembling the matrix in the packed format is the triangular
rasterization. The column scaling is not different from section 3.2. For the
outer product addition, we need to draw two separate triangles, because the
input access pattern of the fragments in the lower trapezoid is different from the
upper triangle. So one triangle, with vertices 1 through 3, covers the trapezoid
and the other, with vertices 4 through 6, covers the upper triangular part.
We can use the same kernels by attaching different texture coordinates T0
and T1 to each vertex V of the triangle covering the upper part. We attach
texture coordinates for the vertices of the upper part as we would for their
original positions in the full format. The texture coordinates attached to vertices
1 through 3 are exactly the same as those used for the full format. We summarize
how we issue vertices with attached texture coordinates in Fig. 4.
4 Cholesky Decomposition
We can factor an m × m symmetric positive definite matrix C as C = LLT ,
















































(b) When m is odd, h = m and w =
m/2 + 0.5.
Figure 4: Drawing two isosceles right triangles is the key to assembling the
matrix in packed format. Notice that the texture coordinates for the 6th vertex
are same as those for the 3rd. Texture coordinates for the 5th vertex would be
same as those for a vertex at (w, h).
4.1 Full format on CPU
A CPU implementation of the Cholesky decomposition consists of three com-
ponents, described in Algorithm 4. We repeat square rooting the k-th diagonal
entry, normalizing the entries below the diagonal, and subtracting the outer
product of the normalized column from the entries in the remaining lower trian-
gular submatrix. In the k-th normalization process, we use the pivot to indicate
the k-th diagonal entry. In the k-th outer product subtraction process, we refer
to the entries at (j, k) and (i, k) as the pivot and neighbor of an active entry at
(i, j), respectively.
4.2 Full format on GPU
We can transform each component of Algorithm 4 to a GPU kernel as described
in Kernels 3, 4 and 5.
Kernel 3 GPU kernel for square rooting
float main( uniform samplerRECT L : TEXUNIT0 ,




Algorithm 4 A CPU version of Cholesky decomposition
// Indices are in (row, column) order
L = lower triangle(C);
for k = 0 to m-2 do
// Square rooting
L(k, k) = sqrt(L(k, k));
// Normalization
for i = k+1 to m-1 do
L(i, k) = L(i, k)/L(k, k);
end for
// Outer product subtraction
for i = k+1 to m-1 do
for j = k+1 to i do




L(m-1, m-1) = sqrt(L(m-1, m-1));
Kernel 4 GPU kernel for normalization
float main( uniform samplerRECT L : TEXUNIT0 ,
float2 index : WPOS ,




Kernel 5 GPU kernel for outer product subtraction
float main( uniform samplerRECT L : TEXUNIT0 ,
float2 index : WPOS ,
float2 neighbor_index : TEXCOORD0 ,
float2 pivot_index : TEXCOORD1 ) : COLOR {




For the outer product subtraction, we draw an oversized triangle whose
vertices are at (k + 1, k + 0.5), (k + 1,m) and (m + 0.5,m) to cover the lower
triangular matrix. As seen in Fig. 5, we attach two sets of texture coordinates
to each vertex to generate indices for the active neighbor and pivot. An active
fragment at (j, i) will have the interpolated texture coordinates T0 of (k+0.5, i)
and T1 of (j, k + 0.5). As described in Fig. 5(b), the pivot is at (k + 0.5, j). So
we use the GPU’s swizzle operator to rearrange indices for the pivot.
For the square rooting, we simply draw a square to make the kernel operate
on the diagonal entry. For the normalization kernel, we draw a rectangle cover-
ing the off-diagonal column with attached texture coordinates pointing to the
10
square rooted diagonal entry. By combining all three rasterization processes, we
can write a GPU version of Cholesky decomposition as described in Algorithm
5.
Algorithm 5 Cholesky decomposition on a GPU
Create two streams LS and LT of size equal to C;
Copy C to LS ;
for k = 0 to m− 2 do
Load square root kernel;
Bind LT as the target and LS as an input texture;
Draw a square covering the entry at (k, k);
Load copy kernel;
Bind LS as the target and LT as an input texture;
Draw a square covering the entry at (k, k);
Load normalization kernel;
Bind LT as the target and LS as an input texture;
Draw a rectangle covering the k-th column below the diagonal;
Load copy kernel;
Bind LS as the target and LT as an input texture;
Draw a rectangle covering the k-th column below the diagonal;
Load outer product subtraction kernel;
Bind LT as the target and LS as an input texture.
Draw an isosceles right triangle covering lower triangular submatrix;
Swap LS and LT ;
end for
Load square root kernel;
Bind LT as the target and LS as an input texture;
Draw an isosceles right triangle covering the entry at (m− 1, m− 1);
// Now LT is the Cholesky factor of C
To complete the factorization, we draw m squares, m − 1 rectangles, and
m− 1 triangles to initiate the square rooting, normalization, and outer product
subtraction kernel, respectively. Drawing a square covering a single pixel to
initiate the square root kernel does not utilize the parallel architecture [7]. As
a result our Cholesky algorithm utilizes less memory bandwidth for small ma-
trices than the GPU LU decomposition algorithm of Galoppo et al. [4]. Coping
with this trouble requires architectural changes supporting thread level paral-
lelism. With the support, square rooting could be done in conjunction with the
outer product subtraction once the outer product subtraction kernel finishes














(a) We draw an oversized triangle to
cover the sub-lower triangular part of
the matrix. We attach texture coor-
dinates T0 to the vertices to get the




(b) An active fragment at (j, i) has
its active neighbor at (k + 0.5, j) and
active pivot at (k + 0.5, i), whereas it
has interpolated texture coordinates
T0 of (k+0.5, i) and T1 of (j, k+0.5).
Figure 5: These figures illustrate how we rasterize an oversized triangle for the
k-th outer product subtraction of Cholesky decomposition.
4.3 Rectangular packed format
In order to factor a symmetric positive definite matrix in packed format, we
draw two triangles for the outer product subtraction in the first half of the
iterations as we did in section 3.3. In the second half of the iterations, we draw
a single triangle covering the upper triangular part. Since we draw triangles
differently, we need to divide the single for loop in Algorithm 5 into two, the
first of which iterates k from 0 to w − 1 and the second of which iterates k
from w − 1 to 1 (if m is even, as in Fig. 6) or to 2 (if m is odd, as in Fig. 7).
We can also use the same kernels that we used for the full format by attaching
different texture coordinates to each vertex of the triangles. We summarize how
to specify vertices and texture coordinates for the outer product subtraction in
Fig. 6 and Fig. 7.
5 Results
We tested our algorithms using a PC equipped with an Intel Xeon 3.0GHz CPU
and a GeForce 7800 GTX GPU attached to 16x PCIe slot, and running on 64bit
Red Hat Linux.
5.1 Assembling AD2AT
The packed format can save half of the storage space without sacrificing speed.
Fig. 8 compares our implementation for packed format with an algorithm made

























(a) For the first half of the iterations,
we iterate k from 0 to w−1. The tex-
ture coordinates attached to the 6th













(b) For the second half of the itera-
tions, we decrease k from w − 1 to 1.
Figure 6: These figures illustrate how we rasterize the triangles for the k-th outer























(a) For the first half of the iterations,
we iterate k from 0 to w−1. The tex-
ture coordinates attached to the 6th













(b) For the second half of the itera-
tions, we decrease k from w − 1 to 2.
Figure 7: These figures illustrate how we rasterize the triangles for the k-th outer
product subtraction of Cholesky decomposition for the packed format when m
is odd.
13
Figure 8: Timing result for assembling AD2AT . Transferring A and d to and
retrieving the result from the GPU are not considered.
Software) library. As the packed version performs almost the same as the non-
packed version, we do not provide the timing result for non-packed format. (For
m=2048, the non-packed version is only 1% faster than the packed version.)
Drawing two triangles only requires three more vertices and three more sets
of texture coordinates, and the time is negligible. The number of processed
fragments at every iteration determines the performance of our algorithm, and
it is exactly the same in both the non-packed and the packed version.
Our GPU implementations outperforms that of ATLAS for m larger than
768, but, due to latency caused by initiating kernels through drawing shapes,
GPU implementations are slower than ATLAS for small matrices. Problems of
size less than 1024 are considered small by today’s standards.
5.2 Cholesky Decomposition
Triangular rasterization is the key to implementing the algorithms for decom-
position. Without the triangular rasterization, we cannot achieve performance
gain over LU [7]. We compared our Cholesky algorithm with the LU algorithm
written by Galoppo et al. [4]. The LU has wider applicability (since it can
be used for nonsymmetric matrices, too) but requires twice as many arithmetic
operations. As seen in Fig. 9, our algorithm outperformed the LU algorithm
by 83.5% for m = 3584. We also compared our algorithms with spotrf, the
Cholesky decomposition algorithm of ATLAS.
14
Figure 9: Timing result for Cholesky decomposition.
As for the Cholesky decomposition algorithm, the number of generated frag-
ments at every iteration is exactly the same in both the non-packed and the
packed version. For this reason, the algorithm for the rectangular packed for-
mat performs as fast as that for the non-packed format. For m = 3328, the
non-packed version is only 1% faster than the packed. We cannot pack the ma-
trix for m = 4096, because the packed matrix would have height 4097 whereas
the maximum texture height that the GPU supports is 4096.
5.3 Weighted Least Squares
In our implementation, we used the MATLAB-C interface to make use of MAT-
LAB functions for operations other than the assembly and factorization. CPU
operations use double precision. MATLAB uses Intel Math Kernel Library for
the BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra
PACKage) routines. Table 1 and Table 2 show the timing results and perfor-
mance gain of our implementation using the GPU over the CPU implementa-
tion in solving randomly generated problems of various sizes, with n = 2m. We
measured the speedup by dividing the CPU timing result by that of GPU. We
measured the relative error of xkdbl by computing∥∥xkdbl − xCPU∥∥
‖xCPU‖
15









512 3.11E-04 3.37E-13 4 0.39 0.02 0.41 0.20 0.49
1024 3.10E-04 4.25E-13 4 0.49 0.04 0.53 1.03 1.94
1536 4.61E-04 6.96E-13 4 1.33 0.09 1.42 3.41 2.40
2048 1.01E-03 1.76E-12 5 2.99 0.22 3.21 7.95 2.48









512 2.92E-02 1.16E-10 7 0.36 0.02 0.38 0.20 0.53
1024 7.19E-02 2.01E-10 10 0.48 0.10 0.58 1.05 1.81
1536 9.35E-02 2.37E-10 13 1.33 0.30 1.63 3.41 2.09
2048 1.13E-01 3.41E-10 15 3.00 0.66 3.66 7.98 2.18
where xCPU is the solution to the system of equations (3) computed only by
the CPU in double precision. Timing results to get x0sgl include transferring
the data A and d, and retrieving the Cholesky factor L. We set the toler-
ance parameter ε to 10−8 and the iteration limit to 100 in Algorithm 1. We
generated the data matrix A, the diagonal entries of D2, and the vector b
using uniformly distributed random numbers between 0 and 1. To generate ill-
conditioned problems, we set the i-th diagonal element of D2 as 10−4+8i/(n−1)
for i = 0, . . . , n−1. Ill-conditioned problems require more refinement iterations.
Notice that we obtained more than 100% speedup for m larger than 1536 in both
types of problems.
6 Conclusions
We have presented efficient GPU algorithms for forming a dense positive definite
matrix AD2AT and then computing its Cholesky factor. The best algorithms
exploit triangular rasterization to minimize storage without increasing compu-
tation time. By comparing our implementations with conventional CPU algo-
rithms, we demonstrated the potential of the commodity parallel architecture
of a GPU for solving important numerical problems.
We demonstrated performance advantage in solving weighed least squares
problems by using a GPU. In such applications, assembling and factoring are
most demanding of computational resources. Thus support for triangular ras-
terization is essential in exploiting the structure of symmetric or triangular
16
matrices. This provides one reason for developers of streaming languages such
as BrookGPU [1] and Accelerator [11] to consider implementing this feature.
References
[1] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,
P. Hanrahan, Brook for GPUs: stream computing on graphics hardware,
in: SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, ACM, New York,
NY, USA, 2004.
[2] A. Buttari, J. Dongarra, J. Langou, J. Langou, P. Luszczek, J. Kurzak,
Mixed precision iterative refinement techniques for the solution of dense
linear systems, International Journal of High Performance Computing Ap-
plications 21 (4) (2007) 457–466.
[3] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency
of GPU algorithms for matrix-matrix multiplication, in: HWWS ’04:
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on
Graphics Hardware, ACM Press, New York, NY, USA, 2004.
[4] N. Galoppo, N. K. Govindaraju, M. Henson, D. Manocha, LU-GPU: Effi-
cient algorithms for solving dense linear systems on graphics hardware, in:
SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomput-
ing, IEEE Computer Society, Washington, DC, USA, 2005.
[5] G. H. Golub, C. F. Van Loan, Matrix Computations (3rd ed.), Johns Hop-
kins University Press, Baltimore, MD, USA, 1996.
[6] J. A. Gunnels, F. G. Gustavson, A new array format for symmetric and
triangular matrices., in: PARA, 2004.
[7] J. H. Jung, Cholesky decomposition and linear programming on a GPU,
Scholarly Paper (Jan. 2006).
[8] J. Krüger, R. Westermann, A GPU framework for solving systems of linear
equations, in: M. Pharr (ed.), GPUGems 2 : Programming Techniques for
High-Performance Graphics and General-Purpose Computation, chap. 44,
Addison-Wesley, 2005, pp. 703–718.
[9] W. R. Mark, R. S. Glanville, K. Akeley, M. J. Kilgard, Cg: a system for
programming graphics hardware in a C-like language, in: SIGGRAPH ’03:
ACM SIGGRAPH 2003 Papers, ACM, New York, NY, USA, 2003.
[10] Y. Nesterov, A. Nemirovski, Interior Point Polynomial Algorithms in Con-
vex Programming, SIAM, 1995.
[11] D. Tarditi, S. Puri, J. Oglesby, Accelerator: simplified programming of
graphics processing units for general-purpose uses via data-parallelism,
Tech. Rep. MSR-TR-2005-184, Microsoft (Dec. 2005).
17
[12] S. J. Wright, Primal-Dual Interior-Point Methods, Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, 1997.
18
