Performance Gains in Conjugate Gradient Computation with Linearly Connected GPU Multiprocessors by Tarsa, Stephen John et al.
 
Performance Gains in Conjugate Gradient Computation with
Linearly Connected GPU Multiprocessors
 
 
(Article begins on next page)
The Harvard community has made this article openly available.
Please share how this access benefits you. Your story matters.
Citation Stephen J. Tarsa, Tsung-Han Lin, and H.T. Kung. 2012.
Performance gains in conjugate gradient computation with linearly
connected GPU multiprocessors. Proceedings of the 4th USENIX
Workshop on Hot Topics in Parallelism (HotPar'12), June 7-8,
2012, Berkley, CA: 1-7.
Published Version https://www.usenix.org/conference/hotpar12/performance-gains-
conjugate-gradient-computation-linearly-connected-gpu
Accessed February 19, 2015 1:43:54 PM EST
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11859330
Terms of Use This article was downloaded from Harvard University's DASH
repository, and is made available under the terms and conditions
applicable to Open Access Policy Articles, as set forth at
http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#OAPPerformance Gains in Conjugate Gradient Computation with Linearly
Connected GPU Multiprocessors
Stephen J. Tarsa
Harvard University
Tsung-Han Lin
Harvard University
H.T. Kung
Harvard University
Abstract
Conjugate gradient is an important iterative method used
for solving least squares problems. It is compute-bound
and generally involves only simple matrix computa-
tions. One would expect that we could fully parallelize
such computation on the GPU architecture with multiple
Stream Multiprocessors (SMs), each consisting of many
SIMD processing units. While implementing a conju-
gate gradient method for compressive sensing signal re-
construction, we have noticed that large speed-up due to
parallel processing is actually infeasible due to the high
I/O cost between SMs and GPU global memory. We
havefoundthatifSMswerelinearlyconnected, wecould
gain a 15x speedup by loop unrolling. We conclude that
adding these relatively inexpensive neighbor connections
for SMs can signiﬁcantly enhance the applicability of
GPUs to a large class of similar matrix computations.
1 Introduction
The recent ubiquity of cheap, powerful GPUs, featuring
hundreds of processing units (PUs) at a cost of less than
$1 per PU, has made them a primary building block of
modern high-performance computing. Built to acceler-
ate real-time graphics rendering, GPUs are comprised
of multiple independent Stream Processors (SMs), each
consisting of many SIMD PUs. In general, they are
amenable to embarassingly data parallel tasks that re-
quire high computational throughput. Active research in
recent years has focused on adapting algorithms in var-
ious application domains to the speciﬁcs of this archi-
This material is in part based on research sponsored by Air
Force Research Laboratory under agreement number FA8750-10-2-
0180. The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copyright no-
tation thereon. The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily representing
the ofﬁcial policies or endorsements, either expressed or implied, of
Air Force Research Laboratory or the U.S. Government.
tecture, and general-purpose GPU computing has found
success in diverse ﬁelds including ﬁnance and biol-
ogy [1].
However, in the process of adapting a class of sig-
nal processing algorithms for the GPU, we notice that
even in a simple case of computation on linear systems
with small input data sets, popular iterative algorithms
like conjugate gradient methods cannot be efﬁciently im-
plemented across multiple SMs. Fundamentally, this is
due to synchronization and I/O costs that occur when-
ever such algorithms must aggregate operands computed
by different SMs, e.g. to calculate vector norms, or sort
component magnitudes. We show that synchronization
on the current GPU architecture, either using the host
CPU, or with a barrier implemented in GPU global mem-
ory, causes running time to be dominated by the high la-
tency penalty between SMs and global memory. In this
paper, we argue that one dimensional (1D) interconnec-
tions between neighboring SMs would substantially in-
crease parallel speedup gains on future generations of
GPUs by reducing synchronization and operand aggre-
gation time.
Wefocusonoptimizingtheconjugategradientcompu-
tation to demonstrate these limitations, and showcase the
possible speedups from such suggested interconnections.
Conjugategradientiswell-knownandwidely-appliedfor
solving least squares problems and ﬁnding solutions to
sparse linear systems. It is of interest to us due to its cen-
tral role in decoding compressively sensed signals [2]. It
is compute-bound, requires mostly simple matrix opera-
tions, and for many applications of interest in compres-
sive sensing, operates on a small amount of input data
that can be stored entirely in GPU global memory. Intu-
itively, this is an ideal case for GPU acceleration.
Our paper proceeds as follows: we brieﬂy review
the latest nVidia Fermi GPU architecture and CUDA,
its computing platform and programming model. We
present conjugate gradient, its applications, and our
strategy for acceleration using the GPU. We show howL2	 ﾠCache	 ﾠ
Global	 ﾠMemory	 ﾠ
PCIe	 ﾠ2.0	 ﾠBus	 ﾠto	 ﾠCPU	 ﾠ
…
Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
SM	 ﾠ1	 ﾠ
Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
SM	 ﾠ2	 ﾠ
Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
SM	 ﾠ3	 ﾠ
Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
Processing	 ﾠUnits	 ﾠ(PUs)	 ﾠ
Figure 1: A simpliﬁed depiction of the nVidia Fermi architec-
ture. SMs have limited on-chip memory shared among their
PUs. Global memory is visible to all SMs, but while its aggre-
gate bandwidth is high (152.0 GB/sec on the 570 GTX), round
trip access latency from PUs is also high, and can vary between
400-600 clock cycles [3].
the lack of support for direct inter-SM communica-
tion causes costs associated with synchronization and
operand aggregation to dominate any multi-SM adapta-
tion of conjugate gradient. We then demonstrate gains on
potential future GPUs with 1D SM interconnections. Fi-
nally, we survey the current state of compilers for paral-
lel architectures, arguing that standard optimization tech-
niques such as loop unrolling are readily amenable to
GPUs with such interconnected SMs.
2 Background
2.1 GPU Hardware Overview
Using the nVidia GTX 570 as an example, we ﬁrst brieﬂy
describe the Fermi GPU architecture, sketched in Fig-
ure 1. The GTX 570 is comprised of 15 independent
Stream Multiprocessors (SMs), each an SIMD multipro-
cessor with 32 PUs operating at 1.4 GHz. The primary
on-chip memory for an SM is a 64 KB shared memory
pool, visible to its PUs, and organized into 32 banks so
PUs may access memory in parallel.
All SMs are connected to a 1.2 GB global memory
store. Aggregate bandwidth between the SMs and global
memory is 152 GB/s, and the round trip access latency
varies between 400-600 clock cycles from PUs [3]. The
GPU also features texture and constant memory units
with lower latency to SMs. However, since their contents
aremanagedbythehostprogramontheCPU,weeschew
their use to minimize CPU intervention during program
execution, which is extremely costly due to high latency.
2.2 CUDA Overview
The above system is exposed to application programmers
throughnVidia’sCUDAlanguageandcompiler[3]. Data
parallel functions are coded in CUDA’s C-like syntax,
and are termed kernels. Upon launch, the GPU replicates
a kernel’s instructions into multiple data independent
blocks, and assigns them to SMs. At each SIMD SM,
kernel instructions execute simultaneously on the PUs,
forming parallel execution threads. SMs are generally
overscheduled with multiple batches of threads, called
warps; this allows I/O latencies to be hidden by swap-
ping out warps whose threads are waiting on operands.
In CUDA, execution synchronization is supported
only at the lowest level, for threads. When branching and
I/O serialization cause threads to fall out of step, paral-
lelism is reduced, so CUDA’s syncthreads() primi-
tive restores proper execution order. In order for the GPU
to schedule and load balance kernel execution over SMs,
blocks are assumed independent, and so no synchroniza-
tion is supported among SMs. However, the host-CPU
may launch multiple kernels at once to pipeline launch
overhead, and a stream identiﬁer is provided to enforce
kernel execution order; kernels launched into the same
stream will execute in serial, while those in different
streams may execute asynchronously.
Throughout, global memory is the only persistent re-
source that is visible to all threads and all blocks. A pro-
gram on the host CPU is responsible for managing data
transfer to and from global memory, as well as kernel
launches.
2.3 Conjugate Gradient Algorithm and
Applications
The conjugate gradient (CG) computation on normal
equations, our target for acceleration, is an efﬁcient and
numerically stable algorithm that can be used to solve
least squares problems [4]. Its steps are shown in Al-
gorithm 1. CG forms its least-squares estimation by it-
eratively reﬁning search direction vectors and updating
a residual vector. The most compute intensive steps are
matrix-vector multiplications (steps 3 and 7). The re-
maining steps are a series of simple vector additions and
scalar multiplications. However, the scalar multipliers a
and b are computed from vector norms of the results of
steps 3 and 7. We will see that, if we accelerate the ma-
trix vector multiplications by parallelizing across SMs,
aggregatingtheirresultstocalculatenormswillbecostly.
Conjugate gradient is a widely-applied optimization
algorithm in many communities, including scientiﬁc
computing, signal and image processing, and machine
learning. It is of particular interest to us due to its role
in reconstructing compressively sensed signals [2]. In
2Algorithm 1 Conjugate Gradient on Normal Equations
Input: vector y, matrix A, initial approximation vector x0
Output: vector xi
1: r0 = y Ax0; s0 = p0 = ATr0; g0 = ks0k2
2
2: while gi >tol do
3: qi = Api
4: a = gi=kqik2
2
5: xi+1 = xi+api
6: ri+1 = ri aqi
7: si+1 = ATri+1
8: gi+1 = ksi+1k2
2
9: b = gi+1=gi
10: pi+1 = si+1+bpi
11: end while
compressive sensing applications, such as those in med-
ical imaging, security, spectrum sensing, and wireless
sensor networks, faster signal reconstruction translates
directly into application-level gains [5][6][7]. For exam-
ple, in CS-based spectrum sensing, faster decoding leads
directly to the ability to scan a wider frequency band
within a given delay budget [8]. GPUs are a potentially
cost-effective, portable accelerator that would make an
immediate impact for these applications.
2.4 Dense Linear Algebra on GPUs
Currently, dense linear algebraic computation is sup-
ported on the GPU primarily by CUBLAS, a package
maintained by nVidia to incorporate improvements from
the research community. It includes routines for ma-
trix and vector operations that can be decomposed into
independent blocks for batched execution over multiple
SMs, butitreliesonCPUcoordinationwhenresultsmust
be aggregated [9]. The implementation of CG provided
with CUBLAS uses these routines to implement Algo-
rithm 1, and stages the iteration loop on the CPU [10].
CUBLAS’s routines are intended as generic build-
ing blocks for linear algebra-heavy algorithms, at the
expense of sub-optimal runtime. For example, since
CPU-to-GPU communication can exhibit up to 11 ms la-
tency, coordinating on the host machine provides pro-
gramming ﬂexibility, but introduces large delays. As
recently acknowledged by other authors, achieving best
performance on the GPU in general requires minimizing
CPU intervention [11].
3 Accelerating CG with Multiple SMs
As discussed in Section 2.3, solving a least squares prob-
lem using CG requires many iterations of the loop in Al-
gorithm 1, whose most compute-intensive steps (3 and
7) are matrix-vector multiplications. Fortunately, these
are parallelizable, and our strategy for acceleration on
the GPU is to decompose A and AT row-wise, and group
the resulting vector multiplications into different blocks.
The results must then be aggregated to compute scalar
multipliers in steps 4 and 8, as depicted in Figure 2.
In order to show how implementing CG on the GPU
leads to a signiﬁcant I/O bottleneck, we ﬁrst model the
idealized I/O cost of Algorithm 1. We examine the to-
tal number of operands that must be loaded from global
memory into SM shared memoryfor access by PUs. This
ignores the upload cost from the CPU, which is ﬁxed and
independent of GPU architecture. We also assume that
the cost of fetching operands into PUs from SM shared
memory is negligible. The I/O cost is then:
TI/O = [(2MN+N)+l(2N+2M)]=g (1)
Since A and AT are reused in successive iterations, we
load A, AT, and x0 once. This accounts for 2MN +N to-
tal operands when A is M N. Then, in each iteration,
we store computed vectors s and p to global memory and
subsequently refresh them. Over l iterations, this ac-
counts for l(2N+2M) operands. In Equation 1, g is the
pipeline rate between global and shared memories.
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ1	 ﾠ
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ2	 ﾠ
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ15	 ﾠ
SM	 ﾠ	 ﾠ1	 ﾠ SM	 ﾠ	 ﾠ2	 ﾠ SM	 ﾠ	 ﾠ15	 ﾠ
Vector	 ﾠNorm	 ﾠ
&	 ﾠUpdate	 ﾠ
	 ﾠ
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ1	 ﾠ
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ2	 ﾠ
Matrix-ﾭ‐Vector	 ﾠ
Mult	 ﾠ
Block	 ﾠ15	 ﾠ
…	 ﾠ
…	 ﾠ
CG	 ﾠLine	 ﾠ3	 ﾠ
CG	 ﾠLine	 ﾠ7	 ﾠ
CG	 ﾠLines	 ﾠ	 ﾠ
4-ﾭ‐6	 ﾠ
…	 ﾠ
Figure 2: A ﬂow diagram showing the data dependency of
steps in CG. Matrix-vector multiplications are parallelizable,
but vector norms require aggregating results, leading to a need
for barrier synchronization and operand exchange.
3.1 CPU Staging
Since CUDA does not support inter-SM synchronization,
iterative algorithms must be implemented by launching
subsequent iterations as ordered kernels in a common
stream, the strategy taken by CUBLAS. This means that
the contents of caches and shared memory are ﬂushed
and reloaded for every kernel, and reusable operands are
not persistent across iterations. For CG, we incorporate
the cost of reloading A and AT into Equation 1, and the
penalty becomes apparent:
T0
I/O = [(2MN+N)+l(2MN+2N+2M)]=g (2)
3The I/O cost given by Equation 2 is 18x worse when
M = 128, N = 256, and l = 20. Emprically, l = 20
iterations are often required for CG convergence in this
setting, though at l = 10, the I/O cost is still 9x higher
than the ideal case.
3.2 Global Memory Barrier
For reusable operands to be persistent, we must unroll
CG into a single kernel and implement a barrier for syn-
chronization and operand exchange within the lifetime
of a block. We ﬁrst do this using global memory, the
only resource for communication among SMs without
CPU involvement in the current architecture. Using a
simple kernel that evalulates Alu over l iterations, we
then study kernel performance. Structurally, this kernel
is similar to CG in that elements of A may be reused, and
that the most recently updated u must be aggregated at
everyiteration. Fortheresultspresented, Aisa384384
element matrix of single-precision ﬂoats, the largest that
can be accomodated by the current shared memory size.
Access	 ﾠ
Delays	 ﾠ
Context	 ﾠ
Switch	 ﾠDelay	 ﾠ
val=atomicInc(lock)  val=atomicInc(lock) 
val = *lock 
val = *lock 
val = 0 
lock = 0 
lock = 1 
lock = 2 
val = 1 
val = 2 
val = 2 
SM	 ﾠ Global	 ﾠMem	 ﾠ
Warp	 ﾠ1	 ﾠ
Warp	 ﾠ2	 ﾠ
Exit	 ﾠand	 ﾠCon nue	 ﾠ
Execu on	 ﾠ
Barrier	 ﾠ
Delay	 ﾠ
Figure 3: The operation of our software barrier at a single
SM with two warps. The lock incurs two sources of delays:
context switching delay, and atomic access delay. Note that
CUDA’s atomicInc() function returns the lock’s value prior
to its value being changed.
Our barrier works as follows: after guaranteeing the
visibility of interim results to other SMs, we enter the
barrier and use CUDA’s atomicInc() function to atom-
ically increment a counter in global memory, busy wait-
ing while checking the counter value. Currently, CUDA
supports a threadfence() primitive that guarantees
the visibility of memory updates independently by warp,
therefore each warp must check in at the barrier. This
is important because as the number of warps is tuned to
optimize I/O throughput, we will show that relying on
global memory for synchronization will result in an ex-
orbitant performance penalty for large numbers of warps.
Figure 3 illustrates barrier operation at an SM scheduled
with two warps.
If we examine the operand I/O of our test kernel, we
see that it is sharply reduced when compared to a corre-
sponding CPU-staged implementation, as was our goal.
For both methods, A is loaded in 5888 cycles on aver-
age, close to the theoretical lower bound. The vector u is
loaded in an average of 688 cycles, which is dominated
by the 600 cycle latency penalty to global memory. The
total observed operand I/O is reduced from 131,520 to
19,648 cycles using the global memory barrier. These
results are summarized in Figure 6. Extrapolating to CG,
which requires two matrix loads, and two vector loads
per iteration, we project a 7x reduction in operand I/O
time, a reduction that can be further improved by reduc-
ing the latency of operand exchanges.
However, the overall runtime using the global memory
barrier is worse, underscoring the poor performance of
synchronizing via global memory. As illustrated in Fig-
ure 3, there are two sources of delay in the barrier that
affect runtime: context switching time between warps,
and access delay to the lock. These delays compound
as a function of the number of warps, as shown in Fig-
ure 4. For example, with a single warp at each SM, the
barrier delay averages 1250 cycles, or only two round-
trip times to global memory, as expected. However, with
eight warps, the barrier delay is already 10,000 cycles.
Since our test kernel requires 32 warps to achieve max-
imum I/O throughput, the resulting runtime using the
synchronization barrier is an order of magnitude worse:
1:4106 cycles vs. 2:8105 cycles for the CPU-staged
kernel, due to high barrier overhead.
 0
 10000
 20000
 30000
 40000
 50000
 60000
 0  5  10  15  20  25  30
B
a
r
r
i
e
r
 
D
e
l
a
y
 
(
C
y
c
l
e
s
)
Warps Per SM
Figure 4: The number of cycles spent in the lock at each SM
as a function of the number of warps. As warps are added to
pipeline I/O, the software barrier becomes more costly, over-
whelming I/O cost at only eight warps.
3.3 Hardware-Supported Neighbor Syn-
chronization
Low latency synchronization and operand exchange can
be achieved with relatively simple neighbor connections
between SMs, allowing us to realize runtime speedups
by reducing operand I/O. While fast all-to-all SM com-
munication is expensive and does not scale well with the
number of SMs, connections between neighbor SMs are
4Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
SM	 ﾠi	 ﾠ
Ingress	 ﾠ	 ﾠ
Egress	 ﾠ
Control	 ﾠBuﬀer	 ﾠ
Shared	 ﾠMem	 ﾠ
L1	 ﾠCache	 ﾠ
SM	 ﾠi+1	 ﾠ
Ingress	 ﾠ	 ﾠ
Egress	 ﾠ
Control	 ﾠBuﬀer	 ﾠ
Neighbor	 ﾠ	 ﾠ
Bus	 ﾠ
Figure 5: Our proposed 1D SM-interconnection architecture
adds only inexpensive connections between adjacent SMs and
the appropriate buffers to support operand exchange.
relatively cheap and scale easily. Furthermore, any all-
to-all communication using a centralized memory unit
must ultimately service one request at a time, causing
access delay to vary signﬁcantly, especially under high
load. This effect is apparent in our global memory bar-
rier. However, with simple neighbor communication, lo-
cal connections serve only two SMs, and many connec-
tions can operate in parallel.
Our proposed architecture is shown in Figure 5. In it,
we make the following assumptions: each pair of adja-
cent SMs is connected by a full duplex local bus with
152GB=s bandwidth, equal to that of the global mem-
ory bus; ingress and egress buffers are added for data
exchange with dedicated slots to store 32 single preci-
sion ﬂoating point operands from each SM; an additional
bufferisaddedtostorecontrolﬂags; andlatencybetween
SMs is on the order of 10 cycles, which is half of the ob-
served latency between PUs and the current L2 cache.
In this design, only operand buffers need to be scaled
when more SMs are added, since we assume they are
large enough to hold operands from all SMs, in order to
propagate operand updates with a single transfer.
To emulate the synchronization and exchange process
when only neighbor communication is possible, we simi-
larly implement this barrier using buffers in global mem-
ory, and augment the busy-wait loop to propagate control
ﬂags and data to neighbors. We then plug in the previ-
ous assumptions to estimate the expected performance.
Traversing the linear topology end-to-end requires 14
transfers, and transferring the contents of the data and
controlbufferstakes17cycles, leadingtoatotalsynchro-
nization and operand exchange time of only 240 cycles
per iteration. As shown in Figure 6, the resulting imple-
mentation of our test kernel would have a 15x reduction
in total I/O over the CPU-staged method. When we ex-
trapolate to CG, we achieve a similar reduction in I/O
of 15x. At the dimensions used in our test kernel, CG’s
lower bound for computation is roughly 26,000 cycles
meaning that the implementation is now compute bound
and could be sped up with additional computational re-
sources.
0	 ﾠ
20000	 ﾠ
40000	 ﾠ
60000	 ﾠ
80000	 ﾠ
100000	 ﾠ
120000	 ﾠ
140000	 ﾠ
CPU-ﾭ‐Staged	 ﾠ	 ﾠ
	 ﾠ
Glbl	 ﾠMem	 ﾠBarrier	 ﾠ
	 ﾠ
Neighbor	 ﾠ
Interconnec on	 ﾠ	 ﾠ
C
y
c
l
e
s
	 ﾠ
Barrier	 ﾠ
Delay	 ﾠ
	 ﾠ
	 ﾠ
	 ﾠ
Oprnd.	 ﾠExch	 ﾠ
Oprnd.	 ﾠExch	 ﾠand	 ﾠ
Sync	 ﾠDelay	 ﾠ
Total	 ﾠ
Operand	 ﾠI/O	 ﾠ
Init	 ﾠOprnd	 ﾠLd	 ﾠ Init	 ﾠOprnd	 ﾠLd	 ﾠ
1.2	 ﾠx	 ﾠ106	 ﾠ
Figure 6: The contribution of synchronization and data ex-
change costs to total runtime for our test kernel
4 Implications for Future GPUs
Wehaveshownthatfastsynchronizationandoperandex-
change between SMs are important in order to properly
leverage loop unrolling in iterative algorithms. With-
out such support, current GPU architectures from both
nVidia and ATI/AMD induce heavy operand I/O, reduc-
ing compute-to-I/O ratio, and therefore available paral-
lel processing opportunities. As future generations of
GPUs scale computational resources, inter-SM commu-
nicationwouldalleviatetheseI/Olimitationsatlowhard-
ware cost. For a large class of iterative parallelizable al-
gorithms, these steps are necessary to achieve the beneﬁt
of additional computational power.
With neighbor interconnections already a feature of
some other parallel processors (e.g. the Cell Broadband
Engine [12], and [13]), efﬁcient use of loop unrolling has
been a standard compiler optimization in the literature.
Discussions of this capability include [14] and [15]. It is
therefore reasonable to expect that this optimization can
be readily integrated into the CUDA compiler.
5 Conclusion
GPUs are an attractive many-core platform due to their
cost, availability, and scalable architecture; however
without low-latency hardware-supported intercommuni-
cation between SMs, they are ineffective for accelerat-
ing a large class of iterative parallelizable algorithms.
Given the capabilities of existing compilers and common
parallel processors, and the myriad waiting applications
such as those relying upon high-performance compres-
sive sensing decoding, the GPU architecture itself is cur-
rently the biggest bottleneck to these applications. We
have shown that simply adding cheap neighbor intercon-
nections will lead to signiﬁcant gains and allow gen-
eral purpose GPU computing to accelerate similar matrix
computations.
5References
[1] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and
J. Phillips, “GPU computing,” Proceedings of the IEEE, vol. 96,
no. 5, pp. 879–899, 2008.
[2] D. Needell and J. Tropp, “CoSaMP: Iterative signal recovery
from incomplete and inaccurate samples,” Applied and Compu-
tational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, 2009.
[3] “nVidia CUDA Programming Guide 4.0,,” 2011.
http://developer.nvidia.com/nvidia-gpu-computing-
documentation.
[4] ˚ A. Bj¨ orck, Numerical methods for least squares problems.
No. 51, Society for Industrial Mathematics, 1996.
[5] Y. Gwon, H. Kung, and D. Vlah, “DISTROY: detecting integrated
circuit Trojans with compressive measurements,” in 6th USENIX
Workshop on Hot Topics in Security (HotSec), 2011.
[6] M. Lustig, D. Donoho, and J. Pauly, “Sparse MRI: The appli-
cation of compressed sensing for rapid MR imaging,” Magnetic
Resonance in Medicine, vol. 58, no. 6, pp. 1182–1195, 2007.
[7] Z. Tian and G. Giannakis, “Compressed sensing for wideband
cognitive radios,” in Acoustics, Speech and Signal Processing,
2007. ICASSP 2007. IEEE International Conference on, vol. 4,
pp. IV–1357, Ieee, 2007.
[8] Y. Gwon and H. T. Kung and D. Vlah, “Compressive sensing
with directly recoverable optimal basis and applications in spec-
trum sensing,” in Harvard University Technical Report TR-08-11,
Harvard Univerisity School of Engineering and Applied Science,
2011.
[9] “nVidia CUBLAS Library User Guide,” 2012.
http://developer.nvidia.com/nvidia-gpu-computing-
documentation.
[10] “nVidia CUDA Libraries SDK Code Samples - Precondition
Conjugate Gradient,” 2012. http://developer.nvidia.com/cuda-
libraries-sdk-code-samples.
[11] V. Volkov and J. Demmel, “Benchmarking GPUs to tune dense
linear algebra,” in High Performance Computing, Networking,
Storage and Analysis, 2008. SC 2008. International Conference
for, pp. 1–11, IEEE, 2008.
[12] T. Chen, R. Raghavan, J. Dale, and E. Iwata, “Cell broadband
engine architecture and its ﬁrst implementationa performance
view,” IBM Journal of Research and Development, vol. 51, no. 5,
pp. 559–572, 2007.
[13] C. Cascaval, J. Castanos, L. Ceze, M. Denneau, M. Gupta,
D. Lieber, J. Moreira, K. Strauss, and H. Warren Jr, “Evaluation
of a multithreaded architecture for cellular computing,” in High-
Performance Computer Architecture, 2002. Proceedings. Eighth
International Symposium on, pp. 311–321, IEEE, 2002.
[14] M. Gordon, W. Thies, and S. Amarasinghe, “Exploiting coarse-
grained task, data, and pipeline parallelism in stream programs,”
in ACM SIGOPS Operating Systems Review, vol. 40, pp. 151–
162, ACM, 2006.
[15] M. P¨ uschel, F. Franchetti, and Y. Voronenko, Encyclopedia of
Parallel Computing, ch. Spiral. Springer, 2011.
6