Parallelization of Fast 1-Minimization for Face Recognition by Wagner, Andrew et al.
June 2011 UILU-ENG-11-2204 
DC-251
PARALLELIZATION OF FAST h -  
MINIMIZATION FOR FACE 
RECOGNITION
Andrew Wagner, Victor Shia, Allen Yang, Mark 
Murphy, and Yi Ma
Coordinated Science Laboratory
1308 West Main Street, Urbana, IL 61801
University o f  Illinois at Urbana-Champaign
-REPORT DOCUMENTATION PAGE- Form Approved OMB NO. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, 
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comment regarding this burden estimate or any other aspect of this 
collection of information, including suggestions for reduang this burden, to Washington Headquarters Services. Directorate for information Operations and Reports, 1215 Jefferson 
Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE
June 2011
3. REPORT TYPE AND DATES COVERED
4. TITLE AND SUBTITLE
Parallelization of Fast li -Minimization for Face Recognition
6. AUTHOR(S)
Andrew Wagner, Victor Shia, Allen Yang, Mark Murphy, and Yi Ma
5. FUNDING NUMBERS
ONRN00014-09-1-0230
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign 
1308 West Main Street 
Urbana, Illinois 61801-2307
8. PERFORMING RGANIZATION 
REPORT NUMBER
UILU-ENG-11-2204 
DC-251
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
ONR, Ballston Centre, Tower 1, 800 N. Quincy, Arlington, VA 22217-5660
10. SPONSORING/MONITORING 
AGENCY REPORT NUMBER
11. SUPPLEMENTARY NOTES
The views, opinions and/or findings contained in this report are those of the author(s) and should not be construed as an official 
position, policy, or decision, unless so designated by other documentation
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited.
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum 200 words)
Recently a family of promising face recognition algorithms based on sparse representation and ' 1-minimization ( '1-min) have been 
developed. These algorithms have not yet seen commercial application, largely due to higher computational cost compared to other 
traditional algorithms. This paper studies techniques for leveraging the massive parallelism available in GPU and CPU hardware to 
accelerate ' 1-min based on augmented Lagrangian method (ALM) solvers. For very large problems, the GPU is faster due to higher 
memory bandwidth, while for problems that fit in the larger CPU L3 cache, the CPU is faster. On both architectures, the proposed 
implementations significantly outperform naive library-based implementations, as well as previous systems. The source code of the 
algorithms will be made available for peer evaluation.
14. SUBJECT TERMS 
Face recognition
15. NUMBER OF PAGES 
8
16. PRICE CODE
17. SECURITY CLASSIFICATION 
OF REPORT
UNCLASSIFIED
18. SECURITY CLASSIFICATION 
OF THIS PAGE
UNCLASSIFIED
19. SECURITY CLASSIFICATION 
OF ABSTRACT
UNCLASSIFIED
20. LIMITATION OF ABSTRACT
UL
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18 
298-102
Parallelization of Fast i \--Minimization for Face Recognition
Andrew Wagner Victor Shia
University of Illinois at Urbana-Champaign University of California Berkeley
awagner@illinois.edu
Allen Yang Mark Murphy
yang@eecs.berkeley.edu mjmurphy@eecs.berkeley.edu
Abstract
Recently a family o f  promising face recognition algo­
rithms based on sparse representation and i\-minimization 
(t\-min) have been developed. These algorithms have not 
yet seen commercial application, largely due to higher com­
putational cost compared to other traditional algorithms. 
This paper studies techniques fo r leveraging the massive 
parallelism available in GPU and CPU hardware to ac­
celerate i\-m in based on augmented Lagrangian method 
(ALM) solvers. For very large problems, the GPU is faster 
due to higher memory bandwidth, while fo r  problems thatfit 
in the larger CPU L3 cache, the CPU is faster. On both ar­
chitectures, the proposed implementations significantly out­
perform naive library-based implementations, as well as 
previous systems. The source code o f the algorithms will 
be made available fo r  peer evaluation.
1. Introduction
t \ -minimization (G-min) has received much attention in 
recent years due to important applications in compressive 
sensing [4] and sparse representation [21]. The problem 
refers to finding the minimum t i-norm solution to an under­
determined linear system b =  Ax:
m in ||x ||i  subj. to b = A x .  (1)
It is now well established that, under certain conditions 
[5, 8], the minimum G-norm solution is also the sparsest 
solution to the system (1).
Among its many applications, G-min has been re­
cently used to reformulate image-based face recognition 
as a sparse representation problem [22], If we stack the 
m  pixels of the rii training images of K  subject classes 
into the columns of matrices (A\ e  MTOXUl,-- - ,A k  € 
l mxnK), combine the matrices into a larger matrix A  =  
[Ai, • • • , A k \ € RmXn, and arrange the pixels of a new 
query image into a vector b € Mm, sparsity-based classi­
fication (SBC) solves the following minimization problem:
min ||x ||i +  ||e|!i subj to b — A x  +  e. (2)
x,e
vshia@eecs.berkeley.edu
YiMa
University of Illinois at Urbana-Champaign
yima@illinois.edu
If the sparsest solutions for x  and e are recovered, e pro­
vides a means to compensate for pixels that are corrupted 
due to occlusion of some part of the query image, and the 
dominant nonzero coefficients of x  reveal the membership 
of b based on the training image labels associated with A.
In this paper, we study parallelization of Gmin on many- 
core CPU and GPU architectures. Although G-min (1) is a 
convex program, conventional algorithms such as interior- 
point methods [6, 18] scale poorly for applications such as 
face recognition. Recently, a number of accelerated algo­
rithms have been proposed that explicitly take advantage of 
the special structure of G-min problems [13, 24]. We in­
vestigate parallelization of a state-of-the-art G-min solution 
based on augmented Lagrangian methods (ALM) [2, 24].
In addition, the accuracy of the solution given by (2) re­
lies on the assumption that the query image is well aligned 
with the training images. In [19], SBC was extended to 
align a query image to each subject class individually. The 
alignment algorithm solves the following problem:
fj =  arg min ||e ||i subj. to b o n  = AiX + e, (3)
x ,e ,T i
where r* G T is in a finite-dimensional group T  of trans­
formations (e.g., affine or homography) acting on the im­
age domain. The sequence r* converges to a transformation 
that aligns the test image b with the training images from 
the i-th class A t . In other words, the algorithm extends 
Lucas-Kanade iterative alignment [14] to use the G-norm 
as a robust error function, while simultaneously estimating 
the coefficients of the linear illumination model AjX.
Although ALM has been applied to improve the speed 
of a prototype face recognition system [20], due to the high 
per-class cost of the alignment step, the system still fails to 
achieve real-time performance against datasets of hundreds 
or thousands of subjects. In this paper, in addition to accel­
erating the generic G-min objectives (1) and (2), we also 
discuss how to efficiently accelerate the face alignment step 
(3) on multi-core CPUs and GPUs.
To clarify discussion, we clearly distinguish between 
parallelism, which is a property of an algorithm, and con­
currency, which is a property of a hardware architecture. 
Parallelism in an algorithm provides the opportunity to per­
form computations concurrently on hardware. Algorithms
1
often exhibit multiple levels of parallelism, and hardware 
often provides multiple levels of concurrency. In these 
terms, the primary contribution of this paper will be to 
determine the optimal mappings between the parallelism 
available in ¿'i-min and the concurrency available in multi­
core CPU and GPU architectures.
Finally, we present extensive benchmarks to compare the 
performance of the generic ALM ¿i-min algorithm and the 
corresponding face recognition routines on massively paral­
lel CPUs and GPUs. To this end, a face recognition system 
for security and access control applications has been im­
plemented, and is diagrammed in Figure 1. This pipeline 
follows the work of [20], but the fast implementations pre­
sented in this paper are novel. Since the parallelization of 
any algorithm is highly dependent on the target hardware, 
we will give a brief overview of the architecture of our 
test system in detail in Section 3. Finally, note that while 
this paper focuses on the parallelism available in a single 
(shared memory) machine, applications targeting thousands 
of subjects or more will likely require computation to be 
distributed over a cluster or network of computers. These 
systems are not addressed here. The source code of our im­
plementations, which target NVIDIA Fermi GPUs as well 
as Intel x86-64 CPUs, will be made available for peer eval­
uation.
1.1. Literature Review
Traditionally, ^i-min (a.k.a. basis pursuit (BP)) has been 
formulated as a linear program [6]. Several variations of the 
solution are also well known in optimization, including a 
noisy approximation via quadratic programming called the 
LASSO [18] and truncated Newton interior-point method 
(TNIPM) [12].
One of the drawbacks of most interior-point methods for 
¿i-min is that they require the solution sequence to fol­
low an interior path via gradient descent or conjugate gra­
dient methods, which are computationally expensive. To 
mitigate these issues, an approach called Homotopy has 
been recently studied to accelerate the speed of £i-min 
[17, 11, 15, 9]. Although Homotopy can be shown to ex­
actly estimate BP when the solution is sufficiently sparse 
[9], the algorithm still involves computationally expensive 
operations such as matrix-matrix multiplication and linear 
least-squares problems with varying A  matrices.
In Section 2, we contend that ALM is a better choice 
for implementation on many-core CPUs and GPUs. The 
ALM algorithm belongs to a category of approximate i\- 
min solutions called iterative shrinkage-thresholding (1ST) 
methods [23, 1]. 1ST algorithms mainly involve elementary 
operations such as vector algebra and matrix-vector multi­
plication. Therefore, when the dimension of the problem 
becomes high, IST-type algorithms are particularly suitable 
for hardware systems with a high degree of concurrency. In
Figure 1. The face recognition pipeline.
[24], the authors showed that ALM is able to significantly 
improve the solver speed, while achieving estimation ac­
curacy competitive with other ¿’i-min solutions. Therefore, 
we choose ALM as the core algorithm for implementation 
of t \-min in the parallel face recognition pipeline.
In terms of the past works in parallel ¿i-min, the liter­
ature has been limited, to the best of our knowledge. In 
[3], Borghi et al. developed a special proximal gradient i\- 
min algorithm based on Moreau-Yosida regularization. In 
[16], Murphy et al. presented parallel implementation of 
the 11-SPIRIT MRI reconstruction algorithm on the same 
GPU architecture addressed in this paper.
2. Augmented Lagrangian Method
In this section, we briefly describe the ALM algorithm 
for ¿'1-min (1) [24] and analyze its complexity. Lagrange 
multipliers have been frequently used in convex program­
ming to eliminate equality constraints via adding a penalty 
term to the cost function for infeasible points. ALM meth­
ods differ from other penalty-based approaches by simulta­
neously estimating the optimal solution and Lagrange mul­
tipliers in an iterative fashion. For ^-m in  (1), the aug-
merited Lagrange function is defined as:
L» {x ,y )  = Hxllx + y T ( b -  Ax )  + ^\ \b -  Ax\\%, (4)
where y  > 0 is a constant that penalizes infeasibility and y  
is a vector of Lagrange multipliers.
In Lagrange Multiplier Theory [2], if there exists a mul­
tiplier y* that satisfies the second-order sufficiency condi­
tions for optimality, then for a sufficiently large y,  the opti­
mal fi-min solution also minimizes
x* =  arg min LM(x, y*). (5)
In practice, the optimal values for the triplet (x* , y* ,y )  
are all unknown. Furthermore, it has been observed that 
solving (5) with a large initial value of y  tends to lead to 
slower convergence speed [23,24], In [2,25], an alternating 
procedure has been shown to iteratively update x  and y  :
i  x k+1 = arg min x L ^ k( x , y k)
1 Vk+1 =  y k + y k{ b -  A x k+l) ’
where y k —> oo increases monotonically. The iteration ter­
minates when the estimates ( x k, y k) converge.
Note that in the iterative procedure (6), the second step 
only involves vector algebra and matrix-vector multiplica­
tion. Therefore, the procedure is computationally efficient 
only if it is easier to minimize the augmented Lagrangian 
L^ k( x , y k) compared to solving the original problem (1) 
directly. In fact, this problem can be solved element-wise 
iteratively by a soft-thresholding algorithm [23, 1], whose 
time complexity is bounded by 0 ( n 2) and can be easily par­
allelized. Algorithm 1 summarizes the generic ALM £i-min 
algorithm. 1 The soft-thresholding algorithm involves the 
element-wise operator shrink(x) =  sign(x) max (M,0) .
Algorithm 1 Augmented Lagrangian Method (ALM)
INPUT: b £ Rm, A =  [AU - - - , A K] e Mmxn, r  
maxeig(A7 A), and constant p > 1.
1: while not converged (k = 1, 2 ,.. .)  do 
2: ii 4— 1, z i  <— x k, Mi x k
3: while not converged (/ =  1 ,2 ,...)  do
4: ui+i 4- shrink(z, -  ±AT{Azt - b  -  ¿ y fc),
5: 4—  ^(1 +  ^ /l +  4i )^
6: zi+i 4- u i+i + -  ui)
7: end while
8: 3J/.-+1 4— Wi-fi
9: y k+i < -y k + y k ( b - A x k+i)
10: y k+ i 4 - p - y k
11: end while 
OUTPUT: x * 4- x k.
1 For conciseness, we only present the ALM algorithm in the primal 
domain. There also exist implementations in the dual domain [25, 24].
3. Hardware Concurrency
In this section we discuss the levels of concurrency avail­
able in the hardware architectures considered in this paper, 
as well as other aspects of the hardware that are important 
for performance. In particular, since caches (regions of on- 
chip memory) are often orders of magnitude faster than off- 
chip memory, their sizes and speeds have a dramatic effect 
on performance. We give a brief overview of the caches 
that are available in our target architectures, and defer dis­
cussion of their performance effects to Sections 4.1 and 4.2.
Our discussion and experiments will address the most 
common hardware configuration for engineering worksta­
tions: a motherboard with two quad-core processors and a 
PCIe card with a single high-end GPU. Recognition involv­
ing more than a few hundred subjects with contemporary 
hardware would require a server (or cluster) configuration 
with an expandable number of processors. We will not ad­
dress efficient parallelization for these systems, which may 
have additional challenges associated with their non-shared 
memory model. Unless otherwise specified, all implemen­
tations utilize single precision floating point datatypes.
3.1. CPU Hardware Concurrency
The main defining characteristics of contemporary 
multi-core CPU architectures are that they have two lev­
els of concurrency, relatively large amounts of cache, and 
relatively high clock speeds. The baseline architecture for 
our experiments is a Linux workstation with two quad-core 
Intel Nehalem E5530 processors clocked at 2.4 GHz. Each 
processor has its own memory interface, and is directly in­
terfaced to half of the RAM installed in the machine. The 
amount of RAM exceeds the amount used by the algo­
rithms, and is not an important performance consideration.
Each core has a private 32 KiB LI data cache and a pri­
vate 256 KiB L2 cache. Each processor further has 8 MiB 
of L3 cache that is shared by the four cores. Overall, the 
algorithm has approximately 16 MiB of L3 cache available 
for a dual-processor configuration.
For floating-point instructions, each core also has a vec­
tor processing unit (SSE) capable of performing the same 
arithmetic operation on four single-precision floating point 
values simultaneously. There are thus two important levels 
of concurrency that need to be exploited to efficiently use 
a modem CPU: core-level concurrency and SSE-level con­
currency.
3.2. GPU Hardware Concurrency
The main defining characteristics of contemporary 
multi-core GPU architectures are that they have two (much 
wider) levels of concurrency, relatively small amounts of 
cache, and relatively low clock speeds. Whereas most 
of the transistors on a typical CPU are dedicated either 
to cache or to hardware that enables higher clock speeds
□
16KÌB: 
Resampled 
Face Image
(20 ii
(a) The larger algorithm data structures
(b) The caches on a E5530 CPU
64 KiB
LI +
64 KiB 
LI +
64 KiB 
LI +
64 KiB
LI +
64 KiB
LI +
64 KiB
LI +
64 KiB
LI +
64 KiB 
LI + 
shared
64 KiB 
LI +
64 KiB
LI +
64 KiB
LI +
64 KiB 
LI +•
64 KiB 64 KiB
LI +
64 KiB
LI +
768 KiB L2
(c) The caches on a GTX480 GPU
Figure 2. A visual comparison of the algorithm working set to the 
CPU and GPU caches. Note: Although the aspect ratio could not 
be preserved, all arrays and caches are drawn so that the area is 
proportional to the size of the data.
(such as branch prediction, out-of-order instruction execu­
tion, etc...), most of the transistors on a GPU are dedicated 
to arithmetic logic units. For our GPU implementations, 
we target NVIDIA Fermi GPU architecture (e.g., the GTX 
480 GPU used in this paper). An explanation of the GPU 
programming model (CUDA) at a useful level of complete­
ness would take more space than is available here, so we 
will instead frame our discussion in terms of hardware ca­
pabilities. CUDA programmable GPUs are comprised of 
several streaming multiprocessors (SM), each of which is 
roughly analogous to a CPU core. For Fermi architecture 
GPUs there are up to 16 SMs, and each SM is capable of 
executing up to 64 single precision floating point operations 
concurrently. 2
Each SM contains its own LI-cache, which is divided be­
tween hardware managed and software managed (“shared”) 
memory. Additionally, all SMs share a common L2-cache. 
For our system, each SM has 64 KiB of LI cache, and all 
SMs share 768KB of L2 cache. The relatively small amount 
of cache (1/23 as much as CPU L3) on the GPU is balanced 
by a significantly higher bandwidth between the processor 
chip and off-chip memory (DRAM) compared to the CPU. 
A scale drawing of the caches available on the GPU can be
384 KiB
A T
3.1 MiB Recognition A matrix 
(20 face images x 10 users)
ages plus 4 column Jacobian)
seen in Figure 2(c). The GPU has its own memory system, 
and any data the GPU uses must first be transferred from 
CPU DRAM to GPU DRAM over PCIe. For our applica­
tion, this transfer overhead can be amortized over a large 
amount of computation and is not a major concern.
While the programming model for the SIMD units of 
each SM is somewhat more flexible than on the CPU, lever­
aging the flexibility typically comes at the cost of reduced 
concurrency.3 Therefore, for the purposes of comparing 
hardware architectures, the SIMD units on the GPU are 
roughly analogous to the CPU SIMD units. Thus, in sum­
mary, the GPU hardware provides two levels of concur­
rency: SM-level concurrency and thread-level concurrency. 
Note that the GPU provides more concurrency than a CPU 
at both levels (14 SMs vs. 8 cores) and (64 wide vs. 4 wide 
SIMD units).
4. Parallelism in the Face Alignment Stage
We will next discuss the parallelism available in the face 
recognition pipeline, and propose techniques for exploiting 
this parallelism on the GPU and CPU hardware described 
in the previous section. In this section we first focus on the 
face alignment stage (see Figure 1).
Face alignment (3) estimates an image transformation r  
that rectifies the query image b with possible pose variation 
w.r.t. each training class Ai, which leads to the minimal 
sparsity in error e after the alignment. Note that directly 
solving (3) is inefficient since it is a non-convex problem 
and may exhibit local minima. However, given a good ini­
tial estimate of the transformation r  (e.g., provided by an 
accurate face detector), the optimal solution for r  can be 
sought iteratively by linearization at each jth  step:
min || e || i subj to bo  Tj + J jA r j  =  AiX +  e, (7)
x , e , A  Tj
where Jj  =  V Tj (b o tj) is the Jacobian, and A t is a step 
update to r . Denote bj = b o Tj, Bj  =  [.A i , —J j ] and 
w T = [xT, A rT], then the update A t can be computed by 
solving the following linear program:
min ||c ||i subj to bj =  B j w  -T e. (8)
The per-class alignment algorithm via ALM is summarized 
in Algorithm 2.
In summary, the alignment stage essentially contains two 
levels of available parallelism. At the higher level, there are 
per-class alignment problems that are solved independently, 
or problem-level parallelism. At a lower level, the first- 
order linear algebraic operations exhibit parallelism within
2In CUDA terms, the warp width is 32, and the floating pipeline can 
issue up to two warps simultaneously
3 In CUDA terms, code branches that cause warp divergence usually 
result in serialization o f  the different groups o f threads
Algorithm 2 (Face Alignment via ALM)
Input: 6, A i,x o =  0, r 0, and Jo.
1: while not converged (j =  1 ,2 ,...)  do 
2: Update bj  4— |^ °^ ~ ^ |; Bj — [Ai,—J j - 1] and corre­
sponding (J?j)7 
3: Initialize wo =  0, y 0 =  0
4: while not converged (k =  1 ,2 ,. . .) do
5: tto 4- w k- u  z 0 <- ek~i
6: while not converged (/ =  1, 2 ,.. .)  do
7: zi «-shrink (*>j -  BjUi-! +
8: u i  + - B ]  (b j  - z i  +
9: end while
10: Wk 4— up  efc 4— zi
11: Vk +~ V k - i  +  H k - i ( b j  — BjWk — efc)
12: Hk 4- pjLtfc-i
13: end while
14: Update ej, Tj, and Jj
15: end while
Output: Tj* 4— Tj, e? 4- ej
image operations, i.e., at the pixel level. We call this pixel- 
level parallelism. The following two sections discuss meth­
ods for exploiting these two levels of parallelism on CPU 
and GPU architectures.
4.1. CPU Implementation
Optimal implementation of Algorithm 2 on a multi-core 
CPU must take into account the properties of the cache hi­
erarchy. In general, cache that is closer to the core (i.e. LI 
cache) has higher bandwidth, but smaller size compared to 
cache that is farther from the core (L3 cache). For refer­
ence, Figure 2(b) shows the the sizes of L2 and L3 caches 
of the Intel E5530 CPU, which is a representative example 
of a modem multi-core CPU.
For the E5530, the L2 cache in each core is only able to 
store about 16 images. For image alignment problem, the 
number of training images per class is typically larger than 
16. In this paper, we compare two mappings of the paral­
lelism available in the alignment stage to the concurrency 
available on the CPU: a naive implementation that is purely 
based on stock libraries and compilers, and a manually op­
timized version we advocate.
The naive implementation disregards problem-level par­
allelism, and instead maps pixel-level parallelism onto both 
the core-level and the SSE-level concurrency provided by 
the CPU. In other words, alignment of the query image is 
performed against a single subject at a time using all avail­
able cores on the processors. The potential advantage of 
this implementation is that all of the variables for the inner 
loop, most notably Bj and B j ,  fit in L2 cache.
In contrast, our manually optimized implementation 
maps problem-level parallelism onto core-level concur­
rency, and maps pixel-level parallelism onto SSE-level con­
currency. In other words, eight instances of (8) are executed 
in parallel with each problem running on one of the eight 
cores. The advantage of this implementation is that since 
the cores are operating on different alignment problem in­
stances, no data is shared between cores, and therefore no 
synchronization is necessary between the cores. Further­
more, even with eight problem instances running concur­
rently, the local variables for the inner loop still fit in L3 
cache. Because of the sequential data access patterns of the 
solver, the CPU hardware is able to use the full L3 cache 
bandwidth, which is high enough to make the program CPU 
limited, rather than memory limited. For these reasons, this 
implementation outperforms the previous naive solution.
In both implementations, most of the operations take ad­
vantage of Intel Math Kernel Library (MKL), a commercial 
implementation of the standard Basic Linear Algebra Sub­
programs (BLAS). For the image resampling step, we make 
use of the Intel Integrated Performance Primitives (IPP) li­
brary. Both MKL and IPP are optimized for Intel multi-core 
CPUs, and are able to automatically utilize both core-level 
and vector-level concurrency (for the first implementation), 
but are also available in single-threaded versions (for the 
second implementation). For operations that are not opti­
mized by Intel in-house libraries, such as the soft threshold­
ing operator in step 7 of Algorithm 2, we achieve vector- 
level concurrency via the automatic vectorization facilities 
of the Intel C++ compiler (ICC) [10]. To achieve core-level 
concurrency we make use of the Open MP API. [7]
4.2. GPU Implementation
While on the CPU, algorithm performance is highly de­
pendent on effective cache usage, cache is relatively unim­
portant on the GPU for ALM based (’i-min. As can be seen 
in Figure 2(c), the GPU has a very small amount of cache 
compared to the CPU. While it might be possible to fit a 
single instance of the alignment problem (with a slightly 
reduced problem size) into L2 cache, this would not be an 
efficient use of the GPU’s resources. The strength of the 
GPU’s memory architecture for our purposes is its ability to 
sustain a very high bandwidth to DRAM. This bandwidth 
is achieved by having a very large number of threads is­
suing interleaved memory requests. Solving many align­
ment problem instances concurrently increases the number 
of threads that can be used.
Therefore, our recommended GPU implementation is 
strikingly similar in spirit to our recommended CPU im­
plementation: on the CPU we use the cores to solve multi­
ple instances concurrently, and on the GPU we use multiple 
SMs for the same purpose.4 The number of subject classes 
that are actually scheduled to run concurrently is chosen by
4In CUDA terms, our proposed alignment stage implementation con­
sists o f  a single kernel that performs alignment for all subject classes. Each 
instance o f  the alignment problem is assigned its own thread block, and the
the GPU hardware, but can be indirectly influenced by man­
ual tuning of the code (i.e., the kernel launch configuration 
in CUDA terms). We have empirically determined that per­
formance is highest with 5-7 subject classes aligned simul­
taneously on each SM.
Several other aspects of the implementation merit discus­
sion. First of all, we take advantage of the GPU’s special 
hardware dedicated to bilinear interpolation for the com­
putation of fr(r) and J ( r ) ,  which essentially consist of re­
sampled versions of the test image and its derivatives. Sec­
ond, since there are no standard CUDA libraries that work 
at the SM level, we use a custom routine for computing 
B] = ( B j B j )~1B T =  with G inverted using
Gaussian elimination with partial pivoting. To achieve pre­
cision comparable to the single precision LAPACK routines 
in Intel’s MKL with this simplistic algorithm, we use double 
precision. Since G is only n  x n  and Bj  is only computed 
once per optimization problem, the cost of the inversion is 
dwarfed by the cost of other steps. Similarly, sums and dot- 
products of large vectors are also performed in double pre­
cision.
5. Parallelization of the Face Recognition Stage
After the face alignment stage is complete, the 20 subject 
classes with lowest alignment error are selected for recogni­
tion, b and A t are re-sampled to a common alignment using 
Tj, and Ai  are concatenated into a new matrix A. A sparse 
representation of b w.r.t A  is then recovered by solving a 
single ¿i-min problem, as shown in (2), using Algorithm 3. 
The coefficients in x  are then used to compute error resid­
uals that are used for classification. In this stage of the al­
gorithm, there is no problem-level parallelism to exploit, 
so this section will discuss how to exploit pixel-level paral­
lelism on both CPU and GPU hardware. Then in Section
6, we will benchmark the performance of the two archi­
tectures and further demonstrate the speed gains achieved 
by our proposed implementations over previously published 
implementations.
5.1. Recognition Stage Implementation
Compared to problem-level parallelism, the exclusively 
pixel-level parallelism in Algorithm 3 is relatively straight­
forward to exploit: On the CPU, most of the operations 
map well onto MKL BLAS calls, and operations that do 
not can be easily parallelized using OpenMP and auto- 
vectorization.
On the GPU, most of the operations map well onto 
similar calls in NVIDIA’s GPU BLAS library (CUBLAS) 
which, like MKL, can take advantage of both levels of con­
currency available in the hardware architecture. Operations 
that do not map well onto the CUBLAS API were imple-
GPU hardware schedules as many thread blocks to run concurrently as the 
hardware will allow.
Algorithm 3 (Face Recognition via ALM)
1: Input: b E Mm, A E Rmxn, X\ =  0, ei =  b, y 1 =  0.
2: while not converged (k = 1,2, . . . )  do 
3: ek+i = shrink(6 -  A x k +  ¿ y fc, ¿);
4: t\ 4— 1, z \  4— x k, w \ 4— x k\
5: while not converged (l = 1,2, . . . )  do
6: wi+i 4- shrink(zi +  ^ A T(b -  Azi — ek+i +
¿2/fc), ^ ) ;
7: ti-yi 4— ^(1 4- y /l  +  4tf);
8: z i+x 4- wi+i +  -  wi);
9: end while
10: x fc+i  wi, y k+1 4- y k +  / / ( & -  A x k+i -  efc+i) ;
11: end while
12: Output: x* 4- x k,e* 4— ek.
mented directly in CUDA code. Additionally, some opera­
tions that could have been implemented via multiple BLAS 
calls, performance improvements were achieved by com­
bining multiple vector-vector operations into a single ker­
nel, due to reduced bandwidth and kernel call overhead.
In order to avoid expensive data transfer across the bus 
connecting the GPU card and the CPU motherboard (the 
PCI express bus), all of the data is transferred to GPU 
DRAM once, and all non-trivial tasks are performed on the 
GPU on data stored in GPU DRAM.
6. Experiments
In this section, we benchmark the performance of our 
parallel implementations of on CPU and GPU plat­
forms. In order to show how our ^-m in algorithms scale 
with problem size, in Section 6.1 we will begin with bench­
marks for the general ¿’i-min problem on synthetic data. 
We will then progress in Section 6.2 to demonstrating the 
speed and accuracy of our implementations as applied to 
the alignment and recognition stages of the face recognition 
problem.
6.1. Simulations on Random Data
The first experiment compares the performance of our 
proposed CPU and GPU implementations of the general t \ -  
min solver (Algorithm (1)). The m  x n  measurement ma­
trix A is a random Gaussian matrix, with each entry gener­
ated from the standard normal distribution and normalized 
to unit column norm. The ratio of m / n  is fixed at 1/2 with 
n  varying from 1000 to 8000. The ground truth signal, x 0 
has a sparsity rate of 10% of m  with elements sampled from 
a normal distribution and nonnalized to unit column norm. 
Because the ground truth signal Xo is known, the algorithm 
terminates when ||x — Xo|| < r  with r  =  10~3. The mea­
surement vector is generated by b — A x 0.
The results of this benchmark can be found in Figure 3. 
The x-axis represents the size of the A matrix and the y-
40
m (A is size 0.5m x m)
Figure 3. Comparison of G-min runtime vs. dimensions of A  on 
random data.
Table 1. Benchmark of implementations of alignment G-min.
Sequential solver using CUBLAS 302 ms
One problem per SM using CUDA streams 70 ms
Four problems per SM using single kernel 36 ms
'in 30 ■
LU 10 -
---------GPU
CPU, library threading 
--------- CPU, manual threading
0 U
0 100 150
Number of training users
200 250
Figure 4. Alignment stage runtime vs. size of training database.
axis represents the average amount of time to complete a 
single problem instance. The GPU implementation tends to 
be faster than the CPU implementation at solving a single 
large problem, whereas the CPU implementation is faster 
at solving a single small problem. The transition between 
the two regimes occurs where the size of A  reaches 2000 x 
1000 x 4 =  8 MB, i.e. the size of the CPU L3 cache.
6.2. Face Recognition Pipeline Benchmark
This section presents benchmarks of the CPU and GPU 
implementations of the alignment stage (i.e., Algorithm 2) 
as well as the recognition stage (i.e., Algorithm 3) of the 
face recognition pipeline.
First, in order to measure the impact of solving many 
G problems-per concurrently on the GPU, we benchmark 
three implementations of the alignment G-min solver with 
A  of size 5120 x 32, and a fixed 50 inner loop iterations for 
each of 50 outer loop iterations. The runtime on a GTX480 
GPU is averaged over a large number of trials, which are 
run sequentially or concurrently depending on the imple­
mentation. The results are shown in Table 1. Our proposed 
parallelization of the G-min used in the alignment stage is 
eight times faster than an implementation solving a single 
problem at a time using the stock BLAS libraries. 5
Using an implementation motivated by the previous re­
sult, we now benchmark our iterative alignment implemen­
tation on real face data. For experiments on face data we 
use subsets of the CMU Multi-PIE Face Database. For 
gallery images we use frontal images from session 1, which 
contains 20 images per subject taken under different illu­
minations. For test images we use images from session 2. 
The training images are prepared as follows: The iterative 
alignment stage seeks a similarity transformation between 
the coordinate frame of the full-resolution test image and 
a “canonical” coordinate frame in which images are com­
pared. The training images are aligned by applying a sim-
5 The streams implementation is limited significantly by a cap on the 
number o f  concurrent streams in the current version o f  CUDA
ilarity transormation that maps two manually clicked outer 
eye comers to fixed locations in the canonical frame. A 
64 x 64 pixel window in the canonical frame is used for 
resampling.
Figure 4 shows the average run time of the CPU and 
GPU implementations to align a query image against all 
the subject classes separately. We vary the total num­
ber of subject classes to show how the algorithms scale. 
The plateaus seen in the GPU curve result from the GPU 
hardware scheduling the computation of alignment prob­
lems in concurrent batches, but the overall trend is linear 
as expected. The manually threaded CPU implementation 
slightly outperforms the GPU implementation, and it sur­
passes the naive library threaded implementation by a wide 
margin. The new implementation can align the query image 
in ss 40 ms per subject, while the fastest previously pub­
lished result [20] required «  600 ms seconds per subject in 
a similar setting.
The next experiment compares the speed of the GPU and 
CPU implementations of the recognition stage. It was de­
termined that keeping 20 gallery subjects is sufficient to en­
sure that the correct subject is kept for the recognition stage 
with 95% probability. Since recognition failures are typi­
cally caused by a poor alignment, we find that keeping more 
subjects for the recognition stage does not necessarily im­
prove recognition rate.
Figure 5 shows the recognition stage runtime for canon­
ical images of size: 32 x 32,48 x 48, 64 x 64, 96 x 96, and 
128x 128. For all image resolutions, the problem size is suf­
ficiently small that the CPU is significantly faster than the 
GPU. Note also that for both implementations, the recog­
nition stage takes much less time than the alignment stage.
Finally, we perform an experiment verifying the recog­
nition accuracy of the overall pipeline. As shown in Fig­
ure 6, at the optimal resolution, which happens to match 
the alignment stage resolution, the GPU implementation 
reaches 95% recognition rate, the max achievable given the
Window width (pixels)
Figure 5. Recognition stage runtime vs. face window resolution
qS
100
£
CD
80
m
c
o
60
c
o>
40
8
CDa:
20
20 40 60 80 100 120 140
Window width (pixels)
Figure 6. Recognition rate of the full pipeline vs. face window 
resolution.
alignment selection process. For significantly lower resolu­
tions, the accuracy drops off significantly. The slight differ­
ence in CPU vs. GPU accuracy may be a result of numerical 
precision differences in our matrix inversion and vector re­
duction routines.
7. Conclusion
We have demonstrated that on both CPU and GPU al­
gorithms, parallelizations of ALM that solve multiple face 
alignment problems concurrently are significantly faster 
than implementations that rely purely on vendor-supplied 
BLAS libraries. Furthermore, thanks to a combination of 
faster hardware and more efficient use of the hardware, 
we have demonstrated dramatic improvements in recogni­
tion speed over previously reported implementations. As 
CPU manufacturers increase the number of cores and vec­
tor widths, and as GPU manufacturers increase the amount 
of cache, both architectures are rapidly converging towards 
an architectural balance that increasingly favorable for i \ - 
min based face recognition.
References
[1] A. Beck and M. Teboulle. A fast iterative shrinkage­
thresholding algorithm for linear inverse problems. SIAM 
Journal on Imaging Sciences, 2( 1): 183-202, 2009. 2. 3
[2] D. Bertsekas. Nonlinear Programmin. Ath. Sci., 2003. 1, 3
[3] A. Borghi, J. Darbon, S. Peyronnet, T. Chan, and S. Osher. A 
compressive sensing algorithm for many-core architectures. 
Adv. in Vis. Comp., 6454/2010:678-686, 2010. 2
[4] A. Bruckstein, D. Donoho, and M. Elad. From sparse solu­
tions of systems of equations to sparse modeling of signals 
and images. SIAM R., 51(1 ):34-81,2009. 1
[5] E. Candes and T. Tao. Decoding by linear programming. 
IEEE Transactions on Information Theory, 51(12), 2005. 1
[6] S. Chen, D. Donoho, and M. Saunders. Atomic decomposi­
tion by basis pursuit. SIAMR., 43:129-159, 2001. 1,2
[7] L. Dagum and R. Menon. OpenMP: an industry standard 
API for shared-memory programming. Comp. Sci. & Eng., 
5:46-55, 2002. 5
[8] D. Donoho. For most large underdetermined systems of lin­
ear equations the minimal U-norm near solution approxi­
mates the sparest solution. CPAM, 59:907-934, 2006. 1
[9] D. Donoho and Y. Tsaig. Fast solution of fUnorm minimiza­
tion problems when the solution may be sparse, preprint, 
2006. 2
[10] C. Dulong, R. Krishnaiyer, and D. Kulkami. An overview of 
the Intel IA-64 compiler, 1999. 5
[11] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least 
angle regression. Ann. Stat., 32:407-499, 2004. 2
[12] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. 
An interior-point method for large-scale O--regularized least 
squares. JSTSP, 1:606-617, 2007. 2
[13] I. Loris. On the performance of algorithms for the minimiza­
tion of i i  -penalized functionals. Inv. Prob., 25:1-16, 2009. 
1
[14] B. Lucas and T. Kanade. An iterative image registration tech­
nique with an application to stereo vision. In IJCAI, pages 
674-679, 1981. 1
[15] D. Malioutov, M. Cetin, and A. Willsky. Homotopy contin­
uation for sparse signal representation. In ICASSP, 2005. 2
[16] M. Murphy, K. Keurzer, S. Vasanawala, and M. Lustig. Clin­
ically feasible reconstruction for Ll-SPIRiT parallel imaging 
and compressed sensing mri. In Joint Annual Meeting o f Int. 
Soc. for Mag. Res. in Med. ,2010. 2
[17] M. Osborne, B. Presnell, and B. Turlach. A new approach to 
variable selection in least squares problems. IMA J. o f Num. 
Ana., 20:389-404, 2000. 2
[18] R. Tibshirani. Regression shrinkage and selection via the 
LASSO. JRSSB, 58:267-288, 1996. 1, 2
[19] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, and Y. Ma. To­
ward a practical face recognition: Robust pose and illumina­
tion via sparse representation. In CVPR, 2009. 1
[20] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, and 
Y. Ma. Toward a practical face recognition system: Robust 
pose and illumination via sparse representation, (submitted) 
PA Ml, 2011. 1,2,7
[21] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. 
Sparse representation for computer vision and pattern recog­
nition. PIEEE, 98:1031-1044, 2010. 1
[22] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust 
face recognition via sparse representation. PAMI, 31:210 -  
227,2009. 1
[23] S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruc­
tion by separable approximation. In ICASSP, 2008. 2, 3
[24] A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Fast 11- 
minimization algorithms and an application in robust face 
recognition: a review. In ICIP, 2010. 1, 2, 3
[25] J. Yang and Y. Zhang. Alternating direction algorithms 
for Ul-problems in compressive sensing. (preprint) 
arXiv:0912.1185, 2009. 3
