Acceleration of real-time face recognition pipeline on heterogeneous hardware platforms by Zhuge, Chuanhao
c© 2017 Chuanhao Zhuge
ACCELERATION OF REAL-TIME FACE RECOGNITION PIPELINE
ON HETEROGENEOUS HARDWARE PLATFORMS
BY
CHUANHAO ZHUGE
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Professor Deming Chen
ABSTRACT
In recent years, advancements in machine learning techniques, and specifi-
cally, deep learning methods, have started to create a great impact in the
world. With the advent of deep neural network, we are able to achieve un-
precedented results in previously unsolvable computer vision tasks. Face
recognition, one of the critical computer vision tasks, also sees breakthrough
in terms of accuracy. This thesis presents an accelerated and optimized end-
to-end face recognition pipeline. Such a pipeline consists of three stages: face
detection, alignment, and face recognition/verification. Algorithms for these
jobs are extremely computation intensive and thus real-time application was
not attainable. In order to bring about the goal of high definition real-time
multi-face recognition, we leverage different types of hardware to accelerate
detection and recognition stages, which are the most time-consuming stages
of the recognition pipeline. To achieve this goal, we leverage an embedded
Graphic Processing Unit (GPU) platform as the front end, to perform video
capture and face detection. For the back end, we employ a powerful Field-
Programming Gate Arrays (FPGA) equipped server, which runs a state-of-
the-art deep neural network to recognize faces streamed from the front end
with low latency. With the two acceleration schemes targeting GPUs and
FPGAs, respectively, we are able to achieve real-time performance for the
overall task, and such face recognition system can be widely adopted for
various applications.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
First, I would like to express my deepest gratitude to my advisor, Professor
Deming Chen, for his guidance, encouragement, and support. Without his
professional experience and insightful suggestions, this work would not have
been done. Professor Chen’s exemplary personality and passionate attitude
will continue to motivate me throughout my life and professional career.
I would also like to thank all members in the ESCAD research group. Dis-
cussions happening among this community are always inspring, thoughtful,
and engaging. I feel fortunate to be surrounded by these talented people.
Last, I would like to thank my parents and my girlfriend, Anqi Song, for
their continuous and unconditional support.
iv
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . vi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 FACE DETECTION ACCELERATION ON EM-
BEDDED GPU PLATFORM . . . . . . . . . . . . . . . . . . . . . 4
2.1 Face Detection Algorithms . . . . . . . . . . . . . . . . . . . . 4
2.2 Embedded GPU Platform . . . . . . . . . . . . . . . . . . . . 5
2.3 Face Detection Optimization Schemes on Embedded GPU
Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 FACE RECOGNITION ACCELERATION ON HIGH-
PERFORMANCE FPGA . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Traditional Methods and Deep Neural Networks for Face
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 High-Level Synthesis for FPGA . . . . . . . . . . . . . . . . . 11
3.3 Design Challenges and Difficulties . . . . . . . . . . . . . . . . 13
3.4 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Design Space Exploration for Fast Convolution Algorithms . . 23
3.6 Overall System . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 4 EVALUATION . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 32
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
LIST OF ABBREVIATIONS
ASIC Application-Specific Integrated Circuits
BRAM Block RAM
CNN Convolutional Neural Network
DNN Deep Neural Network
FFT Fast Fourier-Transform
FLOP Floating-Point Operation
FPGA Field-Programmable Gate Array
GEMM General Matrix Multiplication
GPU Graphics Processing Unit
HLS High-Level Synthesis
LUT Look-Up Table
vi
CHAPTER 1
INTRODUCTION
Advancements in machine learning research, especially Deep Neural Net-
works (DNNs), have created profound and revolutionary changes in the field
of artificial intellegence. Such techniques have achieved great success in dif-
ferent areas of applications, delivering human-surpassing result in computer
vision [1, 2], natural language processing [3], and even demonstrated remark-
able ability in content generation and art creation [4].
Seeing this level of breakthroughs, industry is moving fast in adopting DNNs,
deploying them at a scale in various types of platforms, including data cen-
ters, supercomputer clusters, and embedded or mobile systems, in an attempt
to accomplish some previously unattainable tasks, for example autonomous
driving, machine translation, face recognition, etc. It is evident that some of
these tasks are latency critical, but as is well known, DNN models are exceed-
ingly computation and memory intensive. In order to achieve cutting-edge
accuracy, billions of parameters are required, and data needs to propagate
through DNN’s deep topological structures, incurring billions of neural com-
putations. Due to this intricacy, it is challenging to achieve low latency, high
throughput, and maintain energy efficiency.
In order to resolve the imperative need to achieve low-latency inference tasks,
industry and researchers have a growing interest in developing and adapt-
ing hardware accelerators for DNNs, to replace general-purpose processors.
These hardware accelerators include GPUs, FPGAs [5], and ASICs [6]. In
this thesis, our goal is to accelerate one of the crucial computer vision tasks,
face recognition, which has numerous practical applications including access
control, crime sensing and alert, security surveillance, and general identity
verification. A common face recognition pipeline consists of the following
stages: detection, alignment, recognition (representation and classification),
1
Table 1.1: Profiling of face recognition pipeline
Stage Runtime percentage
Face Detection (OpenCV) 41%
Face Alignment (using dlib library) 2%
Face Representation (FaceNet CNN model) 55%
SVM Classification 2%
as
illustrated as Figure 1.1. We profile such a pipeline and find that detection
and recognition are the most computation-intensive stages, taking 95% of
the runtime, shown in Table 1.1. Therefore, these stages should be the pri-
mary acceleration candidates. In our vision, the face detection stage is suited
to run on embedded platforms such that it is more convenient to deploy at a
large scale (such as surveillance systems). In the meantime, the deep learn-
ing based face recognition phase, which is more computing-power hungry, is
better fitted to be done on high-performance machines with hardware accel-
erators, forming a cloud computing scheme. With the above schematic, we
are able to leverage the advantages of distinct hardware platforms to optimize
and accelerate the pipeline, aiming to attain real-time performance.
2
Figure 1.1: Face recognition pipeline (from OpenFace)
3
CHAPTER 2
FACE DETECTION ACCELERATION ON
EMBEDDED GPU PLATFORM
2.1 Face Detection Algorithms
An efficient algorithm for object detection using Haar feature-based cascade
classifiers was presented by Paul Viola and Michael Jones [7]. It is a machine
learning approach where a cascade function is trained from many positive
and negative images. It is then used to detect objects in other images. This
method becomes one of the most successful algorithms for face detection.
As mentioned above, the Haar-based feature filters (kernel), shown as Fig-
ure 2.1, are used to extract features from the input images. Each feature
is a single value obtained by subtracting the sum of pixels under the white
rectangle from the sum of pixels under the black rectangle. Every feature is
applied on all the training images and a series of weak classifiers are trained
and the final classifier is a weighted sum of this cohort. It is reported by
Viola and Jones that 200 features have the detection accuracy of 95%.
Figure 2.1: Haar features
For training, we use Adaboost to select the best features from all possible
4
features extracted from each kernel and find the best threshold that will
classify the faces as either positive or negative. During inference time, the
trained Haar-classifier scans the image with a window size to search for a
set of (rectangular) features and saves the regions with high probability of
object detection. In this process, we notice the fact that most of the image
region is a non-face region, which suggests that we should employ a simple
and efficient method to check if a window is a face region or not. If not,
we discard it and do not process it again. In this way, we are able to save
time and focus more on the possible face region. To achieve this, the concept
of cascade of classifiers is introduced. Fundamentally, instead of applying
all the features in a window, we group the features into different stages of
classifiers and apply these classifiers one-by-one. If in earlier stages a window
fails, it is discarded and the remaining features will not be considered. If it
passes, we apply later stages. In the end, if a window passes all the stages,
we conclude that this is a face region. The scanning process is repeated
by improving the window size until the window size becomes larger than the
image. Subsequently, the saved regions with high-confidence object detection
are filtered to determine if there is a face. This process is illustrated in Figure
2.2.
2.2 Embedded GPU Platform
Recently, embedded GPU platforms have evolved to close the performance
gap between traditional desktop GPUs, while maintaining low-power. Thus,
they are more widely used in acceleration for mobile-oriented application.
Despite similar interfaces and programming environments between the two
GPU classes, developing and porting existing implementations onto embed-
ded GPUs remains challenging for several reasons. First, embedded GPUs
typically have fewer computation cores, reduced shared memory resources,
and lower clock frequency than desktop GPUs. Second, domain-specific li-
braries such as OpenCV are not optimized for embedded GPUs, therefore
they may not consider qualitative performance aspects in algorithms such as
face detection, where detection accuracy and latency are equally important.
In our work, we implement the face detection algorithm on two platforms:
NVIDIA Jetson TK1, which contains a quad-core ARM Cortex A15 CPU and
5
Figure 2.2: Inference procedure using Haar-based cascade classifier
a 192-core Kepler GPU, NVIDIA Jetson TX1, which contains a quad-core
ARM Cortex A57 CPU and a 256-core Maxwell GPU.
2.3 Face Detection Optimization Schemes on
Embedded GPU Platform
We boost the performance of face detection by the following algorithmic and
platform-specific optimizations.
2.3.1 Improving Filtering by Tracking Faces
OpenCV provides a GPU implementation of the Haar-based classifier. How-
ever, although the GPU implementation is theoretically identical to the CPU
version, we observe an increase in detection failure. In order to retain detec-
tion accuracy without compromising runtime, we optimize the filtering algo-
rithm. We track the detected faces, and ensure that faces are consistently
6
detected. The tracking scheme is based on the assumption that faces with
the same identity would not move too fast between frames. We record all the
detected faces in a current frame, and push them into an unordered map,
storing the coordinates and id of the cropped faces. In the next frame, we
compared the detected location of the faces with the locations store in the
map. If there are locations in close proximity to prior detected faces, we say
it is a match, mark that the face is still in the frame, and update its location.
If one coordinate in the map is not matched for 30 frames, we decide that
the identity disappears and pop it out of the map. With face tracking we
can give increased weights to previously detected faces so as to make it more
likely to continue being detected. Also, when we receive the recognition from
the server asynchronously, we are able to put a label to the correct face.
2.3.2 Improving Resource Usage on Embedded GPU and
Reducing Computation Workload
Increasing parallelism (doing as many tasks in parallel as possible) is a press-
ing factor to improving performance on embedded GPUs because we want
to avoid computing resources that remains idle. In the OpenCV implemen-
tation, resource occupancy of NVIDIA Tegra Jetson TK1 is only 39% when
input resolution is 160×120 and 57% when the input resolution is 320×240.
We identify that one source of low occupancy is that as the window size
increases, there are fewer regions to distribute among threads, as shown in
Figure 2.3. We optimize the mapping between threadIds and workload data
and thus reduce the unused resource. This technique increases the resource
usage to 80% on 320× 240 resolution input.
With efficient resource usage, computational workload still remains signifi-
cantly high, limiting performance. The number of regions and total workload
are highest for the smallest window size, but this is required to detect faces
farther away from the camera and meet our qualitative goal. But as the
window size increases beyond a certain size, there is less need to compute
at the highest resolution. At the larger window sizes, we now only look for
faces closer to the camera. Thus, to reduce workload, after reaching a certain
window size, we down-scale the input image by half to reduce workload with-
7
out reducing accuracy. Furthermore, we identify that with a relatively low
resolution like 320×240 we are already able to detect faces at high accuracy.
As a result, we can capture the input frame at higher resolution, down-scale
it to lower resolution and send it to the Haar classifier. After we acquire the
coordinates, widths and heights of the faces, we multiply the results by the
down-scale factor, and crop the face at the original high resolution. In this
way, the resolution will only be constrained by video decoding performance
of the board, and the web camera, but not related with detection computa-
tion cost. In reality, we achieve real-time 1280 × 720 on Tegra Jetson TK1,
and 1920× 1080 on Tegra Jetson TX1.
Figure 2.3: GPU implementation of the face detection algorithm
8
CHAPTER 3
FACE RECOGNITION ACCELERATION
ON HIGH-PERFORMANCE FPGA
Face recognition is a specific and hard case of object recognition. The diffi-
culties of this problem stem from the fact that in the frontal view, which is
the faces’ most common presented form, they appear to be roughly alike and
the difference between two faces with different identities are subtle. Further-
more, face pictures can be captured under different angles, or under different
lighting conditions. Those situations pose grand challenges in improving face
recognition accuracies.
3.1 Traditional Methods and Deep Neural Networks
for Face Recognition
Traditional methods to tackle face recognition difficulties include recognition
with “Eigenfaces” [8], and based on that, an improved version of recognition
with “Fisher Linear Discriminators” [9]. The use of Eigenfaces is a general
dimensionality reduction method. It finds k principal components (eigen-
vectors of covariance matrix Σ) u1, ...uk (means), and projects each training
image xi onto subspace spanned by principal components:
(wi1, ..., wik) = (u
T
1 (xi − u), ..., uTk (xi − u)))
and classifies it as the closest training face in k-dimensional subspace. This
method proves to be accurate and fast, but it is not robust to misalignment,
background variation etc. Also, the direction of maximum variance is not
always good for classification. To improve this problem, Fisherface is pro-
posed, which finds a projection that maximizes scatter between classes and
minimizes scatter within classes, and achieves a substantial increase in accu-
racy.
9
After the deep learning storm swipes the field of visual recognition, re-
searchers quickly turn their attention to adapt deep convolution neural net-
work to decipher the face recognition problem [10, 11, 12]. These attempts
present recognition performance that closes the gap, or even surpasses, human-
level face recognition performance. Often, the neural network is used as a
feature extractor to produce a low-dimensional representation that charac-
terizes a person’s face. One of the most successful schemes is proposed by
Google researchers, known as FaceNet [13]. FaceNet innovates by training its
output to be a compact 128-D embedding using a triplet-based loss function,
motivated in the context of nearest-neighbor classification. The basic idea is
that the triplet loss minimizes the distance between an anchor and a positive,
both of which have the same identity, and maximizes the distance between
the anchor and a negative of a different identity, as shown in Figure 3.1.
Figure 3.1: FaceNet’s triplet training procedure
Mathematically, our target is:
||xai − xpi ||22 + α < ||xai − xni ||22,∀(xai , xpi , xni ) ∈ TrainingSet (3.1)
where α is a margin that is enforced between positive and negative pairs.
The loss that is being minimized is then
N∑
i
[||f(xai )− f(xpi )||22 − ||f(xai )− f(xni )||22 + α]+ (3.2)
10
Google researchers experimented with training FaceNet based on GoogLeNet
[14] and ZFNet [15] deep convolutional network architectures with different
configurations. It is noticed that the GoogLeNet design, which utilizes In-
ception modules, has 20× fewer parameters and 5× fewer Floating-point
Operations (FLOPs). In our work, since our goal is to map the network onto
resource limited FPGAs, we choose a further compressed configuration of
GoogLeNet, which has 2.4 million parameters and 300 million FLOPs. The
configuration is shown in Table 3.1. In the table, dimensionality reductions
to N dimensions after pooling is denoted with “Np”. The normalization is
local response normalization.
Table 3.1: FaceNet nn4.small2 network definition
type output size #1 #3× 3R #3× 3 #5× 5R #5× 5 pool
conv1(7× 7× 3, s2) 48× 48× 64
max pool + norm 24× 24× 64 m3× 3, 2
inception 2 24× 24× 192 64 192
norm + max pool 12× 12× 192 m3× 3, 2
inception 3a 12× 12× 256 64 96 128 16 32 m, 32p
inception 3b 12× 12× 320 64 96 128 16 32 l2,, 32p
inception 3c 6× 6× 640 128 256, 2 32 64, 2 m3× 3, 2
inception 4a 6× 6× 640 256 96 192 32 64 l2, 128p
inception 4e 3× 3× 1024 160 256, 2 64 128, 2 m3× 3, 2
inception 5a 3× 3× 736 256 96 384 l2, 96p
inception 5b 3× 3× 736 256 96 384 m, 96p
avg pool 736
linear 736
l2 norm 128
3.2 High-Level Synthesis for FPGA
High-Level Synthesis (HLS) allows the use of high-level programming lan-
guages such as C/C++ for abstract description of hardware functions, as
displayed in Figure 3.2. During the HLS design flow, on-chip logic resources
11
are allocated and function units are bound with desired operations which
form different modules in hardware. Also, input and output interfaces are
generated to provide connection among modules, memories and other com-
munication interfaces. The main advantages of HLS include the reduction of
design effort, the exploration of design space, the convenience in high-level
language debugging and the automatic generation of test schemes. These
features have gained wider acceptance for HLS today in both industry and
academia [16]. HLS also provides an efficient development approach for neu-
ral networks [5, 17]. However, optimization with limited on-chip resources
is still very challenging, especially targeting large neural networks. Since
performance can vary a lot in HLS-based FPGA design before and after
HLS optimization, how to optimize the design given the resource constraints
would be essential for high-speed solutions. We quantitatively analyze the
computational demand and resource allocation across the different layers of
the CNN and develop a theoretical guideline for the best allocation scheme
to achieve high performance.
Figure 3.2: An example HLS design workflow (LegUp)
12
3.3 Design Challenges and Difficulties
3.3.1 Complexity Analysis
The memory space and computational complexity of FaceNet are very high.
Table 3.2 summaries the detailed requirements for FaceNet. In total, 488
million floating-point operations are necessary during inference while 1.6
million inputs are distributed to different layers and 6 million outputs are
generated. There are 3.71 million weight data needed, which occupies 126.24
MB of memory.
Table 3.2: Complexity analysis for FaceNet CNN architecture
Layers # Parameters # FLOPs
conv1 9408 21676032
Inception2 114688 66060288
Inception3a 163328 59719680
Inception3b 227328 110886912
Inception3c 397312 46817280
Inception4a 544768 88473600
Inception4e 716800 32993280
Inception5a 790528 35979264
Inception5b 661504 25860096
Linear (fully-connected) 94208 94208
Total 3719872 488560640
3.3.2 Inception Module Topology
In the FaceNet architecture, the researchers in Google propose a novel CNN
building block called the Inception modules. The main idea of this archi-
tecture is based on finding out how an optimal local sparse structure in a
convolutional vision network can be approximated and covered by readily
available dense components. Based on their experiment, a typical inception
module is designed to have 1 × 1, 3 × 3, and 5 × 5 convolution parts, as
well as a pooling part. These computations can be done in parallel and the
respective outputs are concatenated to form the output of the entire module.
Google researchers also propose a second idea, which is judiciously applying
13
Figure 3.3: Inception module with dimension reductions
dimension reductions and projections to avoid the blow up of computational
requirements. That is, 1 × 1 convolutions are used to compute reductions
before the expensive 3× 3 and 5× 5 convolutions. The setup is as shown in
Figure 3.3.
In our implementation, the inception module poses difficulties in optimiza-
tion, because of the different sizes of the convolutional kernels. Some of the
recently invented acceleration algorithms all trade speeds with memory con-
sumptions. However, with limited onboard block memory (BRAM), we have
to mathematically analyze the different patterns and choose the optimiza-
tion scheme wisely. For example, one efficient implementation of convolution
is reducing the problem from convolution to general matrix multiplication
(GEMM). This is done by laying out the convolutional kernels and input fea-
ture maps to 2-D matrices, and thus there are opportunities to apply highly
efficient matrix multiplication optimization algorithms to obtain equivalent
result. However, we identify that the memory overhead is significant when
the convolutional kernel is large. For instance, in conv1, where input size
is 48 × 48 and kernel size is 7 × 7, GEMM method consumes 6.8× memory
compared to conventional convolution. However in the case where kernel
size is small, such as 1 × 1, GEMM method’s memory overhead stands at
14
only 1.6x. This means that in order to allocate the limited FPGA compu-
tation resources so as to achieve higher performance, prudent consideration
and calculation is needed, and different acceleration schemes for different
convolution kernel sizes are imperative.
3.3.3 Resource Allocation and Partitioning
Each layer in the FaceNet is a nested multiply-add loop. Using HLS prag-
mas such as loop unroll, the designer can ask for the instantiation of parallel
hardware instances in loop iterations, which provides an opportunity to im-
prove the performance of the loop. However, there are two difficulties in
using such pragmas freely. First, it is not straightforward to relate such HLS
pragmas directly to the performance of the loop because of possible depen-
dencies across loop iterations. Additional insights and analysis of the loop
are required to effectively use these HLS resources to improve performance.
Second, FaceNet consists of multiple such loop structures (corresponding to
different layers), and an optimal implementation of the overall design re-
quires careful resource allocation among the various loops. Since the loops
vary widely with respect to their computational demands, a homogeneous re-
source allocation method will likely fail to provide high performance. A good
resource allocation scheme must consider the resource-performance trade-offs
of each loop as well as a global performance model to create a good resource
allocation scheme.
To tackle these issues, we develop two methods. First, we propose an HLS
Interllectual-Property (IP) design that may be used to implement the differ-
ent loops in the FaceNet efficiently. The IP vastly simplifies the computation
model of a loop allowing the designer to directly assess the impact of tuning
the IP with various HLS pragmas. Second, we derive a simple global per-
formance model for end-to- end latency in terms of the resource allocated
to each instance of the HLS IP as well as the computation demand of the
loop it implements. We solve this model for minimum end-to-end latency to
obtain resource allocation recommendations for each loop/layer in FaceNet.
Details are provided in Section 3.4.
15
3.3.4 Limited On-Chip Memory and Memory Access Latency
The large size of the FaceNet weight data forces us to use external memory
storage. Weights are heavily used in the network computations and external
memory is constantly accessed for these weights. If external memory be-
comes a bottleneck, then it means that the performance of a critical loop, for
example, is not affected so much by its computation demand and resource
allocation, but by how frequently it can access weights from the memory.
While it may be possible to work such memory access considerations into the
performance modeling of a loop, it is much simpler if this memory bottle-
neck can be eliminated so as to reduce its impact on the systems performance.
We explore optimizing techniques, on both the hardware side and algorithm
side, including Fast-Fourier Transform (FFT), Winograd’s minimal filtering
algorithm for convolution, bit-width quantization, memory organization, and
efficient memory sub-system design in order to improve memory performance,
and protect our performance modeling assumptions.
3.4 Design Methodology
3.4.1 Process Engine IP
In the critical loops representing the FaceNet layers, we moved the loop it-
erations with minimal dependency inward, so that the inner loops in the
transformed source-code may be unrolled for maximum parallelization and
resource utilization. We found that the same inner-loop structure is repeated
in all the layers in FaceNet. We abstracted this loop structure as an HLS IP,
and use it to construct all the layers of the network. The HLS IP encloses
all the critical resources used in the implementation of the layer.
As shown in Figure 3.4, the IP consists of Coo multiply-accumulate units of
dimension Cii each. In other words, it represents a two-dimensional, unrolled
loop tile of multiply-accumulate operations along one dimension, where Cii
and Coo represent the unroll factors along the two dimensions of the tile. To
illustrate this, Figure 3.5 shows the HLS code for a convolutional layer, mark-
16
Figure 3.4: An Individual HLS IP instance
ing out the IPs structure in the code. The multiplications within a single
multiply-accumulate unit in the PE line-up along the Z-axis with respect to
the input featuremap. Different multiply-accumulate units work on produc-
ing activations for different output channels. Thus, the multiply accumulate
units themselves are lined up along the Z-axis with respect to the output
channels. Twenty-four input channels are processed at a time to produce
sixteen output channel results, giving Cii = 24 and Coo = 16, for this specific
configuration.
Increasing Coo increases the number of operations executed by the HLS IP
per second since there are no dependencies in the Coo dimension. On the
other hand, on first sight, it does not seem like increasing Cii would linearly
improve the performance of the tile, because there is a dependency along that
dimension in the form of an adder tree whose depth increases with increasing
Cii, thus worsening the latency. We would like the performance to increase
linearly with Cii as well. Hence we use Vivados #pragma HLS PIPELINE
to introduce pipeline stages into the adder tree. The depth of this pipeline is
logarithmic with respect to Cii due to its tree structure. The latency of using
the adder tree is visible only when the tile is used the first time within the
outer loop. Thereafter, the pipeline stages in the adder will be fully occupied
until the completion of the layer computation. We assume that the period
for which the adder-tree pipeline is fully occupied far surpasses the period
17
Figure 3.5: HLS IP in C Code and HLS pragmas
which is needed to warm up the pipeline. For example, in layer 3, Cii = 24,
implying that there are, at the most, five pipeline stages in the adder tree.
However, the tile will be reused 128 × 2 × 8 times to obtain all the outputs
18
of the layer, and all this while the adder pipeline stages are fully occupied.
Hence the impact of the increasing latency of the adder tree is negligible and
the tiles performance improves linearly with Cii as well. It must be noted
that external factors such as memory latency can dominate the performance
of the layer, rather than the computational capacity of the hardware imple-
menting the layer. In this case, performance will not improve linearly with
the tile dimensions Cii, and Coo.
3.4.2 Data Quantization
Extensive research has been conducted on compressing the network architec-
ture so as to conquer the challenges of high compute and memory bandwidth
requirements. One simple solution, using low-bit representation of data in
DNN, can reduce this overhead will ultimately increase power efficiency and
lower the total power required. In addition to saving power during compu-
tation, lower bit-width also lowers the power needed for memory bandwidth,
because fewer bits are transferred with the same number of memory trans-
actions. It is discovered that [18] low bit-width fixed points are sufficient in
deep learning inferences to keep the same level of accuracy. In FaceNet, we
analyze the distribution of all the input weight data, and feature map data,
and determine that using 16 bits, among them 5 bits are for signed integers
and 11 bits are for decimals, is ideal to obtain performance improvement and
preserve accuracy at the same time. In the naive implementation, we simply
change all data from floating-point to fixed-point numbers and we discover
that error quickly accumulates. In Figure 3.6, the red curve show the error
between fixed-point output and the ground truth, and the orange curve shows
the ground truth reference output of inception3a module. It can be seen that
the error is not negligible. Even more, naive truncation even generates in-
valid values because of computations that are sensitive to rounding. In order
to deal with this, we identify some of the key computations, for example log,
and exponent, that require higher precision, and use floating-point instead
of fixed-point numbers. It turns out that these computations are rare in the
entire network and do not take too much of the computation resource so the
cost of using floating-point numbers can be ignored. This small trick cuts
down the error dramatically as can be seen in Figure 3.7.
19
Figure 3.6: Erroneous inception 3a output
3.4.3 FFT for Efficient Convolution Computation
To further improve the speed of convolution, it is essential to cut down the
cost of convolution algorithmically. As is well known, the convolution theo-
rem states that circular convolutions in the spatial domain are equivalent to
element-wise products in the frequency (Fourier) domain. Thus convolution
can be denoted as Equation 3.3.
k ∗ g = F−1(F (k) • F (g)) (3.3)
where f denotes the convolution kernel and g denotes the input tile. Note
that the asymptotic runtime complexity of normal convolution is Θ(n2),
where the FFT-based method is Θ(nlogn). Although the computation com-
plexity of FFT-based convolution is superior compared to traditional convo-
lution, the memory footprint of FFT-based convolution is significantly larger.
This problem is more pronounced when the size of the input feature map and
kernel differ a lot. Therefore, we apply FFT-based convolution to 5× 5 ker-
nels in the inception modules. In implementation, we use the HLS FFT IP
20
Figure 3.7: inception 3a output with insignificant error
provided by Xilinx.
3.4.4 Winograd’s Minimal Filtering for Efficient Convolution
Computation
Previous work [19] reveals that FFT based convolution provides speedup
for large filters, but for small filters, due to memory and transformation
overhead, FFT does not yield expected results and sometimes decreases the
speed. However, recent state-of-the-art DNNs use smaller convolution filters
like 3 × 3. To accommodate the situation, Winograd’s minimum filtering
algorithm is introduced [20]. In FaceNet’s inception module topology, 3× 3
filters are widely used. Hence, it can take advantage of the Winograd algo-
rithm.
The algorithm improves the convolution speed by exploiting the algebraic
structure and effectively reduces the number of computations it needs. We
denote the algorithm that takes a 3 × 3 filter and generates a 2 × 2 output
as F (2× 2, 3× 3). The details go as follows.
The algorithm consists of three transformations. The first transformation
is applied to the convolution kernel, as shown in Equation 3.4, where U is
21
an (m + r − 1) × (m + r − 1) transformed filter, g is an input r × r filter,
and G is a transform matrix defined by the Winograd algorithm, shown in
Equation 3.5. In our work, m is 2, and r is 3.
U = GgGT (3.4)
B =

1 0 0
1/2 1/2 1/2
1/2 −1/2 1/2
0 0 1
 (3.5)
The transformed filter values can be pre-computed so that we can avoid
computing filter transformation on FPGA, saving FPGA resources. Accord-
ingly, space needed for storing transformed filters increased by 33%. The
second transformation, shown in Equation 3.6 is applied to input image tiles,
where V is an (m + r − 1) × (m + r − 1) transformed input tile, d is an
(m+r−1)× (m+r−1) input tile, and B is a transform matrix, also defined
by the Winograd algorithm, displayed in Equation 3.7.
V = BTdB (3.6)
BT =

1 0 −1 0
0 1 1 0
0 −1 −1 0
0 1 0 0
 (3.7)
Y = AT [U  V ]A (3.8)
AT =
[
1 1 1 0
0 1 1 0
]
(3.9)
Once we obtain the two transformation U and V , we apply pair-wise multipli-
cations to U and V . To get the final output, we apply the last transformation
using A, shown in Equation 3.9, generating a 2 × 2 non-overlapping output
22
tile, shown in Equation 3.8. The output of a single channel feature map
can be calculated iteratively. Since the algorithm is highly parallelizable in
its nature, during implementation, we fetch and process multiple tiles across
dimension in one iteration. This scheme is demonstrated in Figure 3.8.
Figure 3.8: Winograd algorithm processing unit
3.5 Design Space Exploration for Fast Convolution
Algorithms
Although these two fast algorithms deliver remarkable speedup to conven-
tional convolution, theoretical analysis and previous experiments [19, 20]
have shown that these two algorithms are best suited for different convo-
lution types. The FFT-based method in theory provides greater speedup
when kernel size is larger. This opinion is supported by Nicolas et al.’s im-
plementation on GPU [19]. On the other hand, a study claims that Winograd
algorithm’s improvement on speed winds down quickly when kernel size be-
comes larger because the number of additions and constant multiplications
required by the transformation increases quadratically, offsetting the savings
in the multiplications [20].
Recently emerged deep CNN structures contain multiple parallel branches
with different kinds of convolution. Therefore, a single efficient algorithm
cannot provide the best optimization. Consequently, we come up with a in-
23
Table 3.3: Design space explorations for FFT-based and Winograd-based
convolutions
Dimensions Sizes evaluated
kernel sizes 3, 5, 7
feature map sizes 6, 12, 24
input/output dimensions 16, 32, 64, 128 (combinations)
novative heuristic idea to design a hybrid accelerator that incorporates both
fast algorithms to cover different workloads, and to deliver the best perfor-
mance. In order to find an ideal strategy of using different algorithms, we
carry out design space explorations on FPGA, with the configurations shown
in Table 3.3. For Winograd-based convolution with larger kernels, we evalu-
ate F (2×2, 5×5), and F (2×2, 7×7). For our implementation of FFT-based
convolution, since the Radix-2 FFT inputs must be of size of powers of 2.
One observation is that the kernel size does not affect FFT’s performance in
general because the zero-padding leads to the input data being the same size.
With a sole exception when input size is 6× 6 and 7× 7. For this particular
parameter combination, we pad it to 16, leading to similar performance as
12× 12 input.
The result is shown in Figures 3.9, 3.10, and 3.11. The baseline is imple-
mented using a conventional loop-optimization method. Across the three fig-
ures, orange curves represent Winograd-based convolution’s speedup against
the baseline, and gray curves represent the FFT-based method’s speedup
compared to the baseline. From the figure we learn that in small kernels,
Winograd’s algorithm dominates the performance. For larger kernel sizes,
FFT-based convolution starts to catch up in speed, and when kernel size is 7
and input/output depth is large, the FFT method outperforms Winograd’s
method by a maximum of 3x margin.
We try to keep the same parallel factor for different algorithms so that they
use similar amount of resource. However, HLS result shows that different
algorithms prefer to use different kinds of resource. For example, FFT-based
algorithm uses 60% of the DSPs, but consumes as much as 2.3x of LUTs.
Also, BRAM usage is affected by the number of output channels, since it
24
buffers the intermediate results to prevent unnecessary IFFTs.
3.6 Overall System
Our FaceNet design is implemented on the Xilinx Ultrascale+ VCU118 board,
targeting at 200 MHz frequency. The complete on-chip implementation con-
sists of PCIe, and DMA, IPs, the FaceNet module, and an external memory
controller for interfacing with off-chip DDR memory where the weights are
stored. The DMA, the HLS IP instances in the FaceNet (through FIFOs)
and the external memory controller connect to a common AXI bus system,
and the PCIe IP directly interfaces with the DMA. A host program is writ-
ten that can transfer weights and images to the DDR memory through the
PCIe. Using this setup consisting of the host PC and FPGA board for back-
end processing, we build an end-to-end, real-time, face recognition system
that can directly process video frames from a commercial webcam. For the
front end of the system, we use NVidia Tegra TK1, a low-cost, low-power
embedded-SoC development board previously described, and Logitech C920
full-HD webcam to capture the video. We down-sample the captured frames
to the size that fits the FaceNet network, and stream the image frame over
the Internet. The rate of transmission may be tuned to fit the available band-
width of connection. On the back-end side, the host PC receives the frames
and uses multi-processing to pre-process the face image, including alignment
using dlib functions, reordering of pixels and fixed-point conversion, before
off-loading the data to our low-latency FaceNet kernel implemented on the
FPGA. After computation, the kernel feeds face embedding vector to an
trained SVM classifier, to perform face recognition / verification. This pro-
totype system may be expanded for use in applications such as crime sensing,
entry control, or bio-metric verification, and can be easily deployed at large
scale. The complete system is shown in Figures 3.13 and 3.14.
25
Figure 3.9: Speedup comparison when kernel size is 3
Figure 3.10: Speedup comparison when kernel size is 5
26
Figure 3.11: Speedup comparison when kernel size is 7
Figure 3.12: Normalized resource usage for different experiment setup
27
Figure 3.13: Frontend: Tegra TK1 and web camera
Figure 3.14: Backend: Xilinx FPGA board installed in a Personal
Computer
28
CHAPTER 4
EVALUATION
We implement all the proposed convolution optimization schemes, including
GEMM on 1×1 layers, Winograd convolution on 3×3 layers, and FFT-based
convolution on 5×5 layers. During our extensive experiment, we find the the
HLS FFT IP from Xilinx is hard to optimize, since it is difficult to configure
to achieve ideal performance. Some of the parallel computation methods are
improbable to apply, because of the constraint of the IP. We find that with
using our traditional well optimized convolution IP, we are able to achieve
better result. Therefore we decide to relinquish the FFT-based convolution
method, and will implement our proprietary FFT design in the future work.
We apply the REALM equations and use the results as a guidance for
ideal resource allocations between different modules. Figure 4.1 displays the
recommended resource allocation. In our experiment, we find that LUTs are
the critical resource. Due to the time consuming synthesis procedure, we
choose to optimize conv1, inception2, and inception3a modules. These mod-
ules contain all the different patterns of convolutions, and represent 30%
of the entire workload. For the overall latency, we estimate by applying the
following equation: Loverall = Lsimulated × GOPSoverallGOPSsimulated . We try to allocate
resources following the REALM results, but difference exists due to coarse
grain control of resource allocation in HLS implementation. The results are
shown in Table 4.1.
We compare the simulated FPGA performance of FaceNet with GPU, which
is one of the popular accelerators for both training and inference. The GPU
card used in our experiment is Nvidia Tesla K80, running OpenFace imple-
mentation [21] of FaceNet. The OpenFace implementation uses Torch7, an
optimized framework for CNN workload. The results are shown in Table
4.2. In can be seen that our FPGA implementation achieves more than 5.3x
29
speedup in terms of single face image inferences. Also, FPGA consumes sig-
nificantly lower power, making it a much more energy-efficient solution when
deployed at a large scale.
Figure 4.1: Optimized allocation scheme calculated from REALM equations
Table 4.1: FaceNet latency and implementation statistics
Layers # Cycles # Runtime (ms) LUT usage (%)
conv1 99159 0.49 5.59%
Inception2 578207 2.89 7.45%
Inception3a 525426 2.62 8.55%
Inception3b 960911 4.80 15.02%
Inception3c 574931 2.87 4.66%
Inception4a 986481 4.93 10.25%
Inception4b 323773 1.62 6.86%
Inception5a 348716 1.74 11.35%
Inception5b 352742 1.76 9.32%
30
Table 4.2: FaceNet latency comparison between FPGA and GPU
Latency (ms) Speedup
This work 23.9 -
NVidia Tesla K80 129.0 5.37x
31
CHAPTER 5
CONCLUSION
In this thesis, we presented an implementation of an end-to-end system de-
sign of a face recognition pipeline, using multiple optimization schemes for
different hardware platforms, including optimization for embedded GPUs,
and HLS-based design flow for FPGAs. We introduced an HLS IP that, in
addition to providing optimal implementations of the FaceNet layers, also
allows us to formulate a simplified model for layer latency in terms of layer
computation demand and resource consumption. Using this model, we de-
rived theoretical guidelines for per-layer resource allocation for minimum
overall latency. We discussed issues that could lead to deviations from the
underlying assumptions of our theoretical results, and implemented methods
to reduce their impact including network pruning, weight quantization, as
well as an efficient memory system. Using our resource allocation guidelines,
we tuned the parameters of the HLS IP instances to implement each layer
in LRCN to obtain a design whose power and latency performance surpasses
that of GPU and CPU implementations. The overall system can play a role in
important bio-metric verification scenarios and has vast area of application.
Overall, we addressed certain key issues arising in an HLS design flow, re-
garding per-loop optimizations, and resource allocation across layers to meet
performance goals. We believe that an effective design strategy using HLS
such as ours can help quickly develop high-performance implementations of
complex DNNs.
32
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for im-
age recognition,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems, 2012, pp. 1097–1105.
[3] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep be-
lief networks for natural language understanding,” IEEE/ACM Trans-
actions on Audio, Speech and Language Processing (TASLP), vol. 22,
no. 4, pp. 778–784, 2014.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
in Advances in Neural Information Processing Systems, 2014, pp. 2672–
2680.
[5] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
2017.
[7] P. Viola and M. J. Jones, “Robust real-time face detection,” Interna-
tional Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[8] M. A. Turk and A. P. Pentland, “Face recognition using eigen-
faces,” in Computer Vision and Pattern Recognition, 1991. Proceedings
CVPR’91., IEEE Computer Society Conference on. IEEE, 1991, pp.
586–591.
33
[9] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.
fisherfaces: Recognition using class specific linear projection,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 19,
no. 7, pp. 711–720, 1997.
[10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep-
resentation by joint identification-verification,” in Advances in Neural
Information Processing Systems, 2014, pp. 1988–1996.
[11] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are
sparse, selective, and robust,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 2892–2900.
[12] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing
the gap to human-level performance in face verification,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2014, pp. 1701–1708.
[13] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified em-
bedding for face recognition and clustering,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–
823.
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-
han, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolu-
tions,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 1–9.
[15] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in European Conference on Computer Vision.
Springer, 2014, pp. 818–833.
[16] R. Nane, V.-M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,
H. Hsiao, S. Brown, F. Ferrandi et al., “A survey and evaluation of
fpga high-level synthesis tools,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591–
1604, 2016.
[17] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator
for large-scale convolutional neural networks,” in Proceedings of the 2016
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays. ACM, 2016, pp. 16–25.
[18] T. Dettmers, “8-bit approximations for parallelism in deep learning,”
arXiv preprint arXiv:1511.04561, 2015.
34
[19] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and
Y. LeCun, “Fast convolutional nets with fbfft: A gpu performance eval-
uation,” arXiv preprint arXiv:1412.7580, 2014.
[20] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 4013–4021.
[21] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “Openface: A general-
purpose face recognition library with mobile applications,” CMU-CS-16-
118, CMU School of Computer Science, Tech. Rep., 2016.
35
