FPGA acceleration of CNN training by Samal, Kruttidipta
	  



























In Partial Fulfillment 
of the Requirements for the Degree 
Master of Science in the 













Copyright © Kruttidipta Samal 2015 
	  









Dr. Marilyn Wolf, Advisor 
School of Electrical and Computer Engineering 
Georgia Institute of Technology 
 
Dr. Tom Conte 
School of Electrical and Computer Engineering 
Georgia Institute of Technology 
 
Dr. Saibal Mukhopadhyay 
School of Electrical and Computer Engineering 




Date Approved:  12.04.2015 
	  
	   iii	  
ACKNOWLEDGEMENTS 
 
I wish to thank my advisor Prof. Marilyn Wolf for her guidance and support without 
which this thesis couldn’t have been complete. I am very grateful to my professors and 
teachers both here at Georgia Tech and in India who have imparted me with the 
knowledge and confidence to complete this work. I would also like to thank my friends 
and colleagues at Georgia Tech for discussing and critiquing my ideas.  
Finally I am most grateful to my parents, elder brother, family and friends for their moral 
















	   iv	  
TABLE OF CONTENTS 
Page 
ACKNOWLEDGEMENTS iii 
LIST OF TABLES v 
LIST OF FIGURES vi 
LIST OF CODE SNIPPETS vii 
SUMMARY viii 
CHAPTER 
1 INTRODUCTION 1 
2 VISION IS COMPLEX 4 
Learning Algorithms 6 
Functionality 11 
3 CONVOLUTIONAL NEURAL NETWORKS (CNN) 19 
4 DRAM SIMULATION 29 
Architecture 31 
Result 33 
5 FPGA DESIGN OF CNN 40 
Architecture 41 
Memory Storage Calculations 46 
Utilization 47 
Loop Iteration 48 
6 CONCLUSION 50 
APPENDIX A: Code Walkthrough 52 
REFERENCES 56 
	   v	  
LIST OF TABLES 
Page 
Table 1: Layer-wise algorithm distribution 22 
Table 2: Utilization Table in [34] 48 
Table 3: Utilization Table in Virtex 7 Ultrascale xcvu440 48 
Table 4: Loop Latency with single Processing Element 49 

















	   vi	  
LIST OF FIGURES 
Page 
Figure 1: Layered Architecture of Neo-cortex 4 
Figure 2: Artificial Neuron 8 
Figure 3: Stacked Autoencoder 8 
Figure 4: HMAX Model 15 
Figure 5: Transforming Autoencoder 15 
Figure 6: LCN in space 16 
Figure 7: Left: Biological neuron Response Right: Common Artificial neuron response17 
Figure 8: Yann LeCunn's CNN (first successful artificial CNN) 19 
Figure 9: AlexNet Architecture 21 
Figure 10: Zeiler's CNN architecture 23 
Figure 11: Coupled Convolution and Deconvolution Network 23 
Figure 12: RCNN 24 
Figure 13: Google LeNet 28 
Figure 14: Flowchart of DRAM Simulator 31 
Figure 15: Sample Simulator output of single channel DRAM running on first 
convolution layer trace 36 
Figure 16: Sample Simulator output of double channel DRAM running on first 
convolution layer trace 37 
Figure 17: Sample simulator output of second convolution layer trace 38 
Figure 18: Conversion and storage of input image in BRAM blocks 41 
Figure 19: Architecture designed in [33] 42 
Figure 20: Designed FPGA Architecture 43 
 
	   vii	  
LIST OF CODE SNIPPETS 
Page 
Code Snippet 1: HLS code for FPGA with single Executing Unit 52 
Code Snippet 2: Sample code for BRAM port and block declaration 53 
Code Snippet 3: Sample Code containing data transfer and computation logic to generate 
2916 parallel processing units 54 
 
	   	  
	   viii	  
SUMMARY 
 
This thesis presents the results of an architectural study on the design of FPGA-
based architectures for convolutional neural networks (CNNs).  
We have analyzed the memory access patterns of a Convolutional Neural 
Network (one of the biggest networks in the family of deep learning algorithms) by 
creating a trace of a well-known CNN architecture and by developing a trace-driven 
DRAM simulator.  The simulator uses the traces to analyze the effect that different 
storage patterns and dissonance in speed between memory and processing element, can 
have on the CNN system. This insight is then used create an initial design for a layer 
architecture for the CNN using an FPGA platform.  The FPGA is designed to have 
multiple parallel-executing units. We design a data layout for the on-chip memory of an 
FPGA such that we can increase parallelism in the design.  As the number of these 
parallel units (and hence parallelism) depends on the memory layout of input and output, 
particularly if parallel read and write accesses can be scheduled or not. The on-chip 
memory layout minimizes access contention during the operation of parallel units. The 
result is an SoC  (System on Chip) that acts as an accelerator and can have more number 
of parallel units than previous work.  The improvement in design was also observed by 
comparing post synthesis loop latency tables between our design and one with a single 
unit design. This initial design can help in designing FPGAs targeted for deep learning 
algorithms that can compete with GPUs in terms of performance. 




Recent developments in deep learning has led to its application in tasks that were 
earlier possible only with traditional handcrafted methods such as SIFT (Scale Invariant 
Feature Transform) [1], HOG (Histogram of Gradients), BoW (Bag of Words) [2] and 
LHoP (Learning Hierarchy of Parts) [3]. Deep Learning methods are vestiges of 
developmental stages of cognition. As biological vision systems are very robust even in 
depraved conditions, learning features in line with biology, can lead to better 
performance of artificial systems especially in divergent conditions which are pain points 
in modern vision systems. These divergent conditions are (but not limited to) object 
recognition in partial occlusion, extraction of form from motion, inference of surficial 
properties from reflectance etc. In current review we are concerned with how to relate 
two images. In particular, we are concerned with tracking a moving object and/or infer 
the transformation of an object in an image sequence. Current state of the art in Computer 
Vision for tracking is Bayesian method called ‘Particle Filter’. But we will approach this 
topic by studying the visual pathways in visual cortex and juxtapose them with the state 
of the art in deep learning. Extraction of motion and extraction of transformation have 
traditionally been entirely separate from one another, this will be evident from some of 
the networks that will be presented, but the same isn’t true in biology, as it is not very 
functionally structured. Of course, the best method would be to figure out affine 
transform of an image, this can take care of all kinds of transformation, but as the 
artificial system receives an intensity value per location, its very difficult to associate an 
	   2	  
intensity to an object and then calculate the transformation factors. Despite all these 
challenges, a recent review of three DARPA funded neuromorphic vision systems [4] 
confirmed that biologically inspired systems are much more energy efficient and have 
almost nearly and sometimes better performance. 
All the three projects model the human vision systems from a systems perspective. 
But as the systems themselves aren’t modeled according to biology, they loose on 
efficiency. Maintaining scale invariance is an issue, this is mitigated by these systems by 
maintaining multiple scaled versions of the same object feature. Also these features are 
hand-crafted and not learned. Although the model from Penn State and USC have 
efficiently inculcated the concept of ‘Attention’ into their model, this is based on saliency 
maps, which don’t take care of ‘objectivity’ of a region, thus their attention region may 
not have a rigid structure or form. Finally their attended region is compared to already 
stored BoW or LHoP to detect an object and use a particle filter to track. Thus, even 
though these neuromorphic systems are good, they can be much better, if modeled from 
studies in neuroscience and deep learning. 
Therefore, this report attempts to discern those biological systems that are 
influential in tracking and/or object detection/recognition while comparing them with the 
traditional and state of the art technologies in Computer Vision, both at an algorithmic 
level as well as at system level. It starts with basic cortical neuronal connections vis-à-vis 
sparse coding and ICA. Then it moves into neural pathways and systems that are 
associated with it, following it with a brief study on deep learning architectures and 
documenting their evolution in tandem with biology. One of the popular system 
implementations is analyzed with respect to its memory accesses and an FPGA design is 
	   3	  
presented whose performance is comparable to prevalent implementations. The report 
ends by delineating those aspects of FPGA that should be enhanced to get better 
performance. 
  
	   4	  
CHAPTER 2 
VISION IS COMPLEX 
 
This section will deal with the biological structures and phenomenon underlying 
human vision. Observing structure in- Connection and Functionality separately would 
present more coherent understanding of the same. We would later observe both in 
conjecture, when they are implemented in deep learning systems. 
 
Figure 1: Layered Architecture of Neo-cortex 
Neocortex Layered Architecture- Being the youngest in evolutionary time-cycle, and 
biggest in volume share of the brain, neo-cortex is considered to be the center of all 
higher cortical functions. One of the additional evidences suggesting its efficacy is the 
fact that humans have almost twice as many neurons in their neocortex than chimpanzees, 
our predecessor in cognitive development. One of the prevalent hypotheses for explaining 
	   5	  
cortical computation- single algorithm theory [5] suggests the presence of a single 
algorithm applied in each layer of the cortex. Considering cortex is made up stacked 
layers with uniform structure, it is safe to assume that this architecture is a layered, 
hierarchical and uses repetitive application of the same algorithm. This general structure 
of neocortex is independent of function i.e. visual cortex is almost similar in structure to 
auditory cortex. It is well documented that there are roughly 6/7 layers in neocortex. The 
input from thalamus enters into Layer 4, which is fed into layers 2/3 to 1, and there are 
feedback connections from these layers to layers 5,6. The number of feedback 
connections in cortex is almost seven times that of feedforward. This kind of architecture 
is very effective for reducing redundancy in the incoming signal. One of the models that 
explore this idea is predictive coding model [6] that uses a sparse autoencoder like 
algorithm to predict the value of a pixel from the intensity values of surrounding pixels, 
when applied to natural images. But similarly there is another model called HTM 
(Hierarchical temporal memory) [7]–[9] by Jeff Hawkins, inspired from Lee and 
Mumford’s Bayesiam model of cognition, which does a similar kind of prediction in both 
space and time. While the aim of Rao and Ballards’s predictive coding model was 
increasing coding efficiency by using feedback, the aim of HTM is to improve prediction. 
But one of the issues with HTM is that it uses a SDR- Sparse Distribution Representation, 
to convert raw signals into encoded ones where each bit has a semantic meaning, before 
they can be fed into the learning network. As our aim is to study visual processing in 
cortex, we cannot venture further into HTM, owing to its above limitation. From the 
above connections, we can infer what kinds of learning algorithms may be employed in 
our cortex. 
	   6	  
Learning Algorithms 
As discussed, most of the popular learning algorithms are responsible for reducing 
dimension of representation and result in successively sparse representation of a given 
data. Independent Component Analysis is one such algorithm, it represents the data in 
term of weights of independent components which are linearly independent of each other. 
Thus the dimension of internal representation has to be ideally equal to that of the data. 
But as the basis vectors/functions are linearly independent, progressive application of 
ICA on the data will weed out redundancies and thus induce sparseness. But one of the 
conditions for calculating ICA is W*WT = I, where W= weight matrix of the basis vectors 
and I is the identity matrix. This problem is intractable. One of the work-around is called 
RICA- Reconstruction ICA [10] by Andrew Ng’s team at Stanford which adds a 
reconstruction term while calculating the cost and iteratively tries to minimize this cost 
function.  
 
ICA optimization:   min! 𝑔(𝑊!𝑥(!))!!!!!!!! , 𝑠𝑢𝑏𝑗𝑒𝑐𝑡  𝑡𝑜  𝑊𝑊! = 𝐼 
 
where, g(.) is a non-linear convex function like log(cosh(.)), W=weight matrix 
𝑊   ∈ ℝ!×!, k is the number of features and 𝑊! is one feature in W, ‘m’ is total number 
of data points available. 
ICA Reconstruction Optimization:        






!!! + 𝑔(𝑊!𝑥(!))!!!!!!!!  
	   7	  
Another such algorithm is Sparse Autoencoder. Unlike ICA, SA’s internal 
representation’s dimension is more than that of data. Thus the basis vectors can no longer 
be linearly independent, but the sparsity constraint enforces the basis vectors to be 
mutually orthogonal. Similar to RICA, we reconstruct the input from output by applying 
the same weights to the output and calculate how off our reconstructed input is from 
original input. According to reconstruction errors, the weights of the layer are changed. 
This can be done by backpropagation [11] using gradient descent where we calculate the 
effect (or gradient) of output based on each weight, which controls the change for that 
weight, this is repeated every iteration.  
Both of the above algorithms are linear representations of given data, but its conceivable 
that all data can be represented as non-linear combination of basis vectors [11], [12]. In 
order to introduce nonlinearity and simultaneously maintain the ease of operation of the 
above algorithm, they are implemented as stacked layers each running the same 
algorithm on the output of the lower layer. Thus making the entire set-up non-linear. One 
of the demerits of this set-up is that now gradient descent algorithm can be stuck in 
localized minima, while this problem will always persist, we can use SGD-Stochastic 
Gradient Descent and dropout [12] etc. to avoid it. 
	   8	  
 













	   	  
Figure 3: Stacked Autoencoder 
Each neuron in a SA network has an activation function such as- ℎ!,! 𝑥 = 𝑓 𝑊!𝑥 =
𝑓( 𝑊!𝑥!!!!! + 𝑏). Where, 𝑊! are the weights of a network, 𝑥! is the input and b is a bias. 
Following equations are frequently used in SA 
      
 
Here, 𝜆 is a regularization factor and the second term of the above equation (also called 
regularization term) is used to avoid a condition where weights are tuned to predict 
These notes are organized as follows. We will first describe feedforward
neural networks and the backpropagation algorithm for supervised learning.
Then, we show how this is used to construct an autoencoder, which is an
unsupervised learning algorithm. Finally, we build on this to derive a sparse
autoencoder. Because these notes are fairly notation-heavy, the last page
also contains a summary of the symbols used.
2 Neural networks
Consider a supervised learning problem where we have access to labeled train-
ing examples (x(i), y(i)). Neural networks give a way of defining a complex,
non-linear form of hypotheses hW,b(x), with parameters W, b that we can fit
to our data.
To describe neural networks, we will begin by describing the simplest
possible neural network, one which comprises a single “neuron.” We will use
the following diagram to denote a single neuron:
This “neuron” is a computational unit that takes as input x1, x2, x3 (and
a +1 intercept term), and outputs hW,b(x) = f(W Tx) = f(
P3
i=1 Wixi + b),
where f : R 7! R is called the activation function. In these notes, we will
choose f(·) to be the sigmoid function:
f(z) =
1
1 + exp( z) .
Thus, our single neuron corresponds exactly to the input-output mapping
defined by logistic regression.
Although these notes will use the sigmoid function, it is worth noting that
another common choice for f is the hyperbolic tangent, or tanh, function:






Here are plots of the sigmoid and tanh functions:
2
Equations (2-5) more compactly as:
z




(3) = W (2)a(2) + b(2)
hW,b(x) = a
(3) = f(z(3))
More generally, recalling that we also use a(1) = x to also denote the values
from the input layer, then given layer l’s activations a(l), we can compute
layer l + 1’s activations a(l+1) as:
z
(l+1) = W (l)a(l) + b(l) (6)
a
(l+1) = f(z(l+1)) (7)
By organizing our parameters in matrices and using matrix-vector operations,
we can take advantage of fast linear algebra routines to quickly perform
calculations in our network.
We have so far focused on one example neural network, but one can
also build neural networks with other architectures (meaning patterns of
connectivity between neurons), including ones with multiple hidden layers.
The most common choice is a nl-layered network where layer 1 is the input
layer, layer nl is the output layer, and each layer l is densely connected to
layer l + 1. In this setting, to compute the output of the network, we can
successively compute all the activations in layer L2, then layer L3, and so on,
up to layer Lnl , using Equations (6-7). This is one example of a feedforward
neural network, since the connectivity graph does not have any directed loops
or cycles.
Neural networks can also have multiple output units. For example, here
is a network with two hidden layers layers L2 and L3 and two output units
in layer L4:
5
To train this network, we would need training examples (x(i), y(i)) where
y
(i) 2 R2. This sort of network is useful if there’re multiple outputs that
you’re interested in predicting. (For example, in a medical diagnosis applica-
tion, the vector x might give the input features of a patient, and the di↵erent
outputs yi’s might indicate presence or absence of di↵erent diseases.)
2.2 Backpropagation algorithm
Suppose we have a fixed training set {(x(1), y(1)), . . . , (x(m), y(m))} of m train-
ing examples. We can train our neural network using batch gradient descent.
In detail, for a single training example (x, y), we define the cost function with
respect to that single example to be




This is a (one-half) squared-error cost function. Given a training set of m




















































The first term in the definition of J(W, b) is an average sum-of-squares error
term. The second term is a regularization term (also called a weight de-
cay term) that tends to decrease the magnitude of the weights, and helps
prevent overfitting.1 The weight decay parameter   controls the rela-
tive importance of the two terms. Note also the slightly overloaded notation:
J(W, b; x, y) is the squared error cost with respect to a single example; J(W, b)
is the overall cost function, which includes the weight decay term.
This cost function above is often used both for classification and for re-
gression problems. For classification, we let y = 0 or 1 represent the two class
labels (recall that the sigmoid activation function outputs values in [0, 1]; if
1
Usually weight decay is not applied to the bias terms b(l)i , as reflected in our definition
for J(W, b). Applying weight decay to the bias units usually makes only a small di↵erent
to the final network, however. If you took CS229, you may also recognize weight decay
this as essentially a variant of the Bayesian regularization method you saw there, where we
placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood)
estimation.
6
To train this network, we would need training examples (x(i), y(i)) where
y
(i) 2 R2. This sort of network is useful if there’re multiple outputs that
you’re interested in predicting. (For example, in a medical diagnosis applica-
tion, the vector x might give the input features of a patient, and the di↵erent
outputs yi’s might indicate presence or absence of di↵erent diseases.)
2.2 Backpropagation algorithm
Suppose we have a fixed training set {(x(1), y(1)), . . . , (x(m), y(m))} of m train-
ing examples. We can train our neural network using batch gradient descent.
In detail, for single training xample (x, y), we define the cost functio with
respect to that single example to b




This is a (one-half) squared-error cost function. Given a training set of m




















































The first term in the definition of J(W, b) is a average sum-of-squares error
term. The second term is a regularization term (also called a weight de-
cay term) that tends to decrease the magnitude of the weights, and helps
prevent overfitting.1 The weight decay parameter   controls the rela-
tive importance of the two terms. Note also the slightly overloaded notation:
J(W, b; x, y) is the squared error cost with respect to a single example; J(W, b)
is the overall cost function, which includes the weight decay term.
This c st f ti above is often used both for cl ssification and for re-
gressi n problems. F r classification, we let y = 0 or 1 represent the two class
labels (recall that the gmo d activation function outputs values in [0, 1]; if
1
Usually weight decay is not applied to the bias terms b(l)i , as reflected in our definition
for J(W, b). Applying weight decay to the bias units usually mak s only a small di↵erent
to the final network, ow ver. If you took CS229, you ay also recognize weight decay
this s essentially a variant of the Bayesian regularization method you saw there, where we
placed a Gaussi n prior on the par meters nd d d MAP (instead of maximum likelihood)
estim tion.
6
Cost Function to be optimized -  
	   9	  
accurately for training data only and don’t perform well for unseen data, a condition 
called overfitting. 
Update Rule for weights and biases, based on value of cost/error function, performed on 
every iteration on a data element –  
 
Gradient of cost function determining the of change of network parameters –  
 
 
Sparsity constraint –  
 
This term is equated to a sparsity parameter 𝜌, which is usually kept at very low value. 
This is such to make sure that the total number of activations in the network is kept at a 
minimum. Now, the divergence between these two terms has to be kept at a minimum. 
This divergence is captured by Kullback-Leibler (KL) divergence, which is of the form –  
we were using a tanh activation function, we would instead use -1 and +1
to denote the labels). For regression problems, we first scale our outputs to
ensure that they lie in the [0, 1] range (or if we were using a tanh activation
function, then the [ 1, 1] range).
Our goal is to minimize J(W, b) as a function of W and b. To train
our neural network, we will initialize each parameter W (l)ij and each b
(l)
i to
a small random value near zero (say according to a N (0, ✏2) distribution
for some small ✏, say 0.01), and then apply an optimization algorithm such
as batch gradient descent. Since J(W, b) is a non-convex function, gradient
descent is susceptible to local optima; however, in practice gradient descent
usually works fairly well. Finally, note that it is important to initialize the
parameters randomly, rather than to all 0’s. If all the parameters start o↵ at
identical values, then all the hidden layer units will end up learning the same
function of the input (more formally, W (1)ij will be the same for all values of




3 = . . . for any input x). The random initialization
serves the purpose of symmetry bre king.





















where ↵ is the learning rate. The key step is computing the partial derivatives
above. We will now describe the backpropagat o algorithm, which gives
an e cient way to compute these partial derivatives.









J(W, b; x, y), the partial derivatives of the cost func-
tion J(W, b; x, y) defined with respect to a single example (x, y). Once we can
compute these, then by referring to Equation (8), we see that the derivative















J(W, b; x(i), y(i))
#














J(W, b; x(i), y(i)).
The two lines above di↵er slightly because weight decay is applied to W but
not b.
7
we were using a tanh activation function, we would instead use -1 and +1
to denote the labels). For regression problems, we first scale our outputs to
ensure that they lie in the [0, 1] range (or if we were using a tanh activation
function, then the [ 1, 1] range).
Our goal is to minimize J(W, b) as a function of W and b. To train
our neural network, we will initialize each parameter W (l)ij and each b
(l)
i to
a small random value near zero (say according to a N (0, ✏2) distribution
for some small ✏, say 0.01), and then apply an optimization algorithm such
as batch gradient descent. Since J(W, b) is a non-convex function, gradient
descent is susceptible to local optima; however, in practice gradient descent
usually works fairly well. Finally, note that it is important to initialize the
parameters randomly, rather than to all 0’s. If all the parameters start o↵ at
identical values, then all the hidden layer units will end up learning the same
function of the input (more formally, W (1)ij will be the same for all values of




3 = . . . for any input x). The random i itializ ion
serves the purpose of symmetry breaking.





















where ↵ is the learning rate. The key step is computing the partial derivatives
above. e will now describe the backpropagation algorithm, which gives
an e cient way to compute these partial derivatives.









J(W, b; x, y), the partial derivatives of the cost func-
tion J(W, b; x, y) defined with respect to a single example (x, y). Once we can
compute these, then by referring to Equati n (8), e see that the derivative















J(W, b; x(i), y(i))
#














J(W, b; x(i), y(i)).
The two lines above di↵er slightly because weight decay is applied to W but
not b.
7
𝑎!!	  is	  the	  activation	  of	  a	  particular	  node	  j	  for	  an	  input	  𝑥(!)	  
Our argument above relied on the number of idden units s2 being s all.
But even when the number of hidden units is large (perhaps even greater
than the number of input pixels), we can still discover interesting structure,
by imposing other constraints on the network. In particular, if we impose
a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.
Informally, we will think of a neuron as being “active” (or as “firing”)
if its output value is close to 1, or as being “inactive” if its output value is
close to 0. We would like to constrain the neurons to be inactive most of the
time.3
Recall that a(2)j denotes the activation of hidden unit j in the autoencoder.
However, this notation doesn’t make explicit what was the input x that led
to that activation. Thus, we will write a(2)j (x) to denote the activation of this












be the average activation of hidden unit j (averaged over the training set).
We would like to (approximately) enforce the constraint
⇢̂j = ⇢,
where ⇢ is a sparsity parameter, typically a small value close to zero (say
⇢ = 0.05). In other words, we would like the average activation of each
hidden neuron j to be close to 0.05 (say). To satisfy this constraint, the
hidden unit’s activations must mostly be near 0.
To achieve this, we will add an extra penalty term to our optimization
objective that penalizes ⇢̂j deviating significantly from ⇢. Many choices of






+ (1  ⇢) log 1  ⇢
1  ⇢̂j
.
Here, s2 is the number of neurons in the hidden layer, and the index j is
summing over the hidden units in our network. If you are familiar with the
3
This discussion assumes a sigmoid activation function. If you are using a tanh activa-
tion function, then we think of a neuron as being inactive when it outputs values close to
-1.
14
	   10	  
  
This term is added to our earlier cost function to result in this final cost function –  
 
*Further implementation details beyond scope of this report, please refer to given citation 
for further details. 
As the architecture discussed above is layered, it is very difficult to optimize the network 
by doing gradient descent; calculating gradient for each of the weights in the network and 
backpropagating the error caused at the top-most layer into each node in every lower 
layer. One of the hacks commonly used is called ‘greedy learning’, i.e. learning each of 
the layers one at a time in an unsupervised manner by calculating the reconstruction error 
and adjusting the weights in that layer based on this error only. This is called pre-training. 
As the size of such networks, like the one used for generic object detection, can be very 
large, there is a risk of overfitting, where the network performs well for training data but 
not for general data as the parameters are optimized only for data seen during training. 
One of the techniques used to avoid such scenario is called drop-out, developed by 
G.Hinton’s lab at Univ. of Toronto. According to drop-out [11], [12], contribution of any 
node in the network is dropped with a probability of 50%. This introduces a level of 
stochasticity to the network and forces the network to learn more generic form of 
representation. 
Our argument above relied on the number of hidden units s2 being small.
But even when the number of hidden units is large (perhaps even greater
than the number of input pixels), we can still discover interesting structure,
by imposing other constraints on the network. In particular, if we impose
a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.
Informally, we will think of a neuron as being “active” (or as “firing”)
if its output value is close to 1, or as being “inactive” if its output value is
close to 0. We would like to constrain the neurons to be inactive most of the
time.3
Recall that a(2)j denotes the activation of hidden unit j in the autoencoder.
However, this notation doesn’t make explicit what was the input x that led
to that activation. Thus, we will write a(2)j (x) to denote the activation of this












be the average activation of hidden unit j (averaged over the training set).
We would like to (approximately) enforce the constraint
⇢̂j = ⇢,
where ⇢ is a sparsity parameter, typically a small value close to zero (say
⇢ = 0.05). In other words, we would like the average activation of each
hidden neuron j to be close to 0.05 (say). To satisfy this constraint, the
hidden unit’s activations must mostly be near 0.
To achieve this, we will add an extra penalty term to our optimization
objective that penalizes ⇢̂j deviating significantly from ⇢. Many choices of






+ (1  ⇢) log 1  ⇢
1  ⇢̂j
.
Here, s2 is the number of neurons in the hidden layer, and the index j is
summing over the hidden units in our network. If you are familiar with the
3
This discussion assumes a sigmoid activation function. If you are using a tanh activa-
tion function, then we think of a neuron as being inactive when it outputs values close to
-1.
14





where KL(⇢||⇢̂j) = ⇢ log ⇢⇢̂j + (1   ⇢) log
1 ⇢
1 ⇢̂j is the Kullback-Leibler (KL)
divergence between a Bernoulli random variable with mean ⇢ and a Bernoulli
random variable with mean ⇢̂j. KL-divergence is a standard function for
measuring how di↵erent two di↵erent distributions are. (If you’ve not seen
KL-divergence before, don’t worry about it; everything you need to know
about it is contained in these notes.)
This penalty function has the property that KL(⇢||⇢̂j) = 0 if ⇢̂j = ⇢, and
otherwise it increases monotonically as ⇢̂j diverges from ⇢. For example, in
the figure below, we have set ⇢ = 0.2, and plotted KL(⇢||⇢̂j) for a range of
values of ⇢̂j:
We see that the KL-divergence reaches its minimum of 0 at ⇢̂j = ⇢, and blows
up (it actually appr aches 1) as ⇢̂j approaches 0 or 1. Thus, minimizing
this penalty term has he e↵ect of causing ⇢̂j to be close ⇢.
Our overall cost function is now




where J(W, b) is as defined previously, and   controls the weight of the
sparsity penalty term. The term ⇢̂j (implicitly) depends onW, b also, because
it is the average activation of hidden unit j, and the activation of a hidden
unit depends on the pa amet rs W, b.
To incorporate the KL-divergence term into your derivative calculation,
there is a simple-to-implement trick involving only a small change to your
15
	   11	  
Both ICA and Autoencoder are deterministic methods of learning. Similarly there are 
other families of learning algorithms that are probabilistic. Most popular of these are 
called RBM-Restricted Boltzmann Machine[13]. In RBMs there are 2 visible sub-layer- 
input and output and 1 hidden sub-layer. State of each hidden sub-layer node is calculated 
by a linear weighted sum of the input, this is synonymous to calculating the probability of 
activation of a hidden node conditioned on a given input. The same is done for output 
sub-layer. RBM is an energy model, i.e. the nett. Energy of the network is calculated 
based on a particular state of all the nodes in the layer. Optimization can be achieved by 
changing the weights such that the energy of the network is minimized. But this requires 
calculating aposteriori of the probability of an input node activation conditioned on state 
of the hidden layer. But as this data is not available, it is approximated using iterative 
gibbs sampling, this takes a long time and accuracy depends on the number of iterations. 
Furthermore, gradients or change in weights to minize the energy function is done using 
Contrastive Divergence, which is similar to gradient descent in Autoencoders. 
For the rest of the report we will concentrate on autoencoders for learning. 
Functionality 
One of the theories explaining cortical computation, other than the single algorithm 
theory, is functionally Modular design [5]. Such a design states that the brain has a very 
modular design and consists of specialized regions not just in high level cortexes (such as 
visual, auditory etc.) but also within a cortex. One of the evidences supporting such a 
claim would be the fact that- V1 is responsible for filtering edges (although about 30% of 
cells are thought to be direction selective), IT region has cells which are selective of faces 
(and also some non-face objects), similarly hippocampus has cells which are called ‘place 
	   12	  
cells’ which are responsive to certain location in space. These are suggestive of a task-
wise breakdown of processing. Thus, an agreement between the single algorithm theory 
and modular design would be an architecture that would apply the same algorithm on 
different data (either pre-processed with the same algorithm, raw data or data from 
different processing area) and result in functionally different results.  
One of the traits that is required for above design is hierarchical and multi-modal 
structure. Most of the work in deep learning has been concentrated on hierarchical 
structures. Epitome of multimodality can be observed by analyzing Basal Ganglia part of 
the brain, which gets input from visual and auditory cortexes, limbic system, and 
amygdala that is responsible for emotions. As this system is responsible for learning, this 
type of multi-modal operation facilitates in coherence and also support the long standing 
belief that emotions can facilitate learning and recall. 
Observing the esoteric nature of generic multi-modality in the brain, its better if we 
concentrate on hierarchy for now. The most prevalent explanation of visual system is the 
two pathway theory [14] - The output from retina cells (bipolar and ganglion cells) go to 
LGN (Lateral Geniculate Nucleus, also responsible for consolidation of signals coming 
from both eyes), forms a nerve bundle and fed into V1. Operations such as local contrast 
normalization and whitening to reduce redundancy and shredding of unnecessary 
information are done in LGN and retina cells. V1 cells are traditionally thought of as 
edge detectors or Gabor wavelet detectors. These cells are distributed in space, such that 
a neighborhood of cells is responsible for a patch in space. They are localized in 
orientation and frequency space, i.e. cells responsible for same patch in image are 
	   13	  
responsive to different Gabor orientation and spatial frequency. The output of V1 is fed 
into two different pathways- Ventral and Dorsal. 
 
Ventral Pathway (V1-V2-V4-IT) - This is called the ‘what’ pathway. It is responsible for 
object recognition and thus has to deal with invariance such as- spatial, rotational and 
partial occlusion. As V2 and V4 are not easily accessible, not a lot has been documented 
on the inner workings of these layers. But this entire pathway can be thought of as 
successively capturing complex structures of objects such as edges, shapes, textures etc. 
As, there has been a lot of psychophysical studies in this pathway especially in V1 and IT, 
the amount of literature available is huge and the reason why deep learning community is 
the most attracted to this pathway. There are 3 models which have been inspired from this 
pathway- HMAX pooling, LCN and reLU. 
HMAX – Developed by Riesenhuber and Poggio [15]. This is a simple method to tackle 
spatial invariance. Same object occurring at different locations in the scene don’t have to 
be learnt explicitly, implying a learning technique, independent of location. Therefore, 
HMAX- hierarchical max pooling, is a layered architecture, where each layer’s node 
pools over a predetermined region of the lower layer. This pooling can be a mean, max or 
any other operation. Biologically, this can be similar to keeping the most active response 
within a population, symbolizing a kind of down-sampling. This not only gradually 
decreases the total number of feature maps/filter maps/information to be stored and 
processed in the network but as the most active set of nodes in one corner of the image 
can come to the center after repetitive application of this pooling, provided that they are 
	   14	  
active enough, and thus final learning layer can learn this pattern of activation, without 
regards to where in the scene it occurred.  
While HMAX has been very successful in translation invariance there has been a lot of 
criticism as it leads to loss of a lot of details. Traits belonging to the same object can be 
lost due to this winner takes all configuration. Incase of multiple objects, due to max 
pooling, the details of one of the objects is completely lost or even worse details of both 
objects can be skewered to form a high level activation that is characteristic of neither 
leading to faulty classification. Due to lack of preservation of any low level detail, it 
becomes very hard for a high level layer to detect multiple objects. There has been some 
work in this regard such as capsule theory [16], where the detail propagated to higher 
layer includes the absolute position of the pattern activating that. But one of the problems 
with this idea is that, the algorithm can get confused if the same pattern occurs several 
times in the same scene, thus confusing the absolute location of the pattern. Also, this 
design recommends each node to have access to entire image, thus removing the concept 
of local RF, which is not inline with biology and is computationally expensive. 
While dealing with spatial invariance with local RF, it is intuitive to assert that the 
patterns occurring in one part of the image can also occur at different part of the image. 
Thus the patterns learnt from one part can be reused all over the image. This led to tied 
weights where the filter parameters learnt from one part of the image is applied all over to 
get feature maps, a different but intermediate approach used in [17], with filters that lie 
within a small range have tied weights. But by sharing filter parameters, we need more 
number of filters to incorporate all variations possible. 
	   15	  
One of the alternatives to HMAX, is maintaining multiple copies of feature 
maps/response activation at multiple scales and using the one which gives the best 
detection. This was heavily used for traditional computer vision techniques such as 
SIFT/HOG and also in some earlier implementations of HMAX, but as the number of 
layers in NN grew bigger, it had to be dropped as storage became an issue. 
           
Figure 4: HMAX Model      
 















Fig. 1. Three capsules of a transforming auto-encoder that models translations. Each
capsule in the figure has 3 recognition units and 4 generation units. The weights on the
connections are learned by backpropagating the discrepancy between the actual and
target outputs.
a pure translation of the retinal image and the cortex has non-visual access to
information about eye-movements.
2 Learning the First Level of Capsules
Once pixel intensities have been converted into the outputs of a set of active,
first-level capsules each of which produces an explicit representation of the pose
of its visual entity, it is relatively easy to see how larger and more complex visual
entities can be recognized by using agreements of the poses predicted by active,
lower-level capsules. But where do the first-level capsules come from? How can
an artificial neural network learn to convert the language of pixel intensities
to the language of pose parameters? That is the question addressed by this
paper and it turns out that there is a surprisingly simple answer which we call a
“transforming auto-encoder”. We explain the idea using simple 2-D images and
capsules whose only pose outputs are an x and a y position. We generalize to
more complicated poses later.
Consider the feedforward neural network shown in figure 1. The network is
deterministic and, once it has been learned, it takes as inputs an image and
desired shifts, ∆x and ∆y, and it outputs the shifted image. The network is
composed of a number of separate capsules that only interact at the final layer
when they cooperate to produce the desired shifted image. Each capsule has its
own logistic “recognition units” that act as a hidden layer for computing three
numbers, x, y, and p, that are the outputs that the capsule will send to higher
levels of the vision system. p is the probability that the capsule’s visual entity is
	   16	  
LCN – Local Contrast Normalization [18]. This is also called divisive normalization, 
where the value at a feature response map is divided by the mean response of all other 
values of that feature map in a particular neighborhood. This decreases redundancy in 
response maps. This normalization can also be subtractive, where a pixel in a feature map 
is subtractive by the Gaussian weighted mean of responses of its neighbors. In either case, 
this technique is similar to predictive coding where, reducing spatial redundancy 
increases coding efficiency, because features usually extend over long distances beyond 
local RFs. But there can be redundancy reduction is across filters, as responses in 
neighborhood tend to capture similar features (neighboring filter’s parameters are closely 
associated, as in MT columns). This normalization can also be applied across channels, 
e.g. Red, Green and Blue channels. Most of the models of LCN in use rigid and do not 









= 𝑣𝑎𝑙𝑢𝑒  𝑜𝑓  𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒  𝑚𝑎𝑝  𝑓  𝑎𝑡  𝑐𝑜
− 𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒  (𝑥,𝑦) 
N = Meighborhood to be normalized. 
f = output of divisive normalization. 
Feature Map normalizing in one 
feature map in single channel. 
	   17	  
reLU – Rectifying Linear Unit [12], [19], [20]. Untill now we have assumed the output of 
a model neuron is the weighted sum of its lower node’s activation, but in biology, the 
neuron fires only when the sum of all its weighted inputs is above a particular threshold. 
This threshold often is a soft threshold and this response function isn’t a step function, 
rather is quite smooth, such as a ‘tanh’ or logistic functions these are called activation 
functions. When used in a network, this function can create a problem during back-
propagation. In gradient descent, the updated weight is calculated by subtracting the 
current weight with a value proportional to weighted error from the layer above. Now, the 
gradient of a tanh/logistic function is zero valued except in the neighborhood of zero. 
This implies that, if the error values are large (which usually occurs during initial phase 
of the training), the correction term is going to be zero, due to zero value of activation 
gradient. Thus, the network may never optimize, or might take very long to do so, this is 
called as decaying gradients problem. This is mitigated by changing the activation 
function to a function whose gradient is non-zero even at large positive values such as 
max function (reLU(x) = max(0, x)) [20]. This function shows the kind of asymmetry 
that is evident in cortical neurons as shown in the figure 7. 
 
Figure 7: Left: Biological neuron Response Right: Common Artificial neuron response 
     317
Xavier Glorot, Antoine Bordes, Yoshua Bengio
Figure 1: Left: Common neural activation function motivated by biological data. Right: Commonly
used activation functions in neural networks literature: logistic sigmoid and hyperbolic tangent (tanh).
hyperbolic tangent (see Figure 1, right), which are
equivalent up to a linear transformation. The hy-
perbolic tangent has a steady state at 0, and is
therefore preferred from the optimization stand-
point (LeCun et al., 1998; Bengio and Glorot,
2010), but it forces an antisymmetry around 0
which is absent in biological neurons.
2.2 Advantages of Sparsity
Sparsity has become a concept of interest, not only in
computational neuroscience and machine learning but
also in statistics and signal processing (Candes and
Tao, 2005). It was first introduced in computational
neuroscience in the context of sparse coding in the vi-
sual system (Olshausen and Field, 1997). It has been
a key element of deep convolutional networks exploit-
ing a variant of auto-encoders (Ranzato et al., 2007,
2008; Mairal et al., 2009) with a sparse distributed
representation, and has also become a key ingredient
in Deep Belief Networks (Lee et al., 2008). A sparsity
penalty has been used in several computational neuro-
science (Olshausen and Field, 1997; Doi et al., 2006)
and machine learning models (Lee et al., 2007; Mairal
et al., 2009), in particular for deep architectures (Lee
et al., 2008; Ranzato et al., 2007, 2008). However, in
the latter, the neurons end up taking small but non-
zero activation or firing probability. We show here that
using a rectifying non-linearity gives rise to real zeros
of activations and thus truly sparse representations.
From a computational point of view, such representa-
tions are appealing for the following reasons:
• Information disentangling. One of the
claimed objectives of deep learning algo-
rithms (Bengio, 2009) is to disentangle the
factors explaining the variations in the data. A
dense representation is highly entangled because
almost any change in the input modifies most of
the entries in the representation vector. Instead,
if a representation is both sparse and robust to
small input changes, the set of non-zero features
is almost always roughly conserved by small
changes of the input.
• E cient variable-size representation. Dif-
ferent inputs may contain di↵erent amounts of in-
formation and would be more conveniently repre-
sented using a variable-size data-structure, which
is common in computer representations of infor-
mation. Varying the number of active neurons
allows a model to control the e↵ective dimension-
ality of the representation for a given input and
the required precision.
• Linear separability. Sparse representations are
also more likely to be linearly separable, or more
easily separable with less non-linear machinery,
simply because the information is represented in
a high-dimensional space. Besides, this can reflect
the original data format. In text-related applica-
tions for instance, the original raw data is already
very sparse (see Section 4.2).
• Distributed but sparse. Dense distributed rep-
resentations are the richest representations, be-
ing potentially exponentially more e cient than
purely local ones (Bengio, 2009). Sparse repre-
sentations’ e ciency is still exponentially greater,
with the power of the exponent being the number
of non-zero features. They may represent a good
trade-o↵ with respect to the above criteria.
Nevertheless, forcing too much sparsity may hurt pre-
dictive performance for an equal number of neurons,
because it reduces the e↵ective capacity of the model.
	   18	  
Now we can put all of the above techniques together to form a network that can learn 
patterns. Of all the neural networks (NN), one that has recently become most successful 
is called Convolutional Neural Network. 
  
	   19	  
CHAPTER 3 
CONVOLUTIONAL NEURAL NETWORK (CNN) 
 
CNN (Convolutional Neural Networks) [21] are the state of the art in computer vision. 
From signal processing we know that Convolution can be interpreted as inverse 
correlation. Thus, convoluting an image patch with a filter is similar to finding how 
similar that patch is to the feature selective to that filter. However the neural basis of 
convolution isn’t well document, but by observing psychophysical experiments we can 
infer of a convolution-like processing underlying at least in early stages of ventral 
pathway. Artificial CNNs were successfully applied to handwritten digit recognition in 
[22], their first implementation to real world problem. But recent developments that 
facilitated building of large networks, lead to a rejuvenation of the same for natural image 
processing. The concept of a CNN is pretty generic- Stacked layers of convolution 
followed by stacked layers of unsupervised learning, which can then be fed into a 
classifier. 
 
Figure 8: Yann LeCunn's CNN (first successful artificial CNN) 
	   20	  
LeNet – As can be observed from figure 8 above the feature maps become smaller 
in successive stages in the network and so does the receptive field of each node/neuron. 
This is due to the intermittent subsampling/pooling layers, these are mean pooling rather 
than max pooling. But even though the RF of neurons in higher layers is small, it actually 
represents large regions in image space, this can be observed by de-sampling all the 
pooling layers. The big receptive fields of higher neurons help to achieve some spatial 
invariance. Here, the filters can be thought of localized autoencoders, and error 
propagation is also similar. Also, the convolutions are not localized in 2D but are 3D, this 
results in uneven spread of feature maps across different layers, which is not very 
intuitive. However, this network is not nearly as big as the state of the art. This is because 
of all the reasons discussed previously, such as gradient/error decay during 
backpropageation, risk of overfitting, overdependence on initial conditions (due to 
absence of reLU) and the most important lack of computational power/ huge latency due 
to lack of parallelism in implementation. 
AlexNet – Developed by Alex Krizeskvy at University of Toronto [23], won the 
ImageNet object recognition challenge in 2012. The architecture, shown in figure 9, can 
be considered a scaled up version of Lenet with all the modifications such as max pooling, 
reLU and dropout with pre-training. The implementation was run on a GPU that 
decreased execution time to few days, depending on how much pre-training is done. 
	   21	  
 
Figure 9: AlexNet Architecture 2012 
This network had about 60 million parameters, 650 thousand neurons. With these 
many parameters, overfitting was avoided using dropout. It was trained on 1 million 
images from Imagenet database. Each of these images was 256x256, but each image was 
divided into 10 images of 224x224 size. Thus the network was trained on 10 million 
images. This model has five convolution layers, each followed by a max pooling and 
reLU layer. The final output is connected to two consecutive fully connected layers. 
Regularization was used to avoid overfitting in these fully connected layers. 
Regularization implies adding another term to cost/error function that adds an extra 
penalty on weights, thus limiting them from achieving very high values. This is similar to 
imposing a sparsity constraint to the weights. The pooling regions were overlapped, thus 
capturing local translation invariance. The final output layer had 1000 nodes, and hence 
the layer can ideally discriminate between 1000 categories (equal to total number of 
classes in Imagenet 2012). The target of the final layer was to maximize log probability 
of correct prediction across all image classes. Images were trained in mini-batches rather 
than one-by-one and weights were updated after each mini-batch was processed. 
	   22	  
Table 1: Layer-wise algorithm distribution 
 
This is the state of the art in Computer Vision for Object recognition, achieving a 
top-5 error rate of 17% and top-1 error rate at 37.5%. Subsequent models by others have 
improved upon this model mainly by changing 2 parameters- 
1. Changing the size of filter kernels and stride of pooling layers. 
2. Changing hyperparameters such as increasing number of parameters- filter and 
fully connected neurons. 
Zeiler’s CNN – This [24] was presented in CVPR 2014 and is similar to AlexNet. 
The only change was that the size of filter in 1st layer was decreased from 11x11 to 7x7, 
and stride of filter was decreased from 4x4 to 2x2. The convolution in layers 3,4 and 5 
ware fully connected unlike AlexNet, as the authors didn’t partition their data for GPUs. 
These changes improved the performance of their network from AlexNet by 1.4%.  
	   23	  
 
Figure 10: Zeiler's CNN architecture 
 
Figure 11: Coupled Convolution and Deconvolution Network 
The most interesting point of this paper wasn’t the marginal change in architecture, 
shown in figure 10, but the reason behind it. The authors, in 2010 had published a paper 
on deconvolutional NN [25], shown in figure 11, which produced the low level stimulus 
(in layer 0 - image space) that stimulates a particular trained filter at any higher layer. 
This network was coupled with the AlexNet CNN and preferential stimuli of filters in 
different layers were analyzed. It was observed that in first layer, most of filters were 































Layer 1 Layer 2 
13 
256!



















Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.









































































































































































































































Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the
image is transformed.
Visualizing and Understanding Convolutional Networks
invert this, the deconvnet uses transposed versions of
the same filters, but applied to the rectified maps, not
the output of the layer beneath. In practice this means
flipping each filter vertically and horizontally.
Projecting down from higher layers uses the switch
settings generated by the max pooling in the convnet
on the way up. As these switch settings are peculiar
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps 
Feature Maps 






























Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
3. Training Details
We now describe the large convnet model that will be
visualized in Section 4. The architecture, shown in
Fig. 3, is similar to that used by (Krizhevsk et al.,
2012) for ImageNet classification. One di↵erence is
that the sparse connections used in Krizhevsky’s lay-
ers 3,4,5 (due to the model being split across 2 GPUs)
are replaced with dense connections in our model.
Other important di↵erences relating to layers 1 and
2 were made following inspection of the visualizations
in Fig. 6, as described in Section 4.1.
The model was trained on the ImageNet 2012 train-
ing set (1.3 million images, spread over 1000 di↵erent
classes). Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di↵erent sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10 2, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation erro plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10 2 and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10 1 to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple di↵er-
ent crops and flips of each training example to boost
traini g set size. We stopped traini g after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizati ns from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel
space reveals the di↵erent structures that excite a
given feature map, hence showing its invariance to in-
put deformations. Alongside these visualizations we
show the corresponding image patches. These have
greater variati n than visualizations as the latter solely
focu o th discriminant s ructure within e ch patch.
For example, in layer 5, row 1, col 2, the patches ap-
pear to have little in common, but the visualizations
reveal that this particular feature map focuses on the
grass in the background, not the foreground objects.
Every max pooling layer acts as 
a sw tch map. The index of the 
element having max value is 
stored. During deconvolution, 
these indexes are used to 
populate elements of lower 
layers from values at higher 
layers. Dec volutio  is just the 
convolution with a transpose of 
the filter under consideration. 
	   24	  
sensitive to either high or low spatial frequency, thus mandating a decrease in filter size 
and stride, so that they can filter out middle frequencies too. This coupling of 
deconvolutional layer also gave us insight to the kind of filtering that goes on inside a 
deep-CNN (more details beyond scope of this report). 
R-CNN – But all the above networks are sensitive to variation in training and test 
conditions. The biggest problem is a disconnection between what and where i.e. where in 
image space is the region that our network has classified as ‘x’? To solve this R-CNN – 
Rich Hierarchical CNN [26] from Berkeley, preprocesses a given image and generates 
nearly 2000 candidate regions which may have an object. This selection can be done by a 
number of object selection algorithms in CV literature. Each of the regions is fed into the 
CNN individually and thus resulting in classification of localized regions, as shown in 
figure 12.  
 
Figure 12: RCNN 
 
Parameter Server – The architecture of CNN is fairly generic. Stacked 
convolution layers each optionally followed by pooling and reLU layers and finally 
couple of layers of fully connected RICA- Reconstruction ICA algorithm. When the size 
of this network is large (line AlexNet with ~60 million parameters), its called as deep 
neural network. All the above architectures were tightly coupled, i.e these were run in 
Rich feature hierarchies for accurate object detection and sem ntic gmentation




Object detection performance, as measured on the
canonical PASCAL VOC dataset, has plateaued in the last
few years. The best-performing methods are complex en-
semble systems that typically combine multiple low-level
image features with high-level context. In this paper, we
propose a simple and scalable detection algorithm that im-
proves mean average precision (mAP) by more than 30%
relative to the previous best result on VOC 2012—achieving
a mAP of 53.3%. Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural net-
works (CNNs) to bottom-up region proposals in order to
localize and segment objects and (2) when labeled training
data is scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, yields a significant
performance boost. Since we combine region proposals
with CNNs, we call our method R-CNN: Regions with CNN
features. Source code for the complete system is available at
http://www.cs.berkeley.edu/˜rbg/rcnn.
1. Introduction
Features matter. The last decade of progress on various
visual recognition tasks has been based considerably on the
use of SIFT [27] and HOG [7]. But if we look at perfor-
mance on the canonical visual recognition task, PASCAL
VOC object detection [13], it is generally acknowledged
that progress has been slow during 2010-2012, with small
gains obtained by building ensemble systems and employ-
ing minor variants of successful methods.
SIFT and HOG are blockwise orientation histograms,
a representation we could associate roughly with complex
cells in V1, the first cortical area in the primate visual path-
way. But we also know that recognition occurs several
stages downstream, which suggests that there might be hier-
archical, multi-stage processes for computing features that
are even more informative for visual recognition.
Fukushima’s “neocognitron” [17], a biologically-
inspired hierarchical and shift-invariant model for pattern
recognition, was an early attempt at just such a process.
The neocognitron, however, lacked a supervised training al-
1. Input 
image













R-CNN: Regions with CNN features
Figure 1: Object detection system overview. Our system (1)
takes an input image, (2) extracts around 2000 bottom-up region
proposals, (3) computes features for each proposal using a large
convolutional neural network (CNN), and then (4) classifies each
region using class-specific linear SVMs. R-CNN achieves a mean
average precision (mAP) of 53.7% on PASCAL VOC 2010. For
comparison, [34] reports 35.1% mAP using the same region pro-
posals, but with a spatial pyramid and bag-of-visual-words ap-
proach. The popular deformable part models perform at 33.4%.
gorithm. Building on Rumelhart et al. [30], LeCun et al.
[24] showed that stochastic gradient descent via backprop-
agation was effective for training convolutional neural net-
works (CNNs), a class of models hat extend the neocogni-
tron.
CNNs saw heavy use in the 1990s (e.g., [25]), but then
fell out of fashion with the rise of support vector machines.
In 2012, Krizhevsky et al. [23] rekindled interest in CNNs
by showing substantially higher image classification accu-
racy on the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) [9, 10]. Their success resulted from train-
ing a large CNN on 1.2 million labeled images, together
with a few twists on LeCun’s CNN (e.g., max(x, 0) rectify-
ing non-linearities and “dropout” regularization).
The significance of the ImageNet result was vigorously
debated during the ILSVRC 2012 workshop. The central
issue can be distilled to the following: To what extent do
the CNN classification results on ImageNet generalize to
object detection results on the PASCAL VOC Challenge?
We answer this question by bridging the gap between
image classification and object detection. This paper is the
first to show that a CNN can lead to dramatically higher ob-
ject detection performance on PASCAL VOC as compared
to systems based on simpler HOG-like features. To achieve
this result, we focused on two problems: localizing objects
1
	   25	  
machines, even though they have multiple cores or GPUs. In [27], [28], Jeffrey Dean, 
Q.V.Le and Andrew Ng designed a 1000 node machine and ran a CNN on a massive 
scale and data. The inspiration behind this work was that, accuracy of humans in object 
recognition isn’t just in computation but also the amount of data we experience during 
learning. Instead the modern deep networks work on a scale of data that is an aorta of 
what can be naturally experienced. They ran their unsupervised massive network on 
youtube videos (scaled down to 256x256) for a few days and observed the features learnt 
by higher layer nodes/neurons using deconvolutional network previously discussed. It 
was observed that these neurons were sensitive to high-level features such as human 
faces and figure of cats (which was not surprising considering the plethora of cat videos 
in youtube). Biologically, these neurons acted the same way as neurons in IT part of the 
brain (highest functional part in ventral pathway) works, which is sensitive to, among 
other things, human faces. 
Even though the structure of each layer of this network was same as that of a 
generic CNN, it had about 1 billion parameters. This posed a problem from a 
computational standpoint. The amount of time required optimizing these many 
parameters may be months, if run on single machines. Therefore, the authors adopted a 
distributed computing approach. They used a cluster of 1000 machines each having 16 
cores, thus 16000 cores, to train 5 instances of the entire network separately, each 
instance was again distributed among many machines. The data was distributed and 
convolution operation tiled such that there was minimal inter-machine communication. 
Further more, each instance of the network ran a different mini-batch of training data. But 
as backpropagation is a serial process, weights adjusted by different data would be 
	   26	  
different. To avoid this, a parameter server was used which had an exclusive duty of 
maintaining and updating weights. It was the responsibility of all the instances of 
networks to send their gradients to the parameter server after forward propagation of each 
mini-batch, and receive the updated weights before training the next mini-batch. This is 
called Asynchronous SGD-Stochastic Gradient Descent. The salient features learnt by the 
high level neurons were observed to be more resilient to rotaion, scale and vertical and 
horizontal offset than previous networks. 
Google Lenet – One of the issues faced by previous designs was the loss of information 
due to pooling layers which was designed to alleviate multi-scale recognition problems. 
Also, it is computationally expensive if there are many large filters (like most of the 
filters were of size 5x5), as these are full matrix operations. These problems were 
addressed by Google Lenet [29], shown in figure 13, by introducing parallel pipelines 
with 1x1 convolutions. Their use is two fold- 1. They preserve information which 
would’ve rather been lost due to large filter convolutions, 2. They make use of sparsity of 
layer activation at higher layers. This makes use of [30], which concluded that incase of 
sparse deep network, we can design better network topology if we have knowledge of the 
previous layer’s probability distribution of activation. But as it is very difficult to 
analyze/encode the activation statistics of a layer, its best if all the filter sizes are used.  
Thus there are multiple filters used at any given higher layer, not just for sparsity, but 
also to preserve details occurring at different scales. Every convolution layer is followed 
by a reLU operation and preceded by occasional pooling layer. They also used multiple 
sites of softmax regression/classification layer. This was to avoid the problem of 
decaying gradients. As height of the network increases its very difficult for 
	   27	  
backpropagation error to trickle down to lower layers. Mltiple sites generating 
classification error (at different layers), added to the final errors propagated from top, 
make sure this weight decay is not severe. As there are so many parameters in the 
network, it is susceptible to overfitting, as discussed; therefore the rate of dropout was 
increased from 50% to 70%.  
Finally, there are about 22 layers in this network, but if all the 1x1 layers preceding 3x3 
and 5x5 filters are taken into consideration, there are about 100 layers (depending on 
implementation). 
This network performed best in Imagenet 2014 giving a top-5 error of 6.67%. It also used 
the same technique as RCNN (to be discussed below) for localization alongside 
classification. But as can be observed the size of this network is a major bottleneck.  
 
	   28	  
	  
Figure 13: Google LeNet 
 












































































































































































































































































Figure 3: GoogLeNet network with all the bells and whistles
7





As the first part of this thesis, we tried to create a DRAM simulator for studying 
the behavior of memory access of a CNN algorithm. One of the most popular open source 
platforms for development of CNNs is Caffe: Convolutional Architecture for Fast Feature 
Embedding[31] from BVLC, UC Berkeley. We used a C variant of Caffe called ‘darknet’ 
from University of Washington. The source code was compiled in CPU-only mode, the 
network was loaded with AlexNet configuration and the weight of the neurons was equal 
to learned weights of Alexnet. While all images undergo a pre-processing operation 
before they can be fed into the CNN, we didn’t analyze it as the access pattern is not 
poignant to the algorithm. This pre-processing includes a image format conversion from 
jpeg/tiff i.e. padded/unpadded format to standard OpenCV complaint image format such 
as IPLImage. We know images in jpeg format are stored in RGB format, which is a 3d 
matrix. Pixels of the three channels are stored discreetly across the third dimension. 
Incase of padded formats like tiff format, a fourth layer of padded layer is added so that 
memory accesses are more structured. Whereas opencv format such as IPLImage 
converts the 3d matrix format of images into a single dimensional array. Pixels of all 
channels having common 2d co-ordinates are stored in adjacent positions in this format. 
While this format is beneficial for many image-processing operations, in deep learning 
systems, pixels in different channels are not processed independently. Furthermore most 
of the current CNNs have untied weights, and the convolution is carried out in a strided 
	   30	  
format i.e. he image is segmented into small windows. These windows have some 
overlap with one another. Thus if an operation were carried out across the entire image on 
all windows in parallel, there would be considerable read contention due to this overlap. 
Therefore, it is better to duplicate overlapping pixels and disassociate windows having 
overlapping regions. This leads to a nearly 5 times increase in input data in the first layer 
of Alexnet. As the filters and input image to a convolution layer are both 3 dimensional, 
all pixels belonging to the same 3d window are stored adjacently in an array. 
At the heart of any CNN layer, there is a kernel called gemm- general matrix-
matrix multiplication. This kernel is responsible for the convolution operation of a filter 
with each window of the input image. The inputs to this kernel are instrumented as 
suggested above. The trace of the DRAM simulator was obtained from this kernel. 
Unlike popular DRAM simulators such as dsim[32] which need a number of attributes in 
their memory trace, our trace contained only 3 attributes for each memory access- 1. 
Operation type- read/write, 2. Array name- A/B/C (convolution layers have 3 inputs- A= 
weight, B=input image, C= output) and 3. Index of the element being accessed. We also 
tried using standard memory trace generator such as intel pin tool[33] to generate a trace 
of the application while running an Alexnet, but as the application was targeted for single 
core general purpose computers, it instruments and de-instruments the inputs and outputs 
after every layer’s processing, there are many such operations which are not poignant to 
the algorithm and wouldn’t be needed in a dedicated FPGA system, and using a standard 
memory trace generator would’ve forced our simulator to analyze those operations too. 
Therefore, we created our own traces by distilling only those parts of the application, 
	   31	  
which are essential to our analysis. A trace was generated for each layer, when the 
application was running in training mode i.e. for both forward and backward propagation. 
Architecture 
 
Figure 14: Flowchart of DRAM Simulator 
As shown in figure 14, the simulator starts by detecting the current layer to be 
analyzed and loading the corresponding trace file. This file is then transferred to the 
	   32	  
analysis program. This program reads each memory access, which is a triplet as stated 
above, and keeps sending them to the DRAM simulator program. The analysis program 
returns control to its caller only when there is no more memory operations in the given 
file. Before the analysis program can be started, the main program also has to set up the 
base addresses of each array that will be accessed by the simulator. Currently we are 
hardcoding the base addresses according to the layer that is being run. This also gives us 
the freedom to store different arrays in same or different banks/channels. The mapping of 
array addresses to actual DRAM banks can also be customized before compilation. 
The simulator is an approximate simulator that has few essential features of 
DRAM. This is an asynchronous, trace driven simulator, i.e. there is no concept of a 
synchronous clock and the virtual clock is incremented when a new memory address is 
fed into the simulator by the analysis program. There is a caveat, when all the banks as 
busy, there is a possibility that the simulator may become unresponsive. In such a case, 
the simulator increases virtual clock while blocking the pipeline to avoid analysis 
program from reading more memory operations. The simulator also has queue, which are 
channel specific, and instructions that cannot be processed, because their corresponding 
bank is busy, are moved into the queue. The analysis program can keep reading new 
requests as long as the DRAM queues aren’t filled. Intuitively before a memory request 
can be sent into the simulator, all the queues are checked for pending requests and if any 
of the corresponding request’s bank is available, that request is removed from the queue. 
From the simulator’s perspective, processing a memory request is equivalent to just 
calculating the access time- this depends on which row is open in the corresponding bank 
and then marking that bank busy for that much time. There is no actual data transfer 
	   33	  
anywhere. The timing calculations are similar to SDRAM. The timing parameters in the 
simulator are not absolute and are relative to the virtual clock, thus we can also analyze 
the effects of different relative speeds between cpu and memory. 
In an attempt to simulate multiple processing sites, there is a concept of window 
in the analysis program. The total number of requests on the fly in the system cannot be 
more than the window size. For most of our analyses we kept this number equal to 500. 
There is also the concept of dispatch rate, which represents the approximate delay 
between two successive memory requests to the memory. It is the ratio between dispatch 
rate and timing parameters that simulate the difference in speed between cpu and memory. 
Here, we didn’t take care of any Read after Write, Write after Read or Write after Write 
dependency as there is no such dependency in CNN algorithm unless both the input and 
output arrays are mapped to the same memory, which in our case, isn’t true. Also there is 
concept of hierarchical cache in this model. Even if there would be a hit rate of ~95% by 
adding a single layer of cache, due to prefetching, this model concentrates on DRAM for 
now. 
Also the array elements were stored in a cyclic manner across the banks. As the 
memory accesses are serial, such a distribution pattern ensured that every new request 
would go to a different bank than the previous request. Thus ensuring more parallelism 
while catering to multiple requests on the fly. 
Result 
The DRAM simulator, running under two different configurations, analyzed the 
traces as shown in figure 15 and 16. The two configuration were- single channel single 
	   34	  
controller and double channel double controllers. In the later case, we assumed the set-up 
not only had two channels, but each of these two DRAMs had there own queues. As the 
trace analysis is very long, we can discuss only very few interesting observations here.  
When the ratio between the dispatch rate and timing parameters is high enough, 
we can start seeing contention in the DRAM. This can be observed by polling the DRAM 
for the status of its banks and queues once in every ten million operations. Under single 
DRAM mode, with 64 banks, and a queue size of 64, we can observe that the number of 
requests on the fly is almost always nearly 120, and there are about 60 requests in the 
DRAM at any given time. When the memory configuration was changed to have 2 
DRAM channels with independent controllers and having the same timing constraints as 
earlier, it was observed that the number of requests on the fly reduced to about 90. But 
there was almost no request in any of the DRAM queues under this set-up. When the 
memory configuration is set-up in 2 channel mode, it is important to note that assigning 
different arrays to different DRAM controllers is required to ensure parallel servicing of 
requests. If this is not done, then the result is same as single controller configuration. 
Also, it was observed empirically that the most of the memory accesses are either 
belonging to the input image array (B) or the output array (C), the weight array (A) is 
accessed once every 2916 iterations. Therefore, during set-up of array base address to 
memory mapping, it was ensured that these arrays are mapped to different channels and 
controllers. The word size of the memory was equal to 4 bytes as all of the accesses are 
single precision floating point data. Under the existing timing conditions, there wouldn’t 
have been any new observation if we had increased the number of banks or channels. As 
the DRAM queue is almost always empty, this means the memory banks are never 
	   35	  
saturated in a 2 channel configuration. The only condition when increasing banks and 
channels would show some improvement in performance is if the ratio between dispatch 
rate and timing parameters was increased, i.e. memory was made to operate more slower 
than the processing element(s).                                                                          
                                                           
	   36	  
 
total requests yet 10000001 
DRAM Queue Fill 64 
num requests on fly 128 
 
 total requests yet 20000001 
DRAM Queue Fill 64 
num requests on fly 128 
 
 total requests yet 30000001 
DRAM Queue Fill 64 
num requests on fly 121 
 
 total requests yet 40000001 
DRAM Queue Fill 57 
num requests on fly 127 
 
 total requests yet 50000001 
DRAM Queue Fill 63 
num requests on fly 128 
 
 total requests yet 60000001 
DRAM Queue Fill 64 
num requests on fly 128 
 
 total requests yet 70000001 
DRAM Queue Fill 64 
num requests on fly 128 
 
 total requests yet 80000001 
DRAM Queue Fill 64 
num requests on fly 122 
Figure 15: Sample Simulator output of single channel DRAM running on first 
convolution layer trace 
Total	  number	  of	  requests	  that	  have	  
been	  analyzed	  until	  the	  time	  of	  
polling	  
Status	  of	  DRAM	  when	  polled	  once	  
every	  10,000,000	  requests	  
Numbers	  of	  requests	  in	  DRAM	  
queue	  as	  their	  corresponding	  
banks	  are	  busy	  
Number	  of	  requests	  being	  serviced	  
by	  the	  DRAM	  banks	  and	  waiting	  in	  
the	  queue	  
	   37	  
 
filter range 0x3938700 - 0x11e1a300 
Base Addresses: BASE A : 0x3938700       BASE B :0x7997ee00      BASE C : 
0x32116200 
Number of req in bank of DRAM[0] 0 
Number of req in bank of DRAM[1] 1 
num requests on fly 1 
 total requests yet 1 
DRAM Queue Fill 0 
DRAM Queue Fill 0 
 
Number of req in bank of DRAM[0] 45 
Number of req in bank of DRAM[1] 45 
num requests on fly 90 
 total requests yet 10000001 
DRAM Queue Fill 0 
DRAM Queue Fill 0 
 
Number of req in bank of DRAM[0] 45 
Number of req in bank of DRAM[1] 45 
num requests on fly 90 
 total requests yet 20000001 
DRAM Queue Fill 0 
DRAM Queue Fill 0 
 
Number of req in bank of DRAM[0] 45 
Number of req in bank of DRAM[1] 45 
num requests on fly 90 
 total requests yet 30000001 
DRAM Queue Fill 0 
DRAM Queue Fill 0 
 
	  
Figure 16: Sample Simulator output of double channel DRAM running on first convolution 
layer trace 
Setting	  up	  the	  base	  addresses	  
before	  starting	  a	  layer’s	  
computation	  
Status	  of	  DRAM	  when	  polled	  once	  
every	  10,000,000	  requests	  
Number	  of	  requests	  being	  serviced	  
by	  the	  banks	  in	  the	  two	  channels	  
Number	  of	  requests	  being	  services	  
and	  stuck	  in	  queues	  in	  both	  DRAM	  
channels/controllels.	  
Number	  of	  memory	  requests	  that	  
are	  waiting	  in	  the	  two	  DRAM	  
queues	  
	   38	  
 
filter range 0x3938700 - 0x11e1a300 
Base Addresses: BASE A : 0x7270e00       BASE B :0x59682f00      BASE C : 
0x35a4e900 
 
Number of req in bank of DRAM[0] 0 
Number of req in bank of DRAM[1] 63 
num requests on fly 127 
 total requests yet 140000001 
DRAM Queue Fill 0 
DRAM Queue Fill 64 
 
Number of req in bank of DRAM[0] 0 
Number of req in bank of DRAM[1] 61 
num requests on fly 122 
 total requests yet 150000001 
DRAM Queue Fill 0 
DRAM Queue Fill 61 
 
Number of req in bank of DRAM[0] 0 
Number of req in bank of DRAM[1] 63 
num requests on fly 127 
 total requests yet 160000001 
DRAM Queue Fill 0 
DRAM Queue Fill 64 
 
Number of req in bank of DRAM[0] 0 
Number of req in bank of DRAM[1] 63 
num requests on fly 127 
 total requests yet 170000001 
DRAM Queue Fill 0 
DRAM Queue Fill 64 
	  
Figure 17: Sample simulator output of second convolution layer trace 
Setting	  up	  the	  base	  addresses	  
before	  starting	  a	  layer’s	  
computation.	  Here,	  the	  Base	  
addresses	  of	  B	  and	  C	  are	  
mapped	  to	  the	  same	  DRAM	  
channel.	  
Status	  of	  DRAM	  when	  polled	  
once	  every	  10,000,000	  requests	  
No	  requests	  in	  this	  DRAM	  
channel	  
As	  all	  the	  requests	  are	  mapped	  
into	  this	  DRAM	  channel,	  all	  the	  
banks	  are	  busy	  and	  the	  queue	  is	  
full.	  
Number	  of	  request	  being	  
currently	  serviced	  or	  pending	  is	  
higher	  than	  in	  fig	  [20]	  
	   39	  
It can be observed from the trace simulator that the second convolution layer, 
shown in figure 17, which takes its input from previous max-pool layer, doesn’t reap the 
benefit from two channels. This is because all the layer outputs are put in single DRAM 
channel and hence both the input and output are stored in the same DRAM leading to 
access contention. This effect is also observed in the last three convolution layers of the 
network, which are adjacent to each other, and their inputs and outputs feed into each 
other in forward and backward propagation. This contention can be resolved by putting 
outputs of adjacent layers in different DRAM channels thus creating a ping-pong access 
where the inputs and outputs of each layer come from different DRAMs thus avoiding 
any kind of access collision. 
It might feel unintuitive to request the DRAM for every access, as mentioned 
earlier; even the presence of a single layer cache can achieve a hit rate of ~95% due to 
pre-fetching. But as we are targeting FPGA, the above analysis is helpful if the designed 
is realized in a very small but fast FPGA that consumes very little power but has a very 
small or no on-chip memory. 
  
	   40	  
CHAPTER 5 
FPGA DESIGN OF CNN 
 
In this section we analyze using FPGAs to compute the CNNs. For simplicity 
purposes we target the design for the first convolution layer only, this is because this 
layer consumes 19.6% of execution time in a CPU and 16.9% of execution time in 
GPU[34] during forward propagation and also consumes the highest amount of memory 
among all layers. Besides the design that works best for this layer will also work for the 
rest of the convolution layers, as the core operation is the same across the board. 
There have been few works on realizing large scale CNN such as Alexnet on an 
FPGA[35][36], [37]. All these works concentrate on designing FPGAs for forward 
propagation only, as it’s a current practice in industry and academia to train the network 
once in the beginning and then deploying, which uses only forward propagation. This 
simplifies the design as the outputs needn’t be stored and can be overwritten, thus 
decreasing the memory footprint. Also, most of these algorithms don’t expand the input 
due to overlapping windows- as prescribed by GEMM. But, we created a design 
assuming the input image is expanded. Under current norms, the input image expands 
from 224x224x3 element array to one having 1.05 Million elements as shown in figure 
18. All the inputs and outputs to a layer in an FPGA are stored in DRAM, but they are 
transferred into the FPGA before starting the computation as shown in figure 19. These 
accesses can be overlapped with the computations as stated in [34].  
	   41	  
 
Figure 18: Conversion and storage of input image in BRAM blocks 
Our main objective in this design was to increase the number of parallel execution 
units such that we can, not only fully utilize all the resources in a FPGA, but also find 
limitations in hardware that can be bottlenecks in increasing parallelism.  
Architecture 
The design in figure 20 is similar to [35] as shown in figure 19 but instead of 
parallelizing across each filter, we parallelize computation across an entire output feature 
map. We know that convolution is same as reverse correlation, which consists of 
multiplication and accumulation operations. Each 3d block in the input image is 
multiplied with each 3d filter and the result is accumulated to produce one pixel of the 
feature map corresponding to a filter. Under the current configuration, each blob and 
filter consists of 11x11x3 = 363 elements each. As each of these multiplications don’t 
	   42	  
have any dependency, they can be carried out in parallel and then the results can be 
accumulated. Observing the architecture of both [35], [36], it can be inferred they follow 
similar architecture. This architecture can be further parallelized by operating on a large 
batch of images than one per iteration. Usually this batch size is kept at 128. 
 




accumulated results are sent to a specialized ne work-on-chip, which re-circulates the computed output 




Figure 3. Top-Level Architecture of the Convolutional Neural Network Accelerator. 
The accelerator highlighted in Figure 3 targets a dual-socket Xeon server equipped with a Catapult FPGA 
card, which includes a mid-range Stratix V D5 FPGA and 8GB of DDR3-1333 [3].  Each FPGA card supports 
up to 8GB/s of bandwidth over PCIe 3x8 and up to 21.3 GB/s of bandwidth to local DRAM.  More 
specifications of the hardware are described in the original Catapult paper [3]. 
Table 1 shows the throughput of image classification (forward propagation only) using well-known models 
such as CIFAR-10 based on cuda-convnet [4], and ImageNet-1K based on Krishevsky et al [1]. We further 
evaluate the largest and most challenging model available to us, the ImageNet 22,000-category deep 
convolutional neural network trained using Project ADAM at Microsoft [2].  
In general, our current Catapult server equipped with a mid-range Stratix V D5 FPGA achieves competitive 
processing throughput relative to recently published state-of-the-art FPGA solutions [5] and Caffe+cuDNN 
running on high-end GPGPUs [6].  It is worth noting that the GPGPU solutions require up to 235W of power 
to operate [7], making them impractical to deploy at scale in our power-constrained datacenters. In 
contrast, the FPGA solution consumes no more than 25W of power, incurring a less than 10% overhead in 
overall power consumption of the server. Also, our design achieves nearly 3X speedup relative to the most 
recently published work on accelerating CNNs using a Virtex 7 485T FPGA [5].    
                                                                
2 Although not shown in Figure 3, additional logic is present to handle pooling and rectified linear 
operations. 
	   43	  
 
Figure 20: Designed FPGA Architecture 
We designed the FPGA for a batch size of 1 image. Once this design is realized 
we can expand it for bigger batch size, if there are resources left in the FPGA. There are 
three dimensions along which the design can be parallelized- filter/window size (363), 
along all the filters (64) or along the output feature map pixels (2916). The last option 
gives us the highest parallelization. Before starting any operation, the entire expanded 
image array is copied into the FPGA Block RAM (BRAM). The access latency to bram is 
equal to 1. Each pixel of each filter is copied once and stored in a register. This value is 
then multiplied to the corresponding element of each window of the input image in 
parallel, as the image array is already expanded, there is no contention while scheduling 
read operation of all these pixels in parallel. As there are 2916 windows and 2916 
elements in each feature map, the number of parallel units we have is equal to 2916 
operating in parallel per filter per image. In order to support this kind of parallelization, 
	   44	  
we have to distribute the inputs and output arrays in a particular manner. The output is 
stored in BRAM too, and can be transferred out of the FPGA in parallel with computation. 
The expanded input image, which is a 1 dimensional array, is stored such that all 
pixels of a window are stored adjacent to each other and each window is adjacent to its 
neighbor. Each window/blob is stored in one BRAM block, thus we need at least 2916 
blocks to store the entire image. Similarly, each of the outputs of al the parallel execution 
units have to go to different BRAM blocks, if full parallelizing can be fully extracted. As 
the output of each unit is a partial output of any feature map, this operation is carried out 
363 times per filter to obtain the complete feature map. In each iteration all the elements 
of the feature map, containing partial accumulations, are read, added with the multiplier 
result and written back. Even though it might seem the number of memory accesses is 
very high, the latency is not. In order to ensure low latency, each pixel of a feature map is 
stored in different BRAM block. But this might lead to inefficient use of memory, as 
each BRAM block has a capacity of 18K/36K bits and capable of storing much more than 
one pixel. Thus, the design stored all pixels of all output feature maps having the same 2d 
co-ordinate/index, in the same block. As we are never commencing two filters at the 
same time, to avoid read contention, there is never an instance where we access more 
than element in the same output block in parallel. Thus avoiding write contention and 
simultaneously saving memory. 
As the input and output arrays are stored in on-chip memory, connections 
between each BRAM block and the parallel executing units can be a wired, thus the 
internal fan-in and fan-out of the computing unit is very large. As the data-type of the 
elements is float, they are realized using the DSP blocks in the FPGA. Also, the storage 
	   45	  
of filter/weight arrays are neglected in this analysis, as their access occurs once in every 
2916 operations, but as we are doing all the operations in a single iteration, these accesses 
will occur with a duration equal to the lowest iteration latency. This is avoided by storing 
them in BRAM, but the BRAM capacity is a problem. As these accesses are regular, the 
memory reads can be pipelined such that barring the first iteration of the lowest loop, we 
can receive new elements every cycle. 
The FPGA design was realized using Vivadio HLS, this was done because of the 
large scale of the network. Vivadio is responsible for loop scheduling and streamlining 
the memory accesses to extract the best utilization of the hardware. In [36] the design was 
optimized using HLS directives such as loop unroll and loop pipeline. But in our design 
loop unrolling couldn’t be realized due to the large number of input/output BRAM banks. 
Therefore, we had to hardcode the 2916 operations in the lowest loop manually. The 
same paper also discussed overlapping computation and memory access by dividing the 
internal memory into buffers, both the input and output memory were divided into 2 parts 
each. This allowed loading of data in one set of input and output buffers while the other 
set was used for computation. Unfortunately doing the same in our design isn’t simple as 
have 2916 input and output buffers. Also as each of the BRAM blocks are used during 
each iteration, its is very difficult to parallelize computation and memory operation. But 
instead of using single port BRAM, if we can use a dual ported BRAM in Read-
Write/Read-Only mode, we can parallelize computation and memory access. 
Unfortunately this couldn’t be realized in current design, as the ‘dataflow’ HLS directive 
(used for parallelizing memory access and computation) couldn’t succeed, due to large 
number of banks. Future work can be continued to solve this problem. 
	   46	  
Memory Storage Calculations 
In this section, capacity of BRAM blocks is analyzed.  
Ideal Scenario 
Size of expanded image array B = 2916 x 363 float elements = 1058508 elements 
Size of output image (including all feature maps) array C = 64 x 2916 elements = 186624 
elements 
Total number of elements in Weights/Filters array = 64 x 363 elements = 23232 elements 
Total number of elements to be stored = 1268364 elements 
Total size of all the elements = 1268364 x 32 bits (assuming single precision float) = 
40587648 bits 
Hence, the total size required for the first convolution layer is about 39 Mb. This is within 
the resource budget of Xilinx Virtex-7 Ultrascale FPGA [38]. But practically, under the 
available BRAM block configurations, the memory requirement for the design is much 
higher. 
Practical Scenario 
Number of elements that can be stored in a single 18K BRAM block = 18Kb/32b = 576 
elements.    According to our memory layout, each BRAM block stores each input image 
window/blob, which has 363 elements and can easily fit into a BRAM block. Similarly, 
the design stores all pixels of all output feature maps with same index in the same BRAM 
block. As there are 64 filters, each output BRAM block has to store 64 elements. Even 
	   47	  
though the BRAM blocks are large enough to contain respective workloads, there is a 
large amount of memory that is wasted. This can be a problem as shown below 
Total number of BRAM blocks required for storing B = 2916 
Total number of BRAM blocks required for storing C = 2916 
Total number of BRAM blocks required = 5832 
As total number of BRAM blocks in a Virtex 7 Ultrascale FPGA is 5040, this design 
can’t fit into the FPGA in its current form. 
This starvation in memory also implies we cannot compute different filter 
operations at the same time. As all the filters need to operate on the same image array 
stored in BRAM blocks, it will cause a read contention at BRAM memory controller. 
Usually this is avoided by replicating overlapping data, but as the design is already 
memory starved, it’s not practical to replicate the array. 
Utilization 
Our primary design objective was to increase the resource utilization. Comparing 
the utilization table of [34] table 1, and that of our design in table 2, it can be observed 
that our design used a lot more resources than [34]. But our design also over-utilizes DSP 
blocks by about 5 times than what is available. As we declared our BRAM ports 
exclusively to ensure the requisite memory layout, BRAM usage estimation doesn’t 
appear in the utilization report. Kindly refer to the previous section for the same. 
	   48	  
Table 2: Utilization Table in [34] 
 
 
Table 3: Utilization Table in Virtex 7 Ultrascale xcvu440 
Resource DSP BRAM (18K) LUT FF 
Used 14580 - 1046931 1239404 
Available 2880 5040 2518560 5037120 





According to the synthesis report, the total latency for all the computation of the 
first layer for each image is equal to 185984 cycles. It doesn’t include the time it will take 
to send the input image into and to take the output array out from the FPGA.  
The 3d GEMM algorithm is a 3 fold-nested loop, if there is a single processing 
element then the loop latency is very large as shown in table 4. But due to manual 
unrolling and pipelining, it is reduced to 2 fold-nested loops. As shown in table 5, this 
set-up reduced the execution latency by a factor of 4148 times. Also evident from the 
Figure 12: Timing graph
maps. We first introduce each bu↵er set’s organization and
followed by the ping-pong data transfer mechanism.
Every bu↵er set contains several independent bu↵er banks.
The number of bu↵er banks in each input bu↵er set is equal
to T
n
(tile size of input fm). The number of bu↵er banks in
each output bu↵er set is equal to T
m
(tile size of output fm).
Double bu↵er sets are used to realize ping-pong opera-
tions. To simplify discussion, we use a concrete case in Fig-
ure 9 to illustrate the mechanism of ping-pong operation.
See the code in Figure 9. The “o↵-load” operation will oc-





times of “load” operation. But the
amount of data in every output fm transfer are larger than
that of input fm in a ratio of ⇡ Tm
Tn
= 647 . To increase
the bandwidth utilization, we implement two independent
channels, one for load operation and the other for o↵-load
operation.
Figure 12 shows the timing of several compute and data
transfer phases. For the first phase, computation engine is
processing with input bu↵er set 0 while copying the next
phase data to input bu↵er set 1. The next phase will do the
opposite operation. This is the ping-pong operation of input






tion and data copying are done, the resulting output feature
maps are written down to DRAM. The “o↵-load” operation






phases till the reused temporary data in the output
bu↵er set 1 generates the new results. This is the ping-pong
operation of output feature maps. Note that those two in-
dependent channel for load and store operation mechanism
work for any other data reuse situation in this framework.
4.4 External Data Transfer Engines
The purposes of using external data transfer engines are
in two folds: 1) It can provide data transfer between acceler-
ator and external memory; 2) It can isolate our accelerator
from various platform and tool specific bandwidth features.
Figure 13 shows an experiment with AXI4 bus bandwidth in
Vivado 2013.4. In these two figures, we set two parameters,
bitwidth of AXI bus to DRAM controller and DRAM con-
troller’s external bandwidth, at their highest configurations
while changing the number of IP-AXI interfaces and the
bitwidth of each IP. In Figure 13(a), the increase in IP-AXI
interface bitwidth has no e↵ect on bandwidth (400MB/s un-
der 100MHz frequency). In Figure 13(b), with more IP in-
terfaces added to AXI bus, its bandwidth increases almost
linearly and the highest bandwidth is about 4.5 GB/s. In
our CNN accelerator design, a minimal bandwidth of 1.55
GB/s is required. Therefore, 4 IP interfaces are su cient
for this design according to Figure 13. We use two AXI-IP
interfaces in data transfer engine 0 and two in data transfer
engine 1, as is shown in Figure 10.




Figure 13: IP-DRAM bandwidth(Vivado 2013.4)
5. EVALUATION
In this section, we first introduce the environment setup of
our experiments. Then, comprehensive experimental results
are provided.
5.1 Experimental Setup
The accelerator design is implemented with Vivado HLS
(v2013.4). This tool enables implementing the accelerator
with C language and exporting the RTL as a Vivado’s IP
core. The C code of our CNN design is parallelized by adding
HLS-defined pragma and the parallel version is validated
with the timing analysis tool. Fast pre-synthesis simula-
tion is completed with this tool’s C simulation and C/RTL
co-simulation. Pre-synthesis resource report are used for
design space exploration and performance estimation. The
exported RTL is synthesized and implemented in Vivado
(v2013.4).
Our implementation is built on the VC707 board which
has a Xilinx FPGA chip Virtex7 485t. Its working frequency
is 100 MHz. Software implementation runs on an Intel Xeon
CPU E5-2430 (@2.20GHz) with 15MB cache.
5.2 Experimental Results
In this subsection, we first report resource utilization.
Then, we compare software implementation (on CPU) to
our accelerator on FPGA. Finally, the comparison between
our implementation and existing FPGA approaches is pro-
vided.
The placement and routing is completed with Vivado tool
set. After that, the resource utilization of our implemen-
tation is reported out, as shown in Table 6. We can tell
that our CNN accelerator has almost fully utilized FPGA’s
hardware resource.
Table 6: FPGA Resource utilization
Resource DSP BRAM LUT FF
Used 2240 1024 186251 205704
Available 2800 2060 303600 607200
Utilization 80% 50% 61.3% 33.87%
	   49	  
table is the fact that there are 2916 processing elements, all of which are executed once in 
every 8 cycles. This parallelism is possible only because of the storage pattern of the 
arrays, this conclusion was derived when using default HLS partial loop unrolling and 
pipeline directives didn’t show any considerable improvement in latency. 
Table 4: Loop Latency with single Processing Element 
 
 
Table 5: Loop Latency with 2916 Processing Element 
 
  




The report explores biologically inspired learning algorithm such as auto-
encoders and the systems that are inspired by them. The current neural networks are not 
remotely close to the scale and complexity of neural circuits found in humans. And the 
expansion of size is the current trend in AI and deep learning community. Most of the 
work in both industry and academia use GPUs to run these networks, which are used not 
only for computer vision tasks such as object recognition but also for natural language 
processing. A natural recourse for these applications is to be deployed in cloud. Even 
though current systems are being trained in isolation and deployed in test only mode, i.e. 
no backpropagation in deployment. But in future, systems will require online learning, 
which would require backpropagation in production systems. Therefore datacenters 
should be capable of running both forward and backward propagations in real-time.  
While GPUs have performed incredibly by decreasing training and run time of 
CNNs, their high power usage is a deterrent from being deployed in large-scale systems 
like in datacenters. The reason behind GPU’s success in running neural networks is the 
presence of many threads execution units that can run in parallel. But as GPUs are 
designed for graphics operations, each SIMD unit or thread warp unit consists of a bulk 
of hardware that is not pertinent to run deep learning algorithms. But even if the hardware 
inside GPU core isn’t used they still consume power, but if the hardware can be 
configured to consist of only pertinent units then we can decrease power usage. This was 
the motivation behind this thesis. The design that has been realized and synthesized has 
	   51	  
increased parallelism beyond the capacity of the state of the art FPGA. From the 
observations, we can conclude that DSP blocks and BRAM are more important to 
increase parallelism than flip-flops and Look-up tables, this can guide the design of future 
generation FPGAs targeted for running deep learning systems. 
  





Code snippet 1 shows a sample of the code used for realizing the FPGA 
accelerator with a single execution unit. Also the snippet contains code to load the inputs 
into and transferring the output array out of the FPGA. The core logic block of the 
snippet can be optimized using predefined directives such as ‘pipeline’ and ‘partial 
Loading	  input	  arrays	  into	  FPGA	  
Transferring	  the	  output	  out	  of	  the	  FPGA.	  We	  have	  to	  
store	  the	  layer	  outputs	  for	  calculating	  deltas	  in	  
‘backpropagation’	  
This	  block	  contains	  the	  logic	  to	  be	  realized	  in	  the	  design	  
Code Snippet 1: HLS code for FPGA with single Executing Unit    
	   53	  
unroll’ with a factor of 500, but as explained in FPGA Design – Architecture section, 
both of these optimization didn’t result in desired performance improvement. As there are 
about 15000 lines of code in the parallel design that was realized, only code snippets of 
major sections of the program can be shown here. 
 
Code Snippet 2: Sample code for BRAM port and block declaration 
#define	  NUM_FILTERS	  64	  //M	  =	  Num	  Filters	  	  
#define	  FEATURE_MAP_SIZE	  2916	  //N	  =	  output	  image/feature	  map	  size	  	  
#define	  WINDOW_SIZE	  363	   //	  K	  =	  11x11x3	  iNPUT	  pATCH/FILTER	  SIZE	  
void	  convolution_layer(…	  ,float	  A[WEIGHT_SIZE],	  int	  lda	  ,	  float	  B[INPUT_SIZE],	  int	  ldb,	  	  
float	  C[OUTPUT_SIZE],	  	  
float	  C_local0[OUTPUT_SIZE	  /	  2916],	  	  




float	  C_local2915[OUTPUT_SIZE	  /	  2916],	  int	  ldc,	  
float	  B_local0[INPUT_SIZE	  /	  2916],	  
float	  B_local1[INPUT_SIZE	  /	  2916],	  
.	  
.	  
float	  B_local2915[INPUT_SIZE	  /	  2916])	  {	  
//	  each	  C_local<i>[]	  contains	  pixel	  i	  of	  all	  feature	  maps	  	   	  
//	  each	  B_local<i>[]	  contains	  ith	  blob	  of	  input	  image	  	  
	  
#pragma	  HLS	  INTERFACE	  bram	  depth=363	  port=B_local0	  	  
#pragma	  HLS	  INTERFACE	  bram	  depth=363	  port=B_local1	  	  




#pragma	  HLS	  INTERFACE	  bram	  depth=363	  port=B_local2913	  
#pragma	  HLS	  INTERFACE	  bram	  depth=363	  port=B_local2914	  	  
#pragma	  HLS	  INTERFACE	  bram	  depth=363	  port=B_local2915	  
	  
	   #pragma	  HLS	  INTERFACE	  bram	  depth=64	  port=C_local0	  	  
#pragma	  HLS	  INTERFACE	  bram	  depth=64	  port=C_local1	  	  




#pragma	  HLS	  INTERFACE	  bram	  depth=64	  port=C_local2915	  
	  
	   //	  successive	  code	  in	  next	  snippet…	  
}	  
	  
	   54	  
As shown in code snippet 2, to explicitly declare an array to be stored in BRAM 
block, we need to declare the array both as a function parameter and also as an interface 
with different ports.  
 
Code Snippet 3: Sample Code containing data transfer and computation logic to 
generate 2916 parallel processing units 
//	  	   …	  continued	  from	  code	  snippet	  2	  …	  
//#pragma	  HLS	  DATAFLOW	  	  	  
	   	  
for(index=0;index<WINDOW_SIZE;index++)	  	   	  
{	  	   	  
B_local0[index]	  =	  B[index*FEATURE_MAP_SIZE+0];	  	   	   	  
B_local1[index]	  =	  B[index*FEATURE_MAP_SIZE+1];	  	   	   	  




B_local2915[index]	  =	  B[index*FEATURE_MAP_SIZE+2915];	  	   	  
}	  
	  
register	  float	  A_PART;	  
for	  (i	  =	  0;	  i	  <	  NUM_FILTERS;	  ++i){	  	   	   	  
for	  (k	  =	  0;	  k	  <	  WINDOW_SIZE;	  ++k){	  	   	   	   	  
A_PART	  =	  A[i*WINDOW_SIZE	  +	  k];	  	  	  
	   	   	   	   	  
C_local0[i]	  +=	  A_PART*B_local0[k];	  	   	   	   	   	  
C_local1[i]	  +=	  A_PART*B_local1[k];	  	   	   	   	   	  




C_local2915[i]	  +=	  A_PART*B_local2915[k];	  
	   {	  
}	  
	  
//	  transfer	  all	  filter	  outputs...	  	   	  
	  int	  index_C;	  	   	  
for	  (index_C	  =	  0;	  index_C<NUM_FILTERS;	  index_C++)	  	   {	  	  
	   C[index_C*FEATURE_MAP_SIZE	  +	  0]	  =	  C_local0[index_C];	  	  
	   C[index_C*FEATURE_MAP_SIZE	  +	  1]	  =	  C_local1[index_C];	  	  
	   C[index_C*FEATURE_MAP_SIZE	  +	  2]	  =	  C_local2[index_C];	  
	   .	  
	   .	  
	   .	  
	   C[index_C*FEATURE_MAP_SIZE	  +	  2915]	  =	  C_local2915[index_C];	  	   	  
}	  
This	  code	  generates	  the	  
desired	  number	  of	  parallel	  
processing	  units	  
Transferring	  image	  
array	  into	  BRAM	  
blocks	  of	  FPGA	  
Transferring	  
output	  array	  from	  
FPGA	  BRAM	  blocks	  
to	  external	  DRAM	  
storage	  
This	  HLS	  directive	  is	  used	  
to	  overlap	  memory	  access	  
and	  computation,	  but	  
doesn’t	  work	  here,	  may	  be	  
because	  of	  the	  large	  
number	  of	  BRAM	  blocks	  
	   55	  
 Sample code responsible for data transfer into and out of FPGA and generation of 
parallel units is shown in code snippet 3. Ideally, we would like to transfer image array 
into the FPGA and output array out of the FPGA, in parallel with computation of parallel 
units. This should be possible because we have about 2916 input and output buffers 
(BRAM blocks) that are dual ported. But the HLS directive responsible to interweave 
external memory access and internal access and computation doesn’t seem to work here, 












	   56	  
REFERENCES 
 
[1] D. G. Lowe, “Object recognition from local scale-invariant features,” Proc. 
Seventh IEEE Int. Conf. Comput. Vis., vol. 2, 1999. 
[2] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” 
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 4, pp. 591–606, 2009. 
[3] S. Fidler and A. Leonardis, “Towards scalable representations of object categories: 
Learning a hierarchy of parts,” in Proceedings of the IEEE Computer Society 
Conference on Computer Vision and Pattern Recognition, 2007. 
[4] R. Kasturi, D. Goldgof, S. A. Street, S. College, M. Anderson, M. Peot, M. Aguilar, 
D. Khosla, Y. Chen, K. Kim, E. Krotkov, D. D. Hackett, G. Technologies, L. 
Elazary, R. C. Voorhies, and D. F. Parks, “Performance Evaluation of 
Neuromorphic-Vision Object Recognition Algorithms.” 
[5] T. Dean, G. Corrado, and J. Shlens, “Three Controversial Hypotheses Concerning 
Computation in the Primate Cortex,” Twenty-Sixth AAAI Conf. Artif. Intell., pp. 
1543–1549, 2012. 
[6] R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional 
interpretation of some extra-classical receptive-field effects.,” Nat. Neurosci., vol. 
2, pp. 79–87, 1999. 
[7] D. George, “How the brain might work: A hierarchical and temporal model for 
learning and recognition,” Learning, no. June, p. 191, 2008. 
[8] J. Hawkins, S. Ahmad, and D. Dubinsky, “HTM Cortical Learning Algorithms,” 
	   57	  
pp. 1–68, 2011. 
[9] D. George and J. Hawkins, “Hierarchical Bayesian Model of Invariant Recognition 
in the Visual Cortex Pattern,” Proceedings. 2005 IEEE Int. Jt. Conf. Neural 
Networks, 2005., vol. 3, pp. 1812–1817, 2005. 
[10] Q. Le, A. Karpenko, J. Ngiam, and A. Ng, “ICA with reconstruction cost for 
efficient overcomplete feature learning,” Adv. Neural …, pp. 1–9, 2011. 
[11] Y. Bengio, “Learning Deep Architectures for AI,” Found. Trends® Mach. Learn., 
vol. 2, pp. 1–127, 2009. 
[12] L. Deng and D. Yu, “Deep Learning: Methods and Applications,” Found. Trends 
Signal Process., vol. 7, pp. 197–387, 2013. 
[13] G. Hinton, “Deep belief networks,” Scholarpedia, pp. 4–5, 2009. 
[14] P. Dayan and L. F. Abbott, “Theoretical Neuroscience: Computational and 
Mathematical Modeling of Neural Systems,” Comput. Math. Model. Neural …, p. 
480, 2001. 
[15] M. Riesenhuber and T. A. Poggio, “Hierarchical models of object recognition in 
cortex.,” Nat. Neurosci., vol. 2, no. 11, pp. 1019–1025, 1999. 
[16] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” Lect. 
Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes 
Bioinformatics), vol. 6791 LNCS, no. PART 1, pp. 44–51, 2011. 
[17] Q. V Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng, “Tiled 
convolutional neural networks,” Adv. Neural Inf. Process. Syst. 23, pp. 1279–1287, 
2010. 
	   58	  
[18] N. Brady and D. J. Field, “Local contrast in natural images: Normalisation and 
coding efficiency,” Perception, vol. 29, no. 9, pp. 1041–1055, 2000. 
[19] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann 
Machines,” Proc. 27th Int. Conf. Mach. Learn., pp. 807–814, 2010. 
[20] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” 
Proc. 14th Int. Conf. Artif. Intell. Statisitics 2011, vol. 15, pp. 315–323, 2011. 
[21] K. Fukushima, “Neocognitron: A self-organizing neural network model for a 
mechanism of pattern recognition unaffected by shift in position,” Biol. Cybern., 
vol. 36, no. 4, pp. 193–202, 1980. 
[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied 
to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998. 
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with 
Deep Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1–9, 
2012. 
[24] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional 
Networks,” arXiv Prepr. arXiv1311.2901, 2013. 
[25] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional 
networks,” in Proceedings of the IEEE Computer Society Conference on Computer 
Vision and Pattern Recognition, 2010, pp. 2528–2535. 
[26] R. Girshick, J. Donahue, T. Darrell, U. C. Berkeley, and J. Malik, “Rich feature 
hierarchies for accurate object detection and semantic segmentation,” Cvpr’14, pp. 
2–9, 2014. 
	   59	  
[27] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V Le, M. Z. Mao, M. A. 
Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large Scale Distributed 
Deep Networks,” NIPS 2012 Neural Inf. Process. Syst., pp. 1–11, 2012. 
[28] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and 
A. Y. Ng, “Building high-level features using large scale unsupervised learning,” 
2011. 
[29] C. Szegedy, S. Reed, P. Sermanet, V. Vanhoucke, and A. Rabinovich, “Going 
deeper with convolutions,” pp. 1–12, 2014. 
[30] S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable Bounds for Learning Some 
Deep Representations,” arXiv Prepr. arXiv1310.6343, p. 18, 2013. 
[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, 
T. Darrell, and U. C. B. Eecs, “Caffe  : Convolutional Architecture for Fast Feature 
Embedding,” ACM Conf. Multimed., 2014. 
[32] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob, 
“DRAMsim,” ACM SIGARCH Comput. Archit. News, vol. 33, no. 4, p. 100, 2005. 
[33] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. 
Reddi, and K. Hazelwood, “Pin,” in Proceedings of the 2005 ACM SIGPLAN 
conference on Programming language design and implementation - PLDI ’05, 
2005, p. 190. 
[34] Y. Jia, “Learning Semantic Image Representations at a Large Scale,” 2014. 
[35] K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, and E. S. Chung, 
“Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,” 
	   60	  
pp. 3–6, 2015. 
[36] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based 
Accelerator Design for Deep Convolutional Neural Networks,” Proc. 2015 
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays - FPGA ’15, pp. 161–
170, 2015. 
[37] M. Peemen,  a a a Setio, B. Mesman, and H. Corporaal, “Memory-Centric 
Accelerator Design for Convolutional Neural Networks,” Comput. Des. (ICCD), 
2013 IEEE 31st Int. Conf., pp. 13–19, 2013. 
[38] P. P. Specification and D. Resources, “UltraScale Architecture and Product 
Overview Summary of Features Processing System,” vol. 890, pp. 1–31, 2015. 
 
 
 
 
 
 
 
 
