SPEC2: SPECtral SParsE CNN Accelerator on FPGAs by Niu, Yue et al.
SPEC2: SPECtral SParsE CNN Accelerator on
FPGAs
Yue Niu1, Hanqing Zeng1, Ajitesh Srivastava1, Kartik Lakhotia1, Rajgopal Kannan2, Yanzhi Wang3, and Viktor Prasanna1
1University of Southern California, {yueniu,zengh,ajiteshs,klakhoti,prasanna}@usc.edu
2US Army Research Lab-West, rajgopal.kannan.civ@mail.mil
3Northeastern University, yanz.wang@northeastern.edu
Abstract—To accelerate inference of Convolutional Neural Net-
works (CNNs), various techniques have been proposed to reduce
computation redundancy. Converting convolutional layers into
frequency domain significantly reduces the computation com-
plexity of the sliding window operations in space domain. On the
other hand, weight pruning techniques address the redundancy in
model parameters by converting dense convolutional kernels into
sparse ones. To obtain high-throughput FPGA implementation,
we propose SPEC2 – the first work to prune and accelerate spectral
CNNs. First, we propose a systematic pruning algorithm based
on Alternative Direction Method of Multipliers (ADMM). The
offline pruning iteratively sets the majority of spectral weights
to zero, without using any handcrafted heuristics. Then, we
design an optimized pipeline architecture on FPGA that has
efficient random access into the sparse kernels and exploits
various dimensions of parallelism in convolutional layers. Overall,
SPEC2 achieves high inference throughput with extremely low
computation complexity and negligible accuracy degradation. We
demonstrate SPEC2 by pruning and implementing LeNet and
VGG16 on the Xilinx Virtex platform. After pruning 75% of
the spectral weights, SPEC2 achieves 0% accuracy loss for LeNet,
and < 1% accuracy loss for VGG16. The resulting accelerators
achieve up to 24× higher throughput, compared with the state-
of-the-art FPGA implementations for VGG16.
Index Terms—CNN accelerator, Model Compression, Paral-
lel Implementation, Spectral Convolution, Alternative Direction
Method of Multipliers, Sparse Computation Engine
I. INTRODUCTION
Convolutional Neural Networks (CNNs) are powerful deep
learning models, widely deployed for computer vision tasks
[1], [2]. The high accuracy of CNNs comes with the price
of high computation cost. For example, VGG16 [2] requires
more than 30 billion multiplications and additions to inference
a single image. Although light-weight CNN models [3], [4]
have been recently proposed to meet the hardware constraints
of embedded systems, large models [2], [5] are still irreplace-
able in many advanced computer vision tasks. Motivated by
the computation challenge, there have been many attempts
to speedup CNN inference. From the algorithm perspective,
spectral convolution [6] and spatial weight pruning [7] are
two successful methods to reduce the number of operations of
a CNN. From the architecture perspective, hardware designs
have been tailored on FPGAs to fit the computation and
communication characteristics of inference.
Although accelerators co-designed with the algorithmic and
architectural innovations have shown impressive throughput
and latency performance, challenges still exist in state-of-
the-art approaches. On the one hand, while spectral CNN
accelerators [8], [6] perform exact inference and achieve 3×
to 4× higher throughput than spatial accelerators, they have
significantly higher memory requirement due to enlarged spec-
tral kernels. On the other hand, weight pruning on CNNs has
been explored to reduce computation and storage requirements
simultaneously. However, no spectral domain pruning has been
performed in the literature. Further, many pruning approaches
(e.g., [7], [9], [10]), depend on heuristics, making the quality
of pruning (in terms of accuracy) sensitive to training data. In
addition, the irregular sparse structure of the pruned kernels
may result in low hardware efficiency when performing the
sliding window operation of spatial convolution. Taking the
hardware overhead into account, a heavily pruned spatial CNN
may not lead to significantly faster inference. Not only in CNN
does sparse operations need particular optimization but also in
other applications [11], [12].
To overcome the above challenges, we propose SPEC2: a
novel approach to speedup CNN inference by applying system-
atic weight pruning to spectral convolutional layers. Inference
of SPEC2 has extremely low computation complexity since (i)
the spectral transformation reduces computation redundancy,
and (ii) the kernel pruning reduces parameter redundancy.
Memory requirement of SPEC2 is also significantly reduced,
since after pruning, we only store on chip the non-zeros
of the sparse kernels. Finally, unlike other implementations
based on spatial pruning, SPEC2 achieves high hardware
efficiency. SPEC2 replaces the sliding window operation (of
spatial convolution) with element-wise multiplications (of
spectral convolution), and enables us to develop a sparse tensor
computation engine achieving close to 100% DSP utilization.
Overall, the systematic pruning algorithm of SPEC2 ensures
both high accuracy and high pruning rate of spectral CNNs,
and the corresponding FPGA design achieves high inference
throughput. The main contributions of this paper are:
• We propose SPEC2 to achieve high throughput inference
of spectral sparse CNNs, by:
– Spectral weight pruning formulation using Alternative
Direction Method of Multipliers (ADMM) algorithm.
It minimizes classification error while constraining the
number of non-zero weights based on target sparsity.
ar
X
iv
:1
91
0.
11
10
3v
1 
 [c
s.C
V]
  1
6 O
ct 
20
19
– Spectral retraining of the model to iteratively fine-tune
non-zero spectral weights and improve accuracy.
– Sparse convolution accelerator, which implements the
pruned spectral convolution on FPGA. The accelerator
exploits various dimensions of parallelism, and uses
small number of data replications to significantly re-
duce BRAM conflicts and pipeline stalls.
• We extensively evaluate SPEC2 on the below metrics:
– Inference accuracy: We show that pruning 75%
weights of the spectral kernels leads to negligible
accuracy loss (0% for LeNet on MNIST dataset and
< 1% for VGG16 on CIFAR-10 dataset).
– Inference throughput: We implement the accelerator
on Xilinx Virtex-7 XC7VX690T FPGA. The design
achieves 99% of the theoretically optimal throughput
performance and 24× higher throughput than state-of-
the-art.
II. BACKGROUND AND RELATED WORK
A. Spectral CNNs
The main building blocks of a CNN are convolutional
layers. A convolutional layer convolves the layer inputs with
the kernels to generate the layer outputs (i.e., activations).
Let X ∈ Rb×cin×hact×hact and Y ∈ Rb×cout×h′act×h′act be the
input and output tensors of a convolutional layer, where b
denotes the batch size, cin and cout denote the number of input
and output channels respectively, and hact and h′act denote the
spatial dimensions (width and height, respectively). Let W ∈
Rcout×cin×hkrn×hkrn be the tensor of spatial convolutional kernels,
where hkrn denotes the spatial dimension (width and height).
We use Fn (·) and F−1n (·) for the n × n 2D Fast Fourier
Transform (FFT) operation and its inverse. Let “~” denote
the corresponding spectral domain variable. For example, let
W˜j,i = Fhact+hkrn−1 (Wj,i) and X˜k,i = Fhact+hkrn−1 (Xk,i). A
convolutional layer operates as below:
Yk,j =
cin∑
i=1
Xk,i ∗Wj,i︸ ︷︷ ︸
Spatial convolution
= F−1hact+hkrn−1
(
cin∑
i=1
X˜k,i ◦ W˜j,i
)
︸ ︷︷ ︸
Spectral convolution
(1)
where 1 ≤ k ≤ b and 1 ≤ j ≤ cout; “∗” denotes the
sliding window operation of spatial convolution; “◦” denotes
the Hadamard product operation (element-wise multiplication)
of spectral convolution. Note that Equation 1 applies to con-
volution layers of stride 1 and padding hkrn − 1. For padding
less than hkrn − 1, we will further crop the Yk,j . For stride
larger than 1, we will further slice the Yk,j .1
We apply the Overlap-and-Add (OaA) technique [8], [6]. To
perform spectral convolution with n × n FFT (where hkrn ≤
n ≤ hact +hkrn−1), we follow the below steps for each batch
index k, and channel indices i, j:
1When stride is larger than 1, spectral convolution may not lead to reduced
computation complexity. However, most of the convolutional layers of modern
CNNs (e.g., [2], [5], [13]) have stride equal to 1.
1) Partition Xk,i into tiles of X
t,s
k,i ∈ Rm×m, where m =
n− hkrn + 1 and 1 ≤ t, s ≤ dhactm e;
2) Transform the tiles Xt,sk,i and kernels Wj,i into spec-
tral representation (after proper padding), as X˜t,sk,i =
Fn
(
Xt,sk,i
)
and W˜j,i = Fn (Wj,i);
3) Obtain output tiles Y t,sk,j = F−1n
(
cin∑
i=1
X˜t,sk,i ◦ W˜j,i
)
;
4) Combine Y t,sk,j into Yk,j by OaA. For each tile Y
t,s
k,j ,
place its adjacent tiles (i.e., Y t±1,sk,j , Y
t,s±1
k,j ) with hkrn−1
horizontally or vertically overlapping pixels. We obtain
the final Yk,j by adding the overlapped pixels.
With OaA, we can select the FFT size n to balance the
computation and communication capacity of the target FPGA.
Therefore, inference throughput can be maximized. It has been
shown in [6] that, setting n = 8 or 16 reduces the computation
complexity of spatial convolution by 3× to 4×, and improves
the overall FPGA throughput by up to 5×.
While state-of-the-art accelerator [6] already achieves high
throughput, such design only applies to unpruned spectral
CNNs. In this paper, we significantly improve the design of
[6] by reducing the redundancy in model parameters W˜ .
B. CNN Weight Pruning
Pruning algorithms leverage redundancy in CNN weight
parameters. For a CNN with convolutional layers and fully
connected layers, in general, convolutional layers contribute
to most (99% for VGG16) of the computation workload, but
contain only a small portion (10.6% for VGG16) of the model
weights. Thus, compared with fully connected layers, pruning
convolutional layers is more important, yet much more difficult
as well. For spatial CNNs, pruning rate on fully connected
layers can be as high as 30× [14], [15], while pruning rate on
convolutional layers is often less than 10× [14].
Below we summarize state-of-the-art works in pruning
spatial CNNs. The pioneer work [7] uses an iterative heuristic
to prune weights of small magnitudes and achieves 2.7× re-
duction in convolutional weights on AlexNet [1]. This method
has been extended in two directions. The first is to improve
compression rate by using more sophisticated heuristics. For
example, [9] incorporates both weight pruning and growing,
and [10] uses L1 regularization. Pruning of [16] is based on
a genetic algorithm. The second direction is to enhance the
hardware implementation efficiency by deriving an effective
tradeoff between accuracy and pruning rate, e.g., the energy-
aware pruning [17], and structure-aware pruning [18], [10].
FPGA hardware accelerators [19], [20] have also been in-
vestigated to accommodate pruned CNNs, by leveraging the
reconfigurability in on-chip resources. Recently, the authors of
[14] have developed a systematic weight pruning framework
based on the powerful optimization tool ADMM (Alternating
Direction Method of Multipliers) [21]. Such framework con-
sistently achieves higher pruning rate than prior arts.
Besides the works on spatial CNN pruning, [22] proposes a
frequency-domain compression technique which first converts
inputs and kernels of all layers into frequency domain using
discrete cosine transform (DCT), then dynamically prunes
unimportant connections. Although this method reduces pa-
rameters in convolutional layers by up to 70%, it is difficult
to leverage this technique in hardware platform.
With performance brought by spectral convolution algo-
rithm [6], combining it with pruning creates more potential for
high throughput hardware implementation. To the best of our
knowledge, SPEC2 is the first work to exploit such opportunity.
C. CNN Accelerators
By leveraging the computation and memory access patterns
of CNNs, many domain-specific architectures have been pro-
posed for accelerating inference. Among these works, [23]
shows a unified convolutional matrix-multiplication represen-
tation for both convolutional and fully-connected layers. This
micro-architecture is optimized for both computation speed
and resource utilization. In [24], a pipeline and DRAM-free
architecture is proposed to simplify data movement between
memory and processing elements. This design achieves very
high working frequency as well as DSP utilization. On the
other hand, to exploit sparsity in pruned CNNs, [25] introduces
a sparse computation architecture aware of zeros in both
weights and activations. Beside acceleration in spatial domain,
[8] and [6] accelerate convolutional layers in spectral domain,
where the computation primitives are Fast Fourier Transforms
and Hadamard products. Using the additional Overlap-and-
Add technique, computation complexity (of unpruned CNNs)
can be significantly reduced without hardware reconfiguration.
ASICs also demonstrate promising performance in processing
CNNs due to their customization ability and low power
consumption. Google TPUs [26] employ systolic array based
matrix multiplication units to execute convolutional and fully-
connected layers with 70× higher power efficiency than GPUs.
III. PRUNING SPECTRAL CNNS
A. Overview
The goal of weight pruning is to convert the dense spec-
tral kernels W˜ into sparse ones, while keeping the overall
classification accuracy of CNNs unaffected. ADMM is an
optimization framework that we can utilize to iteratively
achieve the above goal, without using handcrafted heuristics.
As summarized by Figure 1, the inputs to our pruning
algorithm are the L-layer spectral CNN with its unpruned
weights
{
W˜ `
∣∣∣ 1 ≤ ` ≤ L }, and the target pruning factor α.
The output is the CNN with sparse spectral weights
{
W˜ `α
}
,
where the subscript α (> 1) means that at most 1α of the tensor
elements are non-zeros. Spectral pruning of SPEC2 consists of
three steps. The ADMM training step rescales the weights by
joint consideration of the classification accuracy and sparsity
requirements. The outputs
{
W˜ `AD
}
are still dense tensors,
where
(
1− 1α
)
portion of the elements are driven to close-to-
zero values. The pruning step simply removes the close-to-zero
values of
{
W˜ `AD
}
and returns sparse tensors
{
W˜ `PR
}
. The
re-training step fine-tunes the non-zeros values of
{
W˜ `PR
}
to
{
W˜ `
}
ADMM training
{
W˜ `AD
}
Pruning
{
W˜ `PR
}
Re-training
{
W˜ `α
}
α
ρ, η
η′
Fig. 1: SPEC2 spectral pruning overview
recover accuracy. The final outputs are the sparse spectral ker-
nels
{
W˜ `α
}
. It is worth noting that, the ADMM training can
be understood as solving a constrained optimization problem,
where the optimization objective is loss minimization to ensure
high classification accuracy, and the constraint is imposed by
the pruning factor α. Due to the non-convexity in the objective
function and the non-differentiability in the constraint, it is
hard to identify analytical solutions. Thus, ADMM breaks the
optimization problem into two sub-problems, where one of
them has a closed-form solution and the other can be solved
iteratively by the stochastic gradient descent algorithm. The
ADMM training alternates between the two sub-problems, and
eventually they will both converge. The solution identified by
ADMM ensures high accuracy, and can be easily pruned.
Different from the existing ADMM based approach [14]
which performs pruning in the space domain, SPEC2 performs
end-to-end spectral pruning without transformation between
W and W˜ . Such spectral pruning enables us to exploit com-
putation redundancy in both the sliding window operation and
the spectral kernel weights. Therefore, together with our novel
FPGA architecture (Section IV), SPEC2 achieves remarkably
high inference throughput without accuracy loss.
B. ADMM Training
To prune an L-layer CNN with pruning factor α, we abstract
the problem as the below optimization problem:
minimize
{ W˜ `∗ }
Loss
(
L,L;
{
W˜ `∗
})
subject to NNZ
(
W˜ `∗
)
≤ 1
α
· Size
(
W˜ `∗
)
, ∀1 ≤ ` ≤ L.
where Loss (·) gives the CNN loss calculated by the predicted
labels L and the ground-truth L, under the current model
parameters
{
W˜ `∗
}
. Function NNZ (·) measures the number of
non-zeros of the tensor, and Size (·) returns the total number
of elements of the tensor (i.e., Size
(
W˜
)
= cin · cout · n2).
The above problem is hard to solve since the constraint
is non-differentiable. Therefore, we perform several steps of
transformation to make the problem tractable. Firstly, we
rewrite the above optimization problem with an auxiliary
variable Z˜`∗ and an indicator function Ind (·):
minimize
{ W˜ `∗ }
Loss
(
L,L;
{
W˜ `∗
})
+
L∑
`=1
Ind
(
Z˜`∗, α
)
subject to Z˜`∗ = W˜
`
∗ , ∀1 ≤ ` ≤ L.
(2)
where Ind
(
Z˜`∗, α
)
=
{
0, if NNZ
(
Z˜`∗
)
≤ 1αSize
(
W˜ `∗
)
+∞, otherwise
The indicator function, seen as enforcing hard penalty on the
spectral kernel sparsity, still makes the optimization difficult.
Therefore, as the next step of transformation, we relax the
simple constraint of Equation 2. During the optimization
process, we do not enforce Z˜`∗ = W˜
`
∗ . Instead, we apply
to the objective function an additional penalty term Pty (·)
measuring the distance between Z˜`∗ and W˜
`
∗ . Then the opti-
mization problem (on W˜ `∗ and Z˜
`
∗) becomes unconstrained,
and thus, can be solved iteratively. The values of W˜ `∗ and Z˜
`
∗
are updated in each iteration, and ultimately, they converge to
the same value (i.e., W˜ `∗ → Z˜`∗) which is regarded as a good
solution to the objective. Theoretically, penalty Pty (·) should
grow with the number of iterations to ensure convergence.
With the above intuitions, we formally describe the al-
gorithm to solve Equation 2 under ADMM. Let U˜ `∗ mea-
sure the difference between W˜ `∗ and Z˜
`
∗. Define func-
tion fAD := Loss
(
L,L;
{
W˜ `∗
})
+
L∑`
=1
Ind
(
Z˜`∗, α
)
+
L∑`
=1
Pty
(
W˜ `∗ , Z˜
`
∗, U˜
`
∗
)
. ADMM decomposes the optimization
on fAD into two sub-problems, one on W˜ `∗ only and the
other on Z˜`∗. ADMM finds a solution by “alternating the
optimization direction” along W˜ `∗ and Z˜
`
∗. The update rule
for W˜ `∗ and Z˜
`
∗ at iteration i+ 1 is defined as Equation 3.
Now we use the Frobenius norm as the measure of matrix
distance. Thus, Pty
(
W˜ , Z˜, U˜
)
:= ρ2
∥∥∥W˜ − Z˜ + U˜∥∥∥2
F
−
ρ
2
∥∥∥U˜∥∥∥2
F
, where ρ is a given constant coefficient. The first
sub-problem of Equation 3 often does not have an analytic
solution (due to non-convexity of Loss (·)), but can be solved
by stochastic gradient descent since both Loss (·) and Pty (·)
are differentiable.
In fact, the objective of Loss
({
W˜ `
})
+∑
Pty
(
W˜ `, Z˜`, U˜ `
)
can be understood as the standard
CNN loss plus an additional regularization term, and so
existing techniques on CNN training are all applicable here.
The second sub-problem admits a simple analytic solution.
Clearly, Z˜`i+1 should satisfy the sparsity constraint by α (so
that Ind
(
Z˜`i+1, α
)
= 0). Then, due to the special form of
the penalty (i.e., Frobenius norm), to get optimal Z˜`i+1, we
take W˜ `i+1 + U˜
`
i and set the elements of smallest magnitude
to zero, until 1α · Size
(
Z˜`i+1
)
non-zeros remain. As for the
variable U˜ , note that it records the cumulative difference
between W˜ and Z˜, thus enforcing W˜ to eventually converge
to Z˜.
{
W˜ `i+1
}
= arg min
{ W˜ `∗ }
fAD = arg min
{ W˜ `∗ }
(
Loss
(
L,L;
{
W˜ `∗
})
+
L∑
`=1
Pty
(
W˜ `∗ , Z˜
`
i , U˜
`
i
))
(3a)
{
Z˜`i+1
}
= arg min
{ Z˜`∗ }
fAD = arg min
{ Z˜`∗ }
(
L∑
`=1
Ind
(
Z˜`∗, α
)
+
L∑
`=1
Pty
(
W˜ `i+1, Z˜
`
∗, U˜
`
i
))
(3b){
U˜ `i+1
}
=
{
U˜ `i + W˜
`
i+1 − Z˜`i+1
}
(3c)
In summary, we iteratively update W˜ and Z˜ in an “alternat-
ing direction” manner. Throughout all iterations, NNZ
(
Z˜
)
≤
1
α · Size
(
Z˜
)
is always ensured, while NNZ
(
W˜
)
may not
satisfy the α constraint. At the end of the ADMM training,(
1− 1α
)
fraction of the elements in W˜ `AD are close to (but not
exactly equal to) zero. We then prune those near-zero elements
and fine-tune the remaining 1α fraction of elements, by our
spectral re-training algorithm.
C. Pruning and Re-Training
After the ADMM optimization, we remove the
(
1− 1α
)
fraction of the near-zero elements of W˜ `AD, to get sparse
kernels W˜ `PR. Such pruning operation results in slight accuracy
degradation (0.03% for LeNet on MNIST and 0.4% for
VGG16 on CIFAR-10). To recover accuracy, we further tune
the non-zero values of W˜ `PR, without changing the kernel
sparsity. We term the fine-tuning step as spectral re-training.
Since re-training only aims at improving accuracy, we
define loss function simply as fRT = Loss
(
L,L;
{
W˜ `PR
})
.
Thus, using stochastic gradient descent, re-training propagates
gradients from fRT backwards and adjusts the values of W˜ `PR
accordingly. Note that since re-training does not change the
sparsity of W˜ `PR, the gradient
∂fRT
∂W˜ `PR
is masked by the non-
zero positions of W˜ `PR. In summary, at the end of re-training,
we have NNZ
(
W˜ `α
)
≤ 1α ·Size
(
W˜ `α
)
, and the sparse spectral
CNN achieves high classification accuracy (99.1% for LeNet
on MNIST and 90.8% for VGG16 on CIFAR-10 when α = 4).
Algorithm 1 SPEC2 spectral pruning algorithm
Input: Unpruned spectral CNN
{
W˜ `
}
; Pruning rate α;
Learning rate η, η′; Penalty coefficient ρ;
Output: Pruned spectral CNN with sparse kernels
{
W˜ `α
}
1:
{
W˜ `0
}
←
{
W˜ `
}
2:
{
Z˜`0
}
← { 0 }
3:
{
U˜ `0
}
← { 0 }
4: for iteration i = 0 to imax do . ADMM training
5: while not converged do
6:
{
W˜ `i
}
←
{
W˜ `i − η · ∂fAD∂W˜ `i
}
7: end while
8:
{
W˜ `i+1
}
←
{
W˜ `i
}
9:
{
Z˜`i+1
}
←
{
Pruned
(
W˜ `i+1 + U˜
`
i
) }
10:
{
U˜ `i+1
}
←
{
U˜ `i + W˜
`
i+1 − Z˜`i+1
}
11: end for
12:
{
W˜ `AD
}
←
{
W˜ `imax
}
13:
{
W˜ `PR
}
←
{
Pruned W˜ `AD
}
. Pruning
14: for iteration i = 0 to i′max do . Re-training
15:
{
W˜ `PR
}
←
{
W˜ `PR − η′ · Mask
(
∂fRT
∂W˜ `PR
) }
16: end for
17:
{
W˜ `α
}
←
{
W˜ `PR
}
. End of SPEC2 pruning
Algorithm 1 shows the overall SPEC2 pruning algorithm
(including ADMM training, pruning and re-training), as de-
scribed in above. As a final remark, to support the stochastic
gradient descent algorithm for any FFT size n, we incorporate
the OaA technique into ADMM training and re-training.
During forward propagation, we partition the input activation
of each layer into n×n tiles and apply W˜ `i or W˜ `α according to
the steps described in Section II-A. During backward propaga-
tion, the gradients are derived by the automatic differentiation
algorithm of Tensorflow.
IV. ACCELERATOR DESIGN
A. Overview
By Equation 1, computation of sparse spectral convolution
can be decomposed into three steps: (1) 2D FFT on the
input activations; (2) Multiplication-Accumulation (MAC) for
Hadamard product computation and reduction along input
channel dimension; (3) 2D IFFT on output activations. Figure
2 shows the overall accelerator to compute the above steps.
The external DDR stores the input images, layer activations,
as well as the sparse spectral kernel weights of the CNN
model. Due to the limited on-chip BRAM, we perform tiling
of the activations and kernels. A tile of spectral kernel is
pre-loaded into the kernel buffer. Tiles of layer activations
communicate back and forth between DDR and FPGA to keep
the inference pipeline busy. Once a tile of previous-layer acti-
vations arrives on-chip, it goes through a 2D FFT module and
the spectral outputs are stored in the input buffer. The sparse
Hadamard module then computes Hadamard product of the
sparse spectral kernels and the dense spectral activations, by
reading into the kernel and input buffers. The partial results of
accumulation along the input channel dimension are stored in
the output buffer. After iterating through all data in the kernel
tile, the output activations go through a 2D IFFT module and
return to DDR. We implement 2D FFT and IFFT based on [6].
i.e., we decompose the 2D FFT into two phases of 1D FFTs.
We use the 1D FFT pipeline and perform matrix transpose
with Streaming Permutation Network proposed in [27].
The following is required to achieve high performance of
the inference engine: (i) Efficient random access into the
data buffers: The sparse Hadamard module fetches data from
the dense activation tensor based on the non-zero indices of
the pruned kernels. Since kernels can have random non-zero
indices, bank conflicts may happen if multiple MAC pipelines
of the Hadamard module request data from the same activation
BRAM block. Therefore, we replicate activation data to reduce
the pipeline stalls due to BRAM conflicts. (ii) Parallelism
across various tensor dimensions: Spectral convolution with
pruning dramatically reduces the CNN computation com-
plexity. We design a parallelization strategy that translates
the reduced amount of computation into higher throughput.
We parallelize along the batch and channel dimensions of
activations, while ensuring fast random access of on-chip data.
B. Sparse Spectral Convolution Engine
The sparse Hadamard module is the key component of the
inference engine. This module processes a kernel tile of shape
c × c × n × n and an activation tile of shape b × c × n × n,
where c is the tiling factor along the input and output channel
dimensions (i.e., the cin and cout channels are partitioned into
tiles of c channels); n is the width and height of each kernel
and activation map (i.e., 2D FFT size), and b is the batch
size. Note that pruning is along the last two dimensions of
the kernel tile, so the pruning algorithm ensures exactly 1α ·n2
non-zeros in each one of the c2 kernel maps.
Figure 3 illustrates the storage arrangement for sparse
kernels. As an example, Figure 3a shows four sparse spectral
kernels. The upper part of Figure 3b shows the data structure to
represent the data in Figure 3a. For each kernel matrix of input
channel i and output channel j, we use a
(
1
α · n2
)×2 table to
store the non-zero values and the corresponding indices. The
lower part of Figure 3b shows the memory layout. We split the
value-index table into two tables and populate them with some
pre-processed information to facilitate hardware access. The
index table is
(
λ · 1α · n2
)
by R, where R ≥ 1 is the number
of activation replicas and λ ≥ 1 is an overhead coefficient
related with R. This table stores pre-computed addresses into
the input buffer. The value table is
(
λ · 1α · n2
)
by 3. This table
stores the non-zero values and control signals required by the
Hadamard pipeline. More details of the tables and parameters
are discussed later in this section.
To compute a single X˜k,i ◦ W˜j,i in Equation 1, the sparse
Hadamard module sequentially iterates through the index and
Fig. 2: Overview of SPEC2 inference engine
(a) Example sparse kernels (b) Data structure and memory layout
Fig. 3: Storage arrangement for sparse kernels
Fig. 4: Sparse Hadamard pipeline
value tables corresponding to indices j, i, and initiates random
accesses into the input buffer. A single multiplier finishes
the computation for a given (j, i) pair in λ · 1α · n2 cycles.
All multipliers on-chip together exploit parallelism along the
dimensions of batch and output channels. Figure 4 shows the
design of the sparse Hadamard computation pipeline.
Let P = Pb · Po be the total number of multipliers in the
module, where Pb and Po denote batch and output channel
parallelism, respectively (Pb = 2, Po = 4 in Figure 4). The
Po multipliers corresponding to the same Pb index form a
group, and they all access activations belonging to the same
input channel. Thus, each cycle, a group initiates Po memory
requests into the same map X˜k,i. Ideally, all the Po requests
have to be served in one cycle to avoid pipeline stalls. In
our design, since we store each X˜k,i of size n × n into
a single BRAM block, we then need Po activation replicas
in case of the worst case scenario where all the memory
requests are distinct. In practice, however, we notice that the
spectral sparse kernels demonstrate strong correlation in non-
zero locations across channels. Thus, the number of unique
addresses in a group of Po requests is often much less than
Po. For this reason, we introduce the number of replicas R
(where 1 ≤ R ≤ Po, and R = 2 in the example) as an
extra design parameter. Higher R reduces the value of λ and
thus the likelihood of pipeline stalls, but also increases BRAM
overhead due to replication.
To route from the returned R activation values to the Po
multipliers, we implement a crossbar connecting BRAMs to
multipliers for each group. An R-to-1 MUX is placed in front
of each multiplier for data selection. We further utilize the
information in the index and weight tables (Figure 3b) to avoid
runtime calculation of addresses and MUX control signals.
Since non-zero locations of spectral kernels are fixed during
inference, we can determine the unique addresses within each
group of requests in offline pre-processing. Specifically, during
pre-processing, we sort the group of Po requests and store the
Q number of unique addresses in
⌈
Q
R
⌉
rows of the index table.
As a result, the Po runtime requests can be served in
⌈
Q
R
⌉
cycles. We also pre-compute the corresponding MUX selection
signals (i.e., replica ID) and store them in the “sel” column
of the value table (Figure 3b). The “valid” bit specifies if the
value in the corresponding row is valid. Among the λ · 1α · n2
rows in the value table, only 1α · n2 of them are valid.
C. Performance Analysis
We use the notation S∗ to denote the total amount of
resources available: SBW for external bandwidth (number of
complex words per cycle); SDSP for DSPs (number of complex
multiplier-adders); SBRAM for BRAMs (number of 36 × 1K
BRAM blocks). Also, as a reasonable design choice, we set
Po = c, Pb = b. Overall throughput (number of multiplication
/ additions per unit time) is calculated as:
Tsys =
(
1
λR
· Po · Pb · 2
)
·min
{
1,
1
2
· SBW
SreqBW
}
· F (4)
where SreqBW is the required bandwidth to support the DDR to
2D FFT / IFFT communication (Since spectral kernels are
reused by large enough number of input tiles, the amortized
cost of loading kernels from DDR to FPGA is zero, as
analyzed in [6]); 1
λR
specifies the average DSP utilization
given R replicas; the number ”2” indicates multiplication and
addition; and F is the working frequency. In the design, the
throughput of FFT, sparse Hadamard, and IFFT can be easily
matched by adjusting data parallelism of the 2D FFT and IFFT
modules. Hence, SreqBW is equivalent to required bandwidth by
the FFT module, i.e., SreqBW =
2Pb·n2
ΩH
·F, where ΩH is number of
clocks for finishing the corresponding Hadamard product on
the tile. Note that FFT and IFFT computations only consist
of a small portion of the total convolution layer workload,
as shown in [6]. So under a reasonable assumption that there
are sufficient logic resources, we implement the 2D FFT /
IFFT pipelines with the on-chip logic, and the Multiplication-
Accumulation units of the sparse Hadamard module with the
DSPs. Thus, the throughput optimized design is obtained by
solving the following:
maximize
R,Pb,Po
Tsys
subject to Pb · Po ≤ SDSP,
Pb · (R+ Po) + 1.5 · Po ≤ SBRAM
(5)
where, the BRAM constraint arises from the number of 1K
BRAM blocks for the input/output buffers and the kernel
buffer. The number 1.52 in the BRAM constraint means the
number of needed 36×1K BRAM blocks to store kernel values
and indices.
V. EXPERIMENTS
A. Experimental Setup
For pruning, we implement the ADMM and re-training
algorithm by Tensorflow on NVIDIA P100 GPUs. We use
LeNet (on MNIST dataset) and VGG16 (on CIFAR-10 and
Flowers-102 [28] datasets) for accuracy and throughput eval-
uation. As for the hardware implementation, we use Xilinx
Virtex-7 XC7VX690T to accelerate convolution layers of the
CNN. Other operations, such as ReLU, pooling, partitioning
and concatenation, are performed in the host CPU, which
communicates with FPGA using PCIe interface. Due to the
fact that operations in the host CPU account for a small
portion (< 1%) of total computations and most of these
operations can be parallelized, CPU execution will not be
the bottleneck during inference. We use 16-bit fix-point data
format during inference. We use Vivado2018.3 to synthesize
and implement the Verilog code. In the following section,
to make fair comparison with state-of-the-art spatial CNN
accelerators, “image frames per second” (FPS) is adopted to
measure inference throughput.
B. Evaluation Classification Accuracy
In this section, we evaluate the effect of pruning rate α and
FFT size n on the classification accuracy. We first obtain the
dense spectral kernels
{
W˜ `
}
by converting from the pre-
trained spatial CNNs. Then we perform pruning by ADMM
and re-training to get the sparse
{
W˜ `α
}
. Keeping α the same
among different kernels works well in other ADMM based
compression works[14]. Besides, the same compression ratio
avoids load imbalance when multiple kernels are executed in
parallel. We try six pruning configurations, by setting FFT
size n = 8 or 16 and compression rate α = 2, 4 or 8.
For each configuration, we perform hyper-parameter tuning
independently, by searching over the parameter space defined
by: ADMM penalty coefficient ρ ∈ { 0.001, 0.002, 0.005 },
initial learning rate η0 ∈ { 0.001, 0.0001 }, learning rate
decaying factor (per 20 epochs3) γ ∈ { 0.8 }. The ADMM
training converges after 200 epochs for LeNet, and 300 epochs
for VGG16. The re-training step converges after 10 epochs for
LeNet and 30 for VGG16.
Accuracy under various pruning rates Figure 5 shows
accuracy of the three CNN models when increasing the
pruning rate from 2 up to 8. For LeNet on MNIST, accuracy
can be fully recovered even when 75% spectral weights are
pruned. Small accuracy degradation (0.2%) is resulted from
pruning 87.5% of the spectral weights. For VGG16, our
216-bit fix point complex kernels need 1 36×1K BRAM, the corresponding
indices need another 18× 1k BRAM
3An epoch means a full traversal of the training set.
Orig x2 x4 x8
Pruning rate α
90
92
94
96
98
100
A
cc
u
ra
cy
(%
)
MNIST CIFAR-10 Flower102
Fig. 5: Accuracy under various pruning rate α
Orig x2 x4 x8
Pruning rate α
89.5
90.0
90.5
91.0
91.5
A
cc
u
ra
cy
(%
)
16x16 FFT 8x8 FFT
Fig. 6: Accuracy under various FFT sizes n (CIFAR-10)
pruning algorithm achieves less than 0.9% (for CIFAR-10)
and 0.3% (for Flowers-102) accuracy loss when pruning rate
α ≤ 4. The accuracy loss is 1.8% (for CIFAR-10) and 1.7%
(for Flowers-102) when α = 8.
Accuracy under various FFT sizes It has been shown in
[6] that the FPGA computation and communication workload
may depend on the selected FFT size n. Hence, it is necessary
to evaluate the pruning algorithm performance under various
n. As shown in Figure 6, 8×8 spectral kernels leads to higher
accuracy under the same pruning rate. As the pruning rate
increases, the accuracy gap between the 8 × 8 configuration
and the 16× 16 one becomes larger.
ADMM training statistics As described in Section III-B,
the ADMM pruning step drives
(
1− 1α
)
portion of the W˜ `
elements to close-to-zero values. The re-training step sets these
near-zero values to zero and fine-tune the remaining 1α portion.
Here we visualize the value distributions of spectral kernels at
various stages of the pruning, as shown in Figure 7. The left
one represents the distribution of the original kernels
{
W˜ `
}
of VGG16, in which values spread out in the [0.0, 0.1] interval.
After ADMM training with α = 4, the weight distribution is
significantly changed, as most values gather around zero (see
the middle figure). By further pruning near-zero weights and
re-training, smaller percent (25%) of the weights remain non-
zero. We consistently observe such change in distribution for
other CNNs and datasets.
Table I shows the change in classification accuracy after
various steps of weight pruning (α = 4). After ADMM
TABLE I: Accuracy after various pruning steps (α = 4)
MNIST CIFAR-10 Flowers-102
Original
{
W˜ `
}
99.1 91.7 95.3
ADMM
{
W˜ `AD
}
99.0 90.9 95.0
Pruning
{
W˜ `PR
}
99.0 90.5 94.8
Re-training
{
W˜ `α
}
99.1 90.8 95.0
training, small accuracy drop is observed when testing under{
W˜ `AD
}
. After obtaining
{
W˜ `PR
}
by removing the near-
zero values of
{
W˜ `AD
}
, the accuracy further drops for 0.03%
to 0.4%. After the re-training step, the accuracy is recovered,
and the final accuracy degradation compared with the original
unpruned models is 0.0% for MNIST, 0.9% for CIFAR-10 and
0.3% for Flowers-102.
C. Evaluation on Inference Throughput
Overall inference throughput, bottlenecked by either com-
putation or communication speed on FPGA, can be maximized
by solving Equation 5. Note that to obtain the optimal design
point (R∗, P ∗b , P
∗
o ), we have to empirically determine the
relation between DSP utilization and the parameters R, Po.
Figure 8 shows DSP utilization(average ratio of active DSP
to total DSP) under various number of replicas R and output
channel parallelism Po. In all cases, we do not require a large
number of replicas to achieve high DSP utilization. Even when
Po is as high as 128, it suffices to set R = 10 for over 80%
DSP utilization. In addition, with increasing pruning rate, DSP
utilization drops slightly under the same number of replicas.
This is due to that the sparser the spectral kernels, the more
irregular the non-zero locations.
Based on the statistics of DSP utilization, overall throughput
can be obtained by Equation 4, and optimal design point
(R∗, P ∗b , P
∗
o ) can be obtained. Figure 9 shows throughput
under various configurations with α = 4, n = 8 and Pb = 10.
The solid lines indicate throughput on the target hardware
platform, and the dotted lines represent performance bound
due to limited number on-chip BRAMs and DSPs. The solid
lines terminate at Po = 64 since we set Po to be power of
two. Due to enough DSP resources on the target platform,
the design is bound by BRAMs. By Figure 9, the optimal
configuration is P ∗b = 10, P
∗
o = 64, R
∗ = 16, corresponding
to throughput of 148 FPS (with average DSP utilization of
99%). The total required communication bandwidth between
off-chip memory and on-chip processing pipeline is 9 GB/s,
which is lower than the peak system bandwidth of 21 GB/s.
Given different compression ratios, we can also find opti-
mal configurations based on design exploration algorithm, as
shown in Table II. Throughput here is only for computation of
convolutional layers. Although kernels are more sparse given
higher α, like 8, throughput is still almost doubled compared
with α = 4.
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
p
er
ce
n
ta
g
e(
%
)
Original spect ral kernels
0.00 0.02 0.04 0.06 0.08 0.10
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
Spect ral kernels after ADMM
0.00 0.02 0.04 0.06 0.08 0.10
0
10
20
30
40
50
60
70
Spect ral kernel after re-t raining
Fig. 7: Value distribution of spectral kernels with α = 4
0 10 20 30
Num ber of replicas R
0.5
0.6
0.7
0.8
0.9
1.0
D
S
P 
u
ti
liz
at
io
n
 1 λ R
α = 2
Po=  16
Po=  32
Po=  64
Po=  128
0 10 20 30
Num ber of replicas R
0.5
0.6
0.7
0.8
0.9
1.0
α = 4
0 10 20 30
Num ber of replicas R
0.5
0.6
0.7
0.8
0.9
1.0
α = 8
Fig. 8: DSP utilization under various number of replica R, and output channel parallelism Po
20 30 40 50 60 70 80
Out  channels Po
0
10
20
30
40
50
60
R
ep
lic
as
 R
BRAM bound
D
S
P 
b
ou
n
d
Opt im al design
24 32
40
4
8
56
6
4
72
80
88
9
6
1
0
4
1
1
2
1
2
0
1
2
8
136
144
Fig. 9: Throughput under various configurations
TABLE II: Throughput of VGG16 (224× 224 input images)
Pruning rate α
2 4 8
Throughput (FPS) 74 148 284
Replicas R 16 16 16
DSP Utilization 1
λR
100% 99% 96%
Table III shows the throughput comparison with state-
of-the-art on VGG16. [6], [8] accelerates unpruned spectral
CNNs. SPEC2 achieves 24× higher throughput, where we
estimate that 8× improvement is due to the larger device and
3× improvement comes from our spectral pruning. Compared
with [23] which accelerates unpruned spatial CNNs on the
same FPGA as us, SPEC2 achieves 9× improvement, where
around 3× is due to the spectral convolution algorithm and
the other 3× comes from spectral pruning. Compared with
pruned spatial CNN implementation [31], SPEC2 achieves 30×
higher throughput on a device with 2816 more DSPs. For a
fairer comparison, assume throughput of [31] scales with the
number of DSPs and [31] can be implemented on the same
device as us with the same working frequency. In such case,
SPEC2 still achieves 1.8× throughput improvement.
VI. CONCLUSION
We have proposed SPEC2, an approach to prune and ac-
celerate spectral CNNs. SPEC2 performs systematic weight
pruning based on ADMM and deploys an efficient sparse
tensor computation pipeline on FPGA to achieve high accuracy
and high throughput inference simultaneously.
In the future, we will extend SPEC2 in the following
ways. First of all, we will explore model redundancy from
various aspects. We will incorporate channel pruning, and
weight quantization into the ADMM framework to obtain even
TABLE III: Comparison with state-of-the-art designs on VGG16
[6] [8] [29] [23] [30] [31] SPEC2
FPGA Intel Intel Xilinx Xilinx Intel Xilinx XilinxStratix V QPI FPGA Zynq Virtex-7 Arria 10 Artix-7 Virtex-7
Frequency (MHz) 200 200 150 150 221 100 200
Datatype 16-bit FX 32-bit FT 16-bit FX 16-bit FX 16-bit FX 16-bit FT 16-bit FX
DSP Usage 100% (256) 100% (224) 89% (780) 78% (2833) 49% (1500) 52% (384) 89% (3200)
LUT Usage 46% (107K) - 84% (183K) 81% (300K) 73% (313K) - 55% (237K)
RAM Usage 73% (1377) - 87% (486) 42% (624) 61% (1668) 53% (388) 85% (1244)
Throughput (fps) 6 4 8 16 19 5 148
* FX: Fixed-point data; FT: Floating-point data
more compact spectral CNN models. We will also develop
a complete tool chain to automatically prune, quantize and
accelerate spectral CNNs.
VII. ACKNOWLEDGEMENTS
This work was supported in part by National Science
Foundation award number CNS-1643351.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in NIPS’12, 2012.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, 2014.
[3] A. G., M. Z. Howard, and et al., “Mobilenets: Efficient convolutional
neural networks for mobile vision applications,” CoRR, 2017.
[4] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
efficient convolutional neural network for mobile devices,” CoRR, 2017.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016.
[6] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for gener-
ating high throughput cnn implementations on fpgas,” in Proceedings of
the 2018 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, ser. FPGA ’18. ACM, 2018.
[7] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Advances in Neural Information
Processing Systems, 2015.
[8] C. Zhang and V. Prasanna, “Frequency domain acceleration of con-
volutional neural networks on cpu-fpga shared memory system,” in
Proceedings of the 2017 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, ser. FPGA ’17. ACM, 2017.
[9] G. Yiwen, Y. Anbang, and C. Yurong, “Dynamic network surgery for
efficient dnns,” ser. NIPS’16, 2016.
[10] W. Wei, W. Chunpeng, W. Yandan, C. Yiran, and L. Hai, “Learning
structured sparsity in deep neural networks,” ser. NIPS’16, 2016.
[11] S. Zhou, K. Lakhotia, S. G. Singapura, and et al., “Design and imple-
mentation of parallel pagerank on multicore platforms,” in 2017 IEEE
High Performance Extreme Computing Conference (HPEC). IEEE,
2017.
[12] D. A. Bader, V. Agarwal, and K. Madduri, “On the design and analysis
of irregular algorithms on the cell processor: A case study of list rank-
ing,” in 2007 IEEE International Parallel and Distributed Processing
Symposium. IEEE, 2007.
[13] C. Szegedy, W. Liu, Y. Jia, and et al., “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015.
[14] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang,
“A systematic dnn weight pruning framework using alternating direction
method of multipliers,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018.
[15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
[16] D. Xiaoliang, Y. Hongxu, and N. K. Jha, “Nest: A neural network
synthesis tool based on a grow-and-prune paradigm,” arXiv, 2017.
[Online]. Available: https://arxiv.org/abs/1711.02017
[17] Y. Tien-Ju, C. Yu-Hsin, and S. Vivienne, “Designing energy-efficient
convolutional neural networks using energy-aware pruning,” ser.
CVPR’17. IEEE, 2017.
[18] H. Yihui, Z. Xiangyu, and S. Jian, “Channel pruning for accelerating
very deep neural networks,” ser. ICCV’17. IEEE, 2017.
[19] S. Wang, Z. Li, C. Ding, and et al., “C-lstm: Enabling efficient lstm
using structured compression techniques on fpgas,” in Proceedings of
the 2018 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2018, pp. 11–20.
[20] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Building efficient
deep neural networks with unitary group convolutions,” arXiv preprint
arXiv:1811.07755, 2018.
[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed
optimization and statistical learning via the alternating direction method
of multipliers,” Foundations and Trends® in Machine learning, 2011.
[22] Z. Liu, J. Xu, X. Peng, and R. Xiong, “Frequency-domain dynamic
pruning for convolutional neural networks,” in Advances in Neural
Information Processing Systems, 2018.
[23] Z. Chen, S. Guangyu, Zhenman, and et al., “Caffeine: Towards uni-
formed representation and acceleration for deep convolutional neural
networks,” TCAD, 2018.
[24] E. Wu, X. Zhang, D. Berman, I. Cho, and J. Thendean, “Compute-
efficient neural-network acceleration,” in Proceedings of the 2019
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, ser. FPGA ’19. ACM, 2019.
[25] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan et al.,
“Scnn: An accelerator for compressed-sparse convolutional neural net-
works,” in 2017 ACM/IEEE 44th Annual International Symposium on
Computer Architecture (ISCA), June 2017.
[26] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal et al.,
“In-datacenter performance analysis of a tensor processing unit,” in
2017 ACM/IEEE 44th Annual International Symposium on Computer
Architecture (ISCA). IEEE.
[27] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on fpga,” in Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, ser. FPGA ’15. ACM, 2015.
[28] M.-E. Nilsback and A. Zisserman, “Automated flower classification over
a large number of classes,” in Proceedings of the Indian Conference on
Computer Vision, Graphics and Image Processing, Dec 2008.
[29] Q. Jiantao, W. Jie, Y. Song, and et al., “Going deeper with embedded
fpga platform for convolutional neural network,” in Proceedings of the
2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, ser. FPGA ’16. ACM, 2016.
[30] W. Xuechao, Y. Cody Hao, Z. Peng, , and et al., “Automated systolic
array architecture synthesis for high throughput cnn inference on fpgas,”
ser. DAC’17. ACM, 2017.
[31] A. Page, A. Jafari, C. Shea, and T. Mohsenin, “Sparcnet: A hardware
accelerator for efficient deployment of sparse convolutional networks,”
ACM Journal on Emerging Technologies in Computing Systems (JETC),
2017.
