An Application-Specific VLIW Processor with Vector Instruction Set for
  CNN Acceleration by Bytyn, Andreas et al.
An Application-Specific VLIW Processor with
Vector Instruction Set for CNN Acceleration
Andreas Bytyn, Rainer Leupers and Gerd Ascheid
Institute for Communication Technologies and Embedded Systems, RWTH Aachen University
Email: bytyn@ice.rwth-aachen.de
c©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Abstract—In recent years, neural networks have surpassed
classical algorithms in areas such as object recognition, e.g. in the
well-known ImageNet challenge. As a result, great effort is being
put into developing fast and efficient accelerators, especially for
Convolutional Neural Networks (CNNs). In this work we present
ConvAix, a fully C-programmable processor, which contrary
to many existing architectures does not rely on a hard-wired
array of multiply-and-accumulate (MAC) units. Instead it maps
computations onto independent vector lanes making use of a
carefully designed vector instruction set.
The presented processor is targeted towards latency-sensitive
applications and is capable of executing up to 192 MAC oper-
ations per cycle. ConvAix operates at a target clock frequency
of 400 MHz in 28nm CMOS, thereby offering state-of-the-art
performance with proper flexibility within its target domain.
Simulation results for several 2D convolutional layers from
well known CNNs (AlexNet, VGG-16) show an average ALU
utilization of 72.5% using vector instructions with 16 bit fixed-
point arithmetic. Compared to other well-known designs which
are less flexible, ConvAix offers competitive energy efficiency of
up to 497 GOP/s/W while even surpassing them in terms of area
efficiency and processing speed.
I. INTRODUCTION
Since their introduction to the broad public, Convolutional
Neural Networks (CNNs) have been adopted for many tasks
such as object classification and detection [1] [2]. Their ability
to extract meaningful features out of data has been the key
enabling factor for their superior performance compared to
other approaches. However, this comes at the price of in-
creased computational complexity, specifically the number of
Multiply-And-Accumulate (MAC) operations can reach well
into the several GMAC per frame as summarized in [3].
Thankfully, there is a lot of explicit parallelism contained in
CNNs, thereby offering many options for accelerating them
with domain-specific hardware architectures.
The most time- and energy-consuming computational kernel
of every CNN is the 3D-convolution of so called input feature
maps (e.g. RGB input for the first layer) with multiple sets of
filters that compute layer-wise output feature maps [3]. The
focus of this paper is therefore on the acceleration of these
convolutions. As described in [4] and [5], the number of off-
chip memory accesses and the management of the on-chip
memories both play a detrimental role for energy consumption
and arithmetic utilization within the accelerator. There are
many parameters for the convolutions that affect the efficiency
and it is the designers task to find a suitable trade-off between
them.
Important parameters to consider are for example the order
in which convolutions are executed as well as the tiling of in-
put and output feature maps into e.g. column- and depth-slices.
The optimal choice of these parameters however is dependent
on the specific CNN model, thereby making it desirable to
have some degree of flexibility in the data flow. The authors
believe that an Application-Specific Instruction Set Processor
(ASIP), as presented in this paper, can offer a decent trade-
off between flexibility and efficiency. Some parameters, such
as unrolling-factors that result in hardware parallelism, must
be decided at design time. However, other parameters such
as tiling-factors and loop-order, can be flexibly adjusted in
software. Since a fully featured C/C++-compiler is generated
for our ASIP automatically, computational kernels can easily
be re-used or adapted by means of software libraries.
In Section II a brief overview of existing hardware acceler-
ators for CNNs is given. Afterwards, the convolutional kernel
is introduced in Section III and an overview of the processor
architecture is given in Section IV. Section V presents post
place-and-route results of the ASIP implemented in a 28nm
CMOS technology, as well as some relevant benchmarks run-
ning the state-of-the-art CNNs AlexNet and VGG-16. Finally,
Section VI concludes the paper.
II. RELATED WORK
Many of the published accelerators are comprised of a
large array of processing entities (PEs) with some application-
specific interconnectivity between them. These accelerators
often offer immense performance in terms of throughput
(GOP/s) and energy efficiency (GOP/s/W), yet lag the desired
flexibility when it comes to the employable data flow patterns
and the on-chip data management. In [6], the authors present
a 12x14 MAC array that aims to maximize data reuse (and
therefore minimize off-chip accesses) by applying a specific
computation scheme called row stationary. Some data flow
flexibility is achieved by subdividing the 2D MAC array into
slices and distributing parallel computations amongst these
slices. This flexibility has its limits though, as only a pre-
defined set of parameters can be adapted at runtime. In [7],
a C-programmable processor is presented that makes use of
a 16x16 MAC array to accelerate convolutions. The MAC
array is supplied with data by a RISC processor, which gives
flexibility in terms of the on-chip data management, however
the MAC array itself does not offer any additional flexibility.
Both [6] and [7] apply voltage scaling for their circuit to
ar
X
iv
:1
90
4.
05
10
6v
1 
 [c
s.A
R]
  1
0 A
pr
 20
19
demonstrate the potential energy efficiency improvement when
operating at a lower voltage and clock frequency.
Furthermore, since many CNNs can be quantized down to
8 bit fixed-point [8] [9], existing architectures exploit this by
either designing their MAC units to be narrow to begin with
(e.g. 12 bit as in [10]), or by applying techniques such as
precision-gating or using subword parallelism at runtime [7].
Other architectures such as [11] and [12] attempt to alleviate
the memory bottleneck in CNNs by employing near-memory
computation and using 3D memory.
III. CNN DATA FLOW
In general, CNNs consist of a number of concatenated
layers, each executing a pre-defined operation on a 3- or 4-
dimensional tensor (depending on whether batch processing
is applied), thereby generating an output tensor that is used
as input to the following layer. The most common layers
include the convolutional layer, max-pooling layer for tensor-
downsampling, and recently also so-called depth-wise and
point-wise convolutional layers [13], which are special cases of
the regular convolutional layer. For the remainder of this paper
we will focus on the convolutional layers, as they constitute
most of the computational expenditure in modern CNNs.
Furthermore, batch processing is not considered since the
processor presented here is intended for real-time applications
that are latency-sensitive.
ICh
IH
IW
ICh
FH
FW
ICh
FH
FW
OH
OW
OCh
IFMaps Filters OFMaps
F0
FOCh-1
Fig. 1: Convolutional layer example.
Fig. 1 illustrates the 3D-convolution operation used in
CNNs. A volume of input feature maps (IFMaps) consisting
of ICh separate channels (also called the depth), each of them
being of dimension IH x IW, is convolved with OCh banks of
filters (F0..FOCh-1). In this process, each filterbank creates one
output volume of dimension OH x OW by convolving each
IFMap with the corresponding filter of dimension FH x FW
and accumulating the results of the different filters.
As mentioned before, not all IFMaps, OFMaps and filters
can be kept in on-chip memory at the same time, so only
subsets of the data are available for processing. This can
be interpreted as slicing the input and/or output tensors into
smaller chunks of data for which parallel processing is possi-
ble. For more information on the different slicing options, the
interested reader is referred to [4] and [5].
Due to its software programmability, ConvAix supports a
variety of slicing options. Fig. 2 illustrates one option that
is particularly suitable for networks such as AlexNet and
VGG-16, which is also used for the benchmarks presented
in Section V.
PSums
PSums
O0
ON-1
I0
PSumsPSums
IFMap-Slices OFMap-SlicesFilter-Slice
IM-1 Step 1
Step 2
PSums
PSums
... ...
...
Step 3
Fig. 2: Exemplary data flow used for benchmarks in Sec. V.
Both IFMaps and OFMaps are sliced along their depth
dimension to build M input slices I0..IM-1 as well as N output
slices O0..ON-1. Each output slice is then processed in a row-
wise fashion (step 1) in order to re-use existing IFMap rows
when shifting the filter window to the next row. For each slice,
filters are pre-loaded before processing starts, while IFMap-
rows and OFMap-rows are loaded and stored concurrently on
demand. Partial sums (PSums) of the incomplete OFMaps
are accumulated in local scratchpad memories and only if
necessary buffered in off-chip memory, which also happens
concurrent to processing. After all IFMaps of the current slice
Im have been processed, the next slice Im+1 is loaded (step 2).
Finally, the current OFMap slice On is complete and the next
slice On+1 can be processed (step 3). Note that if the IFMaps
are not sliced along their depth-dimension, no intermediate
off-chip buffering of PSums is required.
IV. PROCESSOR ARCHITECTURE
The architecture overview of the proposed ASIP, called
ConvAix, is shown in Fig. 3a. It consists of 8 pipeline stages
(ID, IF, E1..E6) with 4 heterogeneous VLIW issue-slots to
exploit instruction-level parallelism. Slot 0 is reserved for
control instructions as well as memory operations both for
the on-chip as well as the off-chip memory. ConvAix also
offers a scalar ALU that is used for address calculations
and house-keeping computations, e.g. loop-counter updates. In
addition to the regular load/store-unit, an application-specific
line buffer is used to cache IFMap rows. Slots 1-3 each offer
a pipelined SIMD vector datapath, whereas each datapath
(vALU) itself again consists of 4 separate SIMD vector-slices
that can be programmed in C using specific vector primitives
added to the compiler. The vector parallelism is set to 16,
resulting in a total of 4 x 16 = 64 MAC operations per issue-
slot and 3 x 64 = 192 MAC operations in total for all 3
slots. Furthermore, slot 1 includes an application-specific unit
that operates on single vectors of size 16, which is used for
calculating activation functions and performing max-pooling.
In general, the ASIP uses 16 bit fixed-point arithmetic for
both the scalar ALU as well as the vector units. However
the ALU in slot 0 also offers a 32 bit datapath to perform
computations for addressing larger memories such as external
DRAM. Furthermore, the vector datapath supports precision
gating of its operands to reduce the effective word width and
therefore save energy as described in [9]. Settings such as the
In
st
ru
ct
io
n 
Fe
tc
h
In
st
ru
ct
io
n 
De
co
de
IF ID E1 E2 E3 E4 E5 E6
ALU
Ctr
Vec Maxp/Act
Line Buffer
(Vec) Load/Store
Memory Controller
+DMA
External
Memory
DM
16x8 KByte
DM
16x8 KByte8 KByte 
Scalar
Regfile (R)
PM
16 KByte
VRl
3x4x512 Bit
Vector-ALU (Slot 1)
Vector-ALU (Slot 2)
Vector-ALU (Slot 3)
8x16 Bit2x256 Bit
Slot 0
Operand
Prepare
Vectors of 16
4x AccumulatorAccumulatorAccumulatorAccumulator
Vector-ALU
VR
4x4x256 Bit
VR0
VR1
VR2
VR3
16 Banks
(a)
Decoder 1.3%
Memory IF & DMA 10.9%
Line Buffer 5.5%
Register Files 20.2%
vALUs 56.3%
Misc 5.7%
(b)
Decoder 1.7%
Memory IF & DMA 6.8%
Line Buffer 3.6%
Register Files 8.6%
vALUs 44.0%
Misc 1.2%
DM (SRAM) 31.9%
PM (SRAM) 2.2%
(c)
Fig. 3: ConvAix instruction pipeline and storage overview (3a), processor area breakdown (w/o SRAMs) (3b) and exemplary
power distribution (3c) for AlexNet layer 3 (8 bit gated precision).
rounding-scheme as well as the fractional-shift of the vector-
ALUs can be configured at runtime.
All slots have access to a 32-element wide scalar register
file R (16 bit per entry). Two large vector register files VR
and VRl of sizes 16 x 256 bit and 12 x 512 bit respectively
are used to provide data to the vector units, thereby acting
as an intermediate storage between the on-chip SRAM (DM)
and the processor pipeline. The second register file VRl, which
has double the width of VR, is used for vector accumulation.
To reduce the multiplexer depth required to access the vector
register files, both VR and VRl are sliced into 4 (VR0..VR3)
and 3 (VRl0..VRl2) sub-regions respectively. While slot 0 can
access the complete register files, which is required for data
movement and load/stores, the time-critical vector-ALUs only
have access to some of the aforementioned sub-regions. Each
vector-ALU has an operand fetch and prepare stage that can
either broadcast entire vectors to the 4 vector slices within its
ALU or generate a permuted version of the input according
to a pattern, which is set at runtime.
In addition to the 16 KByte program memory (PM) used to
fetch instructions from, ConvAix has access to 128 KByte of
dual-ported on-chip SRAM via a custom memory interface.
The aforementioned memory is called data memory (DM),
which is partitioned into 16 banks of 8 KByte each in order
to allow fetches of 2 vectors per cycle (2x256 bit). This is
required by the application, since at least one new filter vector
and input vector must be loaded each cycle to keep the vector-
ALUs busy. To allow seamless transfer of data to/from external
memory while the ASIP processes data slices as described in
Section III, there is a simple direct memory access (DMA)
engine included in the memory interface. Additionally to the
DMA, the line buffer unit has direct access to the memory
interface. This allows for simultaneous loads of new IFMap
rowchunks while providing (possibly strided) inputs to the
vector-ALUs. Using this approach, strided convolutions are
executed with minimal cycle overhead.
V. RESULTS
ConvAix was synthesized and placed & routed using a
TSMC 28nm CMOS technology at 1V nominal supply voltage
and standard VT for typical conditions (25 ◦ C). Table I
summarizes the implementation results, while Fig. 3b and Fig.
3c present a detailed breakdown of the ASIP’s area and power
distribution, respectively. The overall layout of the processor is
shown in Fig. 4. All presented power values were obtained by
simulating the netlist after place & route, thereby generating
detailed switching activity of the circuit.
TABLE I: PROCESSOR SPECIFICATION
Technology TSMC 28nm SVT 1P8M
Core voltage 1.0 V
Clock frequency 400 MHz
Gate count (logic) 1293 kGE
On-Chip SRAM
128 KByte (Data)
16 KByte (Instruction)
# MAC Units 192 (3 x 4 x 16)
Register Files & Pipe Registers 3648 Byte
Peak throughput 153,6 GOP/s
Arithmetic precision
16 bit fixed-point
(+ precision-gating)
Out of the total chip area, the SRAM macro-cells occupy
the largest portion at 63%. As can be seen in Fig 3b, the largest
area-contributors with regards to the logic-cells are the vector-
ALUs, which make up 56% in total. Regarding the power
consumption it can be observed that SRAM data memories
together with the register files and the line buffer consume
roughly the same amount of power (44.1%) as the vector-
ALUs (44%). The latter power figures however also include
TABLE II: COMPARISON WITH STATE-OF-THE-ART ACCELERATORS
Reference Envision [7] Eyeriss [6] This work (ConvAix)
Technology 40nm LP (Silicon) 65nm LP (Silicon) 28nm LP (P&R)
Architecture RISC + MAC Array ASIC ASIP
Core Voltage [V] 0.85-0.92 1 1
Gate Count (logic only) [kGE] 1600 1176 1293
On-Chip SRAM [KByte] 148 181.5 144
Registers [KByte] - 11.8 3.6
Clock Frequency [MHz] 204 200 400
# MAC Units 256 168 192
Peak Performance [GOP/s] 104.5 67.2 153.6
Arithmetic Units 1-16 bit fixed-pt (scalable) 16 bit fixed-pt 1-16 bit fixed-pt (scalable)
CNN Model AlexNet AlexNet VGG-16 AlexNet VGG-16
Processing Time [ms] 21.07 25.88 1251.63 12.60 263.0
Power Consumption [mW] 70.1 116.8 104.8 228.8 223.9
Off-Chip I/O [MByte] a 9.97 b 7.19 c 125.8 c 10.79 d 208.14 d
MAC Utilization Rate e 0.61 0.77 0.36 0.69 0.76
Area Efficiency [GOP/s/MGE] 39.73 44.01 20.85 82.23 90.26
Energy Efficiency [GOP/s/W] 815 187 104 - -
Energy Efficiency @ 28nm/1V [GOP/s/W] f 955 434 242 459 497
a Off-chip I/O for processing batches of size 1 b Compressed using Huffman coding c Compressed using run-length coding
d Uncompressed e Ratio of actual and ideal processing time based on 100% MAC utilization each cycle
f Power values were scaled according to the following formula: Pscaled = Pold(Lnew/Lold)(VDD,new/VDD,old)2
vALU0vALU2 vALU1
vReg
Mem-Ctr
Controller
LBuf
D
M
A
Misc
DM[0..7] DM[8..15]
PM
Fig. 4: Layout view of ConvAix after place & route.
the contribution of pipeline registers and multiplexers within
the vector-ALUs.
ConvAix was benchmarked using two widely used CNN
models: AlexNet [1] and VGG-16 [14]. Table II summarizes
our results and compares them with two well-known acceler-
ators (Envision [7] and Eyeriss [6]) targeting the same CNN
models. To allow for a fair comparison between all designs,
we scaled the energy efficiency values of all architectures
to a uniform 28nm technology operating at 1V. The values
presented in Table II show overall results across all layers
of the respective CNNs with optimized word width for the
architectures which provide scalable precision. Furthermore, to
increase the fairness of the comparison, the processing times
used in Table II do not include the time required for off-chip
I/O whenever possible. We hereby aim to eliminate the effect
that the I/O bandwidth of the external memories could have
on the presented figures.
Due to its comparatively high clock frequency, ConvAix
exceeds the other designs in terms of processing speed (1.6x
compared to the next fastest in AlexNet and 4.8x for VGG-16)
and area efficiency (1.9x for AlexNet and 4.3x for VGG-16).
At the same time it maintains a competitive energy efficiency
of 459 GOP/s/W on average for AlexNet and 497 GOP/s/W
for VGG-16. The average MAC utilization for AlexNet is
8% lower than that of Eyeriss. This is well expected due
to the fact, that the proposed design is software-programmed
which always incurs a certain overhead required for control-
code. For VGG-16 however, ConvAix demonstrates a much
higher utilization of 76% vs. 36% for Eyeriss. According to
the authors of Eyeriss, this is because of added time required
for repeatedly ramping up the MAC array. The required off-
chip I/O is higher than that of Eyeriss, which can be explained
by the lack of a memory compression engine in our design.
Calculations using the sparsity-values provided in [6] show,
that our design would achieve similar total off-chip I/O figures
as Eyeriss, if compression was added.
VI. CONCLUSION
It was the goal of this work to demonstrate the practical
feasibility of a software-programmable architecture with an
instruction set that is targeted towards, but not limited to, CNN
acceleration. The envisioned architecture, called ConvAix,
was implemented in a modern 28nm CMOS technology and
evaluated using highly relevant benchmarks. Results show
that ConvAix can not only achieve competitive efficiency
compared to other less flexible designs, but even surpass
them in terms of area efficiency (1.9x for AlexNet, 4.3x for
VGG-16) and throughput. Especially for the larger VGG-16
model, ConvAix achieves significantly higher utilization (76%
compared to 36%) and a 4.8x higher processing speed. Incor-
porating techniques such as dynamic voltage and frequency
scaling or memory compression could further improve the
efficiency of the presented design. We leave it to future work
to investigate this further.
ACKNOWLEDGMENT
This work was supported by the German Federal Ministry
of Education and Research (BMBF) via the PARIS project
(16ES0602) aiming at autonomous driving.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica-
tion with Deep Convolutional Neural Networks,” Advances In Neural
Information Processing Systems, pp. 1–9, 2012.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” 2015. [Online]. Available:
http://arxiv.org/abs/1506.02640
[3] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient Processing
of Deep Neural Networks: A Tutorial and Survey,” pp. 1–31, 2017.
[Online]. Available: http://arxiv.org/abs/1703.09039
[4] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Operation
and Dataflow in FPGA Acceleration of Deep Convolutional Neural Net-
works,” Proceedings of the 2017 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays - FPGA ’17, pp. 45–54, 2017.
[5] Y.-h. Chen, J. Emer, and V. Sze, “Using Dataflow to Optimize Energy
Efficiency of Deep Neural Network Accelerators,” IEEE Micro, no. 3,
pp. 12–21.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks,” IEEE Journal of Solid-State Circuits, no. 1, pp. 127–138.
[7] B. Moons and M. Verhelst, “An Energy-Efficient Precision-Scalable
ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-State
Circuits, vol. 52, no. 4, pp. 903–914, 2017.
[8] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed Point
Quantization of Deep Convolutional Networks,” vol. 48, 2015.
[Online]. Available: http://arxiv.org/abs/1511.06393
[9] B. Moons, B. De Brabandere, L. Van Gool, and M. Verhelst, “Energy-
Efficient ConvNets Through Approximate Computing,” in 2016 IEEE
Winter Conference on Applications of Computer Vision (WACV). IEEE,
pp. 1–8.
[10] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional
Network Accelerator,” IEEE Transactions on Circuits and Systems for
Video Technology, no. 11, pp. 2461–2475.
[11] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable
and Energy Efficient Deep Learning with Smart Memory Cubes,” IEEE
Transactions on Parallel and Distributed Systems, no. 2, pp. 420–434.
[12] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
Scalable and Efficient Neural Network Acceleration with 3D Memory,”
Proceedings of the Twenty-Second International Conference on Archi-
tectural Support for Programming Languages and Operating Systems -
ASPLOS ’17, no. 2, pp. 751–764.
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications,” 2017.
[Online]. Available: http://arxiv.org/abs/1704.04861
[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” International Conference on
Learning Representations (ICRL), pp. 1–14, 2014. [Online]. Available:
http://arxiv.org/abs/1409.1556
