Abstract-Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency.
I. INTRODUCTION
Artificial Neural Networks (NNs for short) are a large family of machine learning techniques initially inspired by neuroscience, and have been evolving towards deeper and larger structures over the last decade. Though computationally expensive, NN techniques as exemplified by deep learning [22] , [25] , [26] , [27] have become the state-of-the-art across a broad range of applications (such as pattern recognition [8] and web search [17] ), some have even achieved human-level Yunji Chen (cyj@ict.ac.cn) is the corresponding author of this paper.
performance on specific tasks such as ImageNet recognition [23] and Atari 2600 video games [33] .
Traditionally, NN techniques are executed on generalpurpose platforms composed of CPUs and GPGPUs, which are usually not energy-efficient because both types of processors invest excessive hardware resources to flexibly support various workloads [7] , [10] , [45] . Hardware accelerators customized to NNs have been recently investigated as energyefficient alternatives [3] , [5] , [11] , [29] , [32] . These accelerators often adopt high-level and informative instructions (control signals) that directly specify the high-level functional blocks (e.g. layer type: convolutional/ pooling/ classifier) or even an NN as a whole, instead of low-level computational operations (e.g., dot product), and their decoders can be fully optimized to each instruction.
Although straightforward and easy-to-implement for a small set of similar NN techniques (thus a small instruction set), the design/verification complexity and the area/power overhead of the instruction decoder for such accelerators will easily become unacceptably large, when the need of flexibly supporting a variety of different NN techniques results in a significant expansion of instruction set. Consequently, the design of such accelerators can only efficiently support a small subset of NN techniques sharing very similar computational patterns and data locality, but is incapable of handling the significant diversity among existing NN techniques. For example, the state-of-the-art NN accelerator DaDianNao [5] can efficiently support the Multi-Layer Perceptrons (MLPs) [50] , but cannot accommodate the Boltzmann Machines (BMs) [39] whose neurons are fully connected to each other. As a result, the ISA design is still a fundamental yet unresolved challenge that greatly limits both flexibility and efficiency of existing NN accelerators.
In this paper, we study the design of the ISA for NN accelerators, inspired by the success of RISC ISA design principles [37] : (a) First, decomposing complex and informative instructions describing high-level functional blocks of NNs (e.g., layers) into shorter instructions corresponding to low-level computational operations (e.g., dot product) allows an accelerator to have a broader application scope, as users can now use low-level operations to assemble new high-level functional blocks that are indispensable in new NN techniques; (b) Second, simple and short instructions significantly reduce design/verification complexity and power/area of the instruction decoder.
The result of our study is a novel ISA for NN accelerators, called Cambricon. Cambricon is a load-store architecture whose instructions are all 64-bit, and contains 64 32-bit General-Purpose Registers (GPRs) for scalars, mainly for control and addressing purposes. To support intensive, contiguous, variable-length accesses to vector/matrix data (which are common in NN techniques) with negligible area/power overhead, Cambricon does not use any vector register file, but keeps data in on-chip scratchpad memory, which is visible to programmers/compilers. There is no need to implement multiple ports in the on-chip memory (as in the register file), as simultaneous accesses to different banks decomposed with addresses' low-order bits are sufficient to supporting NN techniques (Section IV). Unlike an SIMD whose performance is restricted by the limited width of register file, Cambricon efficiently supports larger and variable data width because the banks of on-chip scratchpad memory can easily be made wider than the register file.
We evaluate Cambricon over a total of ten representative yet distinct NN techniques (MLP [2] , CNN [28] , RNN [15] , LSTM [15] , Autoencoder [49] , Sparse Autoencoder [49] , BM [39] , RBM [39] , SOM [48] , HNN [36] ), and observe that Cambricon provides higher code density than generalpurpose ISAs such as MIPS (13.38 times), x86 (9.86 times), and GPGPU (6.41 times). Compared to the latest state-ofthe-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambriconbased accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency, power, and area overheads (4.5%/4.4%/1.6%, respectively), with a versatile coverage of 10 different NN benchmarks.
Our key contributions in this work are the following: 1) We propose a novel and lightweight ISA having strong descriptive capacity for NN techniques; 2) We conduct a comprehensive study on the computational patterns of existing NN techniques; 3) We evaluate the effectiveness of Cambricon with an implementation of the first Cambriconbased accelerator using TSMC 65nm technology.
The rest of the paper is organized as follows. Section 2 briefly discusses a few design guidelines followed by Cambricon and presents an overview to Cambricon. Section III introduces computational and logical instructions of Cambricon. Section IV presents a prototype Cambricon accelerator. Section V empirically evaluates Cambricon, and compares it against other ISAs. Section VI discusses the potential extension of Cambricon to broader techniques. Section VII presents the related work. Section VIII concludes the whole paper.
II. OVERVIEW OF THE PROPOSED ISA
In this section, we first describe the design guideline for our proposed ISA, and then a brief overview of the ISA.
A. Design Guidelines
To design a succinct, flexible, and efficient ISA for NNs, we analyze various NN techniques in terms of their computational operations and memory access patterns, based on which we propose a few design guidelines before make concrete design decisions.
• Data-level Parallelism. We observe that in most NN techniques that neuron and synapse data are organized as layers and then manipulated in a uniform/symmetric manner. When accommodating these operations, data-level parallelism enabled by vector/matrix instructions can be more efficient than instruction-level parallelism of traditional scalar instructions, and corresponds to higher code density. Therefore, the focus of Cambricon would be data-level parallelism.
• Customized Vector/Matrix Instructions. Although there are many linear algebra libraries (e.g., the BLAS library [9] ) successfully covering a broad range of scientific computing applications, for NN techniques, fundamental operations defined in those algebra libraries are not necessarily effective and efficient choices (some are even redundant). More importantly, there are many common operations of NN techniques that are not covered by traditional linear algebra libraries. For example, the BLAS library does not support element-wise exponential computation of a vector, neither does it support random vector generation in synapse initialization, dropout [8] and Restricted Boltzmann Machine (RBM) [39] . Therefore, we must comprehensively customize a small yet representative set of vector/matrix instructions for existing NN techniques, instead of simply re-implementing vector/matrix operations from an existing linear algebra library.
• Using On-chip Scratchpad Memory. We observe that NN techniques often require intensive, contiguous, and variable-length accesses to vector/matrix data, and therefore using fixed-width power-hungry vector register files is no longer the most cost-effective choice. In our design, we replace vector register files with on-chip scratchpad memory, providing flexible width for each data access. This is usually a highly-efficient choice for data-level parallelism in NNs, because synapse data in NNs are often large and rarely reused, diminishing the performance gain brought by vector register files.
B. An Overview to Cambricon
We design the Cambricon following the guidelines presented in Section II-A, and provide an overview of the Cambricon in Table I . The Cambricon is a load-store architecture which only allows the main memory to be accessed Figure 2 . Vector Load (VLOAD) instruction. On-chip Scratchpad Memory. Cambricon does not use any vector register file, but directly keeps data in onchip scratchpad memory, which is made visible to programmers/compilers. In other words, the role of on-chip scratchpad memory in Cambricon is similar to that of vector register file in traditional ISAs, and sizes of vector operands are no longer limited by fixed-width vector register files. Therefore, vector/matrix sizes are variable in Cambricon instructions, and the only notable restriction is that the vector/matrix operands in the same instruction cannot exceed the capacity of scratchpad memory. In case they do exceed, the compiler will decompose long vectors/matrices into short pieces/blocks and generate multiple instructions to process them.
Just like the 32x512b vector registers have been baked into Intel AVX-512 [18], capacities of on-chip memories for both vector and matrix instructions must be fixed in Cambricon. More specifically, Cambricon fixes the memory capacity to be 64KB for vector instructions, 768KB for matrix instruc-tions. Yet, Cambricon does not impose specific restriction on bank numbers of scratchpad memory, leaving significant freedom to microarchitecture-level implementations.
III. COMPUTATIONAL/LOGICAL INSTRUCTIONS
In neural networks, most arithmetic operations (e.g., additions, multiplications and activation functions) can be aggregated as vector operations [10] , [45] , and the ratio can be as high as 99.992% according to our quantitative observations on a state-of-the-art Convolutional Neural Network (GoogLeNet) winning the 2014 ImageNet competition (ILSVRC14) [43] . In the meantime, we also discover that 99.791% of the vector operations (such as dot product operation) in the GoogLeNet can be aggregated further as matrix operations (such as vector-matrix multiplication). In a nutshell, NNs can be naturally decomposed into scalar, vector, and matrix operations, and the ISA design must effectively take advantages of the potential data-level parallelism and data locality. We conduct a thorough and comprehensive review to existing NN techniques, and design a total of six matrix instructions for Cambricon. Here we take a Multi-Level Perceptrons (MLP) [50] , a well-known and representative NN, as an example, and show how it is supported by the matrix instructions. Technically, an MLP usually has multiple layers, each of which computes values of some neurons (i.e., output neurons) according to some neurons whose values are known (i.e., input neurons). We illustrate the feedforward run of one such layer in Fig. 3 . More specifically, the output neuron y i (i = 1, 2, 3) in Fig. 3 can be computed as
A. Matrix Instructions
, where x j is the jth input neuron, w i j is the weight between the i-th output neuron and the j-th input neuron, b i is the bias of the i-th output neuron, and f is the activation function. The output neurons can be computed as a vector y = (y 1 , y 2 , y 3 ):
where
are vectors of input neurons and biases, respectively, W = (w i j ) is the weight matrix, and f is the element-wise version of the activation function f (see Section III-B). A critical step in Eq. 1 is to compute W x, which will be performed by the Matrix-Mult-Vector (MMV) instruction in Cambricon. We illustrate this instruction in Fig. 4 , where Reg0 specifies the base scratchpad memory address of the vector output (Vout addr ); Reg1 specifies the size of the vector output (Vout size ); Reg2, Reg3, and Reg4 specify the base address of the matrix input (Min addr ), the base address of the vector input (Vin addr ), and the size of the vector input (Vin size , note that it is variable), respectively. The MMV instruction can support matrix-vector multiplication at arbitrary scales, as long as all the input and output data can be kept simultaneously in the scratchpad memory. We choose to compute W x with the dedicated MMV instruction instead of decomposing it as multiple vector dot products, because the latter approach requires additional efforts (e.g., explicit synchronization, concurrent read/write requests to the same address) to reuse the input vector x among different row vectors of M, which is less efficient. Unlike the feedforward case, however, the MMV instruction no longer provides efficient support to the backforward training process of an NN. More specifically, a critical step of the well-known Back-Propagation (BP) algorithm is to compute the gradient vector [20] , which can be formulated as a vector multiplied by a matrix. If we implement it with the MMV instruction, we need an additional instruction implementing matrix transpose, which is rather expensive in data movements. To avoid that, Cambricon provides a Vector-Mult-Matrix (VMM) instruction which is directly applicable to the backforward training process. The VMM instruction has the same fields with the MMV instruction, except the opcode.
Moreover, in training an NN, the weight matrix W often needs to be incrementally updated with W = W + ηΔW , where η is the learning rate and ΔW is estimated as the outer product of two vectors. Cambricon provides an OuterProduct (OP) instruction (the output is a matrix), a MatrixMult-Scalar (MMS) instruction, and a Matrix-Add-Matrix (MAM) instruction to collaboratively perform the weight updating. In addition, Cambricon also provides a MatrixSubtract-Matrix (MSM) instruction to support the weight updating in Restricted Boltzmann Machine (RBM) [39] .
B. Vector Instructions
Using Eq. 1 as an example, one can observe that the matrix instructions defined in the prior subsection are still insufficient to perform all the computations. We still need to add up the vector output of W x and the bias vector b, and then perform an element-wise activation to W x + b.
While Cambricon directly provides a Vector-Add-Vector (VAV) instruction for vector additions, it requires multiple instructions to support the element-wise activation. Without losing any generality, here we take the widely-used sigmoid activation, f (a) = e a /(1 + e a ), as an example. The elementwise sigmoid activation performed to each element of an input vector (say, a) can be decomposed into 3 consecutive steps, and are supported by 3 instructions, respectively:
1. Computing the exponential e a i for each element (a i , i = 1,...,n) in the input vector a. Cambricon provides a Vector-Exponential (VEXP) instruction for elementwise exponential of a vector. 2. Adding the constant 1 to each element of the vector (e a 1 ,...,e a n ). Cambricon provides a Vector-Add-Scalar (VAS) instruction, where the scalar can be an immediate or specified by a GPR. 3. Dividing e a i by 1+e a i for each vector index i = 1,...,n.
Cambricon provides a Vector-Div-Vector (VDV) instruction for element-wise division between vectors. However, the sigmoid is not the only activation function utilized by the existing NNs. To implement element-wise versions of various activation functions, Cambricon provides a series of vector arithmetic instructions, such as VectorMult-Vector (VMV), Vector-Sub-Vector (VSV), and VectorLogarithm (VLOG). During the design of a hardware accelerator, instructions related to different transcendental functions (e.g. logarithmic, trigonometric and anti-trigonometric functions) can efficiently reuse the same functional block (involving addition, shift, and table-lookup operations), using the CORDIC technique [24] . Moreover, there are activation functions (e.g, max(0, a) and |a|) that partially rely on logical operations (e.g., comparison), and we will present the related Cambricon instructions (e.g., vector compare instructions) in Section III-C.
Furthermore, the random vector generation is an important operation common in many NN techniques (e.g., dropout [8] and random sampling [39] ), but is not deemed as a necessity in traditional linear algebra libraries designed for scientific computing (e.g., the BLAS library does not include this operation). Cambricon provides a dedicated instruction (Random-Vector, RV) that generates a vector of random numbers obeying the uniform distribution at the interval [0, 1]. Given uniform random vectors, we can further generate random vectors obeying other distributions (e.g., Gaussian distribution) using the Ziggurat algorithm [31] , with the help of vector arithmetic instructions and vector compare instructions in Cambricon.
C. Logical Instructions
The state-of-the-art NN techniques leverage a few operations that incorporate comparisons or other logical manipulations. The max-pooling operation is one such operation (see Fig. 5a for an illustration), which seeks the neuron having the largest output among neurons within a pooling window, and repeats this action for corresponding pooling windows in different input feature maps (see Fig.  5b ). Cambricon supports the max-pooling operation with 
D. Scalar Instructions
Although we have observed that only 0.008% arithmetic operations of the GoogLeNet [43] cannot be supported with matrix and vector instructions in Cambricon, there are also scalar operations that are indispensable to NNs, such as elementary arithmetic operations and scalar transcendental functions. We summarize them in Table I , which have been formally defined as Cambricon's scalar instructions.
E. Code Examples
To illustrate the usage of our proposed instruction sets, we implement three simple yet representative components of NNs, a MLP feedforward layer [50] , a pooling layer [22] , and a Boltzmann Machines (BM) layer [39] , using Cambricon instructions. For the sake of brevity, we omit scalar load/store instructions for all three layers, and only show the program fragment of a single pooling window (with multiple input and output feature maps) for the pooling layer. We illustrate the concrete Cambricon program fragments in Fig.  7 , and we observe that the code density of Cambricon is significantly higher than that of x86 and MIPS (see Section V for a comprehensive evaluation). Figure 8 . A prototype accelerator based on Cambricon. In this section, we present a prototype accelerator of Cambricon. We illustrate the design in Fig. 8 , which contains seven major instruction pipeline stages: fetching, decoding, issuing, register reading, execution, writing back, and committing. We use mature techniques such as scratchpad memory and DMA in this accelerator, since we found that these classic techniques have been sufficient to reflect the flexibility (Section V-B1), conciseness (Section V-B2) and efficiency (Section V-B3) of the ISA. We did not seek to explore the emerging techniques (such as 3D stacking [51] and non-volatile memory [47] , [46] ) in our prototype design,but left such exploration as future work, because we believe that a promising ISA must be easy to implement and should not be tightly coupled with emerging techniques.
IV. A PROTOTYPE ACCELERATOR
As illustrated in Fig. 8 , after the fetching and decoding stages, an instruction is injected into an in-order issue queue. After successfully fetching the operands (scalar data, or address/size of vector/matrix data) from the scalar register file, an instruction will be sent to different units depending on the instruction type. Control instructions and scalar computational/logical instructions will be sent to the scalar functional unit for direct execution. After writing back to the scalar register file, such an instruction can be committed from the reorder buffer 1 as long as it has become the oldest uncommitted yet executed instruction.
Data transfer instructions, vector/matrix computational instructions, and vector logical instructions, which may access the L1 cache or scratchpad memories, will be sent to the Address Generation Unit (AGU). Such an instruction needs to wait in an in-order memory queue to resolve potential memory dependencies 2 with earlier instructions in the memory queue. After that, load/store requests of scalar data transfer instructions will be sent to the L1 cache, data transfer/computational/logical instructions for vectors will be sent to the vector functional unit, data transfer/computational instructions for matrices will be sent to matrix functional unit. After the execution, such an instruction can be retired from the memory queue, and then be committed from the reorder buffer as long as it has become the oldest uncommitted yet executed instruction.
The accelerator implements both vector and matrix functional units. The vector unit contains 32 16-bit adders, 32 16-bit multipliers, and is equipped with a 64KB scratchpad memory. The matrix unit contains 1024 multipliers and 1024 adders, which has been divided into 32 separate computational blocks to avoid excessive wire congestion and power consumption on long-distance data movements. Each computational block is equipped with a separate 24KB scratchpad. The 32 computational blocks are connected through an h-tree bus that serves to broadcast input values to each block and to collect output values from each block.
A notable Cambricon feature is that it does not use any vector register file, but keeps data in on-chip scratchpad memories. To efficiently access scratchpad memories, the vector/matrix functional unit of the prototype accelerator integrates three DMAs, each of which corresponds to one vector/matrix input/output of an instruction. In addition, the scratchpad memory is equipped with an IO DMA. However, each scratchpad memory itself only provides a single port for each bank, but may need to address up to four concurrent read/write requests. We design a specific structure for the scratchpad memory to tackle this issue (see Fig. 9 ). Concretely, we decompose the memory into four banks according to addresses' low-order two bits, connect them with four read/write ports via a crossbar guaranteeing that no bank will be simultaneously accessed. Thanks to the dedicated hardware support, Cambricon does not need expensive multi-port vector register file, and can flexibly and efficiently support different data widths using the on-chip scratchpad memory. 
V. EXPERIMENTAL EVALUATION
In this section, we first describe the evaluation methodology, and then present the experimental results.
A. Methodology
Design evaluation. We synthesize the prototype accelerator of Cambricon (Cambricon-ACC, see Section IV) with Synopsys Design Compiler using TSMC 65nm GP standard VT library, place and route the synthesized design with the Synopsys ICC compiler, simulate and verify it with Synopsys VCS, and estimate the power consumption with Synopsys Prime-Time PX according to the simulated Value Change Dump (VCD) file. We are planning an MPW tapeout of the prototype accelerator, with a small area budget of 60 mm 2 at a 65nm process with targeted operating frequency of 1 Ghz. Therefore, we adopt moderate functional unit sizes and scratchpad memory capacities in order to fit the area budget. II shows the details of design parameters. Baselines. We compare the Cambricon-ACC with three baselines. The first two are based on general-purpose CPU and GPU, and the last one is a state-of-the-art NN hardware accelerator:
• CPU. The CPU baseline is an x86-CPU with 256-bit SIMD support (Intel Xeon E5-2620, 2.10GHz, 64 GB memory). We use the Intel MKL library [19] to implement vector and matrix primitives for the CPU baseline, and GCC v4.7.2 to compile all benchmarks with options "-O2 -lm -march=native" to enable SIMD instructions.
• GPU. The GPU baseline is a modern GPU card (NVIDI-A K40M, 12GB GDDR5, 4.29 TFlops peak at a 28nm process); we implement all benchmarks (see below) with the NVIDIA cuBLAS library [35] , a state-of-the-art linear algebra library for GPU.
• NN Accelerator. The baseline accelerator is DaDianNao, a state-of-the-art NN accelerator exhibiting remarkable energy-efficiency improvement over a GPU [5] . We reimplement the DaDianNao architecture at a 65nm process, but replace all eDRAMs with SRAMs because we do not have a 65nm eDRAM library. In addition, we re-size DaDianNao such that it has a comparable amount of arithmetic operators and on-chip SRAM capacity as our design, which enables a fair comparison of two accelerators under our area budget (<60 mm 2 ) mentioned in the previous paragraph. The re-implemented version of DaDianNao has a single central tile and a total of 32 leaf tiles. The central tile has 64KB SRAM, 32 16-bit adders and 32 16-bit multipliers; Each leaf tile has 24KB SRAM, 32 16-bit adders and 32 16-bit multipliers. In other words, the total numbers of adders and multipliers, as well as the total SRAM capacity in the re-implemented DaDianNao, are the same with our prototype accelerator. Although we are constrained to give up eDRAMs in both accelerators, this is still a fair and reasonable experimental setting, because the flexibility of an accelerator is mainly determined by its ISA, not concrete devices it integrates. In this sense, the flexibility gained from Cambricon will still be there even when we resort to large eDRAMs to remove main memory accesses and improve the performance for both accelerators.
Benchmarks. We take 10 representative NN techniques as our benchmarks, see Table III . Each benchmark is translated manually into assemblers to execute on Cambricon-ACC and DaDianNao. We evaluate their cycle-level performance with Synopsys VCS.
B. Experimental Results
We compare Cambricon and Cambricon-ACC with the baselines in terms of metrics such as performance and energy. We also provide the detailed layout characteristics of the prototype accelerator.
1) Flexibility:
In view of the apparent flexibility provided by general-purpose ISAs (e.g., x86, MIPS and GPU-ISA), here we restrict our discussions to ISAs of NN accelerators. DaDianNao [5] and DianNao [3] are the two unique NN accelerators that have explicit ISAs (other ones are often hardwired). They share similar ISAs, and our discussion is exemplified by DaDianNao, the one with better performance and multicore scaling. To be specific, the ISA of this accelerator only contains four 512-bit VLIW instructions corresponding to four popular layer types of neural networks (fully-connected classifier layer, convolutional layer, pooling layer, and local response normalization layer), rendering it a rather incomplete ISA for the NN domain. Among 10 representative benchmark networks listed in Table III , the DaDianNao ISA is only capable of expressing MLP, CNN, and RBM, but fails to implement the rest 7 benchmarks (RNN, LSTM, AutoEncoder, Sparse AutoEncoder, BM, SOM and HNN). An observation well explaining the failure of DaDianNao on the 7 representative networks is that they cannot be characterized as aggregations of the four types of layers (thus aggregations of DaDianNao instructions). In contrast, Cambricon defines a total of 43 64-bit scalar/control/vector/matrix instructions, and is sufficiently flexible to express all 10 networks.
2) Code Density: Code density is a meaningful ISA metric only when the ISA is flexible enough to cover a broad range of applications in the target domain. Therefore, we only compare the code density of Cambricon with GPU, MIPS, and x86, with 10 benchmarks implemented with Cambricon, CUDA-C, and C, respectively. We manually write the Cambricon program; We compile the CUDA-C programs with nvcc, and count the length of the generated ptx files after removing initialization and system-call instructions; We compile the C programs with x86 and MIPS compilers, respectively (with the option -O2). We then count the lengths of two kinds of assemblers. We illustrate in Fig. 10 Cambricon's reduction on code length over other ISAs. On average, the code length of Cambricon is about 6.41x, 9.86x, and 13.38x shorter than GPU, x86, and MIPS, respectively. The observations are not surprising, because Cambricon aggregates many scalar operations into vector instructions, and further aggregates vector operations into matrix instructions, which significantly reduces the code length.
Specifically, on MLP, Cambricon can improve the code density by 13.62x, 22.62x, and 32.92x against GPU, x86, and MIPS, respectively. The main reason is that there are very few scalar instructions in the Cambricon code of MLP. However, on CNN, Cambricon achieves only 1.09x, 5.90x, and 8.27x reduction of code length against GPU, x86, and MIPS, respectively. It is because that the main body of CNN is a deeply nested loop requiring many individual scalar operations to manipulate the loop variable. Hence, the advantage of aggregating scalar operations into vector operations has a small gain on code density.
Moreover, we collect the percentage breakdown of Cambricon instruction types in the 10 benchmarks. On average, 38.0% instructions are data transfer instructions, 4.8% instructions are control instructions, 12.6% instructions are matrix instructions, 33.8% instructions are vector instructions, and 10.9 % instructions are scalar instructions. This observation clearly shows that vector/matrix instructions play a critical role in NN techniques, thus efficient implementations of these instructions are essential to the performance of an Cambricon-based accelerator.
3) Performance: We compare Cambricon-ACC against x86-CPU and GPU on all 10 benchmarks listed in Table III . Fig. 12 illustrates the speedup of Cambricon-ACC against x86-CPU, GPU, and DaDianNao. On average, Cambricon-ACC is about 91.72x and 3.09x faster than of x86-CPU and GPU, respectively. This is not surprising because Cambricon-ACC integrates dedicated functional units and scratchpad memory optimized for NN techniques.
On the other hand, due to the incomplete and restricted ISA, DaDianNao can only accommodate 3 out of 10 benchmarks (i.e., MLP, CNN and RBM), thus its flexibility is significantly worse than that of Cambricon-ACC. In the meantime, the better flexibility of Cambricon-ACC does not lead to significant performance loss. We compare Cambricon-ACC against DaDianNao on the three benchmarks that DaDianNao can support, and observe that Cambricon-ACC is only 4.5% slower than DaDianNao on average. The reason for a small performance loss of Cambricon-ACC over DaDianNao is that, Cambricon decomposes complex high-level functional instructions of DaDianNao (e.g., an instruction for a convolutional layer) into shorter and lowlevel computational instructions (e.g., MMV and dot product), which may bring in additional pipeline bubbles between instructions. With the high code density provided by Cambricon, however, the amount of additional bubbles is moderate, the corresponding performance loss is therefore negligible. 
4) Energy Consumption:
We also compare the energy consumptions of Cambricon-ACC, GPU and DaDianNao, which can be estimated as products of power consumptions (in Watt) and the execution times (in Second). The power consumption of GPU is reported by the NVPROF, and the power consumptions of DaDianNao and Cambricon-ACC are estimated with Synopsys Prime-Tame PX according to the simulated Value Change Dump (VCD) file. We do not have the energy comparison against CPU baseline, because of the lack of hardware support for the estimation of the actual power of the CPU. Yet, recently it has been reported that an SIMD-CPU is an order-of-magnitude less energyefficient than a GPU (NVIDIA K20M) on neural network applications [4] , which well complements our experiments.
As shown in Fig. 13 , the energy consumptions of GPU and DaDianNao are 130.53x and 0.916x that of Cambricon-ACC, respectively, where the energy of DaDianNao is averaged over 3 benchmarks because it can only accommodate 3 out of 10 benchmarks. Compared with Cambricon-ACC, the power consumption of GPU is much higher, as the GPU spends excessive hardware resources to flexibly support various workloads. On the other hand, the energy consumption of Cambricon-ACC is only slightly higher than of DaDianNao, because both accelerators integrate the same sizes of functional units and on-chip storage, and work at the same frequency. The additional energy consumed by Cambricon-ACC mainly comes from instruction pipeline logic, memory queue, as well as the vector transcendental functional unit. In contrast, DaDianNao uses a low-precision but lightweight lookup table instead of using transcendental functional units.
5) Chip Layout:
We show the layout of Cambricon-ACC in Fig. 14 , and list the area and power breakdowns in Table  IV . The overall area of Cambricon-ACC is 56.24 mm 2 , which is about 1.6% larger than of DaDianNao (55.34 mm 2 , re-implemented version). The combinational logic (mainly vector and matrix functional units) consumes 32.15% area of Cambricon-ACC, and the on-chip memory (mainly vector and matrix scratchpad memories) consumes about 15.05% area.
The matrix part (including the matrix function unit and the matrix scratchpad memory) accounts for 62.69% area of Cambricon-ACC, while the core part (including the instruction pipeline logic, scalar function unit, memory queue, and so on) and the vector part (including the vector function unit and the vector scratchpad memory) only account for 9.00 % area. The remaining 28.31% area is consumed by the channel part, including wires connecting the core & vector part and the matrix part, and wires connecting together different blocks of the matrix part.
We also estimate the power consumption of the prototype design with Synopsys PrimePower. The peak power consumption is 1.695 W (under 100% toggle rate), which is only about one percentage of the K40M GPU. More specifically, the core & vector part and matrix part consume 8.20%, and 59.26% power, respectively. Moreover, data movements in the channel part consume 32.54% power, which is several times higher than the power of the core & vector part. It can be expected that the power consumption of the channel part can be much higher if we do not divide the matrix part into multiple blocks. Figure 14 . The layout of Cambricon-ACC, implemented in TSMC 65nm technology.
VI. POTENTIAL EXTENSION TO BROADER TECHNIQUES
Although Cambricon is designed for existing neural network techniques, it can also support future neural network techniques or even some classic statistical techniques, as long as they can be decomposed into scalar/ector/matrix instructions in Cambricon. Here we take logistic regression [21] as an example, and illustrate how it can be supported by Cambricon. Technically, logistic regression contains two phases, training phase, and prediction phase. The training phase employs a gradient descent algorithm similar to the training phase of MLP technique, which can be supported by Cambricon. In the prediction phase, the output can be
T is the input vector, x 0 always equals to 1, θ = (θ 0 , θ 1 ...θ n ) T is the model parameters). We can leverage the dot product instruction, scalar elementary arithmetic instructions, and scalar exponential instruction of Cambricon to perform the prediction phase of logistic regression. Moreover, given a batch of n different input vectors, the MMV instruction, vector elementary arithmetic instructions and vector exponential instruction in Cambricon collaboratively allow prediction phases of n inputs to be computed in parallel. 
VII. RELATED WORK
In this section, we summarize prior work on NN techniques and NN accelerator designs.
Neural Networks. Existing NN techniques have exhibited significant diversity in their network topologies and learning algorithms. For example, Deep Belief Networks (DBNs) [41] consist of a sequence of layers, each of which is fully connected to its adjacent layers. In contrast, Convolutional Neural Networks (CNNs) [25] use convolutional/pooling windows to specify connections between neurons, thus the connection density is much lower than in DBNs. Interestingly, connection densities of DBNs and CNNs are both lower than the Boltzmann Machines (BMs) [39] that fully connect all neurons with each other. Learning algorithms for different NNs may also differ from each other, as exemplified by the remarkable discrepancy among the backpropagation algorithm for training Multi-Level Perceptrons (MLPs) [50] , the Gibbs sampling algorithm for training Restricted Boltzmann Machines (RBMs) [39] , and the unsupervised learning algorithm for training Self-Organizing Map (SOM) [34] .
In a nutshell, while adopting high-level, complex, and informative instructions could be a feasible choice for accelerators supporting a small set of similar NN techniques, the significant diversity and the large number of existing NN techniques make it unfeasible to build a single accelerator that uses a considerable number of high-level instructions to cover a broad range of NNs. Moreover, without a certain degree of generality, even an exisiting successful accelerator design may easily become inapplicable simply because of the evolution of NN techniques. NN Accelerators. NN techniques are computationally intensive, and are traditionally executed on general-purpose platforms composed of CPUs and GPGPUs, which are usually not energy-efficient for NN techniques [3] , because they invest excessive hardware resources to flexibly support various workloads. Over the past decade, there have been many hardware accelerators customized to NNs, implemented on FPGAs [13] , [38] , [40] , [42] or as ASICs [3] , [12] , [14] , [44] . Farabet et al. proposed an accelerator named Neuflow with systolic architecture [12] , for the feed-forward paths of CNNs. Maashri et al. implemented another NN accelerator, which arranges several customized accelerators around a switch fabric [30] . Esmaeilzadeh et al. proposed a SIMD-like architecture (NnSP) for MultiLayer Perceptrons (MLPs) [10] . Chakradhar et al. mapped the CNN to reconfigurable circuits [1] . Chi et al. proposed PRIME [6] , a novel process-in-memory architecture that implements reconfigurable NN accelerator in ReRAM-based main memory. Hashmi et al. proposed the Aivo framework to characterize their specific cortical network model and learning algorithms, which can generate execution code of their network model for general-purpose CPUs and GPUs rather than hardware accelerators [16] . The above designs were customized for one specific NN technique (e.g., MLP or CNN), whose application scopes are limited. Chen et al. proposed a small-footprint NN accelerator called DianNao, whose instructions directly correspond to different layer types in CNN [3] . DaDianNao adopts a similar instruction set, but achieves even higher performance and energyefficiency via keeping all network parameters on-chip, which is a piece of innovation on accelerator architecture instead of ISA [5] . Therefore, the application scope of DaDianNao is still limited by its ISA, which is similar to the case of DianNao. Liu et al. designed the PuDianNao accelerator that accommodates seven classic machine learning techniques, whose control module only provides seven different opcodes (each corresponds to a specific machine learning technique) [29] . Therefore, PuDianNao only allows minor changes to the seven machine learning techniques. In summary, the lack of agility in instruction sets prevents previous accelerators from flexibly and efficiently supporting a variety of different NN techniques.
Comparison. Compared to prior work, we decompose traditional high-level and complex instructions describing highlevel functional blocks of NNs (e.g., layers) into shorter instructions corresponding to low-level computational operations (e.g., scalar/vector/matrix operations), which allows a hardware accelerator to have a broader application scope. Furthermore, simple and short instructions may reduce the design and verification complexity of the accelerators.
VIII. CONCLUSION AND FUTURE WORK
In this paper, we propose a novel ISA for neural networks called Cambricon, which allows NN accelerators to flexibly support a broad range of different NN techniques. We compare Cambricon with x86 and MIPS across ten diverse yet representative NNs, and observe that the code density of Cambricon is significantly higher than that of x86 and MIPS. We implement a Cambricon-based prototype accelerator in TSMC 65nm technology, and the area is 56.24 mm 2 , the power consumption is only 1.695 W . Thanks to Cambricon, this prototype accelerator can accommodate all ten benchmark NNs, while the state-of-the-art NN accelerator, DaDianNao, can only support 3 of them. Even when executing the 3 benchmark NNs, our prototype accelerator still achieves comparable performance/energy-efficiency with the state-of-the-art accelerator with negligible overheads. Our future work includes the final chip tape-out of the prototype accelerator, an attempt to integrate Cambricon into a generalpurpose processor, as well as an in-depth study that extends Cambricon to support broader applications.
