An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC)
  Engine Based on Stochastic Computing by Zhang, Xinyue et al.
An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate 
(MAC) Engine Based on Stochastic Computing 
 
Xinyue Zhang, Jiahao Song, Yuan Wang*, Yawen Zhang, Zuodong Zhang, 
Runsheng Wang* and Ru Huang 
 Institute of Microelectronics and Key Laboratory of Microelectronics Devices and Circuits (MoE) 
Peking University, Beijing 100871, P.R. China 
* Email: wangyuan@pku.edu.cn, r.wang@pku.edu.cn 
 
Abstract 
  Convolutional neural networks (CNN) have achieved excellent performance on various tasks, but deploying CNN to 
edge is constrained by the high energy consumption of convolution operation. Stochastic computing (SC) is an attractive 
paradigm which performs arithmetic operations with simple logic gates and low hardware cost. This paper presents 
an energy-efficient mixed-signal multiply-accumulate (MAC) engine based on SC. A parallel architecture is adopted in 
this work to solve the latency problem of SC. The simulation results show that the overall energy consumption of our 
design is 5.03pJ per 26-input MAC operation under 28nm CMOS technology. 
 
1. Introduction  
  Excellent performances are achieved on various tasks such as image recognition and natural language processing 
through methods based on convolutional neural network (CNN), but at the cost of computational complexity and high 
power consumption [1]. Due to strict requirements for hardware resources and power consumption on edge applications, 
a new computing paradigm is in urgent need. 
  Stochastic computing (SC) is a promising candidate which can simplify the implementations of arithmetic operations 
and reduce power consumption [2]. Mathematical operations among stochastic numbers can be realized with simple logic 
gates. For instance, multiplication can be completed with a single AND gate (Fig. 1(a)). Thus SC is successfully applied 
to computation-intensive applications like digital signal processing [3], artificial neural networks [4] and decoding of 
modern error-correcting codes [5]. Convolution is an essential but energy-intensive operation in CNN which demands 
plenty of multiplications and additions. Since SC can simplify arithmetic operations into simple logic gates, a SC-based 
CNN is of great attraction. 
  On the other hand, high latency caused by long bit stream becomes the bottleneck of SC in real-time applications, 
which could be solved by parallel structure (Fig. 1(b)). Therefore, a deterministic coding method of SC is applied in this 
work to realize parallel structure and to remove inherent randomness. However, increasing degree of parallelism will lead 
to high power consumption  
Fig. 1. Stochastic computing. (a) Multiplication of serial structure; (b) Multiplication of parallel structure; (c) Deterministic code 
 
when using a high-fan-in digital adder tree [6]. To address the problem, we propose a mixed-signal multiply-accumulate 
(MAC) engine to reduce power consumption.  
  The rest of the paper is organized as follow: Section 2 introduces the background and motivation of this work; Section 
3 describes the structure of the proposed MAC engine; Section 4 shows the simulation results; Finally, conclusions are 
drawn in section 5. 
 
2. Background and Motivation 
2.1 Background 
01101010
10111011
00101010A
B
Y
(4/8)
(6/8)
(3/8)
0
1 0
A0
B0
Y0
1
0 0
A1
B1
Y1
(a)                                             
(c) (b)
1
1 1
A2
B2
Y2
0
1 0
A3
B3
Y3...
1
1 1
A6
B6
Y6
0
1 0
A7
B7
Y7
A0A1A2A3A0A1A2A3A0A1A2A3
       B0B1B2B0B1B2B0B1B2B0B1B2
1  1  0  0  1  1  0  0  1  1  0  0  
1  0  0  1  0  0  1  0  0  1  0  0  N2
N1
  SC technique processes data in form of bit streams, where the probability of observing 1 in the stream is treated as the 
value of the stochastic number. For example, bit stream A=01101010 contains four 1s and four 0s, which means the value 
of A is 4/8 (Fig. 1(a)). Multiplication in SC can be realized with a single AND gate. Similarly, addition can be implemented 
with only a MUX gate. Therefore, it is promising for some computing-intensive applications to reduce logic complexity 
with SC technique. 
  However, conversion between stochastic bit streams and binary numbers do consume high power consumption and 
have a latency problem [7]. Besides, the nature of randomness in SC could cause some inherent errors in computation. 
Thus deterministic code [8] of SC has been proposed to produce completely accurate result. Fig. 1(c) shows two stochastic 
numbers N1 and N2, where the effective bit length of N1 is 4 and N2 is 3. Both the lengths of two numbers are extended 
to 12, the least common multiple of 3 and 4. We can observe that every bit of N1 sees every bit of N2, so that the 
calculation result will be completely accurate. Moreover, binary number can be converted into deterministic code with 
simple decoder in parallel [9], which highly improves the energy efficiency and reduces the computing latency. 
 
2.2 Motivation 
  MAC is an essential but energy-intensive part in CNN, while SC technique can simplify complex computation units to 
simple logic gates. Therefore, a SC-based MAC engine is a promising implementation. 
  However, long bit stream brings an increase on computation time, and the bit stream length shows exponential 
relationship with the precision of the stochastic number. As shown in Fig. 1(b), this problem can be solved by increasing 
degree of parallelism [2]. As the degree of parallelism increases, high-fan-in addition becomes the main bottleneck when 
summing up all of the outputs of AND gates in SC-based convolution networks. Thus in this work, an analog addition 
structure is brought up to increase the energy efficiency. 
 
3. Circuit Description 
  Fig. 2 shows the operation diagram of convolution calculation with proposed mixed-signal MAC engine, which is 
composed of three parts: binary-to-stochastic decoders, the proposed MAC engine, and an ADC. Firstly, the decoders 
convert the binary input codes to stochastic codes. The input numbers of feature maps are converted to 11-bit stochastic 
numbers, and the weights 4-bit (both of the two have an extra sign bit). The input precision is selected based on the 
network which our MAC engine is applied to. According to the description of deterministic coding in Section2.1, both of 
them are extended to 4×11=44 bits. Then, multiplications of the stochastic numbers are completed by the AND gate 
array. After that, these products are converted into analog MAC result by the analog accumulate engine (ACE). Finally, 
the analog output is fed to a flash ADC to be converted to binary numbers to prepare for next cycle’s calculation. 
The ACE is composed of three parts: stochastic-to-analog converter (SAC), voltage-to-time converter (VTC) and Fig. 
Fig. 2. Operation diagram of convolution calculation with proposed mixed-signal MAC engine. 
...
Decoder-A
Decoder-B
...
...
Decoder-A
Decoder-B
...
A25 B25S25
VDD
PP
VPUL
IONRST
Multiply-Accumulate Engine
SAC
INT
VSAC
VTCP
VTCNSIGN
VPULP
VPULN
VPP VINT
Y0...
SIGP
INTRST
Input Feature Maps                Kernels
S0 A0 B0
Output Feature Maps
AND Gate Array
ACE
...
...
ADC
integrating circuit (INT). SAC calculates the convolution sum of one input feature map and one kernel, VTC converts the 
sum into pulse duration so that sums from different feature maps can be accumulated by INT. In this work, we take 5×5 
kernel (with a bias), 6 feature maps as example to explain the working process of our design. According to the sign of 
the input numbers Ai and Bi  (i={0,1,…,25}), SAC converts the sum of positive numbers and negative numbers 
to analog voltage VSACP and VSACN separately, and then VTCP and VTCN are employed to convert VSACP and VSACN  
into pulse signals respectively during two conversion stages, the width of which is proportional to VSAC (including 
VSACN and VSACP). To calculate the difference between VSACP and VSACN, we employ a pulse processing (PP) circuit 
with a reference pulse input VPUL. The pulse width of VPUL represents the situation when input is 0. The pulse width 
of VPULN is proportional to the absolute value of negative input number, and the pulse width of VPULP represents the 
value of positive input number. To calculate VSACP	−	VSACN, PP circuit does subtraction between VPUL and VPULN and 
then adds the result with VPULP and obtains a signed addition result.  
  Finally, VPP controls the charging time of the INT so that the output voltage is proportional to the pulse width of VPP. 
As data of different feature maps continuously run to MAC engine one after another, each convolution sum is added up 
to the output of INT, which performs the accumulation operations. The sequence diagram of proposed mixed-signal MAC 
engine is showed in Fig. 3. 
Charge redistribution (CR) technique is applied to multiply-accumulate (MAC) circuit [10] [11] to improve energy 
efficiency. Thus an SAC (Fig. 4) based on CR is proposed in our MAC engine to do addition. 
  The SAC consists of a capacitor array switched by the output of AND gate array, and all capacitors are of the same 
size. To distinguish positive inputs from negative ones, the operation of SAC is composed of two stages: POS and NEG. 
Both of them can be divided into three phases: RESET, CHARGE and SHARE. In the RESET phase, RST is set to be 
valid, both the top and bottom plate of capacitors are connected to GND and all capacitors are discharged. In the CHARGE 
phase, the switches controlled by negative DIN are off, and positive DIN set the other switches off or on according to the 
value of Di[j]. After that, ION is set to be valid so that top plates of all capacitors are connected to VDD, the capacitors 
with bottom plate connected to GND are charged. In the SHARE phase, PUP is set to be high, all bottom plates are 
connected to GND and all top plates are connected with each other, charge is shared over all capacitors. Then the output 
voltage of VSAC is proportional to the sum of positive DINs. The operations in RESET phase and SHARE phase of NEG 
stage are same as POS stage. The only difference happens in CHARGE phase, the capacitors with negative DINs are 
charged so that VSAC is proportional to the sum of negative DINs’ absolute value. Thus we get the sum of positive numbers 
(VSACP) and negative numbers (VSACN) separately after the two stages.  
Fig. 3. Sequence diagram of mixed-signal MAC engine. 
 
  The output of SAC need to be accumulated using an integrating circuit proposed in the next part. Thus we employ VTC 
to convert the analog voltage into pulse signals. There are various kinds of VTC architectures with different characteristics. 
The design based on basic current starved inverter [12] shows low power consumption but poor linearity. Another kind 
of VTC using comparator and slope generator [13] has the feature of high precision but is also more complicated and 
INTRST
SIGPVTC
VSAC
VPUL
VPULP
VPULN
VPP
VINT
VTC
INT
ION
RST
PUP
DIN
NEG STAGE POS STAGE
SIN S0-S25
D0[43:0]-D25[43:0]
SAC
SIGNSAC
SIGPSAC
SIGNVTC
limited to low operation frequency. The design of [14], which makes a tradeoff between energy efficiency and linearity, 
is therefore applied to our design, as shown in Fig. 5(a)(b). VTC works in two phases to implement the conversion. During 
sampling phase, EN=1, ENB=0, I1=0, I0=ICNS (ICNS is a constant current here), the capacitor is charged to VIN. Then comes 
the discharging phase, EN=0, ENB=1, I0=0, I1=ICNS, the discharging current I1 is equal to I0 when the frequency of EN is 
high enough. A constant current leads to manageable discharging time, so that the output pulse 
 
Fig. 4.  Circuit of stochastic-to-analog converter (SAC). 
Fig. 5. (a) Circuit of voltage-to-time converter (VTC); (b) Circuit of pulse processing (PP) module (c) Integrating circuit (INT). 
 
width is proportional to the input voltage VIN. 
  To calculate the difference between VSACP and VSACN, we employ a PP circuit with a reference pulse input VPUL. PP 
circuit does subtraction between VPUL and VPULN with an XNOR gate and then adds the result with VPULP using an OR 
gate. The output of PP circuit is then fed to the INT. 
  Fig. 5(c) describes the integrating circuit in this work. VPP is the output of VTC part, which is a pulse signal whose 
width is proportional to VSAC. The capacitor is charged when VPP is valid. When VPP turns to low, bipolar will be cut off 
and charging stops. The convolution sum of 6 feature maps can be added together sequentially through this part. This 
structure consumes less power within an acceptable precision compared to conventional integrator based on amplifier. 
 
4. Simulation Results 
  The proposed mixed-signal MAC engine is designed under 28nm CMOS technology and operates at 1V supply voltage 
with 25MHz clock, and the overall power consumption is 20.12µW.  
  To guarantee the linearity of VTC, the output voltage of SAC should be above 0.35V. Fig. 6(a) shows the 
VBIAS1
VBIAS2
ENENB
EN
ENB
VIN
ENB
I0I1
VDDC
VPUL
VPULN
VPUL
VPULN
VPP
(a)  
(b)  
VBIAS
RESET
VPP
VOUT
VDD VDDVPP
VPPB
(C) 
DINi[j]
VDD
ION
RST... ...
VSAC
D0[43] D0[42] D0[41]
...
D0[0]
D1[43] D1[42] D1[41]
...
D1[0]
D25[43] D25[42] D25[41]
...
D25[0]
C C C C
C C C C
C C C C
PUP
SINi
SIGN
SIGP Di[j]
SIGN
 
Fig. 6. Simulation result. (a) Transfer characteristic of SAC; (b) Transfer characteristic of VTC; (c) Transfer characteristic of INT; (d) Transfer 
characteristic of MAC engine and error in VINT versus input number 
 
simulation result of SAC, which demonstrates the transfer characteristic between output voltage and input number, the 
output range is from 0.41V to 1.0V. Fig.6(b) displays the transfer characteristic between output pulse width and input 
voltage of VTC. Good linearity is shown when the input voltage is between 0.35V and 1.0V, which perfectly meets the 
requirements of system. The output curve of integrating circuit is shown in Fig. 6(c), the charge current is always stable 
in the structure so that the output linearly increases with the input voltage. Fig. 6(d) shows the transfer characteristic and 
nonlinearity of MAC engine, the absolute value of maximum error is below 0.2mV, which is much smaller than 
quantization step length. 
  Overall, the MAC engine consumes 5.03 pJ per 26-input MAC operation. The energy efficiency of the proposed MAC 
engine achieves 10.14 TOPS/W. 
 
5. Summary 
  A 44-bit, 26-input, mixed-signal SC-based MAC engine has been proposed in this paper, which can reduce the power 
consumption of convolution calculation. To reduce the latency, parallel SC structure is employed. Then we address the 
high-fan-in addition issue of parallel SC MAC engine by processing addition and accumulation in analog domain. 
Moreover, convolutional sum from different feature maps can be accumulated by INT, so that we can reduce the times of 
accessing to the external storage. The simulated energy per MAC operation is 5.03 pJ, which is beneficial to low power 
applications. 
Acknowledgments 
  This work was supported by National Natural Science Foundation of China (Grant No.61834001 and No.61421005) 
and the 111 Project (B18001).  
References  
[1] V. Sze, Y. Chen, T. Yang and J. S. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," in 
Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017. 
[2] Y. Zhang et al., "Design guidelines of stochastic computing based on FinFET: A technology-circuit perspective," IEEE 
International Electron Devices Meeting (IEDM), 2017, pp. 6.6.1-6.6.4. 
[3] B. Yuan, Y. Wang and Z. Wang, "Area-efficient error-resilient discrete fourier transformation design using stochastic 
computing," International Great Lakes Symposium on VLSI (GLSVLSI), 2016, pp. 33-38. 
[4] K. Kim et al., "Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks," 
ACM/EDAC/IEEE Design Automation Conference (DAC), 2016, pp. 1-6. 
[5] X-R. Lee, C-L. Chen, H-C. Chang and C-Y. Lee, “A 7.92 Gb/s 437.2 mW Stochastic LDPC Decoder Chip for IEEE 
802.15.3c Applications,” IEEE Trans. Ccts. and Syst. I: Reg.Papers, vol. 62. pp. 507-516, 2015. 
[6] D. Bankman, L. Yang, B. Moons, M. Verhelst and B. Murmann, "An always-on 3.8µJ/86% CIFAR-10 mixed-signal 
binary CNN processor with all memory on chip in 28nm CMOS," IEEE International Solid - State Circuits Conference 
(ISSCC), 2018, pp. 222-224. 
[7] X. Zhang et al., “Memory System Designed for Multiply-Accumulate (MAC) Engine Based on Stochastic 
Computing”, IEEE International Conference on IC Design and Technology (ICICDT), 2019, pp. 161-164.  
[8] D. Jenson and M. Riedel, "A deterministic approach to stochastic computation," IEEE/ACM International Conference 
on Computer-Aided Design (ICCAD), 2016, pp. 1-8. 
Input voltage (V)
Pu
ls
e 
w
id
th
 (n
s)
0.4         0.6         0.8          1.0
Input number
O
ut
pu
t v
ol
ta
ge
   
(V
)
200    400    600    800   1000
Time (ns)
O
ut
pu
t v
ol
ta
ge
  (
m
V)
20        40        60       80
80 
60  
40 
20
0
0.2 
0.1
0 
-0.1
-0.2C
ha
rg
e 
cu
rr
en
t (
m
A)
(a)                                                                (b)
(c)                                                                (d)
1.0
0.9
0.8 
0.7
0.6
0.5
0.4
Input number
O
ut
pu
t v
ol
ta
ge
  (
m
V)
200    400    600    800   1000
120
110 
100
90
9
8
7
6
5
4
3
Er
ro
r  
(m
V)
1.2
0.4
-0.4
-1.2Error
Output voltage
Charge current
Output  voltage
[9] Y.Zhang,R.Wang,X.Zhang et al, “A parallel bitstream generator for stochastic computing,” Silicon Nanoelectronics 
Workshop (SNW), 2019. 
[10] D. Bankman and B. Murmann, “Passive charge redistribution digital-to-analogue multiplier,” Electron. Lett., vol. 
51, no. 5, pp. 386–388, Mar. 2015.  
[11] D. Bankman and B. Murmann, "An 8-bit, 16 input, 3.2 pJ/op switched-capacitor dot product circuit in 28-nm FDSOI 
CMOS," IEEE Asian Solid-State Circuits Conference (A-SSCC), 2016, pp. 21-24. 
[12] A. Djemouai, M. Sawan and M. Slamani, "New CMOS integrated pulse width modulator for voltage conversion 
applications," IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2000, pp. 116-119 vol.1. 
[13] H. Pekau, A. Yousif and J. W. Haslett, "A CMOS integrated linear voltage-to-pulse-delay-time converter for time 
based analog-to-digital converters," IEEE International Symposium on Circuits and Systems (ISCAS), 2006, pp. 4 
pp.-2376. 
[14] P. Osheroff et al., "A highly linear 4GS/s uncalibrated voltage-to-time converter with wide input range," IEEE 
International Symposium on Circuits and Systems (ISCAS), 2016, pp. 89-92. 
