Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through
  Interleaved Bit-Partitioned Arithmetic by Ghodrati, Soroush et al.
Mixed-Signal Charge-Domain Acceleration of Deep Neural
Networks through Interleaved Bit-Partitioned Arithmetic
Soroush Ghodrati Hardik Sharma† Sean Kinzer Amir Yazdanbakhsh‡∗
Kambiz Samadi[ Nam Sung Kim¶ Doug Burger§ Hadi Esmaeilzadeh
Alternative Computing Technologies (ACT) Lab
University of California, San Diego
†Georgia Institute of Technology ‡Google Research [Qualcomm Technologies ¶Samsung Electronics §Microsoft
soghodra@eng.ucsd.edu hsharma@gatech.edu skinzer@eng.ucsd.edu ayazdan@google.com
ksamadi@qti.qualcomm.com nam.sung.kim@gmail.com dburger@microsoft.com hadi@eng.ucsd.edu
ABSTRACT
Low-power potential of mixed-signal design makes it an al-
luring option to accelerate Deep Neural Networks (DNNs).
However, mixed-signal circuitry suffers from limited range
for information encoding, susceptibility to noise, and Analog
to Digital (A/D) conversion overheads. This paper aims to
address these challenges by offering and leveraging the insight
that a vector dot-product (the basic operation in DNNs) can be
bit-partitioned into groups of spatially parallel low-bitwidth
operations, and interleaved across multiple elements of the vec-
tors. As such, the building blocks of our accelerator become a
group of wide, yet low-bitwidth multiply-accumulate units that
operate in the analog domain and share a single A/D converter.
The low-bitwidth operation tackles the encoding range limita-
tion and facilitates noise mitigation. Moreover, we utilize the
switched-capacitor design for our bit-level reformulation of
DNN operations. The proposed switched-capacitor circuitry
performs the group multiplications in the charge domain and
accumulates the results of the group in its capacitors over mul-
tiple cycles. The capacitive accumulation combined with wide
bit-partitioned operations alleviate the need for A/D conver-
sion per operation. With such mathematical reformulation
and its switched-capacitor implementation, we define a 3D-
stacked microarchitecture, dubbed BIHIWE 1 —pronounced
Bee Hive—that leverages clustering and hierarchical design
to best utilize power-efficiency of the mixed-signal domain
and 3D stacking. For ten DNN benchmarks, BIHIWE delivers
4.9×speedup over a leading purely-digital 3D-stacked accel-
erator TETRIS, with a mere of less than 0.5% accuracy loss
achieved by careful treatment of noise, computation error, and
various forms of variation. Compared to RTX 2080 TI with
tensor cores and Titan Xp GPUs, all with 8-bit execution, BI-
HIWE offers 33.1×and 66.5×higher Performance-per-Watt,
respectively. BIHIWE also outperforms other leading digi-
tal and analog accelerators in power efficiency. The results
suggest that BIHIWE is an effective initial step in a road that
combines mathematics, circuits, and architecture.
1. INTRODUCTION
Deep Neural Networks (DNNs) are revolutionizing a
wide range of services and applications such as language
translation [1], transportation [2], intelligent search [3],
∗This work has been done when the author was a PhD student at
Georgia Institute of Technology.
1BIHIWE: Bit-Partitioned and Interleaved Hierarchy of Wide
Acceleration through Electrical Charge
e-commerce [4], and medical diagnosis [5]. These benefits are
predicated upon delivery on performance and energy efficiency
from hardware platforms. With the diminishing benefits from
general-purpose processors [6–9], there is an explosion of
digital accelerators for DNNs [10–31]. Mixed-signal accelera-
tion [32–42] is also gaining traction. Albeit low-power, mixed-
signal circuitry suffers from limited range of information
encoding, is susceptible to noise, imposes Analog to Digital
(A/D) and Digital to Analog (D/A) conversion overheads, and
lacks fine-grained control mechanism. Realizing the full po-
tential of mixed-signal technology requires a balanced design
that brings mathematics, architecture, and circuits together.
This paper sets out to explore this conjunction of areas by in-
specting the mathematical foundation of deep neural networks.
Across a wide range of models, the large majority of DNN op-
erations belong to convolution and fully-connected layers [23,
28, 32]. Consequently, based on Amdahl’s Law, our archi-
tecture executes these two types of layers in the mixed-signal
domain. Nevertheless, to maintain generality for the ever-
expanding roster of other layers required by modern DNNs,
the architecture handles the other layers digitally. Normally,
the convolution and fully-connected layers are broken down
into a series of vector dot-products, that generate a scalar and
comprise a set of Multiply-Accumulate (MACC) operations.
State-of-the-art digital [10–31] and mixed-signal [32, 33, 35–
40, 34, 43, 41, 42] accelerators use a large array of stand-alone
MACC units to perform the necessary computations. When
moving to the mixed-signal domain, this stand-alone arrange-
ment of MACC operations imposes significant overhead in the
form of A/D and D/A conversions for each operation. The root
cause is the high cost of converting the operands and outputs
of each MACC to and from the analog domain, respectively.
This paper aims to address the aforementioned list of
challenges by making the following three contributions.
(1) This work offers and leverages the insight that the
set of MACC operations within a vector dot-product
can be partitioned, rearranged, and interleaved at the
bit level without affecting the mathematical integrity of
the vector dot-product. Unlike prior work [33, 42, 44],
this work does not rely on changing the mathematics of the
computation to enable mixed-signal acceleration. Instead, it
only rearranges the bit-wise arithmetic calculations to utilize
lower bitwidth analog units for higher bitwidth operations.
The key insight is that a binary value can be expressed as
the sum of products similar to dot-product, which is also a
1
ar
X
iv
:1
90
6.
11
91
5v
3 
 [c
s.A
R]
  1
2 J
ul 
20
19
(a) Bit-Partitioning Multiply-Accumulation (b) Bit-Partitioned Vector Rearrangement
Register
<< 4 << 2
<< 0<< 2
w
1
w
2
w
k
Ww
1 w
1
w
2 w
2
w
k w
k
M
L
M
L
M
L
(c) Wide Bit-Partitioned 
Vector Dot-Product
1 0 1 1
=
22+ 20( )   
1 0 0 1 1 0 0 0 1
1 0 0 122+ 20)(
+
0 1 0 122+ 20( )   
1 1 0 022+ 20)(
x
1
x2
x
k
Xx
1 M
x
1 L
x
k Mx
k L
x
2 M
x
2 L
+
 0+0<<1 1 0 00 1+ ) 0+2<< ((
1 1 1 1
0 1 1 0+ )
 2+0<<1 0 0 00 1 0 1+ ) 2+2<< ((
1 0 1 1
0 1 1 0+ )
0 1
=
a = ~X • ~W =
kX
i=1
xiwi
WMSBs
w1 nwM w2M M
WLSBs
nww2w1L L L
X
M
SBsx
n
1
x M
x
2 M
M
X
LSBs
1
x
x
2
L
L
x
n L
1 n
A/D
1 n
A/D
1 n
A/D
1 n
A/D
+
1 0 1 1
0 1 0 1
1 1 0 0
0 11 0
~X ~W
1 0 1 1 0 1 0 1
1 1
0 0
0 1
1 0
~X
~W
Figure 1: Wide, interleaved, and bit-partitioned mathematical formulation.
sum of multiplications (a=~X •~W =∑ixi×wi). Value b can
be expressed as b=∑i(2i×bi) where bis are the individual
bits or as b = ∑i(24i× bpi), where bpis are 4-bit partitions
for instance. Our interleaved bit-partitioned arithmetic
effectively utilizes the distributive and associative property
of multiplication and addition at the bit granularity.
The proposed model, first, bit-partitions all elements of
the two vectors, and then distributes the MACC operations
of the dot-product over the bit partitions. Therefore, the lower
bitwidth MACC becomes the basic operator that is applied
to each bit-partition. Then, our mathematical formulation
exploits the associative property of the multiply and add to
group bit-partitions that are at the same significance position.
This significance-based rearrangement enables factoring out
the power-of-two multiplicand that signifies the position of
the bit-partitions. The factoring enables performing the wide
group-based low-bitwidth MACC operations simultaneously
as a spatially parallel operation in the analog domain, while
the group shares a single A/D convertor. The power-of-two
multiplicand will be applied later in the digital domain
to the accumulated result of the group operation. To this
end, we rearchitect vector dot-product as a series of wide
(across multiple elements of the two vectors), interleaved and
bit-partitioned arithmetic and re-aggregation. Therefore, our
reformulation significantly reduces the rate of costly A/D con-
version by rearranging the bit-level operations across the ele-
ments of the vector dot-product. Using low-bitwidth operands
for analog MACCs provides a larger headroom between the
value encoding levels in the analog domain. The headroom
leads tackles the limited range of encoding and offers higher ro-
bustness to noise, an inherent non-ideality in the analog mode.
Additionally, using lower bitwidth operands reduces the en-
ergy/area overhead imposed by A/D and D/A convertors that
roughly scales exponentially with the bitwidth of operands.
(2) At the circuit level, the accelerator is designed using
switched-capacitor circuitry that stores the partial results
as electric charge over time without conversion to the dig-
ital domain at each cycle. The low-bitwidth MACCs are
performed in charge domain with a set of charge-sharing ca-
pacitors. This design choice lowers the rate of A/D conversion
as it implements accumulation as a gradual storage of charge in
a set of parallel capacitors. These capacitors not only aggregate
the result of a group of low-bitwidth MACCs, but also enable
accumulating results over time. As such, the architecture en-
ables dividing the longer vectors into shorter sub-vectors that
are multiply-accumulated over time with a single group of low-
bitwidth MACCs. The results are accumulated over multiple
cycles in the group’s capacitors. Because the capacitors can
hold the charge from cycle to cycle, the A/D conversion is not
necessary in each cycle. This reduction in rate of A/D con-
version is in addition to the amortized cost of A/D convertors
across the bit-partitioned analog MACCs of the group.
(3) Based on these insights, we devise a clustered
3D-stacked microarchitecture, dubbed BIHIWE, that
provides the capability to integrate copious number
of low-bitwidth switched-capacitor MACC units that
enables the interleaved bit-partitioned arithmetic. The
lower energy of mixed-signal computations offers the possi-
bility of integrating a larger number of these units compared
to their digital counterpart. To efficiently utilize the more
sizable number of compute units, a higher bandwidth memory
subsystem is needed. Moreover, one of the large sources of
energy consumption in DNN acceleration is off-chip DRAM
accesses [30, 28, 23]. Based on these insights, we devise a
clustered architecture for BIHIWE that leverages 3D-stacking
for its higher bandwidth and lower data transfer energy.
Evaluating the carefully balanced design of BIHIWE
with ten DNN benchmarks shows that BIHIWE delivers
4.9×speedup over the leading purely digital 3D-stacked DNN
accelerator, TETRIS [12], with only 0.5% loss in accuracy
achieved after mitigating noise, computation error, and
Process-Voltage-Temperature (PVT) variations. With 8-bit ex-
ecution, BIHIWE offers 33.1×and 66.5×higher Performance-
per-Watt compared to RTX 2080 TI and Titan Xp, respectively.
With these benefits, this paper marks an initial effort that
paves the way for a new shift in DNN acceleration.
2. WIDE, INTERLEAVED, AND BIT-
PARTITIONED ARITHMETIC
A key idea of this work is the mathematical insight that
enables utilizing low bitwidth mixed-signal units in spatially
parallel groups. This section demonstrates this insight.
Bit-Level partitioning and interleaving of MACCs. To
further detail the proposed mathematical reformulation,
Figure 1(a) delves into the bit-level operations of dot-product
on vectors with 2-elements containing 4-bit values. As
illustrated with different colors, each 4-bit element can be
written in the form of sum of 2-bit partitions multiplied by
powers of 2 (shift). As discussed, vector dot-product is
also a sum of multiplications. Therefore, by utilizing the
distributive property of addition and multiplication, we can
rewrite the vector-dot product in terms of the bit partitions.
However, we also leverage the associativity of the addition and
multiplication to group the bit-partitions in the same positions
2
together. For instance, in Figure 1, the black partitions that
represent the Most Significant Bits (MSBs) of the ~W vector
are multiplied in parallel to the teal2 partitions, representing
the MSBs of the ~X . Because of the distributivity of multipli-
cation, the shift amount of (2+2) can be postponed after the
bit-partitions are multiply-accumulated. The different colors
of the boxes in Figure 1 illustrates the interleaved grouping
of the bit-partitions. Each group is a set of spatially parallel
bit-partitioned MACC operations that are drawn from different
elements of the two vectors. The low-bitwidth nature of these
operations enables execution in the analog domain without the
need for A/D conversion for each individual bit-partitioned
operation. As such, our proposed reformulation amortizes the
cost of A/D conversion across the bit-partitions of different
elements of the vectors as elaborated below.
Wide, interleaved, and bit-partitioned vector dot-product.
Figure 1(b) illustrates the proposed vector dot-product op-
eration with 4-bit elements that are bit partitioned to 2-bit
sub-elements. For instance, as illustrated, the elements of vec-
tor X , denoted as xi, are first bit partitioned to xLi and x
M
i . The
former represents the two Least Significant Bits (LSBs) and the
latter represents the Most Significant Bits (MSBs). Similarly,
the elements of vector W are also bit partitioned to the wLi and
wMi sub-elements. Then, each vector (e.g., W ) is rearranged
into two bit-partitioned sub-vectors, W LSBs and W MSBs. In the
current implementations of BIHIWE architecture, the size of
bit-partition is fixed across the entire architecture. Therefore,
the rearrangement is just rewiring the bits to the compute
units that imposes modestly minimal overhead (less than 1%).
Figure 1 is merely an illustration and there is no need for extra
storage or movement of elements. As depicted with color cod-
ing, after the rewiring,W LSBs represents all the least significant
bit-partitions from different elements of vector W , while the
MSBs are rewired in W MSBs. The same rewiring is repeated
for the vector X . This rearrangement, puts all the bit-partitions
from all the elements of the vectors with the same significance
in one group, denoted as W LSBs, W MSBs, XLSBs, XMSBs.
Therefore, when a pair of the groups (e.g., XMSBs and W MSBs
in Figure 1(c)) are multiplied to generate the partial products,
(1) the shift amount (“4” in this case) is the same for all the
bit-partitions and (2) the shift can be done after partial products
from different sub-elements are accumulated together.
As shown in Figure 1(c), the low-bitwidth elements are
multiplied together and accumulated in the analog domain.
Accumulation in the digital domain would require an adder
tree which is costly compared to the analog accumulation that
merely requires connectivity between the multiplier outputs.
It is only after several analog multiply-accumulations that the
results are converted back to digital for shift and aggregation
with partial products from the other groups. The size of the
vectors usually exceeds the number of parallel low-bitwidth
MACCs, in which case the results need to be accumulated
over multiple iterations. As will be discussed in the next
section, the accumulations are performed in two steps. The
first step accumulates the results in the analog domain through
charge accumulation in capacitors before A/D convertors (see
Figure 1(c)). In the second step, these converted accumula-
tions will be added up in the digital domain using a register.
2Color teal in Figure 1 is the darkest gray in black and white prints.
x1 w1 x2 w2 xn
<latexit sha1_base64="i9Qv5SpUsRWQVXpjkO9DzpTfZNs=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04MR3 6EjYUitv4eO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoF tjOiODar3lz8z+ulGF77mVBJilyx5aIwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPokI9K</latexit><latexit sha1_base64="i9Qv5SpUsRWQVXpjkO9DzpTfZNs=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04MR3 6EjYUitv4eO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoF tjOiODar3lz8z+ulGF77mVBJilyx5aIwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPokI9K</latexit><latexit sha1_base64="i9Qv5SpUsRWQVXpjkO9DzpTfZNs=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04MR3 6EjYUitv4eO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoF tjOiODar3lz8z+ulGF77mVBJilyx5aIwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPokI9K</latexit><latexit sha1_base64="i9Qv5SpUsRWQVXpjkO9DzpTfZNs=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04MR3 6EjYUitv4eO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoF tjOiODar3lz8z+ulGF77mVBJilyx5aIwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPokI9K</latexit>
wn
<latexit sha1_base64="2hey8eTbcZ5oUFTB0kRlc/6JcVk=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04JR36EjYUitv4e O/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGF77mVBJilyx5a IwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPnB49J</latexit><latexit sha1_base64="2hey8eTbcZ5oUFTB0kRlc/6JcVk=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04JR36EjYUitv4e O/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGF77mVBJilyx5a IwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPnB49J</latexit><latexit sha1_base64="2hey8eTbcZ5oUFTB0kRlc/6JcVk=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04JR36EjYUitv4e O/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGF77mVBJilyx5a IwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPnB49J</latexit><latexit sha1_base64="2hey8eTbcZ5oUFTB0kRlc/6JcVk=">AAAB7nicbVA9SwNBEJ2LXzF+RS1tFhPBKtyl0TJgo10E8wHJEfY2e8mSvb1jd04JR36EjYUitv4e O/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3Mz9ziPXRsTqAacJ9yM6UiIUjKKVOtWnQaZm1UG54tbcBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLc2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGF77mVBJilyx5a IwlQRjMv+dDIXmDOXUEsq0sLcSNqaaMrQJlWwI3urL66Rdr3luzbuvVxp3eRxFOINzuAQPrqABt9CEFjCYwDO8wpuTOC/Ou/OxbC04+cwp/IHz+QPnB49J</latexit>
n
CACC+
CACC 
CACC+
CACC 
CACC+
CACC 
ADC
(b) Core
Shifter
… 
Shifter
MS-WAgg
<latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit>
… 
Shifter
… 
Shifter
… 
In
pu
t B
uff
er
Register
Weight Buffer
Shifter
… 
Shifter
MS-WAgg
<latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit>
… 
Shifter
… 
Shifter
… 
In
pu
t B
uff
er
Weight Buffer
Shifter
… 
Shifter
MS-WAgg
<latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit>
… 
Shifter
… 
Shifter
… 
Register
Weight Buffer
Shifter
… 
Shifter
MS-WAgg
<latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit><latexit sha1_base64="ouWWvEq5NWu/CpPMgOvdXf9jIcI=">AAACcXicbZHNbhMxEMedLR8lfKVwQlyspiAk1GgXkMqxwIVLpSCaplJ2FXknk9Sq7V3Zs7SRtW/Qp+EKL8Jz8AJ4t4tEW0ay/Nd/ZmzPz3mppKM4/tWLNm7dvnN3817//oOHjx4Ptp4cuaKygBMoVGGPc+FQSYMTkqTwuLQodK5wmp9+avLTb2idLMwhrUvMtFgZuZQgKFjzwcu01LlPtaATB9YffK3rS4fwnBz43emH1aqu54NhPIrb4DdF0okh62I83+q9TRcFVBoNgRLOzZK4pMwLSxIU1v20clgKOBUrnAVphEaX+Xagmr8IzoIvCxuWId66/3Z4oZ1b6zxUti+/nmvM/+VmFS3fZ16asiI0cPUVHYPMYxU2WVJIGzyDQmthFoGRy0stAO pZnPmdFhkI1SDb2f2L6+P4oKlIz104NUzZTrKsFKeCN/j5QloEUusgRLgkwOBwIqwACp/UD5iT61BviqM3oyQeJV/eDfcPO+Cb7DnbZq9YwvbYPvvMxmzCgF2w7+wH+9n7HT2LeLR9WRr1up6n7EpEr/8AzbTA0w==</latexit>
… 
Shifter
… 
Shifter
… 
Weight Buffer
CTRL
 Register Register
(c) Clustered Architecture
Vaulti,j
Lo
gic
 Di
e
DR
AM
 Di
es
Corei,j,k
Analog
Digital
Storage
(Clusteri,j)
Activation 
Unit
Pooling 
Unit
Normlization 
Unit
Output Buffer
Activation 
Unit
Pooling 
Unit
Normlization 
Unit
Output Buffer
(a) MS-BPMacc
<latexit sha1_base64="lth2qS1d+3H1p24B43HYbfmoT10=">AAACc3icbZHNahsxEMfl7VfqfjntMZcldiEXm9220B5De+kl4NA6CXgXMzuWExFJK6TZNkYs9AnyNLm279EH6T3a9RaapANCf/6jj5nfFEYKR0nyuxfdu//g4aOtx/0nT589fzHYfnnkysoin2EpS3tSgONSaD4jQZKfGMtBFZIfF+efmvzxN26dKPVXWhueKzjVYiUQKFiLwV5mVOEzBXTm0PqDL3W9cYhfkEM//jg9AMS6XgyGySRpI74r0k4MWRfTxXbvbbYssVJcE0pwbp4mhnIPlgRKXvezynEDeA6nfB6kBsVd7tuW6vh1cJbxqrRhaYpb998bHpRza1WEk23tt3ON+b/cvKLVh9wLbSriGm9W0VHIPa/CJgyFtObfsVQ K9DJQcoVRDY15kvtRCw1BNtBG47/AOl7ZhQuvhi7bTlaVjKmMmwHES2E5klwHAeGTACPGM7CAFMbUD5jT21DviqM3kzSZpIfvhvvwYwN8i+2wXbbHUvae7bPPbMpmDNklu2I/2a/en2gn2o1Gm6NRrxvSK3YjovE1kR/CFA==</latexit><latexit sha1_base64="lth2qS1d+3H1p24B43HYbfmoT10=">AAACc3icbZHNahsxEMfl7VfqfjntMZcldiEXm9220B5De+kl4NA6CXgXMzuWExFJK6TZNkYs9AnyNLm279EH6T3a9RaapANCf/6jj5nfFEYKR0nyuxfdu//g4aOtx/0nT589fzHYfnnkysoin2EpS3tSgONSaD4jQZKfGMtBFZIfF+efmvzxN26dKPVXWhueKzjVYiUQKFiLwV5mVOEzBXTm0PqDL3W9cYhfkEM//jg9AMS6XgyGySRpI74r0k4MWRfTxXbvbbYssVJcE0pwbp4mhnIPlgRKXvezynEDeA6nfB6kBsVd7tuW6vh1cJbxqrRhaYpb998bHpRza1WEk23tt3ON+b/cvKLVh9wLbSriGm9W0VHIPa/CJgyFtObfsVQ K9DJQcoVRDY15kvtRCw1BNtBG47/AOl7ZhQuvhi7bTlaVjKmMmwHES2E5klwHAeGTACPGM7CAFMbUD5jT21DviqM3kzSZpIfvhvvwYwN8i+2wXbbHUvae7bPPbMpmDNklu2I/2a/en2gn2o1Gm6NRrxvSK3YjovE1kR/CFA==</latexit><latexit sha1_base64="lth2qS1d+3H1p24B43HYbfmoT10=">AAACc3icbZHNahsxEMfl7VfqfjntMZcldiEXm9220B5De+kl4NA6CXgXMzuWExFJK6TZNkYs9AnyNLm279EH6T3a9RaapANCf/6jj5nfFEYKR0nyuxfdu//g4aOtx/0nT589fzHYfnnkysoin2EpS3tSgONSaD4jQZKfGMtBFZIfF+efmvzxN26dKPVXWhueKzjVYiUQKFiLwV5mVOEzBXTm0PqDL3W9cYhfkEM//jg9AMS6XgyGySRpI74r0k4MWRfTxXbvbbYssVJcE0pwbp4mhnIPlgRKXvezynEDeA6nfB6kBsVd7tuW6vh1cJbxqrRhaYpb998bHpRza1WEk23tt3ON+b/cvKLVh9wLbSriGm9W0VHIPa/CJgyFtObfsVQ K9DJQcoVRDY15kvtRCw1BNtBG47/AOl7ZhQuvhi7bTlaVjKmMmwHES2E5klwHAeGTACPGM7CAFMbUD5jT21DviqM3kzSZpIfvhvvwYwN8i+2wXbbHUvae7bPPbMpmDNklu2I/2a/en2gn2o1Gm6NRrxvSK3YjovE1kR/CFA==</latexit><latexit sha1_base64="lth2qS1d+3H1p24B43HYbfmoT10=">AAACc3icbZHNahsxEMfl7VfqfjntMZcldiEXm9220B5De+kl4NA6CXgXMzuWExFJK6TZNkYs9AnyNLm279EH6T3a9RaapANCf/6jj5nfFEYKR0nyuxfdu//g4aOtx/0nT589fzHYfnnkysoin2EpS3tSgONSaD4jQZKfGMtBFZIfF+efmvzxN26dKPVXWhueKzjVYiUQKFiLwV5mVOEzBXTm0PqDL3W9cYhfkEM//jg9AMS6XgyGySRpI74r0k4MWRfTxXbvbbYssVJcE0pwbp4mhnIPlgRKXvezynEDeA6nfB6kBsVd7tuW6vh1cJbxqrRhaYpb998bHpRza1WEk23tt3ON+b/cvKLVh9wLbSriGm9W0VHIPa/CJgyFtObfsVQ K9DJQcoVRDY15kvtRCw1BNtBG47/AOl7ZhQuvhi7bTlaVjKmMmwHES2E5klwHAeGTACPGM7CAFMbUD5jT21DviqM3kzSZpIfvhvvwYwN8i+2wXbbHUvae7bPPbMpmDNklu2I/2a/en2gn2o1Gm6NRrxvSK3YjovE1kR/CFA==</latexit>
Figure 2: Hierarchically clustered architecture of BIHIWE.
For this pattern of computation, we are effectively utilizing
the distributive and associative property of multiplication
and addition for dot-product but at the bit granularity. This
rearrangement and spatially parallel (i.e., wide) bit-partitioned
computation is in contrast with temporally bit-serial
digital [17, 13, 31, 45] and analog [32] DNN accelerators.
The next section describes the architecture of the
mixed-signal accelerator that leverages our mathematical
reformulation. This architecture is essentially a collection of
the structure that is depicted in Figure 1(c). The structure is the
Mixed-Signal Wide Aggregator (MS-WAGG) that spatially
aggregates the results from its four units as illustrated. Each
of these four units, which are also wide, is a Mixed-Signal Bit-
Partitioned MACC (MS-BPMACC). Note that the number of
MS-BPMACCs in aMS-WAGG is a function of the bitwidth
of the vector elements and the value of bit-partitioning.
3. MIXED-SIGNAL ARCHITECTURE DE-
SIGN FOR WIDE BIT-PARTITIONING
To exploit the aforementioned arithmetic, BIHIWE comes
with a mixed-signal building block that performs wide bit-
3
partitioned vector dot-product. BIHIWE then organizes these
building blocks in a clustered hierarchical design to efficiently
make use of its copious number of parallel low-bitwidth
mixed-signal MACC units. The clustered design is crucial
as mixed-signal paradigm enables integrating a larger number
of parallel operators than the digital counterpart.
3.1 Wide Bit-Partitioned Mixed-Signal MACC
As Figure 2(a) shows, the building block of BIHIWE is a col-
lection of low-bitwidth analog MACCs that operate in parallel
on sub-elements from the two vectors under dot-product. This
wide structure is dubbedMS-BPMACC. We design the low-
bitwidth MACCs using switched-capacitor circuitry for the fol-
lowing reason. This design choice lowers the rate of A/D con-
version as it implements accumulation as a gradual storage of
charge in a set of parallel capacitors. These capacitors not only
aggregate the results of low-bitwidth MACCs, but also enable
accumulating results over time. As such, longer vectors are
divided into shorter sub-vectors that are multiply-accumulated
over time without the need to convert the intermediate results
back to the digital domain. It is only after processing multiple
sub-vectors that the accumulated result is converted to digi-
tal, significantly reducing the rate of costly A/D conversions.
As shown in Figure 2(a), each low-bitwidth MACC unit is
equipped with its own pair of local capacitors, which perform
the accumulation over time across multiple sub-vectors. As
will be discussed in Section 4, the pair is used to handle positive
and negative values by accumulating them separately on one or
the other capacitor. After a pre-determined number of private
accumulations in the analog domain, the partial results need to
be accumulated across the low-bitwidth MACCs. In that cycle,
the transmission gates between the capacitors (Figure 2(a))
connect them and a simple charge sharing between the ca-
pacitors yields the accumulated result for theMS-BPMACC.
That is when a single A/D conversion is performed, the cost of
which is not only amortized across the parallel MACC units
but also over time across multiple sub-vectors.
3.2 Mixed-Signal Wide Aggregator
MS-BPMACCs only process low-bitwidth operands; however,
they cannot combine these operations to enable higher
bit-width dot-products. A collection ofMS-BPMACCs can
provide this capability as discussed with Figure 1 in Section 2.
This structure is named MS-WAGG as it is a Mixed Signal
Wide Aggregator. Figure 2(b) depicts a 2D array of a possible
MS-WAGG design, comprising 16 MS-BPMACCs that are
necessary to perform 8-bit by 8-bit vector dot-product with
2-bit partitioning. In this case, the number 16 comes from
the fact that each of the two 8-bit operands can be partitioned
to four 2-bit values. Each of the four 2-bit partitions of the
multiplicand need to be multiply-accumulated with all the mul-
tiplier’s four 2-bit partitions. As discussed in Section 2, each
MS-WAGG also performs the necessary shift operations to
combine the low-bitwidth results from its 16MS-BPMACCs.
By aggregating the partial results of eachMS-BPMACC, the
MS-WAGG unit generates a scalar output which is stored on
its output register. As illustrated in Figure 2, a collection of
theseMS-WAGGs constitute an accelerator core from which
the clustered architecture of BIHIWE is designed.
3.3 Hierarchically Clustered Architecture
As discussed in Section 4, the proposedMS-WAGG consumes
5.4× less energy for a single 8-bit MACC in comparison with
a digital logic (1 pJ taken from the TETRIS simulator [46],
which is commensurate with other reports [47, 48]). As such,
it is possible to integrate a larger number of mixed-signal
compute units on a chip with a given power budget compared
to a digital architecture. To efficiently utilize the larger
number of available compute units, a high bandwidth memory
substrate is required. Moreover, one of the large sources
of energy consumption in DNN acceleration is off-chip
DRAM accesses [30, 28, 23]. To maximize the benefits of
the mixed-signal computation, 3D-stacked memory is an
attractive option since it reduces the cost of data accesses and
provides a higher bandwidth for data transfer between the
on-chip compute and off-chip memory [12, 25]. Based on
these insights, we devise a clustered architecture for BIHIWE
with a 3D-stacked memory substrate as shown in Figure 2(c).
The mixed-signal logic die of BIHIWE is stacked over the
DRAM dies with multiple vaults, each of which is connected
to the logic die with several through-silicon-via (TSV)s. The
3D memory substrate of BIHIWE is modeled using Micron’s
Hybrid Memory Cube (HMC) [49, 50] which has been shown
to be a promising technology for DNN acceleration [12].
As the results in Section 8.2 Figure 15 shows, a flat systolic
design would result in significant underutilization of the
compute resources and bandwidth from 3D stacking.
Therefore, BIHIWE is a hierarchically clustered architec-
ture that allocates multiple accelerator cores as a cluster to
each vault. Figure 2(b) depicts a single core. As shown in
Figure 2(b), each core is self-sufficient and packs a mixed-
signal systolic array of MS-WAGGs as well as the digital
units that perform pooling, activation, normalization, etc.
The mixed-signal array is responsible for the convolutional
and fully connected layers. Generally, wide and interleaved
bit-partitioned execution within MS-WAGGs is orthogonal
to the organization of the accelerator architecture. This paper
explores how to embed them and the proposed compute model,
within a systolic design and enables end-to-end programmable
mixed-signal acceleration for a variety of DNNs.
Accelerator core. As Figure 2(b) depicts, the first level of
hierarchy is the accelerator core and its 2D systolic array
that utilizes the MS-WAGGs. As depicted, the Input Buffers
and Output Buffers are shared across the columns and rows,
respectively. Each MS-WAGG has its own Weight Buffer.
This organization is commensurate with other designs and
reduces the cost of on-chip data accesses as inputs are
reused with multiple filters [26]. However, what makes our
design different is the fact that each buffer needs to supply
a sub-vector not a scalar in each cycle to the MS-WAGGs.
However, the MS-WAGG generates only a scalar since
dot-product generates a scalar output. The rewiring of the
inputs and weights is already done inside the MS-WAGGs
since the size of bit-partitions is fixed. As such, there is no
need to reformat any of inputs, activations, or weights. As
the outputs ofMS-WAGGs flow down the columns, they get
accumulated to generate the output activations that are fed
to each columns dedicated Normalization/Activation/Pooling Unit s.
To preserve the accuracy of the DNN model, the intermediate
results are stored as 32-bit digital values and intra-column
4
aggregations are performed in the digital mode.
On-chip data delivery for accelerator cores. To minimize
data movement energy and maximally exploit the large
degrees of data-reuse offered by DNNs, BIHIWE uses a
statically-scheduled bus that is capable of multicasting/broad-
casting data across accelerator cores. Compared to complex
interconnections, the choice of statically-scheduled bus sig-
nificantly simplifies the hardware by alleviating the need for
complicated arbitration logic and FIFOs/buffers required for
dynamic routing. Moreover, the static schedule enables the BI-
HIWE compiler stack to cut the DNN layers across cores while
maximizing inter- and intra-core data-reuse. The static sched-
ule is encoded in the form of data communication instructions
(Section 7) that are responsible for (1) fetching data tiles from
the 3D-stacked memory and distributing them across cores
or (2) writing output tiles back from the cores to the memory.
Parallelizing computations across accelerator cores. Data-
movement energy is a significant portion of the overall energy
consumption both for digital designs [12, 23, 24, 28, 30, 51]
and analog designs [33, 35]. As such, the BIHIWE clustered ar-
chitecture (1) divides the computations into tiles that fit within
the limited on-chip capacity of the scratchpads that are private
for each accelerator core, and (2) cuts the tiles of computations
across cores to minimize DRAM accesses by maximally utiliz-
ing the multicast/broadcasting capabilities of BIHIWE on-chip
data delivery network. To simplify the design of the accelera-
tor cores, the scratchpad buffers are private to each core and the
shared data is replicated across multiple cores. Thus, a single
tile of data can be read once from the 3D-stacked memory and
then be broadcasted/multicasted across cores to reduce DRAM
accesses. The cores use double-buffering to hide the latency
for memory accesses for subsequent tiles. The accelerator
cores use output-stationary dataflow that minimizes the num-
ber of ADC conversions by accumulating results in the charge-
domain. Section 6 discusses the BIHIWE compiler stack that
optimizes the cuts and tile sizes for individual DNN layers.
4. SWITCHED-CAPACITOR CIRCUIT DE-
SIGN FOR BIT-PARTITIONING
BIHIWE exploits switched-capacitor circuitry [36, 34, 43, 42,
41] for MS-BPMACC by implementing MACC operations
in the charge-domain rather than using resistive-ladders to
compute in current domain [32, 40, 44]. Compared to the
current-domain approach, switched-capacitors (1) enable
result accumulation in the analog domain by storing them
as electric charge, eliminating the need for A/D conversion
at every cycle, and (2) make multiplications depend only
on the ratio of the capacitor sizes rather than their absolute
capacitances. The second property enables reduction of
capacitor sizes, improving the energy and area of MACC units
as well as making them more resilient to process variation. The
following discusses the details of theMS-BPMACC circuitry.
4.1 Low-Bitwidth Switched-Capacitor MACC
Figure 3 depicts the design of a single 3-bit sign-magnitude
MACC. The xsx1x0 and wsw1w0 denote the bit-partitions
operands. The result of each MACC operation is retained as
electric charge in the accumulating capacitor (CACC). In addi-
tion to CACC, the MACC unit contains two capacitive Digital-
to-Analog Converters, one for inputs (C-DACX) and one for
weights (C-DACW). The C-DACX and C-DACW convert the
C-DACx
<latexit sha1_base64="V7OO0vHGXu8NUNDW2CGFLym1n6c=">AAACDXicbVC7TsNAEDyHVwgvA6KisUiQaIjsUEAZCAVlkMhDii3rfDknp9zZ1t0aJbLyDXwDLdR0iJZ voORPsBMXJDDSSqOZXc1qvIgzBab5pRVWVtfWN4qbpa3tnd09ff+grcJYEtoiIQ9l18OKchbQFjDgtBtJioXHaccbNTK/80ilYmHwAJOIOgIPAuYzgiGVXP3I9kRSsYGOIWmc3143pu64MnX1slk1ZzD+EisnZZSj6erfdj8ksaABEI6V6llmBE6CJTDC6bRkx4pGmIzwgPZSGmBBlZPM3p8ap6nSN/xQphOAMVN/XyRYKDURXropMAzVspeJ/3m9GPwr J2FBFAMNyDzIj7kBoZF1YfSZpAT4JCWYSJb+apAhlphA2thCiieyTqzlBv6Sdq1qXVRr97Vy/SZvp4iO0Qk6Qxa6RHV0h5qohQhK0DN6Qa/ak/amvWsf89WClt8cogVonz9ofJuo</latexit>
C-DACw
VDD
Clk
<latexit sha1_base64="G+gB68jXOyPq2R7bPkkNvjkocc0=">AAACC3icbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVjRhbFI9CE1UeW4TmvVdiLbQYqifALfwAozG2LlIxj5E5w2A205kqWjc871vTpBzKjSjvNtbWxube/sVvaq+weHR8f2yWlPRYnEpIsjFslBgBRhVJCuppqRQSwJ4gEj/WDWLvz+E5GKRuJRpzHxOZoIGlKMtJFGds0LeOZFJlL8kLXZLM9 Hdt1pOHPAdeKWpA5KdEb2jzeOcMKJ0JghpYauE2s/Q1JTzEhe9RJFYoRnaEKGhgrEifKz+fE5vDDKGIaRNE9oOFf/TmSIK5XywCQ50lO16hXif94w0eGtn1ERJ5oIvFgUJgzqCBZNwDGVBGuWGoKwpOZWiKdIIqxNX0tbAl504q42sE56zYZ71Wg+XNdbd2U7FXAGzsElcMENaIF70AFdgEEKXsAreLOerXfrw/pcRDescqYGlmB9/QKyBJv4</latexit>
Clk
<latexit sha1_base64="3VzwkChvmzMqtsP3AECLhTqu9fE=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUwjWUE8wHJEfY2e8mS3b1jd08IR8DfYKu1ndj6Vyz9J+4lV5jEBwOP92aYmRfEnGnjut9OYWNza3unuFva2z84PCofn7R1lChCWyTikeoGWFPOJG0ZZjjtxopiEXDaCSaNzO88UaVZJB/NNKa+wCPJQkawsVK3H4i0wSezQbniVt050Drxc lKBHM1B+ac/jEgiqDSEY617nhsbP8XKMMLprNRPNI0xmeAR7VkqsaDaT+f3ztCFVYYojJQtadBc/TuRYqH1VAS2U2Az1qteJv7n9RIT3vopk3FiqCSLRWHCkYlQ9jwaMkWJ4VNLMFHM3orIGCtMjI1oaUsgsky81QTWSbtW9a6qtYfrSv0uT6cIZ3AOl+DBDdThHprQAgIcXuAV3pxn5935cD4XrQUnnzmFJThfv/a5lqE=</latexit>
Clk
<latexit sha1_base64="3VzwkChvmzMqtsP3AECLhTqu9fE=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUwjWUE8wHJEfY2e8mS3b1jd08IR8DfYKu1ndj6Vyz9J+4lV5jEBwOP92aYmRfEnGnjut9OYWNza3unuFva2z84PCofn7R1lChCWyTikeoGWFPOJG0ZZjjtxopiEXDaCSaNzO88UaVZJB/NNKa+wCPJQkawsVK3H4i0wSezQbniVt050Drxc lKBHM1B+ac/jEgiqDSEY617nhsbP8XKMMLprNRPNI0xmeAR7VkqsaDaT+f3ztCFVYYojJQtadBc/TuRYqH1VAS2U2Az1qteJv7n9RIT3vopk3FiqCSLRWHCkYlQ9jwaMkWJ4VNLMFHM3orIGCtMjI1oaUsgsky81QTWSbtW9a6qtYfrSv0uT6cIZ3AOl+DBDdThHprQAgIcXuAV3pxn5935cD4XrQUnnzmFJThfv/a5lqE=</latexit>
Clk
<latexit sha1_base64="3VzwkChvmzMqtsP3AECLhTqu9fE=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUwjWUE8wHJEfY2e8mS3b1jd08IR8DfYKu1ndj6Vyz9J+4lV5jEBwOP92aYmRfEnGnjut9OYWNza3unuFva2z84PCofn7R1lChCWyTikeoGWFPOJG0ZZjjtxopiEXDaCSaNzO88UaVZJB/NNKa+wCPJQkawsVK3H4i0wSezQbniVt050Drxc lKBHM1B+ac/jEgiqDSEY617nhsbP8XKMMLprNRPNI0xmeAR7VkqsaDaT+f3ztCFVYYojJQtadBc/TuRYqH1VAS2U2Az1qteJv7n9RIT3vopk3FiqCSLRWHCkYlQ9jwaMkWJ4VNLMFHM3orIGCtMjI1oaUsgsky81QTWSbtW9a6qtYfrSv0uT6cIZ3AOl+DBDdThHprQAgIcXuAV3pxn5935cD4XrQUnnzmFJThfv/a5lqE=</latexit>
Clk
<latexit sha1_base64="3VzwkChvmzMqtsP3AECLhTqu9fE=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUwjWUE8wHJEfY2e8mS3b1jd08IR8DfYKu1ndj6Vyz9J+4lV5jEBwOP92aYmRfEnGnjut9OYWNza3unuFva2z84PCofn7R1lChCWyTikeoGWFPOJG0ZZjjtxopiEXDaCSaNzO88UaVZJB/NNKa+wCPJQkawsVK3H4i0wSezQbniVt050Drxc lKBHM1B+ac/jEgiqDSEY617nhsbP8XKMMLprNRPNI0xmeAR7VkqsaDaT+f3ztCFVYYojJQtadBc/TuRYqH1VAS2U2Az1qteJv7n9RIT3vopk3FiqCSLRWHCkYlQ9jwaMkWJ4VNLMFHM3orIGCtMjI1oaUsgsky81QTWSbtW9a6qtYfrSv0uT6cIZ3AOl+DBDdThHprQAgIcXuAV3pxn5935cD4XrQUnnzmFJThfv/a5lqE=</latexit>Clk
<latexit sha1_base64="3VzwkChvmzMqtsP3AECLhTqu9fE=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUwjWUE8wHJEfY2e8mS3b1jd08IR8DfYKu1ndj6Vyz9J+4lV5jEBwOP92aYmRfEnGnjut9OYWNza3unuFva2z84PCofn7R1lChCWyTikeoGWFPOJG0ZZjjtxopiEXDaCSaNzO88UaVZJB/NNKa+wCPJQkawsVK3H4i0wSezQbniVt050Drxc lKBHM1B+ac/jEgiqDSEY617nhsbP8XKMMLprNRPNI0xmeAR7VkqsaDaT+f3ztCFVYYojJQtadBc/TuRYqH1VAS2U2Az1qteJv7n9RIT3vopk3FiqCSLRWHCkYlQ9jwaMkWJ4VNLMFHM3orIGCtMjI1oaUsgsky81QTWSbtW9a6qtYfrSv0uT6cIZ3AOl+DBDdThHprQAgIcXuAV3pxn5935cD4XrQUnnzmFJThfv/a5lqE=</latexit>
Clk
<latexit sha1_base64="G+gB68jXOyPq2R7bPkkNvjkocc0=">AAACC3icbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVjRhbFI9CE1UeW4TmvVdiLbQYqifALfwAozG2LlIxj5E5w2A205kqWjc871vTpBzKjSjvNtbWxube/sVvaq+weHR8f2yWlPRYnEpIsjFslBgBRhVJCuppqRQSwJ4gEj/WDWLvz+E5GKRuJRpzHxOZoIGlKMtJFGds0LeOZFJlL8kLXZLM9 Hdt1pOHPAdeKWpA5KdEb2jzeOcMKJ0JghpYauE2s/Q1JTzEhe9RJFYoRnaEKGhgrEifKz+fE5vDDKGIaRNE9oOFf/TmSIK5XywCQ50lO16hXif94w0eGtn1ERJ5oIvFgUJgzqCBZNwDGVBGuWGoKwpOZWiKdIIqxNX0tbAl504q42sE56zYZ71Wg+XNdbd2U7FXAGzsElcMENaIF70AFdgEEKXsAreLOerXfrw/pcRDescqYGlmB9/QKyBJv4</latexit>
Clk
<latexit sha1_base64="G+gB68jXOyPq2R7bPkkNvjkocc0=">AAACC3icbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVjRhbFI9CE1UeW4TmvVdiLbQYqifALfwAozG2LlIxj5E5w2A205kqWjc871vTpBzKjSjvNtbWxube/sVvaq+weHR8f2yWlPRYnEpIsjFslBgBRhVJCuppqRQSwJ4gEj/WDWLvz+E5GKRuJRpzHxOZoIGlKMtJFGds0LeOZFJlL8kLXZLM9 Hdt1pOHPAdeKWpA5KdEb2jzeOcMKJ0JghpYauE2s/Q1JTzEhe9RJFYoRnaEKGhgrEifKz+fE5vDDKGIaRNE9oOFf/TmSIK5XywCQ50lO16hXif94w0eGtn1ERJ5oIvFgUJgzqCBZNwDGVBGuWGoKwpOZWiKdIIqxNX0tbAl504q42sE56zYZ71Wg+XNdbd2U7FXAGzsElcMENaIF70AFdgEEKXsAreLOerXfrw/pcRDescqYGlmB9/QKyBJv4</latexit>
Clk
<latexit sha1_base64="G+gB68jXOyPq2R7bPkkNvjkocc0=">AAACC3icbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVjRhbFI9CE1UeW4TmvVdiLbQYqifALfwAozG2LlIxj5E5w2A205kqWjc871vTpBzKjSjvNtbWxube/sVvaq+weHR8f2yWlPRYnEpIsjFslBgBRhVJCuppqRQSwJ4gEj/WDWLvz+E5GKRuJRpzHxOZoIGlKMtJFGds0LeOZFJlL8kLXZLM9 Hdt1pOHPAdeKWpA5KdEb2jzeOcMKJ0JghpYauE2s/Q1JTzEhe9RJFYoRnaEKGhgrEifKz+fE5vDDKGIaRNE9oOFf/TmSIK5XywCQ50lO16hXif94w0eGtn1ERJ5oIvFgUJgzqCBZNwDGVBGuWGoKwpOZWiKdIIqxNX0tbAl504q42sE56zYZ71Wg+XNdbd2U7FXAGzsElcMENaIF70AFdgEEKXsAreLOerXfrw/pcRDescqYGlmB9/QKyBJv4</latexit>
Clk+ x [0]
<latexit sha1_base64="vZD+6MRRuL84sUOZDeFzlrd7GXs=">AAACHXicbVDLSgMxFM3UV62vqks3wVIQhDJTBV0Wu3FZwT6gM5RMmmlDk5khuSOWYb7Aj/Ab3OranbgVl/6J6WNhWw8EDuecm5scPxZcg21/W7m19Y3Nrfx2YWd3b/+geHjU0lGiKGvSSESq4xPNBA9ZEzgI1okVI9IXrO2P6hO//cCU5lF4D+OYeZIMQh5wSsBIvWLZ9WXqRiYyuSGti 1F2jh9dwQLo2q7igyF4WaFXLNkVewq8Spw5KaE5Gr3ij9uPaCJZCFQQrbuOHYOXEgWcCpYV3ESzmNARGbCuoSGRTHvp9DsZLhulj4NImRMCnqp/J1IitR5L3yQlgaFe9ibif143geDaS3kYJ8BCOlsUJAJDhCfd4D5XjIIYG0Ko4uatmA6JIhRMgwtbfJmZTpzlBlZJq1pxLirVu8tS7WbeTh6doFN0hhx0hWroFjVQE1H0hF7QK3qznq1368P6nEVz1nzmGC3A+voF2fai4g==</latexit>
Clk+ x [1]
<latexit sha1_base64="CVHlZ8WYXl4gqrkpNZLfyB3+Gsg=">AAACHXicbVDLSgMxFM3UV62vqks3wVIQhDJTBV0Wu3FZwT6gM5RMmmlDk5khuSOWYb7Aj/Ab3OranbgVl/6J6WNhWw8EDuecm5scPxZcg21/W7m19Y3Nrfx2YWd3b/+geHjU0lGiKGvSSESq4xPNBA9ZEzgI1okVI9IXrO2P6hO//cCU5lF4D+OYeZIMQh5wSsBIvWLZ9WXqRiYyuSGti 1F2jh9dwQLoOq7igyF4WaFXLNkVewq8Spw5KaE5Gr3ij9uPaCJZCFQQrbuOHYOXEgWcCpYV3ESzmNARGbCuoSGRTHvp9DsZLhulj4NImRMCnqp/J1IitR5L3yQlgaFe9ibif143geDaS3kYJ8BCOlsUJAJDhCfd4D5XjIIYG0Ko4uatmA6JIhRMgwtbfJmZTpzlBlZJq1pxLirVu8tS7WbeTh6doFN0hhx0hWroFjVQE1H0hF7QK3qznq1368P6nEVz1nzmGC3A+voF25Ki4w==</latexit>
Clk+w [0]
<latexit sha1_base64="vGrUU8WgQNigDFBoKaqP2uQggZA=">AAACEnicbVDLSsNAFJ3UV62vqjvdDBZBEEpSBV0Wu3FZwT4gCWUynbRDZ5Iwc6OUUPAj/Aa3unYnbv0Bl/6J08fCth64cDjnXu69J0gE12Db31ZuZXVtfSO/Wdja3tndK+4fNHWcKsoaNBaxagdEM8Ej1gAOgrUTxYgMBGsFg9rYbz0wpXkc3cMwYb4kvYiHnBIwUqd45AUyq4nBOX70BAvBtT3 Fe33wR4VOsWSX7QnwMnFmpIRmqHeKP143pqlkEVBBtHYdOwE/Iwo4FWxU8FLNEkIHpMdcQyMimfazyQ8jfGqULg5jZSoCPFH/TmREaj2UgemUBPp60RuL/3luCuG1n/EoSYFFdLooTAWGGI8DwV2uGAUxNIRQxc2tmPaJIhRMbHNbAjkymTiLCSyTZqXsXJQrd5el6s0snTw6RifoDDnoClXRLaqjBqLoCb2gV/RmPVvv1of1OW3NWbOZQzQH6+sXP0iduw==</latexit>
Clk+w [1]
<latexit sha1_base64="kr7ZenRJbhXfTFEsZObkNYlV7ww=">AAACEXicbVDLSsNAFJ3UV62vqCtxM1gEQShJFXRZ7MZlBfuAJJTJdNIOnUnCzI1SQvEj/Aa3unYnbv0Cl/6JSduFbT1w4XDOvdx7jx8LrsGyvo3Cyura+kZxs7S1vbO7Z+4ftHSUKMqaNBKR6vhEM8FD1gQOgnVixYj0BWv7w3rutx+Y0jwK72EUM0+SfsgDTglkUtc8cn2Z1sXwHD+6ggXg2K7 i/QF4465ZtirWBHiZ2DNSRjM0uuaP24toIlkIVBCtHduKwUuJAk4FG5fcRLOY0CHpMyejIZFMe+nkhTE+zZQeDiKVVQh4ov6dSInUeiT9rFMSGOhFLxf/85wEgmsv5WGcAAvpdFGQCAwRzvPAPa4YBTHKCKGKZ7diOiCKUMhSm9viyzwTezGBZdKqVuyLSvXusly7maVTRMfoBJ0hG12hGrpFDdREFD2hF/SK3oxn4934MD6nrQVjNnOI5mB8/QICe52o</latexit>
CACC
Clk · x [1]
<latexit sha1_base64="foo2/W7waRwpOZq4Y0psuZu2DpU=">AAACI3icbVC9TsMwGHT4p/wVGFksWiTEUCVlgLGChbFItCAlUeU4TmvVjiP7C6KK8g68BK/ACjsbYmFg4UlwSwegnGTpdHefP/uiTHADrvvuzM0vLC4tr6xW1tY3Nreq2ztdo3JNWYcqofRNRAwTPGUd4CDYTaYZkZFg19HwfOxf3zJ tuEqvYJSxUJJ+yhNOCVipVz2qB5EszsUwoLECHCgbHt9V3AWCJeB7geb9AYRlWe9Va27DnQDPEm9KamiKdq/6GcSK5pKlQAUxxvfcDMKCaOBUsLIS5IZlhA5Jn/mWpkQyExaTP5X4wCoxTpS2JwU8UX9OFEQaM5KRTUoCA/PXG4v/eX4OyWlY8DTLgaX0e1GSCwwKjwvCMdeMghhZQqjm9q2YDogmFGyNv7ZEsqzYUry/FcySbrPhHTeal81a62xazwraQ/voEHnoBLXQBWqjDqLoHj2iJ/TsPDgvzqvz9h2dc6Yzu+gXnI8vvc2lPA==</latexit>
Clk · x [0]
<latexit sha1_base64="XVCblS5I7sHApDKRwCg8sfXP1KU=">AAACI3icbVC9TsMwGHT4p/wVGFksWiQkpCopA4wVLIxFogUpiSrHcVqrdhzZXxBVlIfgIXgGVpjZEAsDA2+CWzoA5SRLp7v7/NkXZYIbcN13Z25+YXFpeWW1sra+sblV3d7pGpVryjpUCaVvImKY4CnrAAfBbjLNiIwEu46G52P/+pZpw1V6BaOMhZL0U 55wSsBKvepREMmifi6GAY0V4EDZ8Piu4i4QLAHfDTTvDyAs62WlV625DXcCPEu8KamhKdq96mcQK5pLlgIVxBjfczMIC6KBU8HKSpAblhE6JH3mW5oSyUxYTD5V4gOrxDhR2p4U8ET9OVEQacxIRjYpCQzMX28s/uf5OSSnYcHTLAeW0u9FSS4wKDxuCMdcMwpiZAmhmtu3YjogmlCwPf7aEsnSduL9bWCWdJsN77jRvGzWWmfTdlbQHtpHh8hDJ6iFLlAbdRBF9+gRPaFn58F5cV6dt+/onDOd2UW/4Hx8AWkCpT8=</latexit>
Clk ·w [0]
<latexit sha1_base64="iMprQ27w+/RgUoef+kRtNdifid4=">AAACLnicbVDNSgMxGMz6b/2revQSbAVPZbeCeiz24lHBqrC7lGyabUOTzZJ8q5Rl38SH8Bm86lnwIOLNxzCtC1p1IDDMzJcvmSgV3IDrvjgzs3PzC4tLy5WV1bX1jerm1qVRmaasQ5VQ+joihgmesA5wEOw61YzISLCraNge+1c3TBuukgsYpSyUpJ/w mFMCVupWD4NI5vVA2cz4irwthkVAewrwt3YbCBaD7waa9wcQFvWi0q3W3IY7Af5LvJLUUImzbvU96CmaSZYAFcQY33NTCHOigVPBikqQGZYSOiR95luaEMlMmE/+V+A9q/RwrLQ9CeCJ+nMiJ9KYkYxsUhIYmN/eWPzP8zOIj8OcJ2kGLKFfi+JMYFB4XBbucc0oiJElhGpu34rpgGhCwVY6tSWShe3E+93AX3LZbHgHjeZ5s9Y6KdtZQjtoF+0jDx2hFjpFZ6iDKLpDD+gRPTn3zrPz6rx9RWeccmYbTcH5+ARO+apk</latexit>
Clk ·w [1]
<latexit sha1_base64="ShvfosGLFbDGW+doFQmbVPt8av8=">AAACLnicbVDNSgMxGMz6b/2revQSbAVPZbeCeiz24lHBqrC7lGyabUOTzZJ8q5Rl38SH8Bm86lnwIOLNxzCtC1p1IDDMzJcvmSgV3IDrvjgzs3PzC4tLy5WV1bX1jerm1qVRmaasQ5VQ+joihgmesA5wEOw61YzISLCraNge+1c3TBuukgsYpSyUpJ/w mFMCVupWD4NI5vVA2cz4irwthkVAewrwt3YbCBaD7wWa9wcQFvWi0q3W3IY7Af5LvJLUUImzbvU96CmaSZYAFcQY33NTCHOigVPBikqQGZYSOiR95luaEMlMmE/+V+A9q/RwrLQ9CeCJ+nMiJ9KYkYxsUhIYmN/eWPzP8zOIj8OcJ2kGLKFfi+JMYFB4XBbucc0oiJElhGpu34rpgGhCwVY6tSWShe3E+93AX3LZbHgHjeZ5s9Y6KdtZQjtoF+0jDx2hFjpFZ6iDKLpDD+gRPTn3zrPz6rx9RWeccmYbTcH5+ARQl6pl</latexit>
2Cw<latexit sha1_base64="OTtEpwe1kGvK0N8zIcO9qeOCtdg=">AAACAnicbVC7TsNAEFyHVwivACWNRYJEFdmmgDIiDWWQyENKrOh8OSen3J2tuzMostzxDbRQ0yFafoSSP+GSuCAJI600mtnVrCaIGVXacb6twsbm1vZOcbe0t39weFQ+PmmrKJGYtHDEItkNkCKMCtLSVDPSjSVBPGCkE0w aM7/zSKSikXjQ05j4HI0EDSlG2kj9fsDTqtcYPFWz0qBccWrOHPY6cXNSgRzNQfmnP4xwwonQmCGleq4Taz9FUlPMSFbqJ4rECE/QiPQMFYgT5afznzP7wihDO4ykGaHtufr3IkVcqSkPzCZHeqxWvZn4n9dLdHjjp1TEiSYCL4LChNk6smcF2EMqCdZsagjCkppfbTxGEmFtalpKCXhmOnFXG1gnba/mXtW8e69Sv83bKcIZnMMluHANdbiDJrQAQwwv8Apv1rP1bn1Yn4vVgpXfnMISrK9fZCmXSg==</latexit>2Cx<latexit sha1_base64="d/FzCA4RIfXmn2isA9eda3ADRaw=">AAACAnicbVC7TsNAEFyHVwivACWNRYJEFdmmgDIiDWWQyENKrOh8OSen3J2tuzMistzxDbRQ0yFafoSSP+GSuCAJI600mtnVrCaIGVXacb6twsbm1vZOcbe0t39weFQ+PmmrKJGYtHDEItkNkCKMC tLSVDPSjSVBPGCkE0waM7/zSKSikXjQ05j4HI0EDSlG2kj9fsDTqtcYPFWz0qBccWrOHPY6cXNSgRzNQfmnP4xwwonQmCGleq4Taz9FUlPMSFbqJ4rECE/QiPQMFYgT5afznzP7wihDO4ykGaHtufr3IkVcqSkPzCZHeqxWvZn4n9dLdHjjp1TEiSYCL4LChNk6smcF2EMqCdZsagjCkppfbTxGEmFtalpKCXhmOnFXG1gnba/mXtW8e69Sv83bKcIZnMMluHANdbiDJrQAQwwv8Apv1rP1bn1Yn4vVgpXfnMISrK9fZb+XSw==</latexit>
Cx<latexit sha1_base64="GRwoLD1wyeHiH+abFO2NpRI4dWY=">AAACAXicbVC7TgJBFJ3FF+ILtbSZCCZWZBcLLYk0lpjII1k2ZHaYhQnz2MzMGslmK7/BVms7Y+uXWPonDrCFgCe5yck59+bee8KYUW1c99spbGxube8Ud0t7+weHR+Xjk46WicKkjSWTqhciTRg VpG2oYaQXK4J4yEg3nDRnfveRKE2leDDTmAQcjQSNKEbGSn4/5Gm1OXiqZqVBueLW3DngOvFyUgE5WoPyT38occKJMJghrX3PjU2QImUoZiQr9RNNYoQnaER8SwXiRAfp/OQMXlhlCCOpbAkD5+rfiRRxrac8tJ0cmbFe9Wbif56fmOgmSKmIE0MEXiyKEgaNhLP/4ZAqgg2bWoKwovZWiMdIIWxsSktbQp7ZTLzVBNZJp17zrmr1+3qlcZunUwRn4BxcAg9cgwa4Ay3QBhhI8AJewZvz7Lw7H87norXg5DOnYAnO1y/u/ZcP</latexit> Cw<latexit sha1_base64="ybwYFBYCsGKCeE2cOKQvpzOpoUc=">AAACAXicbVC7TgJBFJ3FF+ILtbSZCCZWZBcLLYk0lpjII1k2ZHaYhQnz2MzMashmK7/BVms7Y+uXWPonDrCFgCe5yck59+bee8KYUW1c99spbGxube8Ud0t7+weHR+Xjk46WicKkjSWTqhciTRgVpG2oYaQXK4J4yEg3nDRnfveRKE 2leDDTmAQcjQSNKEbGSn4/5Gm1OXiqZqVBueLW3DngOvFyUgE5WoPyT38occKJMJghrX3PjU2QImUoZiQr9RNNYoQnaER8SwXiRAfp/OQMXlhlCCOpbAkD5+rfiRRxrac8tJ0cmbFe9Wbif56fmOgmSKmIE0MEXiyKEgaNhLP/4ZAqgg2bWoKwovZWiMdIIWxsSktbQp7ZTLzVBNZJp17zrmr1+3qlcZunUwRn4BxcAg9cgwa4Ay3QBhhI8AJewZvz7Lw7H87norXg5DOnYAnO1y/tZ5cO</latexit>
sign
<latexit sha1_base64="3SPu17Do5+79ejQ0QBe0Vm6j7A4=">AAACFnicbVC9TsMwGHTKXyl/AUYkFNEiMVVJGWCsYGEsEm2RmqpyXKe1ajuR/QVRRdl4CJ6BFWY2xMrKyJvgthloy0mWTnf3+bMviD nT4LrfVmFldW19o7hZ2tre2d2z9w9aOkoUoU0S8UjdB1hTziRtAgNO72NFsQg4bQej64nffqBKs0jewTimXYEHkoWMYDBSzz72A5FW/MhkJlekPtBHSDUbyCyrZKWeXXar7hTOMvFyUkY5Gj37x+9HJBFUAuFY647nxtBNsQJGOM1KfqJpjMkID2jHUIkF1d10+o/MOTVK3wkjZY4EZ6r+nUix0HosApMUGIZ60ZuI/3mdBMLLbspknACVZLYoTLgDkTMpxekzRQnwsSGYKGbe6pAhVpiAqW5uSyAy04m32MAyadWq 3nm1dlsr16/ydoroCJ2gM+ShC1RHN6iBmoigJ/SCXtGb9Wy9Wx/W5yxasPKZQzQH6+sXBASgZg==</latexit>
sign
<latexit sha1_base64="EOdrZ03C0HLmdkzXOvWm7Gah/UU=">AAACC3icbVC7SgNBFJ2Nrxhf0ZQ2g4lgFXZjoWXQxjKCeUA2hNnJbDJkZnaZuSuGZT/Bb7DV2k5s/QhL/8TJozCJBy4czrmXc zlBLLgB1/12chubW9s7+d3C3v7B4VHx+KRlokRT1qSRiHQnIIYJrlgTOAjWiTUjMhCsHYxvp377kWnDI/UAk5j1JBkqHnJKwEr9YskPZFrxgT1BavhQZZWs0C+W3ao7A14n3oKU0QKNfvHHH0Q0kUwBFcSYrufG0EuJBk4Fywp+YlhM6JgMWddSRSQzvXT2fIbPrTLAYaTtKMAz9e9FSqQxExnYTUlgZFa9qfif100gvO6lXMUJMEXnQWEiMER42gQecM0oiIklhGpuf8V0RDShYPtaSglkZjvxVhtYJ6 1a1bus1u5r5frNop08OkVn6AJ56ArV0R1qoCaiaIJe0Ct6c56dd+fD+Zyv5pzFTQktwfn6BY/wm0A=</latexit>
CACC+
<latexit sha1_base64="qpjhjGaqEu2Dx5fI7GLaewu97hU=">AAACCHicbVDLSsNAFJ3UV62vqEs3g60gCCWpC11Ws3FZwT6gDWEynbZDZyZhZlIoIT/gN7jVtTtx61+49E+ctlnY1gMXDufcy7mcMGZUacf5tgobm1vbO8Xd0t7+weGRfXzSUlEiMWniiEWyEyJFGBWkqalmpBNLgnjISDscezO/PSFS0Ug86WlMfI6Ggg4oRtpIgW33Qp5WvCC987yrrJKVArvsVJ054Dpxc1IGORqB/dPrRzjhRGjMkFJd14m1nyKpKWYkK/USRWKEx2hIuoYKxIny0/nnGbwwSh8OImlGaDhX/16kiCs15aHZ5EiP1Ko3E//zuoke 3PopFXGiicCLoEHCoI7grAbYp5JgzaaGICyp+RXiEZIIa1PWUkrIM9OJu9rAOmnVqu51tfZYK9fv83aK4Aycg0vgghtQBw+gAZoAgwl4Aa/gzXq23q0P63OxWrDym1OwBOvrF33imOQ=</latexit>
CACC 
<latexit sha1_base64="7u5d8KMvRDWHbjy7DCFmjCwA8GA=">AAACCHicbVC7TsMwFHXKq5RXgJHFokVioUrKAGMhC2OR6ENqo8hx3daq7US2U6mK8gN8AyvMbIiVv2DkT3DbDLTlSFc6OudenasTxowq7TjfVmFjc2t7p7hb2ts/ODyyj09aKkokJk0csUh2QqQIo4I0NdWMdGJJEA8ZaYdjb+a3J0QqGoknPY2Jz9FQ0AHFSBspsO1eyNOKF6R3nneVVbJSYJedqjMHXCduTsogRyOwf3r9CCecCI0ZUqrrOrH2UyQ1xYxkpV6iSIzwGA1J11CBOFF+Ov88gxdG6cNBJM0IDefq34sUcaWmPDSbHOmRWvVm4n9e N9GDWz+lIk40EXgRNEgY1BGc1QD7VBKs2dQQhCU1v0I8QhJhbcpaSgl5ZjpxVxtYJ61a1b2u1h5r5fp93k4RnIFzcAlccAPq4AE0QBNgMAEv4BW8Wc/Wu/VhfS5WC1Z+cwqWYH39AoEQmOY=</latexit>
Figure 3: Low-bitwidth switched-capacitor MACC.
(b)  Sharing sampled charge with
 the weight capacitive DAC 
Clk (2)
<latexit sha1_base64="59H2vKkCz78g37PYpH/oXKvcX/I=">AAACd3icbVHLbhMxFHWGVymvFJYIMaIFpYtGM2EByyplwaZSEKStFI8iz81NY8Uv2R5oZM2Kr+kWvoMPQOI3WLDDM+2CtFzJ8tG559o+x6U R3Pks+9lJbty8dfvOxt3Ne/cfPHzU3Xp85HRlAceghbYnJXMouMKx517gibHIZCnwuFweNP3jz2gd1+qTXxksJDtVfM6B+UhNu89oKQPVUdKcEA7Esp4Gaha8N9it62l3O+tnbaXXQX4Jtvd3f//Ye/frfDTd6gzpTEMlUXkQzLlJnhlfBGY9B4H1Jq0cGgZLdoqTCBWT6IrQ+qjTl5GZpXNt41I+bdl/JwKTzq1kGZWS+YW72mvI//UmlZ+/LQJXpvKoYP0VrRpsEbCKGzc+thV+AS0lU7NApSuNZAD1JCvCDm3kwEQ4/Fjv7FGP Z95BGI4OGwU9c/HU6LJ1Mq9E6nXapJ7OuEXwYhUBi5fEMFJYMMvAx79Zs1HKJvX8asbXwdGgn7/uDz7E+IfkojbIU/KC9EhO3pB98p6MyJgA+UrOyTfyvfMneZ68SnoX0qRzOfOErFWS/wX/Vsb8</latexit>
2CX
<latexit sha1_base64="svxBYfYjjf981qKC1GRsFDyT57Q=">AAACYXicbVFNbxMxEHWWr7Z8peXYy4oUiQvRbooExyq9cKkUBGmDsqtodjJprdrelT0Ljaz9FVzhh3Hmj+B dciAtI1l+eu+N7XkuKiUdJ8mvXnTv/oOHj3Z29x4/efrseX//4NyVtUWaYqlKOyvAkZKGpixZ0ayyBLpQdFFcn7b6xVeyTpbmM68ryjVcGrmSCByoL1mh/eh0MWsW/UEyTLqK74J0AwZiU5PFfm+cLUusNRlGBc7N06Ti3INliYqavax2VAFewyXNAzSgyeW+e3ETvwrMMl6VNizDccf+2+FBO7fWRXBq4Ct3W2vJ/2nzmlfvcy9NVTMZ3H5F50abe6rDJisOsqFvW GoNZukz7YpKA2IzT3J/lLV2BOXPPjVHbzKmG3box5Oz1pHduHBqmLKbZFWrmMu4zTdeSkvIah0AhEtCGDFegQXk8AtbYxS6TT29nfFdcD4apsfD0ce3g5PxJv8dcSheitciFe/EifggJmIqUGjxXfwQP3u/o92oHx38tUa9Tc8LsVXR4R+yKrnJ</latexit> CX
<latexit sha1_base64="r+ngYu6Uq5MgAIJvQ8Yxaz88szY=">AAACYHicbVFNbxMxEHWWr1A+2sANLitSJC5Euy0SPVbphUulIEgbKbuKZieT1qrtXdmztJG1f4Ir/DGu/BK8 Sw6kZSTLT++9sT3PRaWk4yT51Yvu3X/w8FH/8c6Tp8+e7+4NXpy5srZIUyxVaWcFOFLS0JQlK5pVlkAXis6Lq5NWP/9G1snSfOV1RbmGCyNXEoEDNcsK7U8Ws2axN0xGSVfxXZBuwFBsarIY9MbZssRak2FU4Nw8TSrOPViWqKjZyWpHFeAVXNA8QAOaXO67Bzfx28As41VpwzIcd+y/HR60c2tdBKcGvnS3tZb8nzaveXWUe2mqmsng9is6N9rcUx02WXGQDV1jqTW Ypc+0KyoNiM08yf1+1toRlD/90uy/z5hu2KEfT05bR3bjwqlhym6SVa1iLuM23ngpLSGrdQAQLglhxHgJFpDDJ2yNUeg29fR2xnfB2cEoPRwdfP4wPB5v8u+L1+KNeCdS8VEci09iIqYChRLfxQ/xs/c76ke70eCvNeptel6KrYpe/QEof7mN</latexit>
|w|CW
<latexit sha1_base64="M8kQJsfaJvw4aYO2HQjOig5XKqk=">AAACZHicbVFNbxMxEHWWr1IKtFSckJBFisSFaLcglWOVXrhUCoI0lbKryDuZtFZtr2XPto3c/Rtc4W/xB /gdeJccSMtIlp/ee2N7nkurpKc0/dVL7t1/8PDRxuPNJ1tPnz3f3nlx4qvaAY6hUpU7LYVHJQ2OSZLCU+tQ6FLhpLw4avXJJTovK/ONlhYLLc6MXEgQFKk8L3W4ubrhR7NJM9vup4O0K34XZCvQZ6sazXZ6w3xeQa3RECjh/TRLLRVBOJKgsNnMa49WwIU4w2mERmj0Rege3fC3kZnzReXiMsQ79t+OILT3S11GpxZ07m9rLfk/bVrT4lMRpLE1oYH1V3RucEX AOm7SUpQNXkGltTDzkGtfWi0AmmlahL28tYNQ4fhrs/c+J7wmD2E4Om4d+bWPp8Ypu0kWteJU8TZiPpcOgdQyAhEviWFwOBdOAMWPWBuj1G3q2e2M74KT/UH2YbD/5WP/cLjKf4O9Ym/YO5axA3bIPrMRGzNgln1nP9jP3u9kK9lNXv61Jr1Vzy5bq+T1H9jbu0M=</latexit>
Clk
<latexit sha1_base64="G+gB68jXOyPq2R7bPkkNvjkocc0=">AAACC3icbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVjRhbFI9CE1UeW4TmvVdiLbQYqifALfwAozG2LlIxj5E5w2A205kqWjc871vTpBzKjSjvNtbWxube/sVvaq+weHR8f2yWlPRYnEpIsjFslBgBRhVJCuppqRQSwJ4gEj/WDWLvz+E5GKRuJRpzHxOZoIGlKMtJFGds0LeOZFJlL 8kLXZLM9Hdt1pOHPAdeKWpA5KdEb2jzeOcMKJ0JghpYauE2s/Q1JTzEhe9RJFYoRnaEKGhgrEifKz+fE5vDDKGIaRNE9oOFf/TmSIK5XywCQ50lO16hXif94w0eGtn1ERJ5oIvFgUJgzqCBZNwDGVBGuWGoKwpOZWiKdIIqxNX0tbAl504q42sE56zYZ71Wg+XNdbd2U7FXAGzsElcMENaIF70AFdgEEKXsAreLOerXfrw/pcRDescqYGlmB9/QKyBJv4</latexit>
VS
|x|CX
<latexit sha1_base64="kexQo4TPGsm3VNLUTVkWfShd584=">AAACZHicbVFNbxMxEHWWr1IKbak4ISGLFIkL0W5BgmOVXrhUCoK0kbKryDuZtFZtr2XPQiJ3/wZX+ Fv8AX4H3iUH0jKS5af33tie59Iq6SlNf/WSO3fv3X+w9XD70c7jJ7t7+0/PfFU7wDFUqnKTUnhU0uCYJCmcWIdClwrPy6uTVj//is7LynyhlcVCiwsjFxIERSrPSx2ul9f8ZDZpZnv9dJB2xW+DbA36bF2j2X5vmM8rqDUaAiW8n2appSIIRxIUNtt57dEKuBIXOI3QCI2+CN2jG/4qMnO+qFxchnjH/tsRhPZ+pcvo1IIu/U2tJf+nTWtafCiCNL YmNLD5is4NrghYx01airLBb1BpLcw85NqXVguAZpoW4TBv7SBUOP3cHL7JCZfkIQxHp60jX/p4apyym2RRK04VbyPmc+kQSK0iEPGSGAaHS+EEUPyIjTFK3aae3cz4Njg7GmRvB0ef3vWPh+v8t9hz9pK9Zhl7z47ZRzZiYwbMsu/sB/vZ+53sJAfJs7/WpLfuOWAblbz4A9zMu0U=</latexit>
(a) Converting digital input to 
analog charge
Clk
<latexit sha1_base64="Xgr6yRqh9yDaM45I070YB7KsX9w=">AAAB/3icbVA9SwNBEJ3zM8avqKXNYhCswl0UtAymsYxgPiQ5wt5mL1myu3fs7gnhuMLfYKu1ndj6Uyz9J26SK0zig4HHezPMzAtizrRx3W9nbX1jc2u7sFPc3ds/OCwdHbd0lChCmyTikeoEWFPOJG0aZjjtxIpiEXDaDsb1qd9+okqzSD6YSUx9gYeShYxgY 6XHXiDSOh9nxX6p7FbcGdAq8XJShhyNfumnN4hIIqg0hGOtu54bGz/FyjDCaVbsJZrGmIzxkHYtlVhQ7aezgzN0bpUBCiNlSxo0U/9OpFhoPRGB7RTYjPSyNxX/87qJCW/8lMk4MVSS+aIw4chEaPo9GjBFieETSzBRzN6KyAgrTIzNaGFLIDKbibecwCppVSveZaV6f1Wu3ebpFOAUzuACPLiGGtxBA5pAQMALvMKb8+y8Ox/O57x1zclnTmABztcvLsSWtQ==</latexit>
VDD Clk (1)
<latexit sha1_base64="OnDyQkg9dbPu+4fvAJfAFsNGgJY=">AAACbHicbVHLbhMxFHWGVymvtLCrkEakoLJoNFOQYFm1G1hUCoK0lTKjyHNz01ixPZZ9BxJZ8yts6S/1J/oN9QxZkJYrWT4651zb97gwUjhKkqtOdO/+g4ePNh5vPnn67PmL7tb2qSsrCziEUpb2vOAOpdA4JEESz41FrgqJZ8X8uNHPfqJ1otQ/aGk wV/xCi6kAToEad7ezQvljOR/7zMzEXvq+rsfdXtJP2orvgnQFemxVg/FW5yiblFAp1ASSOzdKE0O555YESKw3s8qh4TDnFzgKUHOFLvft4+v4bWAm8bS0YWmKW/bfDs+Vc0tVBKfiNHO3tYb8nzaqaPo590KbilDD+itaN9jcYxU2YSjIGn9BqRTXE58pVxjFAepRkvvdrLEDl/7ke727nxEuyIE/Gpw0jmzhwqlhynaSaSVjKuMm6ngiLALJZQA8XBLCiGHGLQcKH7I2RqGa1NPbGd8Fpwf99EP/4NvH3uHXVf4bbIe9YXssZZ/YIfvCBmzIgC3Yb/aHXXauo1fRTvT6rzXqrHpesrWK3t0AW829vw==</latexit>
(c) Accumulation on capacitors and sampling new input 
2CW
<latexit sha1_base64="Q5jJmPZeI5j2F7iv/cd8h8POLAU=">AAACYXicbVFNbxMxEHWWr7Z8peXYy4oUiQvRbooExyq9cKkUBGmKsqtodjJprdrelT0Ljaz9FVzhh3Hmj+BdciAtI1l+eu +N7XkuKiUdJ8mvXnTv/oOHj3Z29x4/efrseX//4NyVtUWaYqlKe1GAIyUNTVmyoovKEuhC0ay4Pm312VeyTpbmM68ryjVcGrmSCByoL1mh/eh0MWsW/UEyTLqK74J0AwZiU5PFfm+cLUusNRlGBc7N06Ti3INliYqavax2VAFewyXNAzSgyeW+e3ETvwrMMl6VNizDccf+2+FBO7fWRXBq4Ct3W2vJ/2nzmlfvcy9NVTMZ3H5F50abe6rDJisOsqFvWGoNZukz7YpKA2IzT3J/lLV2BOXPPjVHbz KmG3box5Oz1pHduHBqmLKbZFWrmMu4zTdeSkvIah0AhEtCGDFegQXk8AtbYxS6TT29nfFdcD4apsfD0ce3g5PxJv8dcSheitciFe/EifggJmIqUGjxXfwQP3u/o92oHx38tUa9Tc8LsVXR4R+wNLnI</latexit>
CW
<latexit sha1_base64="k64uIR85d3TS1q4ygbMW3x4aoMM=">AAACYHicbVFNbxMxEHWWr1A+2sANLitSJC5Euy0SHKv0wqVSEKSJlF1Fs5NJa9X2ruxZaGTtn+AKf4wrvwTvkgNpGcny03tvbM9 zUSnpOEl+9aI7d+/df9B/uPfo8ZOn+weDZ+eurC3SFEtV2nkBjpQ0NGXJiuaVJdCFollxddrqs69knSzNF95UlGu4MHItEThQ86zQ/nQ5a5YHw2SUdBXfBukWDMW2JstBb5ytSqw1GUYFzi3SpOLcg2WJipq9rHZUAV7BBS0CNKDJ5b57cBO/DswqXpc2LMNxx/7b4UE7t9FFcGrgS3dTa8n/aYua1x9yL01VMxncfUXnRpt7qsMmKw6yoW9Yag1m5TPtikoDYrNIcn+YtXYE5c8+N4dvM6ZrdujHk7PWkV27 cGqYsptkXauYy7iNN15JS8hqEwCES0IYMV6CBeTwCTtjFLpNPb2Z8W1wfjRKj0dHn94NT8bb/PvipXgl3ohUvBcn4qOYiKlAocR38UP87P2O+tF+NPhrjXrbnudip6IXfwAmibmM</latexit>
CACC
<latexit sha1_base64="a+C+YLlBB6UYMji7yKJrd+OIJ2M=">AAACZHicbVFNTxsxEHW2X5RCC0U9VapWBCQuRLu0UnuE5MIFKVUJIGVXkXcyAQvba9mzLZG1f4Nr+7f4A/wOvNscGuhIlp/ee2N7ngsj haMkuetEz56/ePlq5fXqm7X1t+82Nt+fubKygCMoZWkvCu5QCo0jEiTxwljkqpB4XlwPGv38J1onSn1Kc4O54pdazARwClSWFcoPJv5oMKjryUY36SVtxU9BugBdtqjhZLPTz6YlVAo1geTOjdPEUO65JQES69Wscmg4XPNLHAeouUKX+/bRdbwbmGk8K21YmuKW/bfDc+XcXBXBqThducdaQ/5PG1c0+5Z7oU1FqGH5Fa0bbO6xCpswFGSNv6BUiuupz5QrjOIA9TjJ/U7W2IFLf/Kj3tnPCG/Ige8PTxpHduPCqWHKdpJZ JWMq4ybieCosAsl5ADxcEsKI4YpbDhQ+YmmMQjWpp48zfgrODnrp597B9y/dw/4i/xX2kW2zPZayr+yQHbMhGzFght2y3+xP5z5ai7aiD3+tUWfRs8WWKvr0AIrWuxw=</latexit>
Clk
<latexit sha1_base64="Xgr6yRqh9yDaM45I070YB7KsX9w=">AAAB/3icbVA9SwNBEJ3zM8avqKXNYhCswl0UtAymsYxgPiQ5wt5mL1myu3fs7gnhuMLfYKu1ndj6Uyz9J26SK0zig4HHezPMzAtizrRx3W9nbX1jc2u7sFPc3ds/OCwdHbd0lChCmyTikeoEWFPOJG0aZjjtxIpiEXDaDsb1qd9+okqzSD6YSUx9gYeShYxgY6XHXi DSOh9nxX6p7FbcGdAq8XJShhyNfumnN4hIIqg0hGOtu54bGz/FyjDCaVbsJZrGmIzxkHYtlVhQ7aezgzN0bpUBCiNlSxo0U/9OpFhoPRGB7RTYjPSyNxX/87qJCW/8lMk4MVSS+aIw4chEaPo9GjBFieETSzBRzN6KyAgrTIzNaGFLIDKbibecwCppVSveZaV6f1Wu3ebpFOAUzuACPLiGGtxBA5pAQMALvMKb8+y8Ox/O57x1zclnTmABztcvLsSWtQ==</latexit>
Clk
<latexit sha1_base64="Xgr6yRqh9yDaM45I070YB7KsX9w=">AAAB/3icbVA9SwNBEJ3zM8avqKXNYhCswl0UtAymsYxgPiQ5wt5mL1myu3fs7gnhuMLfYKu1ndj6Uyz9J26SK0zig4HHezPMzAtizrRx3W9nbX1jc2u7sFPc3ds/OCwdHbd0lChCmyTikeoEWFPOJG0aZjjtxIpiEXDaDsb1qd9+okqzSD6YSUx9gYeShYxgY6XHXi DSOh9nxX6p7FbcGdAq8XJShhyNfumnN4hIIqg0hGOtu54bGz/FyjDCaVbsJZrGmIzxkHYtlVhQ7aezgzN0bpUBCiNlSxo0U/9OpFhoPRGB7RTYjPSyNxX/87qJCW/8lMk4MVSS+aIw4chEaPo9GjBFieETSzBRzN6KyAgrTIzNaGFLIDKbibecwCppVSveZaV6f1Wu3ebpFOAUzuACPLiGGtxBA5pAQMALvMKb8+y8Ox/O57x1zclnTmABztcvLsSWtQ==</latexit>
Clk
<latexit sha1_base64="Xgr6yRqh9yDaM45I070YB7KsX9w=">AAAB/3icbVA9SwNBEJ3zM8avqKXNYhCswl0UtAymsYxgPiQ5wt5mL1myu3fs7gnhuMLfYKu1ndj6Uyz9J26SK0zig4HHezPMzAtizrRx3W9nbX1jc2u7sFPc3ds/OCwdHbd0lChCmyTikeoEWFPOJG0aZjjtxIpiEXDaDsb1qd9+okqzSD6YSUx9gYeShYxgY6XHXi DSOh9nxX6p7FbcGdAq8XJShhyNfumnN4hIIqg0hGOtu54bGz/FyjDCaVbsJZrGmIzxkHYtlVhQ7aezgzN0bpUBCiNlSxo0U/9OpFhoPRGB7RTYjPSyNxX/87qJCW/8lMk4MVSS+aIw4chEaPo9GjBFieETSzBRzN6KyAgrTIzNaGFLIDKbibecwCppVSveZaV6f1Wu3ebpFOAUzuACPLiGGtxBA5pAQMALvMKb8+y8Ox/O57x1zclnTmABztcvLsSWtQ==</latexit> Clk (3)
<latexit sha1_base64="pID5H2Ataz3LFuHdcpqWbejwBpw=">AAACCnicbVDLSsNAFJ3UV62vWJdugkWom5K0gi6L3bisYB/QhDCZTtqhM5MwMxFLyB/4DW517U7c+hMu/RMnbRbaeuDC4Zx7OZcTxJRI ZdtfRmljc2t7p7xb2ds/ODwyj6t9GSUC4R6KaCSGAZSYEo57iiiKh7HAkAUUD4JZJ/cHD1hIEvF7NY+xx+CEk5AgqLTkm1U3YGmHzvzUjaek3rrIMt+s2Q17AWudOAWpgQJd3/x2xxFKGOYKUSjlyLFj5aVQKIIozipuInEM0QxO8EhTDhmWXrr4PbPOtTK2wkjo4cpaqL8vUsiknLNAbzKopnLVy8X/vFGiwmsvJTxOFOZoGRQm1FKRlRdhjYnASNG5JhAJon+10BQKiJSu609KwPJOnNUG1km/2XBajebdZa19U7RTBqfg DNSBA65AG9yCLugBBB7BM3gBr8aT8Wa8Gx/L1ZJR3JyAPzA+fwCJXpqu</latexit>
Clk
<latexit sha1_base64="Xgr6yRqh9yDaM45I070YB7KsX9w=">AAAB/3icbVA9SwNBEJ3zM8avqKXNYhCswl0UtAymsYxgPiQ5wt5mL1myu3fs7gnhuMLfYKu1ndj6Uyz9J26SK0zig4HHezPMzAtizrRx3W9nbX1jc2u7sFPc3ds/OCwdHbd0lChCmyTikeoEWFPOJG0aZjjtxIpiEXDaDsb1qd9+okqzSD6YSUx9gYeShYxgY6X HXiDSOh9nxX6p7FbcGdAq8XJShhyNfumnN4hIIqg0hGOtu54bGz/FyjDCaVbsJZrGmIzxkHYtlVhQ7aezgzN0bpUBCiNlSxo0U/9OpFhoPRGB7RTYjPSyNxX/87qJCW/8lMk4MVSS+aIw4chEaPo9GjBFieETSzBRzN6KyAgrTIzNaGFLIDKbibecwCppVSveZaV6f1Wu3ebpFOAUzuACPLiGGtxBA5pAQMALvMKb8+y8Ox/O57x1zclnTmABztcvLsSWtQ==</latexit>
VDD
|x|newCX
<latexit sha1_base64="uA0L9nQ73ED0JC/8wIpZ7yr109M=">AAACbXicbVHLbhMxFHWGVwmvlIoVCFmkCDZEMwUJllW6YVMpCNJGyoxGnpub1qrtGdl3IJE738IWPomv4BfwDFmQli tZPjrnXNv3uKiUdBTHv3rRjZu3bt/Zudu/d//Bw0eD3ccnrqwt4BRKVdpZIRwqaXBKkhTOKotCFwpPi4ujVj/9itbJ0nyhdYWZFmdGLiUIClQ+2EsL7S9Xl7k3+K3hR/ms6eeDYTyKu+LXQbIBQ7apSb7bG6eLEmqNhkAJ5+ZJXFHmhSUJCpt+WjusBFyIM5wHaIRGl/nu9Q1/GZgFX5Y2LEO8Y//t8EI7t9ZFcGpB5+6q1pL/0+Y1LT9kXpqqJjSw/YrODTbzWIdNVhTkkACUWguz 8Kl2RaUFQDOPM7+ftnYQyh9/bvbfpIQrcuDHk+PWka5cODVM2U2yrBWnkrdZ84W0CKTWAYhwSQiDw7mwAij8yNYYhW5C6snVjK+Dk4NR8nZ08Ond8HC8yX+HPWUv2GuWsPfskH1kEzZlwNbsO/vBfvZ+R0+iZ9Hzv9aot+nZY1sVvfoDzxy+Zw==</latexit>
Figure 4: Charge-domain MACC; phase by phase.
2-bit magnitude of the input and weight to the analog domain
as an electric charge proportional to |x| and |w| respectively.
C-DACX and C-DACW are each composed of two capacitors
((CX, 2CX) and (CW, 2CW)) which operate in parallel and are
combined to convert the operands to analog domain. Each of
these capacitors are controlled by a pair of transmission gates
which determine if a capacitor is active or inactive. Another
set of transmission gates connects the two C-DACSand shares
charge when partitions of x and w are multiplied. The resulting
shared charge is stored on either CACC+ or CACC- depending
on the “sign” control signal produced by xs⊕ws. During mul-
tiplication, the transmission gates are coordinated by a pair of
complimentary non-overlapping clock signals, Clk and Clk.
Charge-domain MACC. Figure 4 shows the phase-by-phase
process of a MACC and its corresponding active circuits, the
phases of which are described below.
Clkφ(1): The first phase (Figure 4(a)) consists of the input
capacitive DAC converting digital input (x) to a charge pro-
portional to the magnitude of the input |x|CX. As a result, the
sampled charge (Qsx) in C-DACX in the first phase is equal to:
Qsx=vDD×(|X |Cx) (1)
Clkφ(2): In the second phase (Figure 4(b)), the multiplication
happens via a charge-sharing process between C-DACX and
C-DACW. C-DACW converts the |w| to the charge domain.
At the same time, the C-DACX redistributes its sampled
charge (Qsx) over all of its capacitors (3×CX) as well as the
equivalent capacitor of C-DACW. The voltage (Vs) at the
junction of C-DACX and C-DACW is as follows:
Vs= QsxCeq =
vDD×(|X |CX)
3CX+|w|CW (2)
Because the sampled charge is shared with the weight
capacitors, the stored charge (Qsw) on C-DACW is equal to:
Qsw=Vs×|w|CW= |x|×|w|
(
CW CXvDD
3CX+|w|CW
)
(3)
Equation 3 shows that the stored charge on C-DACW is
proportional to |x|× |w|, but includes a non-linearity due to
the |w| term in the denominator. To suppress this non-linearity,
CX and CW must be chosen such that 3CX >> |w|CW.
Although this design choice does not completely suppress this
non-linearity, it can be mitigated as discussed in Section 5.
5
(a) Mixed-Signal bit-partitioned MACC unit 
x1 w1 wn
CACC+
<latexit sha1_base64="qpjhjGaqEu2Dx5fI7GLaewu97hU=">AAA CCHicbVDLSsNAFJ3UV62vqEs3g60gCCWpC11Ws3FZwT6gDWEynbZDZyZhZlIoIT/gN7jVtTtx61+49E+ctlnY1gMXDufcy7mcMGZUacf5tgob m1vbO8Xd0t7+weGRfXzSUlEiMWniiEWyEyJFGBWkqalmpBNLgnjISDscezO/PSFS0Ug86WlMfI6Ggg4oRtpIgW33Qp5WvCC987yrrJKVArvs VJ054Dpxc1IGORqB/dPrRzjhRGjMkFJd14m1nyKpKWYkK/USRWKEx2hIuoYKxIny0/nnGbwwSh8OImlGaDhX/16kiCs15aHZ5EiP1Ko3E//zu oke3PopFXGiicCLoEHCoI7grAbYp5JgzaaGICyp+RXiEZIIa1PWUkrIM9OJu9rAOmnVqu51tfZYK9fv83aK4Aycg0vgghtQBw+gAZoAgwl4Aa /gzXq23q0P63OxWrDym1OwBOvrF33imOQ=</latexit>
CACC <latexit sha1_base64="7u5d8KMvRDWHbjy7DCFmjCwA8GA=">AAA CCHicbVC7TsMwFHXKq5RXgJHFokVioUrKAGMhC2OR6ENqo8hx3daq7US2U6mK8gN8AyvMbIiVv2DkT3DbDLTlSFc6OudenasTxowq7TjfVmF jc2t7p7hb2ts/ODyyj09aKkokJk0csUh2QqQIo4I0NdWMdGJJEA8ZaYdjb+a3J0QqGoknPY2Jz9FQ0AHFSBspsO1eyNOKF6R3nneVVbJSYJe dqjMHXCduTsogRyOwf3r9CCecCI0ZUqrrOrH2UyQ1xYxkpV6iSIzwGA1J11CBOFF+Ov88gxdG6cNBJM0IDefq34sUcaWmPDSbHOmRWvVm4n9 eN9GDWz+lIk40EXgRNEgY1BGc1QD7VBKs2dQQhCU1v0I8QhJhbcpaSgl5ZjpxVxtYJ61a1b2u1h5r5fp93k4RnIFzcAlccAPq4AE0QBNgMAE v4BW8Wc/Wu/VhfS5WC1Z+cwqWYH39AoEQmOY=</latexit>
CACC+
<latexit sha1_base64="qpjhjGaqEu2Dx5fI7GLaewu97hU=">AAACCHicbVDLSsNAFJ3UV62vqEs3g60gCCWpC11Ws3FZwT6gDWEynbZDZ yZhZlIoIT/gN7jVtTtx61+49E+ctlnY1gMXDufcy7mcMGZUacf5tgobm1vbO8Xd0t7+weGRfXzSUlEiMWniiEWyEyJFGBWkqalmpBNLgnjISDscezO/PSFS0Ug86WlMfI6Ggg4oRtpIgW33Qp5WvCC987yrrJKVArvsVJ054Dpxc1IGORqB/dPrRzjhRGjMkFJd14m1nyKpKWYkK/USRWKEx2h IuoYKxIny0/nnGbwwSh8OImlGaDhX/16kiCs15aHZ5EiP1Ko3E//zuoke3PopFXGiicCLoEHCoI7grAbYp5JgzaaGICyp+RXiEZIIa1PWUkrIM9OJu9rAOmnVqu51tfZYK9fv83aK4Aycg0vgghtQBw+gAZoAgwl4Aa/gzXq23q0P63OxWrDym1OwBOvrF33imOQ=</latexit>
CACC <latexit sha1_base64="7u5d8KMvRDWHbjy7DCFmjCwA8GA=">AAACCHicbVC7TsMwFHXKq5RXgJHFokVioUrKAGMhC2OR6ENqo8hx3daq7 US2U6mK8gN8AyvMbIiVv2DkT3DbDLTlSFc6OudenasTxowq7TjfVmFjc2t7p7hb2ts/ODyyj09aKkokJk0csUh2QqQIo4I0NdWMdGJJEA8ZaYdjb+a3J0QqGoknPY2Jz9FQ0AHFSBspsO1eyNOKF6R3nneVVbJSYJedqjMHXCduTsogRyOwf3r9CCecCI0ZUqrrOrH2UyQ1xYxkpV6iSIzwG A1J11CBOFF+Ov88gxdG6cNBJM0IDefq34sUcaWmPDSbHOmRWvVm4n9eN9GDWz+lIk40EXgRNEgY1BGc1QD7VBKs2dQQhCU1v0I8QhJhbcpaSgl5ZjpxVxtYJ61a1b2u1h5r5fp93k4RnIFzcAlccAPq4AE0QBNgMAEv4BW8Wc/Wu/VhfS5WC1Z+cwqWYH39AoEQmOY=</latexit>
SA
R 
AD
C
3 bit 3 bit 3 bit 3 bit
ClkACC
<latexit sha1_base64="2q/TKJtGko2lwbpNV55aD/4gZPI=">AAACanicbVHBTh sxEHW20NKUlgCnisuqSaVeGu3SSu2RkgtShRREA0jZVeSdTMCK7V3Zs5TI2j/ptfwT/8BH4N3mQKAjWX56743tec4KKSxF0V0reLG2/vLVxuv2m82377Y62ztnNi8N4Ahym ZuLjFuUQuOIBEm8KAxylUk8z+aDWj+/RmNFrn/RosBU8UstZgI4eWrS6SSZcgM5n7gfg0FVtduTTjfqR02Fz0G8BF22rOFku3WYTHMoFWoCya0dx1FBqeOGBEis2klpseA w55c49lBzhTZ1zdOr8KNnpuEsN35pChv2cYfjytqFyrxTcbqyT7Wa/J82Lmn2PXVCFyWhhtVXNG4wqcPSb6IgL2v8DblSXE9domxWKA5QjaPU9ZLaDly649Oq9zkhvCEL7n B4XDuSG+tP9VM2k8xKGVIe1kGHU2EQSC484P4SH0YIV9xwIP8dK2NkqvKpx08zfg7O9vvxl/7+ydfuwc9l/htsj31gn1jMvrEDdsSGbMSAXbM/7C+7bd0HO8H7YO+fNWgte 3bZSgW9B4m5vGk=</latexit>
ClkACC
<latexit sha1_base64="2q/TKJtGko2lwbpNV55aD/4gZPI=">AAACanicbVHBTh sxEHW20NKUlgCnisuqSaVeGu3SSu2RkgtShRREA0jZVeSdTMCK7V3Zs5TI2j/ptfwT/8BH4N3mQKAjWX56743tec4KKSxF0V0reLG2/vLVxuv2m82377Y62ztnNi8N4Ahym ZuLjFuUQuOIBEm8KAxylUk8z+aDWj+/RmNFrn/RosBU8UstZgI4eWrS6SSZcgM5n7gfg0FVtduTTjfqR02Fz0G8BF22rOFku3WYTHMoFWoCya0dx1FBqeOGBEis2klpseA w55c49lBzhTZ1zdOr8KNnpuEsN35pChv2cYfjytqFyrxTcbqyT7Wa/J82Lmn2PXVCFyWhhtVXNG4wqcPSb6IgL2v8DblSXE9domxWKA5QjaPU9ZLaDly649Oq9zkhvCEL7n B4XDuSG+tP9VM2k8xKGVIe1kGHU2EQSC484P4SH0YIV9xwIP8dK2NkqvKpx08zfg7O9vvxl/7+ydfuwc9l/htsj31gn1jMvrEDdsSGbMSAXbM/7C+7bd0HO8H7YO+fNWgte 3bZSgW9B4m5vGk=</latexit>
ClkACC
<latexit sha1_base64="2q/TKJtGko2lwbpNV55aD/4gZPI=">AAACanicbVHBThsxEHW20NKUlgCnisuqSaVeGu3SSu2RkgtShRREA0jZVeSdTMCK7V3Zs5TI2j/ptfw T/8BH4N3mQKAjWX56743tec4KKSxF0V0reLG2/vLVxuv2m82377Y62ztnNi8N4AhymZuLjFuUQuOIBEm8KAxylUk8z+aDWj+/RmNFrn/RosBU8UstZgI4eWrS6SSZcgM5n7gfg0FVtduTTjfqR02Fz0G8BF22rOFku3WYTHMoFWoCya0dx1FBqeOGBEis2klpseAw55c49lBzhTZ1zdOr8KNnpuEsN35pChv2cYfjytqFyrxTcbqyT7Wa/J82Lmn2PXVCF yWhhtVXNG4wqcPSb6IgL2v8DblSXE9domxWKA5QjaPU9ZLaDly649Oq9zkhvCEL7nB4XDuSG+tP9VM2k8xKGVIe1kGHU2EQSC484P4SH0YIV9xwIP8dK2NkqvKpx08zfg7O9vvxl/7+ydfuwc9l/htsj31gn1jMvrEDdsSGbMSAXbM/7C+7bd0HO8H7YO+fNWgte3bZSgW9B4m5vGk=</latexit>
ClkACC
<latexit sha1_base64="2q/TKJtGko2lwbpNV55aD/4gZPI=">AAACanicbVHBThsxEHW20NKUlgCnisuqSaVeGu3SSu2RkgtShRREA0jZVeSdTMCK7V3Zs5TI2j/ptfw T/8BH4N3mQKAjWX56743tec4KKSxF0V0reLG2/vLVxuv2m82377Y62ztnNi8N4AhymZuLjFuUQuOIBEm8KAxylUk8z+aDWj+/RmNFrn/RosBU8UstZgI4eWrS6SSZcgM5n7gfg0FVtduTTjfqR02Fz0G8BF22rOFku3WYTHMoFWoCya0dx1FBqeOGBEis2klpseAw55c49lBzhTZ1zdOr8KNnpuEsN35pChv2cYfjytqFyrxTcbqyT7Wa/J82Lmn2PXVCF yWhhtVXNG4wqcPSb6IgL2v8DblSXE9domxWKA5QjaPU9ZLaDly649Oq9zkhvCEL7nB4XDuSG+tP9VM2k8xKGVIe1kGHU2EQSC484P4SH0YIV9xwIP8dK2NkqvKpx08zfg7O9vvxl/7+ydfuwc9l/htsj31gn1jMvrEDdsSGbMSAXbM/7C+7bd0HO8H7YO+fNWgte3bZSgW9B4m5vGk=</latexit>
C
lk
r
s
t
<latexit sha1_base64="KqtJlAd4lgluxfHXzG9FcbWTuzw=">AAACBHicbVA9SwNBFNyLXzF+RS1tFoNgFe6ioGUgjWATwSRCcoa9zV6yZHfv2H0nhCOtv8FWazux9X9Y+k/cS64wiQMPhpn3mMcEseAGXPfbKaytb2xuFbdLO7t7+wflw6O2iRJNWYtGItIPATFMcMVawEGwh1gzIgPBOsG4kfmdJ6YNj9Q9TGLmSzJUPOSUgJUee4FMG2LcT7WB6bRfrrhVdwa8SrycVFCOZr/80xtENJFMARXEmK7nxuCnRAOngk1LvcSwmNAxGbKupYpIZvx09vUUn1llgMNI21GAZ+rfi5RIYyYysJuSwMgse5n4n9dNILz2U67iBJii86AwERginFWAB1wzCmJiCaGa218xHRFNKNiiFlICmXXiLTewStq1qndRrd1dVuq3eTtFdIJO0Tny0BWqoxvURC1EkUYv6BW9Oc/Ou/PhfM5XC05+c4wW4Hz9AiyjmZY=</latexit>
C
lk
r
s
t
<latexit sha1_base64="KqtJlAd4lgluxfHXzG9FcbWTuzw=">AAACBHicbVA9SwNBFNyLXzF+RS1tFoNgFe6ioGUgjWATwSRCcoa9zV6yZHfv2H0nhCOtv8FWazux9X9Y+k/cS64wiQMPhpn3mMcEseAGXPfbKaytb2xuFbdLO7t7+wflw6O2iRJNWYtGItIPATFMcMVawEGwh1gzIgPBOsG4kfmdJ6YNj9Q9TGLmSzJUPOSUgJUee4FMG2LcT7WB6bRfrrhVdwa8SrycVFCOZr/80xtENJFMARXEmK7nxuCnRAOngk1LvcSwmNAxGbKupYpIZvx09vUUn1llgMNI21GAZ+rfi5RIYyYysJuSwMgse5n4n9dNILz2U67iBJii86AwERginFWAB1wzCmJiCaGa218xHRFNKNiiFlICmXXiLTewStq1qndRrd1dVuq3eTtFdIJO0Tny0BWqoxvURC1EkUYv6BW9Oc/Ou/PhfM5XC05+c4wW4Hz9AiyjmZY=</latexit>
sign
<latexit sha1_base64="egvSKSqXD/SBGN5odgqVMJJIAss=">AAACCH icbVDLSsNAFJ34rPUVdekmWARXJamCLgtuBDcV7AOaUCbTSTt0ZhJmbool9Af8Bre6didu/QuX/omTNgvbeuDC4Zx7OZcTJpxpcN1va219Y3Nru7RT3t 3bPzi0j45bOk4VoU0S81h1QqwpZ5I2gQGnnURRLEJO2+HoNvfbY6o0i+UjTBIaCDyQLGIEg5F6tu2HIvOBPkGm2UBOpz274lbdGZxV4hWkggo0evaP34 9JKqgEwrHWXc9NIMiwAkY4nZb9VNMEkxEe0K6hEguqg2z2+dQ5N0rfiWJlRoIzU/9eZFhoPRGh2RQYhnrZy8X/vG4K0U2QMZmkQCWZB0UpdyB28hqcP lOUAJ8Ygoli5leHDLHCBExZCymhyDvxlhtYJa1a1bus1h6uKvX7op0SOkVn6AJ56BrV0R1qoCYiaIxe0Ct6s56td+vD+pyvrlnFzQlagPX1C5FHmts=< /latexit>
sign
<latexit sha1_base64="egvSKSqXD/SBGN5odgqVMJJIAss=">AAACCHicbVDLSsNAFJ34rPUVdekmWARXJamCLgtuBDcV7AOaUCbTSTt0ZhJmbool 9Af8Bre6didu/QuX/omTNgvbeuDC4Zx7OZcTJpxpcN1va219Y3Nru7RT3t3bPzi0j45bOk4VoU0S81h1QqwpZ5I2gQGnnURRLEJO2+HoNvfbY6o0i+UjTBIaCDyQLGIEg5F6tu2HIvOBPkGm2UBOpz274lbdGZxV4hWkggo0evaP349JKqgEwrHWXc9NIMiwAkY4nZb9VNMEkxEe0K6hEguqg2z2+dQ5N0rfiWJ lRoIzU/9eZFhoPRGh2RQYhnrZy8X/vG4K0U2QMZmkQCWZB0UpdyB28hqcPlOUAJ8Ygoli5leHDLHCBExZCymhyDvxlhtYJa1a1bus1h6uKvX7op0SOkVn6AJ56BrV0R1qoCYiaIxe0Ct6s56td+vD+pyvrlnFzQlagPX1C5FHmts=</latexit>
sign
<latexit sha1_base64="Vuck55IJrZ9GudZ5ayTBd8k0yv4=">AAACE3icbVDLSgMxFM3UV62vqstuBovgqsxUQZcFN4KbCvYBnaFk0kwbmmSG5I5Y hln4EX6DW127E7d+gEv/xEzbhW09EDicc25ucoKYMw2O820V1tY3NreK26Wd3b39g/LhUVtHiSK0RSIeqW6ANeVM0hYw4LQbK4pFwGknGF/nfueBKs0ieQ+TmPoCDyULGcFgpH654gUi9SITyW9IPaCPkGo2lFmW9ctVp+ZMYa8Sd06qaI5mv/zjDSKSCCqBcKx1z3Vi8FOsgBFOs5KXaBpjMsZD2jNUYkG1n04 /kdmnRhnYYaTMkWBP1b8TKRZaT0RgkgLDSC97ufif10sgvPJTJuMEqCSzRWHCbYjsvBF7wBQlwCeGYKKYeatNRlhhAqa3hS2ByDtxlxtYJe16zT2v1e8uqo3beTtFVEEn6Ay56BI10A1qohYi6Am9oFf0Zj1b79aH9TmLFqz5zDFagPX1C/oOoAE=</latexit>
sign
<latexit sha1_base64="Vuck55IJrZ9GudZ5ayTBd8k0yv4=">AAACE3 icbVDLSgMxFM3UV62vqstuBovgqsxUQZcFN4KbCvYBnaFk0kwbmmSG5I5Yhln4EX6DW127E7d+gEv/xEzbhW09EDicc25ucoKYMw2O820V1tY3NreK26 Wd3b39g/LhUVtHiSK0RSIeqW6ANeVM0hYw4LQbK4pFwGknGF/nfueBKs0ieQ+TmPoCDyULGcFgpH654gUi9SITyW9IPaCPkGo2lFmW9ctVp+ZMYa8Sd0 6qaI5mv/zjDSKSCCqBcKx1z3Vi8FOsgBFOs5KXaBpjMsZD2jNUYkG1n04/kdmnRhnYYaTMkWBP1b8TKRZaT0RgkgLDSC97ufif10sgvPJTJuMEqCSzR WHCbYjsvBF7wBQlwCeGYKKYeatNRlhhAqa3hS2ByDtxlxtYJe16zT2v1e8uqo3beTtFVEEn6Ay56BI10A1qohYi6Am9oFf0Zj1b79aH9TmLFqz5zDFag PX1C/oOoAE=</latexit>
C
lk
r
s
t
<latexit sha1_base64="KqtJlAd4lgluxfHXzG9FcbWTuzw=">AAACBHicbVA9SwNBFNyLXzF+RS1tFoNgFe6ioGUgjWATwSRCcoa9zV6yZHfv2H0nhCOtv8FWazux9X9Y+k/cS64wiQMPhpn3mMcEseAGXPfbKaytb2xuFbdLO7t7+wflw6O2iRJNWYtGItIPATFMcMVawEGwh1gzIgPBOsG4kfmdJ6YNj9Q9TGLmSzJUPOSUgJUee4FMG2LcT7WB6bRfrrhVdwa8SrycVFCOZr/80xtENJFMARXEmK7nxuCnRAOngk1LvcSwmNAxGbKupYpIZvx09vUUn1llgMNI21GAZ+rfi5RIYyYysJuSwMgse5n4n9dNILz2U67iBJii86AwERginFWAB1wzCmJiCaGa218xHRFNKNiiFlICmXXiLTewStq1qndRrd1dVuq3eTtFdIJO0Tny0BWqoxvURC1EkUYv6BW9Oc/Ou/PhfM5XC05+c4wW4Hz9AiyjmZY=</latexit>
C
lk
r
s
t
<latexit sha1_base64="KqtJlAd4lgluxfHXzG9FcbWTuzw=">AAACBHicbVA9SwNBFNyLXzF+RS1tFoNgFe6ioGUgjWATwSRCcoa9zV6yZHfv2H0nhCOtv8FWazux9X9Y+k/cS64wiQMPhpn3mMcEseAGXPfbKaytb2xuFbdLO7t7+wflw6O2iRJNWYtGItIPATFMcMVawEGwh1gzIgPBOsG4kfmdJ6YNj9Q9TGLmSzJUPOSUgJUee4FMG2LcT7WB6bRfrrhVdwa8SrycVFCOZr/80xtENJFMARXEmK7nxuCnRAOngk1LvcSwmNAxGbKupYpIZvx09vUUn1llgMNI21GAZ+rfi5RIYyYysJuSwMgse5n4n9dNILz2U67iBJii86AwERginFWAB1wzCmJiCaGa218xHRFNKNiiFlICmXXiLTewStq1qndRrd1dVuq3eTtFdIJO0Tny0BWqoxvURC1EkUYv6BW9Oc/Ou/PhfM5XC05+c4wW4Hz9AiyjmZY=</latexit>
xn
(b
) C
on
tro
l s
ig
na
ls 
fo
r 
w
id
e 
ac
cu
m
ul
at
io
n
cy
cle
s (
1-
m
)
C
lk
A
C
C
<latexit sha1_base64="2q/TKJtGko2lwbpNV55aD/4gZPI=">AAACanicbVHBThsxEHW20NKUlgCnisuqSaVeGu3SSu2RkgtShRREA0jZVeSdTMCK7V3Zs5TI2j/ptfwT/8BH4N3mQKAjWX56743tec4KKSxF0V0reLG2/vLVxuv2m82377Y62ztnNi8N4AhymZuLjFuUQuOIBEm8KAxylUk8z+aDWj+/RmNFrn/RosBU8UstZgI4eWrS6SSZcgM5n7gfg0FVtduTTjfqR02Fz0G8BF22rOFku3WYTHMoFWoCya0dx1FBqeOGBEis2klpseAw55c49lBzhTZ1zdOr8KNnpuEsN35pChv2cYfjytqFyrxTcbqyT7Wa/J82Lmn2PXVCFyWhhtVXNG4wqcPSb6IgL2v8DblSXE9domxWKA5QjaPU9ZLaDly649Oq9zkhvCEL7nB4XDuSG+tP9VM2k8xKGVIe1kGHU2EQSC484P4SH0YIV9xwIP8dK2NkqvKpx08zfg7O9vvxl/7+ydfuwc9l/htsj31gn1jMvrEDdsSGbMSAXbM/7C+7bd0HO8H7YO+fNWgte3bZSgW9B4m5vGk=</latexit> C
lk
r
s
t
<latexit sha1_base64="KqtJlAd4lgluxfHXzG9FcbWTuzw=">AAACBHicbVA9SwNBFNyLXzF+RS1tFoNgFe6ioGUgjWATwSRCcoa9zV6yZHfv2H0nhCOtv8FWazux9X9Y+k/cS64wiQMPhpn3mMcEseAGXPfbKaytb2xuFbdLO7t7+wflw6O2iRJNWYtGItIPATFMcMVawEGwh1gzIgPBOsG4kfmdJ6YNj9Q9TGLmSzJUPOSUgJUee4FMG2LcT7WB6bRfrrhVdwa8SrycVFCOZr/80xtENJFMARXEmK7nxuCnRAOngk1LvcSwmNAxGbKupYpIZvx09vUUn1llgMNI21GAZ+rfi5RIYyYysJuSwMgse5n4n9dNILz2U67iBJii86AwERginFWAB1wzCmJiCaGa218xHRFNKNiiFlICmXXiLTewStq1qndRrd1dVuq3eTtFdIJO0Tny0BWqoxvURC1EkUYv6BW9Oc/Ou/PhfM5XC05+c4wW4Hz9AiyjmZY=</latexit>
C
lk
<latexit sha1_base64="aZOq5W9sIwOrEzCH+8r47YK59Rw=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGUgjWATwXxAcoS9zV6yZHfv2N0TwhHwN9hqbSe2/hVL/4l7yRUm8cHA470ZZuYFMWfauO63U9jY3NreKe6W9vYPDo/KxydtHSWK0BaJeKS6AdaUM0lbhhlOu7GiWAScdoJJI/M7T1RpFslHM42pL/BIspARbKzU7QcibfDJbFCuuFV3DrROvJxUIEdzUP7pDyOSCCoN4VjrnufGxk+xMoxwOiv1E01jTCZ4RHuWSiyo9tP5vTN0YZUhCiNlSxo0V/9OpFhoPRWB7RTYjPWql4n/eb3EhLd+ymScGCrJYlGYcGQilD2PhkxRYvjUEkwUs7ciMsYKE2MjWtoSiCwTbzWBddKuVb2rau3hulK/z9MpwhmcwyV4cAN1uIMmtIAAhxd4hTfn2Xl3PpzPRWvByWdOYQnO1y/5bpaq</latexit> C
lk
<latexit sha1_base64="B3p6NLemPHpbhCi/Bg5PaP8w/bw=">AAACC3icbVDLSsNAFJ34rPUV7dLNYBFclaQKuix0I7ipYB/QhDKZTtqhM5MwMxFCyCf4DW517U7c+hEu/RMnbRa29cDA4Zxz515OEDOqtON8WxubW9s7u5W96v7B4dGxfXLaU1EiMeniiEVyECBFGBWkq6lmZBBLgnjASD+YtQu//0SkopF41GlMfI4mgoYUI22kkV3zAp55kYkUP2RtNsvzkV13Gs4ccJ24JamDEp2R/eONI5xwIjRmSKmh68Taz5DUFDOSV71EkRjhGZqQoaECcaL8bH58Di+MMoZhJM0TGs7VvxMZ4kqlPDBJjvRUrXqF+J83THR462dUxIkmAi8WhQmDOoJFE3BMJcGapYYgLKm5FeIpkghr09fSloAXnbirDayTXrPhXjWaD9f11n3ZTgWcgXNwCVxwA1rgDnRAF2CQghfwCt6sZ+vd+rA+F9ENq5ypgSVYX7+0uZwB</latexit>
cy
cle
 (m
+1
)th
Figure 5: Mixed-Signal bit-partitioned MACC unit.
With this choice, Qsw becomes Qsw= |x|×|w|CWvDD3 .
Clkφ(3): In the last phase, (Figure 4(c)), the charge from multi-
plication is shared with CACC for accumulation. The sign bits
(xs and ws) determine which of CACC+ or CACC- is selected
for accumulation. The sampled charge by |w|CW is then redis-
tributed over the selected CACC as well as all the capacitors
of C-DACW (=3CW). Theoretically, CACC must be infinitely
larger than 3CW to completely absorb the charge from multipli-
cation. However, in reality, some charge remains unabsorbed,
leading to a pattern of computational error, which is mitigated
as discussed in Section 5 Ideally, theVACC voltage on CACC is:
VACC= |x||w|
(
CWvDD
3×CACC
)
(4)
While the charge sharing and accumulation happens on CACC,
a new input is fed into C-DACX, starting a new MACC process
in a pipelined fashion. This process repeats for all low-bitwidth
MACC units over multiple cycles before one A/D conversion.
4.2 Wide Mixed-Signal Bit-Partitioned MACC
Figure 5(a) depicts an array of n switched-capacitor MACCs,
constituting theMS-BPMACC unit, which perform operations
for m cycles in the analog domain and store the results locally
on their CACCS. Figure 5(b) depicts the control signals and cy-
cles of operations. For the BIHIWE microarchitecture, m and n
are selected to 32 and 8 based on design space exploration (see
Figure 14). Over m cycles, the results of m×n low-bitwidth
MACC operations get accumulated in CACCS, private to each
MACC unit. In cycle m+1, the private results get aggregated
across all the MACC units within theMS-BPMACC. The sin-
gle A/D converter in theMS-BPMACC is responsible for con-
verting the aggregated result, which also starts at cycle m+1.
In the first phase of cycle m + 1, all the n accumulating
capacitors which store the positive values (CACC+) are
connected together through a set of transmission gates to
share their charge. Simultaneously, the same process happens
for the CACC-. ClkACC in Figure 5 is the control signal which
connects the CACCS. The accumulating capacitors (CACCS),
are also connected to a Successive Approximation Register
(SAR) ADC and share their stored charge with the Sample
and Hold block (S&H) of the ADC. This (S&H) block has
differential inputs which samples the positive and negative
results separately, subtracts them and holds them for the
process of A/D conversion. In the second phase of the cycle
m+1, Clkrst connects all the CACCS to ground to clear them
for the next iteration of wide, bit-interleaved calculations.
There is a tradeoff between resolution and sampling rate of
ADC, which also defines its topology. SAR ADC is a better
choice when it comes to medium resolution (8-12 bits) and
sampling rate (1-500 Mega-Samples/sec). We choose a 10-bit,
15 Mega-Samples/sec SAR ADC as it strikes the better balance
Table 1: Energy breakdown forMS-BPMACC
Units Energy (femto Joule)
1 MACC 5.1 fJ
256 MACCs 1,305.6 fJ
SAR ADC (for 256 MACCs) 1,660.0 fJ
Total Energy 1,956.6 fJ
Total Energy per 2b-2b MACC 11.6 fJ
Total Energy per 8b-8b MACC 185.3 fJ
between speed and resolution forMS-BPMACCs. The design
space exploration in Figure 14 shows that this choice makes the
grouping of 8 low-bitwidth MACCs optimal for m=32 cycles
of operation. The process of A/D conversion takes m+1 cy-
cles, pipelined with the sub-vector dot-product. Table 1 shows
the energy breakdown within aMS-BPMACC that uses 2-bit
partitioning. As shown, performing an 8-bit MACC using the
interleaved bit-partitioned arithmetic requires 5.4× less en-
ergy than a digital MACC which consumes around 1 pJ [12].
5. MIXED-SIGNAL NON-IDEALITIES AND
THEIR MITIGATION
Although analog circuitry offers significant reduction in
energy, they might lead to accuracy degradation. Thus,
their error needs to be properly modeled and accounted for.
Specifically,MS-BPMACCs, the main analog component, can
be susceptible to (1) thermal noise, (2) computational error
caused by incomplete charge transfer, and (3) PVT variations.
Traditionally, analog circuit designers mitigate sources of error
by just configuring hardware parameters to values which are
robust to non-idealities. Such hardware parameter adjustments
require rather significant energy/area overheads that scale lin-
early with number of modules. The overheads are acceptable
in conventional analog designs since modules are few in num-
bers. However, due to the repetitive and scaled-up nature of
our design, we need to mitigate these non-idealities in a higher
and algorithmic level. We leverage the training algorithm’s
inherent mechanism to reduce error (loss) and use mathemat-
ical models to represent these non-idealities. We, then, apply
these models during the forward pass to adjust and fine-tune
pre-trained neural models with just a few more epochs across
the chips within a technology node. The rest of this section
details non-idealities and their modeling. It, then elaborates
on how PVT variations are considered in formulations.
5.1 Thermal Noise
Thermal noise is an inherent perturbation in analog circuits
caused by the thermal agitation of electrons, distorting the
main signal. This noise can be modeled according to a normal
distribution, where the ideal voltage deviates relative to a value
comprised of the working temperature (T), Boltzmann con-
stant (k), and capacitor size (C) which produce the deviation
σ =
√
kT/C. Within BIHIWE, switched-capacitor MACC
units are mainly effected by the combined thermal noise result-
ing from weights and accumulator capacitors (CW and CACC
respectively). The noise from these capacitors gets accumu-
lated during the m cycles of computation for each individual
MACC unit and then gets aggregated across the n MACC units
in MS-BPMACC. By applying the thermal noise equation
used for similar MACC units [42] to aMS-BPMACC unit, the
standard deviation at the output is described by Equation 5:
σACC=
√
kT (α|Wm−1|+3α+3)
9α(α+1)2Cw
(
∑m−1i=0
( α
1+α
)2i)×n (5)
In the above equation, α is equal to CACC3CW . We apply the
effect of thermal noise in the forward propagation of DNN by
6
adding an error tensor to the output of convolutional and fully
connected layers. Having computed the standard deviation of
noise for a singleMS-BPMACC (σACC), each element of the
error tensor is sampled from a normal distribution as follows:
N(µ=0,σ2=(σACC×r×85)2) (6)
In the above equation, σACC is scaled by r which is the amount
ofMS-BPMACC operations required to generate one element
in the output feature map as well as the amount of total
bit-shifts applied to each result byMS-WAGG unit, 85.
5.2 Computational Error
Another source of error in BIHIWE’s charge-domain computa-
tions arises when charge is shared between capacitors during
the multiplication and accumulation. Within each MACC
unit, the input capacitors (C-DACX) transfer a sampled
charge to the weight capacitors (C-DACW) to produce charge
proportional to the multiplication result. But the resulting
charge is subject to error dependent on the ratio of weight
and input capacitor sizes (β =Cx/Cw) as shown in Equation 3.
This shared charge in the weight capacitors introduces more
error when it is redistributed to the accumulating capacitor
(CACC) which cannot absorb all of the charge, leaving a small
portion remaining on the weight capacitors in subsequent
cycles. The ideal voltage (VACC,Ideal) produced after m cycles
of multiplication can be derived from Equation 4 as follows:
VACC,Ideal [m]=∑mi=1
V DD
9α WiXi (7)
By considering the computational error from incomplete
charge sharing, the actual voltage at the accumulating capaci-
tor after m cycles of MACC operations (VACC,R[m]) becomes:
3α
3α+|Wm|VACC,R[m−1]+
WmXmβ
(3α+|Wm|)(3β+|Wm|)V DD (8)
Computational error is accounted for in the fine-tuning pass
by including the multiplicative factors shown in Equation 8
in weights. During the forward pass, the fine-tuning algorithm
decomposes weight tensors in convolutional and fully-
connected layers into groups corresponding to MS-WAGG
configuration and updates the individual weight values (Wi) to
new values (W ′i ) with the computational error in Equation 9:
W ′i =
Wi
3α+|Wi|
βVDD
3β+|Wi|
m−1
∏
j=i+1
3α
3α+|Wj|
∀0≤ i≤m−1
(9)
5.3 Process-Voltage-Temperature Variations
Process variations. We use the sizing of the capacitors to
provision and mitigate for the process variations to which the
switched-capacitor circuits are generally robust. The robust-
ness and the mitigation are effective because the capacitors are
implemented using a number of smaller unit capacitors with
common-centroid layout technique [52]. We, specifically,
use the metal-fringe capacitors for MACCs with mismatch
of just 1% standard deviation [53] with the max variation of
6% (6σ ) which is well below the error margins considered
for the computational correctness ofMS-BPMACCs.
Temperature variations. We model the temperature
variations by adding a perturbation term to T in Equation 5 as
a gaussian distributionNT (µ,σ2). We consider the maximum
value of the temperature as 358◦K which is commensurate with
existing practices [54], and the minimum value as 300◦K (This
is the peak-to-peak range for the gaussian distribution (6σ )).
Voltage variations. We also model the voltage variation by
adding a gaussian distribution to VDD term in Equation 9. Our
88.00%
89.00%
90.00%
91.00%
92.00%
93.00%
94.00%
70.00%
71.00%
72.00%
73.00%
74.00%
75.00%
76.00%
Epoch
1 2 3 4 5 6 7 8 9 10
Top-1 Validation Accuracy
Top-5 Validation Accuracy
88.00%
89.00%
90.00%
91.00%
92.00%
93.00%
94.00%
70.00%
71.00%
72.00%
73.00%
74.00%
75.00%
76.00%
Epoch
1 2 3 4 5 6 7 8 9 10
Top-1 Validation Accuracy
Top-5 Validation Accuracy
logs_vgg16
0_epoch 1_val_acc Top1 validation accuraccy2_val_acc_top5 Top5 validation accuraccy3_val_loss
1 1 70.332145 0.70332145 88.1234234 0.881234234 1.353112595048017
2 2 70.44543 0.7044543 88.87652 0.8887652 1.3550823365380915
3 2 70.66789 0.7066789 89.17762 0.8917762 1.3485757963148186
4 3 70.59554 0.7059554 89.588765 0.89588765 1.3519470051615066
5 4 70.877899 0.70877899 89.43544332 0.8943544332 1.348566629453572
6 5 71.000865 0.71000865 89.8777641 0.898777641 1.3448890438098866
7 6 71.118977 0.71118977 90.0032212 0.900032212 1.3483284945021623
8 7 71.043221 0.71043221 90.17877 0.9017877 1.3473637804775658
9 8 71.165667 0.71165667 90.1114323 0.901114323 1.348274378243558
10 9 71.281243 0.71281243 90.1843 0.901843 1.3371749764193077
Ideal Top-1 Accuracy
Ideal Top-5 Accuracy
VGG-16
logs_resnet50
0_epoch 1_val_acc Top1 validation accuraccy2_val_acc_top5 Top5 validation accuraccy3_val_loss
1 1 74.506643 0.74506643 90.123162 0.90123162 1.353112595048017
2 2 75.09900402759999 0.750990040276 90.832154 0.90832154 1.3550823365380915
3 2 74.98777 0.7498777 91.004323 0.91004323 1.3485757963148186
4 3 75.124323 0.75124323 91.334323 0.91334323 1.3519470051615066
5 4 75.1156544 0.751156544 91.224544 0.91224544 1.348566629453572
6 5 74.876554 0.74876554 91.56433 0.9156433 1.3448890438098866
7 6 75.187345 0.75187345 91.876545 0.91876545 1.3483284945021623
8 7 75.198854 0.75198854 92.223321 0.92223321 1.3473637804775658
9 8 75.2122 0.752122 92.445445 0.92445445 1.348274378243558
10 9 75.2045 0.752045 92.50111 0.9250111 1.3371749764193077
Ideal Top-1 Accuracy
Ideal Top-5 Accuracy
ResNet-50
To
p-
1 
Ac
cu
ra
cy
To
p-
5 
Ac
cu
ra
cy
 1
Figure 6: ResNet-50 and VGG-16 accuracy after fine-tuning.
experiments show that, vari tions in vo tage c n be mitigated
up to 20%. The extensive amount of vector dot-product
operations in DNNs, allows for the minimum and maximum
values of the distributions being sampled sufficient amount
of times, leading to coverage of the corner cases.
Atop all these considerations, we use differential signaling
for ADCs which attenuates the c mmon-mode fluctuations
such as PVT variations. To show the effectiveness of our
techniques, Figure 6 plots the result of fine-tuning process of
two benchmarks, ResNet-50 and VGG-16 for ten epochs. Table 4
reports the summary of accuracy trends for all the benchmarks,
which achieve less than 0.5% loss. As Figure 6 shows, the fine-
tuning pass compensates the initial loss (0.73% for top-1 and
2.41% for top-5) to only 0.04% for top-1 and 0.02% for top-5.
VGG-16 is slightly different and reduces the initial loss (1.16%
for top-1 and 2.24% for top-5) to less than 0.18% for top-1 and
0.13% for top-5 validation accuracy. The trends are similar
for other benchmarks and omitted due to space constraints.
6. BIHIWE COMPILER STACK
As Figure 7 shows, DNNs are compiled to BIHIWE through
a multi-stage process beginning with a Caffe2 [55] DNN
specification file. The high-level specification provided in the
Caffe2 file is translated to a layer DataFlow Graph (DFG) that
preserves the structure of the network. The DFG goes through
an algorithm that cuts the DFG and tiles the data to map the
DNN computations to the accelerator clusters and cores. The
tiling also aims to minimize the transfer of model parameters
to limited on-chip scratchpads on the logic die from the
3D-stacked DRAM, while maximizing the utilization of the
compute resources. In addition to the DFG, the cutting/titling
algorithm takes in the architectural specification of the
BIHIWE. These specifications include the organizations and
configurations (# rows, #columns) of the clusters, vaults, and
cores as well as details of theMS-BPMACCs. To identify the
best cuts and tilings, the cutting/tiling algorithm exhaustively
searches the space of possibilities, which is enabled through
an estimation tool. The tool estimates the total energy con-
sumption and runtime for each cuts/tiles pair which represent
the data movement and resource utilization in BIHIWE.
Estimation is viable, as the DFG does not change, there is no
hardware managed cache, and the accelerator architecture is
Binary
Generator
DNN Specifications
In Caffe 2
Communication 
Instruction Blocks
Compute 
Instruction Blocks
Layer Dataflow
Graph
Cutting/Tiling
Algorithm
Accelerator Specifications
# Cores (Rows, Columns)
# Vaults (Rows, Columns)
# MS-WAGG (Rows, Columns) 
MS-BPMACC Width
# Cycles before ADC
Dataflow Cuts
for Each
Cluster & Core
Tiling of 
Activations
and Weights
Runtime/
Energy
Estimation Tool
Translator
Figure 7: BIHIWE compilation stack.
7
fixed during execution. Thus, there are no irregularities that
can hinder estimation. Algorithm 1 depicts the cutting/tiling
procedure. When cuts and tiles are determined, the compiler
generates the binary code that contains the communication
and computation instruction blocks. As commensurate with
state-of-the-art accelerators [12, 28, 23, 25, 18], all the instruc-
tions are statically scheduled. We extend the static scheduling
to cluster coordination, data communication and transfer.
Initialize cutopt [N]← /0
Initialize tilingopt [N]← /0
for layeri∈DFGDNN do
sopt←∞
for tilingi, j∈ layeri do
for cuti, j,k∈ tilingi, j do
(runtimei, j,k, energyi, j,k)←EstimationTool(tilingi, j ,cuti, j,k)
si, j,k← runtimei, j,k×energyi, j,k
if si, j,k<sopt then
cutopt [i]←cuti, j,k
tilingopt [i]← tilingi, j
end
end
end
return cutopt , tilingopt
Algorithm 1: Cutting/tiling algorithm for clustered acceleration.
7. BIHIWE INSTRUCTION SET
The BIHIWE ISA exposes the following unique properties of
its architecture to the software: (1) efficient mixed-signal ex-
ecution using bit-partitionedMS-WAGG and capacitive accu-
mulation, and (2) clustered architecture, that takes advantage
of the power efficiency of mixed-signal acceleration to scale-
up the number ofMS-WAGGs in BIHIWE. As such, BIHIWE
uses a block-structured ISA that segregates the execution of
the DNN into (1) data communication instruction blocks that ac-
cesses tiles of data from the 3D-stacked memory and populates
the on-chip scratchpads (Input Buffer/Weight Buffer/Output
Buffer in Figure 2), and (2) compute instruction blocks each of
which consumes the tile of data produced by a corresponding
communication instruction block and produces an output tile.
The BIHIWE compiler stack statically assigns communication
and compute instruction blocks to accelerator clusters, shifting
the complexity from hardware to the compiler. By splitting
the data transfer and on-chip data processing into separate
instructions, the BIHIWE ISA enables software pipelining
between clusters and allows the memory accesses to run ahead
and fetch data for the next tile while processing the current tile.
Compute instruction block. A block of compute instruc-
tions expresses the entire computation to produce a single tile
in an accelerator core. Further, the compute block governs
how the input data for a DNN layer is bit-partitioned and
distributed across wide aggregators within a single core. As
such, the compiler has complete control over the read/write
accesses to on-chip scratchpads, A/D and D/A conversion,
and execution using the MS-WAGGs and digital blocks in
an accelerator core. The granularity of bit-partitioning and
charge-based accumulation is determined for each microar-
chitectural implementation based on the technology node
and circuit design paradigm. As such, to support different
technology nodes and design styles and allow extensions to
the architecture, the BIHIWE ISA encodes the bit-partitioning
and accumulation cycles. However, we need to explore
the design space to find the optimal design choice for each
combination of technology node and circuits (Section 8).
Table 2: Evaluated benchmarked DNNs
DNN Type Domain Dataset Multiply-Adds Model Weights
AlexNet [56] CNN Image Classification Imagenet [57] 2,678 MOps 56.1 MBytes
CIFAR-10 [58, 59] CNN Image Classification CIFAR-10 [60] 617 MOps 13.4 MBytes
GoogLeNet [61] CNN Image Classification Imagenet 1,502 MOps 13.5 MBytes
ResNet-18 [62] CNN Image Classification Imagenet 4,269 MOps 11.1 MBytes
ResNet-50 [62] CNN Image Classification Imagenet 8,030 MOps 24.4 MBytes
VGG-16 [58] CNN Object Recognition Imagenet 31 GOps 131.6 MBytes
VGG-19 [58] CNN Object Recognition Imagenet 39 GOps 137.3 MBytes
YOLOv3 [63] CNN Object Recognition Imagenet 19 GOps 39.8 MBytes
PTB-RNN [59] RNN Language Modeling Penn TreeBank [64] 17 MOps 16 MBytes
PTB-LSTM [65] RNN Language Modeling Penn TreeBank 13 MOps 12.3 MBytes
Communication instruction block. The key challenge when
scaling up the design is to minimize data-movement while
parallelizing the execution of the DNN across the on-chip
compute resources. To simplify the hardware, BIHIWE
instruction set captures the static schedule of data movement
as a series of communication instruction block s. Static scheduling
is possible as the topology of the DNN does not change during
inference and the order of layers and neurons is known stati-
cally. The BIHIWE compiler stack assigns the communication
blocks to the cores according to the order of the layers. This
static ordering enables BIHIWE to use a simple statically
scheduled bus instead of a more complex interconnection.
To maximize energy efficiency, it is imperative to exploit
the high degree of data-reuse offered by DNNs. To exploit
data-reuse when parallelizing computations across cores of the
BIHIWE architecture, the communication instructions support
broadcasting/multicasting to distribute the same data across
multiple cores, minimizing off-chip memory accesses. Once
a communication block writes a tile of data to the on-chip
scratchpads, it can be reused over multiple compute blocks to
exploit temporal data locality within a single accelerator core.
8. EVALUATION
8.1 Methodology
Table 3: BIHIWE and baselines platforms
Parameters ASIC Parameters GPU
Chip BIHIWE TETRIS Chip RTX 2080 TI Titan Xp
MACCs 16,384 3,136 Tensore Cores 544 —
On-chip Memory 9216 KB 3698 KB Memory 11 GB (GDDR6) 12 GB (GDDR5X)
Chip Area
(mm2)
122.3 56 Chip Area (mm
2) 754 471
Total Dissipation Power 250 W 250 W
Frequency 500 Mhz 500 Mhz Frequency 1545 Mhz 1531 Mhz
Technology 45 nm 45 nm Technology 12 nm 16 nm
Benchmarks. We use ten diverse CNN and RNN models to
evaluate BIHIWE, described in Table 2 that perform image clas-
sification, real-time object detection (YOLOv3), and character-
level (PTB-RNN) and word-level (PTB-LSTM) language model-
ing. This set of benchmarks includes medium to large scale
models (from 11.1 MBytes to 137.3 MBytes) and variety of
multiply-add operations (from 13 Million to 39 Billion).
Simulation infrastructure. We develop a cycle-accurate
simulator and a compiler for BIHIWE that takes in a caffe-2
specification of the DNN, finds the optimum tiling and cutting
for each layer, and maps it to BIHIWE architecture. The simula-
tor executes each of the optimized network using the BIHIWE
architecture model and reports the total runtime and energy.
TETRIS comparison. We compare BIHIWE with TETRIS, a
state-of-the-art fully-digital 3D-stacked dataflow accelerator.
We match the on-chip power dissipation of BIHIWE and
TETRIS and compare the total runtime and energy, including
energy for DRAM accesses. We also perform an iso-area
comparison and scale up original TETRIS with 16 vaults to
36 vaults to match its area to BIHIWE’s. The baseline TETRIS
supports 16-bit execution while BIHIWE supports 8-bit. For
fairness, we modify the open-source TETRIS simulator [46]
and proportionally scale its runtime and energy. BIHIWE sup-
ports 8-bit operands since this representation has virtually no
impact by itself on the final accuracy of the DNNs [59, 66–69].
8
PTB-LSTM 4.6 1.6
GEOMEAN 4.87 2.38
Im
pr
ov
em
en
t /
 
TE
TR
IS
0.0x
1.5x
3.0x
4.5x
6.0x
AlexNet
CIFAR-10
GoogLeNet
ResNet-18
ResNet-50
VGG-16
VGG-19
YOLOv3
PTB-RNN
PTB-LSTM
GEOMEAN
2.4
1.61.6
2.93.02.9
2.32.32.2
3.4
2.3
4.94.6
5.55.5
5.04.75.1
4.14.5
5.4
4.5
Speedup Energy Reduction
Figure 8: Speedup and energy improvement over TETRIS.
GPU comparison. We also compare BIHIWE to two Nvidia
GPUs (i.e., RTX 2080 TI and Titan Xp) based on Turing
and Pascal architecture respectively, listed in Table 3. RTX
2080 TI’s Turing architecture provides tensor cores which
are specialized hardware for deep learning inference. We use
8-bit on GPUs using Nvidia’s own TensorRT 5.1 [70] library
compiled with the optimized cuDNN 7.5 and CUDA 10.1. For
each DNN benchmark, we perform 1,000 warmup iterations
and report the average runtime across 10,000 iterations.
Comparison with other recent accelerators. We also com-
pare BIHIWE to Google TPU [26], mixed-signal CMOS Red-
Eye [35], and two analog memristive accelerators. All the com-
parisons are in 8-bits. The original designs [32, 71] use 16-bits.
Scaling from 16-bit to 8-bit execution for memristive designs
would optimistically provide a 4× increase in efficiency.
Energy and area measurement. All hardware modelings
are performed using FreePDK 45-nm standard cell library [72].
We implement the switched-capacitor MACCs in Cadence
Analog Design Environment V6.1.3 and use Spectre SPICE V6.1.3 to
model the system. We then, use Layout XL of Cadence to lay out
the MACC units and extract the energy/area. The ADC’s en-
ergy/area are obtained from [73]. Based on theMS-BPMACC
configuration, we use the ADC architecture from [74].
We implement all digital blocks of BIHIWE, including
adders, shifters, interconnection, and accumulators in Verilog
RTL and used Synopsys Design Compiler (L-2016.03-SP5) to syn-
thesize them and measure their energy and area. For on-chip
SRAM buffers, we use CACTI-P [75] to measure the energy and
area of the memory blocks. The 3D-stacked DRAM architec-
ture is based on HMC stack [49, 50], the same as TETRIS, and
the bandwidth and access energy are adopted form that work.
Error modeling. For error modeling, we use Spectre SPICE
V6.1.3 to extract the noise behavior of MACCs via circuit sim-
ulations. Thermal noise, computational error, and PVT vari-
ations are considered based on details in Section 5. We imple-
ment the extracted hardware error models and the correspond-
ing mathematical modelings using PyTorch v1.0.1 [76] and
integrate them into Neural Network Distiller v0.3 framework [77]
for a fine-tuning pass over the evaluated benchmarks.
8.2 Experimental Results
8.2.1 Comparison with TETRIS
Iso-power performance and energy comparison. Figure 8
shows the performance and energy reduction of BIHIWE over
TETRIS under the same on-chip power budget. On average,
BIHIWE delivers a 4.9×speedup over TETRIS. This signifi-
cant improvement is attributed to the use of wide mixed-signal
MS-BPMACCs in BIHIWE as opposed to PEs in TETRIS. The
wide bit-partitioned mixed-signal design ofMS-BPMACC in
BIHIWE enables us to cram≈ 5×more compute units within
the same power budget as TETRIS. The highest speedup
is observed in YOLOv3 and PTB-RNN, where their networks’
configurations favor the wide vectorized execution in BIHIWE
by better utilizing compute resources. The lowest speedup
is observed in ResNet-18, since its relatively small size leads
to under-utilization of compute resources in BIHIWE.
Figure 8 demonstrates the total energy reduction for
BIHIWE across the evaluated benchmarks as compared to
TETRIS. On average, BIHIWE yields 2.4×energy reduction
over TETRIS, including energy for DRAM accesses, while
consuming the same on-chip power as TETRIS. CIFAR-10
enjoys the highest energy reduction, since BIHIWE is able
to take advantage of CIFAR-10’s smaller memory footprint to
maximize on-chip data reuse and reduce DRAM accesses. The
lowest energy reduction is observed in RNN benchmarks, PTB-
RNN and PTB-LSTM since the matrix-vector operations in these
benchmarks require a significant number of memory accesses,
diminishing the benefits from mixed-signal computations.
Energy breakdown. Figure 9 shows the energy breakdown
YOLOv3 YOLOv3 0.05658620690.096275862070.0014 0.1905862069
YOLOv3 YOLOv3 0.3087 0.4788 0.001355556 0.21
PTB-RNN PTB-RNN 0.03545 0.09905 0.01397 0.49453
PTB-RNN PTB-RNN 0.1912827190.25579067140.037359906050.51 5667035
PTB-
LSTM
PTB-LSTM
0.03207 0.09595 0.01535 0.47806
PTB-
LSTM
PTB-LSTM
0.17300206640.31760296030. 8444903650.4709500697
No
rm
ali
ze
d 
En
er
gy
 
Br
ea
kd
ow
n
0%
25%
50%
75%
100%
Compute On-chip Data Accesses Interconnection DRAM
Bi
Hi
we
TE
TR
IS
Bi
Hi
we
Bi
Hi
we
Bi
Hi
we
Bi
Hi
we
Bi
Hi
we
Bi
Hi
we
Bi
Hi
we
TE
TR
IS
TE
TR
IS
TE
TR
IS
TE
TR
IS
TE
TR
IS
TE
TR
IS
TE
TR
IS
VGG-19

YOLOv3
VGG-16

ResNet-50
ResNet-18
GoogLeNet
CIFAR-10
AlexNet
PTB-RNN
PTB-RNN
Bi
Hi
we
TE
TR
IS
Bi
Hi
we
TE
TR
IS
Figure 9: Energy breakdown of BIHIWE and TETRIS.
normalized to TETRIS. Energy breakdown is reported across
four major architectural components: (1) on-chip compute
units, (2) on-chip memory (buffers and register file), (3)
interconnect, and (4) 3D-stacked DRAM. DRAM accesses
account for the highest portion of the energy in BIHIWE,
since BIHIWE significantly reduces the on-chip compute
energy. While BIHIWE has a significantly larger number
of compute resources compared to TETRIS, the number of
DRAM accesses remain almost the same. This is because the
statically-scheduled bus allows data to be multicasted/broad-
casted across multiple cores in BIHIWE without significantly
increasing the number of DRAM accesses. Furthermore, the
statically-scheduled bus offers the BIHIWE compiler stack
the freedom to optimize partitioning the computations across
cores. Most layers in the benchmarks benefit for partitioning
the different inputs in a single batch (batch size is 16) across BI-
HIWE cores and broadcasting weights, which is not explored
in TETRIS. As a result, these networks have lower DRAM
accesses. The breakdown of energy consumption varies with
the type of computations required by the DNN as well as the
degree of data-reuse. Benchmarks PTB-RNN and PTB-LSTM
are recurrent neural networks that perform large matrix-vector
operations and require significant DRAM accesses for
ResNet-50 3.5 2.7
VGG-16 3.2 2.9
VGG-19 3.4 3.1
YOLOv3 2.7 3.2
PTB-RNN 2.5 1.7
PTB-LSTM 2.1 1.6
GEOMEAN 3.02 2.62
Im
pr
ov
em
en
t /
 
Sc
ale
d 
up
 T
ET
RI
S
0.0x
1.3x
2.5x
3.8x
5.0x
AlexNet
CIFAR-10
GoogLeNet
ResNet-18
ResNet-50
VGG-16
VGG-19
YOLOv3
PTB-RNN
PTB-LSTM
GEOMEAN
2.6
1.61.7
3.23.12.92.72.82.5
4.2
2.5
3.0
2.1
2.52.7
3.43.23.53.53.03.3
3.4
Speedup Energy Reduction
Figure 10: Iso-area comparison with TETRIS.
9
Sp
ee
du
p 
/
Ti
ta
n 
Xp
-
IN
T8
0×
1×
1×
2×
2×
3×
AlexNet
CIFAR-10
GoogLeNet
ResNet-18
ResNet-50
VGG-16
VGG-19
YOLOv3
PTB-RNN
PTB-LSTM
GEOMEAN
Titan Xp-INT8 RTX 2080TI-INT8 BiHiwe-INT8
11.2x 10.6x
1.0x
2.0x
1.7x
3.2x 3.4x
Figure 11: Performance comparison to GPUs.
weights. Therefore, PTB-RNN and PTB-LSTM use more energy
for DRAM accesses compared to other benchmarks.
Unlike the fully-digital PEs in TETRIS that perform a
single operation in a cycle, BIHIWE usesMS-WAGGs which
perform wide vectorized operations–crucial in BIHIWE
to amortize the high cost of ADCs. As shown in Table 1,
each MACC operation in BIHIWE consumes 5.4× less
energy compared to TETRIS. The output-stationary dataflow
enabled by capacitive accumulation in addition to the systolic
organization ofMS-WAGGs in each core of BIHIWE which
eliminates the need for register files unlike TETRIS, leads to
4.4× reduction for on-chip data movement on average.
Iso-area comparison with TETRIS. We compare the total
runtime and energy of BIHIWE with a scaled up version of
TETRIS which matches BIHIWE’area. Figure 10 shows the
results for the workloads. Scaling-up the compute resources
in TETRIS by 2.25× to match the chip-area of BIHIWE results
in a sub-linear increase in performance by ≈ 60%. This im-
provement in performance comes at a cost of reduced energy-
efficiency due to an increase in memory accesses to feed the ad-
ditional compute resources. The trends in speedup and energy-
reduction remain the same as iso-power comparison, with the
exception of ResNet-18, which now sees resource underutiliza-
tion in TETRIS after scaling up number of compute resources.
8.2.2 Comparison to GPUs
Figure 11 compares performance of BIHIWE with Titan Xp
and RTX 2080 TI. RTX 2080 TI is based on Nvidia’s latest ar-
chitecture, Turing. For a fair comparison, we enable vectorized
8-bit operations and optimized GPU compilations. The results
are normalized to Titan Xp. BIHIWE, on average, yields 70%
speedup over Titan Xp GPU and performs 15% slower than
RTX 2080 TI. Convolutional networks require large amount
of matrix-matrix multiplications that are well-suited for tensor
cores, leading to RTX 2080 TI’s outperformance on both
BIHIWE and Titan Xp. VGG-16 and VGG-19 see the maximum
benefits. However, BIHIWE outperforms RTX 2080 TI GPU
in PTB-RNN and PTB-LSTM with 11.2× and 10.6×, respectively.
These RNN networks require matrix-vector multiplications,
which is particularly suitable for the wide vectorized
operations supported in BIHIWE’sMS-WAGGs–not the best
case for tensor cores. In terms of performance-per-Watt,
BIHIWE outperforms both Titan Xp and RTX 2080 TI GPUs
by a large margin, 66.5×and 33.1×, respectively.
8.2.3 Comparison with Other Accelerators
We also compare the power efficiency (GOPS/s/Watt) and
area efficiency GOPS/s/mm2 of BIHIWE with other recent
digital and analog accelerators. Due to the lack of available
raw performance/energy numbers for specific workloads, we
use these metrics that is commensurate with comparisons for
recent designs [21, 71, 78]. Figure 12 depicts the peak power
and area efficiency results. On average for the evaluated bench-
marks, BIHIWE achieves 72% of its peak efficiency. This infor-
mation is not available in the publications for the other designs.
Digital systolic: Google TPU [26]. In comparison with TPU,
which also uses systolic design, BIHIWE delivers 4.5×more
peak power efficiency and almost the same area efficiency.
Leveraging the wide, interleaved, and bit-partitioned
arithmetic with its switched-capacitor implementation in
BIHIWE architecture, reduces the cost of MACC operations
significantly compared with TPU which uses 8-bit digital
logic, leading to significant improvement in power efficiency.
Mixed-signal CMOS: RedEye [35]. RedEye is an in-sensor
CNN accelerator baed on mixed-signal CMOS technology
which also uses switched-capacitor circuitry for MACC
operations. Compared to RedEye, BIHIWE offers 5.5×
better power efficiency and 167× better area efficiency.
Utilizing the proposed wide, interleaved, and bit-partitioned
arithmetic amortizes the cost of ADC in BIHIWE by reducing
its required resolution and sampling rate, leading to significant
curtailment of ADC power and area, in contrast to RedEye.
Analog Memristive designs [32, 71]. Prior work in
ISAAC [32] and PipeLayer [71] have explored analog mem-
ristive technology for DNN acceleration, which integrates
both compute and storage within the same die, and offers
higher compute density compared to traditional analog CMOS
technology. However, this increase in compute density comes
at the cost of reduced power-efficiency. Generally, memris-
itive designs perform computations in the current domain,
requiring the costly ADCs to sample the current-domain
signals at the same rate as the compute/storage for memristors.
PipeLayer significantly reduces this cost. Overall, compared
to ISAAC and PipeLayer, BIHIWE improves the power
efficiency by 3.6× and 9.6×, respectively.
8.2.4 Design Space Explorations
Design space exploration for bit-partitioning. To evaluate
the effectiveness of bit-partitioning, we perform a design
space exploration with various bit-partitioned options.
Figure 13 shows the reduction in energy and area compared
to an 8-bit×8-bit design when two vectors with 32 elements
go under dot-product. The other design points also perform
8-bit×8-bit MACC operations while utilizing our wide and
interleaved bit-partitioned arithmetic. As depicted, the design
with 2-bit partitioning strikes the best balance in energy and
area with the switched-capacitor design of MACC units at
45 nm CMOS node. The difference between 2-bit and 1-bit
is that single-bit partitioning quadratically increases the
number of low bitwidth MACCs from 16 (2-bit partitioning
) to 64 (1-bit partitioning) to support 8-bit operations. This
   1 10 100 1000 10,000
   
   102
103
104
105
Area Efficiency (GOPS/s/mm2)(Log Scale)
Po
we
r E
ffi
cie
nc
y (
GO
PS
/s
/W
at
t)(
Lo
g 
Sc
ale
)
RedEye TPU
BiHiwe
ISAAC
PipeLayer(1.6, 973.0) (277.9, 1226.6)
(267.9, 5461.3)
(1867.2, 1522.8)
(5940, 571.6)
100 103102101 104
102
103
104
Po
we
r E
ffi
cie
nc
y (
GO
PS
/s
/W
at
t)
(L
og
 S
ca
le)
Area Efficiency (GOPS/s/mm2)  (Log Scale)
Figure 12: Comparison with other accelerators.
      5.0 10.0 15.0 20.0 25.0 30.0
   
0.0
1.0
2.0
3.0
4.0
5.0
Reduction In Area / 8-Bit Partitioning
Re
du
ct
io
n 
In
 E
ne
rg
y /
 8
-B
it 
Pa
rti
tio
nin
g
1.0
8-bit
4-bit
1-bit
2-bit
Optimal
Reduction In Area/ 8-Bit Partitionng
Re
du
ct
io
n 
In
 E
ne
rg
y/
 8
-B
it 
Pa
rti
tio
nin
g
Figure 13: Design space exploration for bit-partitioning.
10
imposes disproportionate overhead that outweighs the benefit
of decreasing each MACC units area and energy.
Design space exploration for MS-BPMACC configura-
tion. The number of accumulation cycles (m) before the
2.50.0 1.0 2.00.5 1.5
5.0
   
1.0
2.0
3.0
4.0
Energy Reduction/TETRIS
Sp
ee
du
p/
TE
TR
IS
(n=512, m=1)
(n=16, m=2)
(n=128, m=2)
(n=256, m=4)
(n=32, m=8)
(n=16, m=16)
(n=8, m=32)
(n=4, m=64)
(n=1, m=128)
Optimal
Figure 14: Design space exploration forMS-BPMACC.
A/D conversion and the number of MACC units (n) are two
main parameters of MS-BPMACC which define resolution
and the sample rate of the ADC, determining its power.
Figure 14 shows the design space exploration for different
configurations of theMS-BPMACC. In a fixed power budget
of 2W for compute units, we measure the total runtime and
energy of BIHIWE over the evaluated workloads which are
normalized to TETRIS. As shown in Figure 14, increasing
number of MACCs, limits the number of accumulation cycles,
consequently leading to using ADCs with high sample-rates.
Using high sample-rate ADCs significantly increases power,
making the design less efficient. On the other hand, increasing
number of accumulation cycles, limits the number of MACCs,
which restricts the number of MS-WAGGs that can be inte-
grated into the design under the given power budget. Overall,
the optimal design point that delivers the best performance and
energy is with eight MACC units and 32 accumulation cycles.
Design space exploration for clustered architecture.
BIHIWE uses a hierarchical architecture with multiple cores
in each vault. Having a larger number of small cores for each
vault yields increased utilization of compute resources, but re-
quires data transfer across cores. We explore the design space
with 1, 2, 4, and 8 cores per cluster.As Figure15 shows, BI-
HIWE with four cores per each vault (default configuration in
BIHIWE) strikes the best balance between speedup and energy
reduction. Performance increases as we increase the number
of cores per vault from 1 to 8. However, the 8-core configu-
ration results in a higher number of data accesses. Therefore,
the 4-core design point provides the optimal balance.
8.2.5 Evaluation of Circuitry Non-Idealities
Table 4 shows the Top-1 accuracy with considering
non-idealities, after fine-tuning, the ideal accuracy, and the
final loss in accuracy. As shown in Table 4, some of the
networks, namely AlexNet and ResNet-18, are more sensitive
to the non-idealities, leading to a higher initial accuracy
degradation. To recover the accuracy loss due to the circuitry
non-idealities, we perform a fine-tuning step for a few
epochs. By performing this fine-tuning step, the accuracy
loss of the CIFAR-10, ResNet-18, and ResNet-50 networks
3.0   1.0 2.00.5 1.5 2.5
6
0.0
1.0
2.0
3.0
4.0
5.0
Energy Reduction/TETRIS
Sp
ee
du
p/
TE
TR
IS
# Cores = 1
# Cores = 2
# Cores = 4# Cores = 8
Optimal
Figure 15: Design space exploration for # core per cluster.
is fully recovered (loss is less than 0.04%) which within
these networks, CIFAR-10 and ResNet-50 are more robust
to non-idealities. The accuracy loss for other networks is
below 0.5% which within those AlexNet has the maximum
loss. The final two networks, namely PTB-RNN and PTB-LSTM
perform character-level and word-level language modeling,
respectively. The accuracy for these two networks is measured
in Bits-Per-Character (BPC) and Perplexity-per-Word (PPW),
respectively. Both PTB-RNN and PTB-LSTM recover all the loss
after fine-tuning. The final results after fine-tuning step show
the effectiveness of this approach in recovering the accuracy
loss due to the non-idealities pertinent to analog computation.
9. RELATED WORK
There is a large body of work on digital accelerators for
DNNs [10–31]. Mixed-signal acceleration has also been
explored previously for neural network [34, 40] and is gaining
traction again for the deep models [32, 33, 35–39, 41, 42]. This
paper fundamentally differs from these inspiring efforts as it
delves into the mathematics of basic operations in DNNs, refor-
mulates and defines the wide, interleaved, and bit-partitioned
approach to overcome the challenges of mixed-signal accel-
eration. By partitioning and re-aggregating the low-bitwidth
MACC operations, this paper addresses the limited range of
encoding and reduces the cost of cross-domain conversions.
Additionally, it combines the proposed mathematical refor-
mulation with switched-capacitor circuitry to share and delay
A/D conversions, which amortizes their cost and reduce their
rate, respectively. Below, we discuss the most related works.
Switched-capacitor design. Switched-capacitor circuits [43]
have a long history, having been mainly used for designing
amplifiers[79], A/D and D/A converters[80] and filters[81].
Similar to resistive circuits, they have been used even for the
previous generation of neural networks [34]. More recently,
they have also been used for matrix multiplication[82, 42],
which can benefit DNNs. This work takes inspiration from
these efforts but differes from them in that it defines and
leverages wide, interleaved, and bit-partitioned reformulation
of DNN operations. Additionally, it offers a comprehensive
architecture that can accelerate a wide variety of DNNs.
Programmable mixed-signal accelerators. PROMISE [33]
offers a mixed-signal architecture that integrates analog units
within the SRAM memory blocks. RedEye[35] is a low-power
near-sensor mixed-signal accelerator that uses charge-domain
computations. These works do not offer wide interleavings
of bit-partitioned basic operations as described in this paper.
Fixed-functional mixed-signal accelerators. They are
designed for a specific DNN. Some focus on handwritten digit
classification [82, 83] or binarized mixed-signal acceleration
of CIFAR-10 images [38]. Another work focuses on spiking
neural networks’ acceleration [39]. In contrast, our design
is programmable and supports interleaved bit-partitioning.
Resistive memory accelerators. There is a large body of
work using resistive memory [32, 71, 78, 84–88]. We provided
Table 4: Accuracy before and after fine-tuning.
DNN Model Dataset Top-1 Accuracy(With non-idealities)
Top-1 Accuracy
(After fine-tuning)
Top-1 Accuracy
(Ideal) Accuracy Loss
AlexNet Imagenet 53.12% 56.64% 57.11% 0.47%
CIFAR-10 CIFAR-10 90.82% 91.01% 91.03% 0.02%
GoogLeNet Imagenet 67.15% 68.39% 68.72% 0.33%
ResNet-18 Imagenet 66.91% 68.96% 68.98% 0.02%
ResNet-50 Imagenet 74.5% 75.21% 75.25% 0.04%
VGG-16 Imagenet 70.31% 71.28% 71.46% 0.18%
VGG-19 Imagenet 73.24% 74.20% 74.52% 0.32%
YOLOv3 Imagenet 75.92% 77.1% 77.22% 0.21%
PTB-RNN Penn TreeBank 1.1 BPC 1.6 BPC 1.1 BPC 0.0 BPC
PTB-LSTM Penn TreeBank 97 PPW 170 PPW 97 PPW 0.0 PPW
11
a direct comparison to ISAAC [32] and PipeLayer [71].
ISAAC [32] most notably introduces the concept of tempo-
rally bit-serial operations, also explored in PRIME [44], and
is augmented with the concept of spike-base data scheme
in PipeLayer [71]. BIHIWE, in contrast, formulates a
partitioning that spatially groups lower-bitwidth MACCs
across different vector elements and performs them in-parallel.
PRIME does not provide absolute measurements and its
simulated baseline is not available for a head-to-head
comparison. PRIME also uses multiple truncations that
change the mathematics. Conversely, our formulation does
not induce truncation or mathematical changes.
10. CONCLUSION
This work proposes wide, interleaved, and bit-partitioned
arithmetic to overcome two key challenges in mixed-signal
acceleration of DNNs: limited encoding range, and costly A/D
conversions. This bit-partitioned arithmetic enables rearrang-
ing the highly parallel MACC operations in modern DNNs into
wide low-bitwidth computations that are mapped efficiently to
mixed-signal units. Further, these units operate in charge do-
main using switched-capacitor circuitry and reduce the rate of
A/D conversions by accumulating partial results in the charge
domain. The resulting microarchitecture, named BIHIWE,
offers significant benefits over its state-of-the-art analog and
digital counterparts. These encouraging results suggest that
the combination of mathematical insights with architectural
innovations can enable new avenues in DNN acceleration.
References
[1] J. Niehues, N.-Q. Pham, T.-L. Ha, M. Sperber, and
A. Waibel. Low-Latency Neural Speech Translation.
ArXiv e-prints, August 2018.
[2] J. Mo and J. Sattar. SafeDrive: Enhancing Lane
Appearance for Autonomous and Assisted Driving
Under Limited Visibility. ArXiv e-prints, July 2018.
[3] R. Li, Y. Shu, J. Su, H. Feng, and J. Wang. Using deep
Residual Network to search for galaxy-Ly{\alpha}
emitter lens candidates based on spectroscopic-selection.
ArXiv e-prints, July 2018.
[4] D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and
A. Karatzoglou. RecoGym: A Reinforcement Learning
Environment for the problem of Product Recommenda-
tion in Online Advertising. ArXiv e-prints, August 2018.
[5] I. Grabec, E. Švegl, and M. Sok. Development of a
sensory-neural network for medical diagnosing. ArXiv
e-prints, July 2018.
[6] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant,
Karthikeyan Sankaralingam, and Doug Burger. Dark
silicon and the end of multicore scaling. In ISCA, 2011.
[7] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.
Toward dark silicon in servers. IEEE Micro, 31(4):6–15,
July–Aug. 2011.
[8] Ganesh Venkatesh, Jack Sampson, Nathan Goulding,
Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-
Martinez, Steven Swanson, and Michael Bedford Taylor.
Conservation cores: Reducing the energy of mature
computations. In ASPLOS, 2010.
[9] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan,
Bingjun Xiao, and Jason Cong. Optimizing fpga-based
accelerator design for deep convolutional neural
networks. In FPGA, 2015.
[10] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and
Doug Burger. Neural acceleration for general-purpose
approximate programs. to apear in Commun. ACM,
2013.
[11] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang
He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu,
Ninghui Sun, et al. Dadiannao: A machine-learning
supercomputer. In MICRO, 2014.
[12] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and
Christos Kozyrakis. Tetris: Scalable and efficient neural
network acceleration with 3d memory. In ASPLOS, 2017.
[13] Alberto Delmas, Sayeh Sharify, Patrick Judd, and An-
dreas Moshovos. Tartan: Accelerating fully-connected
and convolutional layers in deep learning networks by
exploiting numerical precision variability. arXiv, 2017.
[14] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik
Sharma, Amir Yazdanbakhsh, Joon Kim, and Hadi
Esmaeilzadeh. TABLA: A unified template-based
framework for accelerating statistical machine learning.
In HPCA, 2016.
[15] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan,
Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji
Chen. Cambricon-x: An accelerator for sparse neural
networks. In MICRO, 2016.
[16] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor
Aamodt, Natalie Enright Jerger, and Andreas Moshovos.
Cnvlutin: ineffectual-neuron-free deep neural network
computing. In ISCA, 2016.
[17] Patrick Judd, Jorge Albericio, Tayler Hetherington,
Tor M Aamodt, and Andreas Moshovos. Stripes: Bit-
serial deep neural network computing. In MICRO, 2016.
[18] Hardik Sharma, Jongse Park, Divya Mahajan, Em-
manuel Amaro, Joon Kim, Chenkai Shao, Asit Misra,
and Hadi Esmaeilzadeh. From high-level deep neural
models to fpgas. In MICRO, 2016.
[19] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael
Papamichael, Adrian Caulfield, Todd Massengil, Ming
Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman,
Christian Boehn, Oren Firestein, Alessandro Forin,
Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle
Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram
Lanka, Friedel van Megen, Dima Mukhortov, Prerak
Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji
Sridharan, Lisa Woods, Phillip Yi-Xiao, Ritchie Zhao,
and Doug Burger. Accelerating persistent neural
networks at datacenter scale. In HotChips, 2017.
[20] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara,
Antonio Puglielli, Rangharajan Venkatesan, Brucek
Khailany, Joel Emer, Stephen W Keckler, and William J
Dally. SCNN: An Accelerator for Compressed-sparse
Convolutional Neural Networks. In ISCA, 2017.
[21] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca
Benini. Yodann: An ultra-low power convolutional
neural network accelerator based on binary weights.
arXiv, 2016.
[22] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan
Pedram, Mark A Horowitz, and William J Dally. Eie:
efficient inference engine on compressed deep neural
12
network. In ISCA, 2016.
[23] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss:
A spatial architecture for energy-efficient dataflow for
convolutional neural networks. In ISCA, 2016.
[24] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivi-
enne Sze. Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks.
JSSC, 2017.
[25] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar
Yalamanchili, and Saibal Mukhopadhyay. Neurocube:
A programmable digital neuromorphic architecture with
high-density 3d memory. In Computer Architecture
(ISCA), 2016 ACM/IEEE 43rd Annual International
Symposium on, pages 380–392. IEEE, 2016.
[26] Norman P Jouppi, Cliff Young, Nishant Patil, David
Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah
Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-
datacenter performance analysis of a tensor processing
unit. In ISCA, 2017.
[27] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang,
Chengyong Wu, Yunji Chen, and Olivier Temam.
Diannao: a small-footprint high-throughput accelerator
for ubiquitous machine-learning. In ASPLOS, 2014.
[28] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen
Lai, Benson Chau, Vikas Chandra, and Hadi Es-
maeilzadeh. Bit fusion: Bit-level dynamically compos-
able architecture for accelerating deep neural networks.
[29] Vahide Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi,
Hadi Esmaeilzadeh, and Rajesh K. Gupta. Snapea:
Predictive early activation for reducing computation in
deep convolutional neural networks. In ISCA, 2018.
[30] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan,
Michael Pellauer, and Christopher W Fletcher. Ucnn:
Exploiting computational reuse in deep neural networks
via weight repetition. arXiv preprint arXiv:1804.06508,
2018.
[31] Jinmook Lee, Changhyeon Kim, Sanghoon Kang,
Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. Unpu:
A 50.6 tops/w unified deep neural network accelerator
with 1b-to-16b fully-variable weight bit-precision. In
ISSCC, 2018.
[32] Ali Shafiee, Anirban Nag, Naveen Muralimanohar,
Rajeev Balasubramonian, John Paul Strachan, Miao
Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A
convolutional neural network accelerator with in-situ
analog arithmetic in crossbars. In ISCA, 2016.
[33] Prakalp Srivastava, Mingu Kang, Sujan K Gonu-
gondla, Sungmin Lim, Jungwook Choi, Vikram Adve,
Nam Sung Kim, and Naresh Shanbhag. Promise: An
end-to-end design of a programmable mixed-signal
accelerator for machine-learning algorithms. In 2018
ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA). IEEE, 2018.
[34] YP Tsividis and D Anastassiou. Switched-capacitor neu-
ral networks. Electronics Letters, 23(18):958–959, 1987.
[35] Robert LiKamWa, Yunhui Hou, Julian Gao, Mia
Polansky, and Lin Zhong. Redeye: analog convnet
image sensor architecture for continuous mobile vision.
In ACM SIGARCH Computer Architecture News,
volume 44, pages 255–266. IEEE Press, 2016.
[36] Daniel Bankman and Boris Murmann. Passive charge
redistribution digital-to-analogue multiplier. Electronics
Letters, 51(5):386–388, 2015.
[37] E. H. Lee and S. S. Wong. Analysis and design of
a passive switched-capacitor matrix multiplier for
approximate computing. IEEE Journal of Solid-State
Circuits, 52(1):261–271, Jan 2017. ISSN 0018-9200.
doi: 10.1109/JSSC.2016.2599536.
[38] Daniel Bankman, Lita Yang, Bert Moons, Marian
Verhelst, and Boris Murmann. An always-on 3.8 µ j/86%
cifar-10 mixed-signal binary cnn processor with all
memory on chip in 28nm cmos. In Solid-State Circuits
Conference-(ISSCC), 2018 IEEE International, pages
222–224. IEEE, 2018.
[39] Fred N Buhler, Peter Brown, Jiabo Li, Thomas Chen,
Zhengya Zhang, and Michael P Flynn. A 3.43 tops/w
48.9 pj/pixel 50.1 nj/classification 512 analog neuron
sparse coding neural network with on-chip learning and
classification in 40nm cmos. In VLSI Circuits, 2017
Symposium on, pages C30–C31. IEEE, 2017.
[40] Renée St. Amant, Amir Yazdanbakhsh, Jongse Park,
Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi,
Luis Ceze, and Doug Burger. General-purpose code
acceleration with limited-precision analog computation.
In ISCA, 2014.
[41] Jintao Zhang, Zhuo Wang, and Naveen Verma. 18.4
a matrix-multiplying adc implementing a machine-
learning classifier directly with data conversion. In
Solid-State Circuits Conference-(ISSCC), 2015 IEEE
International, pages 1–3. IEEE, 2015.
[42] Edward H Lee and S Simon Wong. Analysis and Design
of a Passive Switched-Capacitor Matrix Multiplier for
Approximate Computing. IEEE Journal of Solid-State
Circuits, 52(1):261–271, 2017.
[43] Paul R Gray, Paul Hurst, Robert G Meyer, and Stephen
Lewis. Analysis and design of analog integrated circuits.
Wiley, 2001.
[44] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen
Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime:
A novel processing-in-memory architecture for neural
network computation in reram-based main memory. In
ISCA, 2016.
[45] Sayeh Sharify, Alberto Delmas Lascorz, Patrick Judd,
and Andreas Moshovos. Loom: Exploiting weight and
activation precisions to accelerate convolutional neural
networks. arXiv, 2017.
[46] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz,
and Christos Kozyrakis. Tetris: Scalable and efficient
neural network acceleration with 3d memory. https:
//github.com/stanford-mast/nn_dataflow, 2017.
[47] Yuanfang Li and Ardavan Pedram. Caterpillar: Coarse
grain reconfigurable architecture for accelerating the
training of deep neural networks. In Application-specific
Systems, Architectures and Processors (ASAP), 2017
IEEE 28th International Conference on, pages 1–10.
IEEE, 2017.
[48] Himani Upadhyay and Shubhajit Roy Chowdhury.
A high speed and low power 8 bit x 8 bit multiplier
design using novel two transistor (2t) xor gates.
Journal of Low Power Electronics, 01 2015. doi:
13
10.1166/jolpe.2015.1362.
[49] Hybrid Memory Cube Consortium et al. Hybrid memory
cube specification 1.0. Last Revision Jan, 2013.
[50] Joe Jeddeloh and Brent Keeth. Hybrid memory cube new
dram architecture increases density and performance.
In VLSI Technology (VLSIT), 2012 Symposium on, pages
87–88. IEEE, 2012.
[51] Amir Yazdanbakhsh, Hajar Falahati, Philip J. Wolfe,
Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung
Kim. GANAX: A Unified SIMD-MIMD Acceleration
for Generative Adversarial Network. In ISCA, 2018.
[52] Mohammed Ismail and Terri Fiez. Analog VLSI: signal
and information processing, volume 166. McGraw-Hill
New York, 1994.
[53] Vaibhav Tripathi and Boris Murmann. Mismatch
characterization of small metal fringe capacitors. IEEE
Transactions on Circuits and Systems I: Regular Papers,
61(8):2236–2242, 2014.
[54] Yasuko Eckert, Nuwan Jayasena, and Gabriel H Loh.
Thermal feasibility of die-stacked processing in memory.
2014.
[55] Facebook AI Research. Caffe2. https://caffe2.ai/.
[56] Alex Krizhevsky. One weird trick for parallelizing
convolutional neural networks. arXiv, 2014.
[57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In CVPR, 2009. URL http://image-net.org/.
[58] Karen Simonyan and Andrew Zisserman. Very deep con-
volutional networks for large-scale image recognition.
arXiv, 2014.
[59] Itay Hubara, Matthieu Courbariaux, Daniel Soudry,
Ran El-Yaniv, and Yoshua Bengio. Quantized neural
networks: Training neural networks with low precision
weights and activations. arXiv, 2016.
[60] Alex Krizhevsky and Geoffrey Hinton. Learning multi-
ple layers of features from tiny images. Computer Sci-
ence Department, University of Toronto, Tech. Rep, 2009.
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 1–9, 2015.
[62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition. In
CVPR, 2016.
[63] Joseph Redmon and Ali Farhadi. Yolov3: An incre-
mental improvement. arXiv preprint arXiv:1804.02767,
2018.
[64] Mitchell P Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. Building a large annotated corpus of
english: The penn treebank. Computational linguistics,
1993.
[65] Sepp Hochreiter and Jürgen Schmidhuber. Long
short-term memory. Neural computation, 1997.
[66] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen,
Yuxin Wu, and Yuheng Zou. Dorefa-net: Training
low bitwidth convolutional neural networks with low
bitwidth gradients. arXiv, 2016.
[67] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and
Debbie Marr. WRPN: wide reduced-precision networks.
arXiv, 2017.
[68] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight
networks. arXiv, 2016.
[69] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and
Gang Hua. Lq-nets: Learned quantization for highly
accurate and compact deep neural networks. arXiv
preprint arXiv:1807.10029, 2018.
[70] Nvidia tensor rt 5.1. https://developer.nvidia.com/
tensorrt.
[71] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen.
Pipelayer: A pipelined reram-based accelerator for deep
learning. In High Performance Computer Architecture
(HPCA), 2017 IEEE International Symposium on, pages
541–552. IEEE, 2017.
[72] NCSU. Freepdk45, 2018. URL https:
//www.eda.ncsu.edu/wiki/FreePDK45.
[73] B. Murmann. ADC Performance Survey 1997-2016.
murmann/adcsurvey.html, [Online]. Available. URL
http://web.stanford.edu/.
[74] Pieter Harpe. A 0.0013 mm2 10b 10ms/s sar adc with
a 0.0048 mm2 42db-rejection passive fir filter. In 2018
IEEE Custom Integrated Circuits Conference, CICC
2018. Institute of Electrical and Electronics Engineers
Inc., 2018.
[75] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P.
Jouppi. CACTI-P: Architecture-level Modeling for
SRAM-based Structures with Advanced Leakage
Reduction Techniques. In ICCAD, 2011.
[76] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch. In NIPS-W, 2017.
[77] Neta Zmora, Guy Jacob, and Gal Novik.
Neural network distiller, June 2018. URL
https://doi.org/10.5281/zenodo.1297430.
[78] Yun Long, Taesik Na, and Saibal Mukhopadhyay.
Reram-based processing-in-memory architecture
for recurrent neural network acceleration. IEEE
Transactions on Very Large Scale Integration (VLSI)
Systems, (99):1–14, 2018.
[79] Jan Crols and Michel Steyaert. Switched-opamp: An
approach to realize full cmos switched-capacitor circuits
at very low power supply voltages. IEEE Journal of
Solid-State Circuits, 29(8):936–942, 1994.
[80] John K Fiorenza, Todd Sepke, Peter Holloway,
Charles G Sodini, and Hae-Seung Lee. Comparator-
based switched-capacitor circuits for scaled cmos
technologies. IEEE Journal of Solid-State Circuits, 41
(12):2658–2668, 2006.
[81] Robert W Brodersen, Paul R Gray, and David A Hodges.
Mos switched-capacitor filters. Proceedings of the IEEE,
67(1):61–75, 1979.
[82] Daniel Bankman and Boris Murmann. An 8-bit, 16
input, 3.2 pj/op switched-capacitor dot product circuit
in 28-nm fdsoi cmos. In Solid-State Circuits Conference
(A-SSCC), 2016 IEEE Asian, pages 21–24. IEEE, 2016.
[83] Daisuke Miyashita, Shouhei Kousai, Tomoya Suzuki,
and Jun Deguchi. A neuromorphic chip optimized for
deep learning and cmos technology with time-domain
14
analog and digital mixed-signal processing. IEEE Jour-
nal of Solid-State Circuits, 52(10):2679–2689, 2017.
[84] Ximing Qiao, Xiong Cao, Huanrui Yang, Linghao
Song, and Hai Li. Atomlayer: a universal reram-based
cnn accelerator with atomic layer computation. In
Proceedings of the 55th Annual Design Automation
Conference, page 103. ACM, 2018.
[85] Houxiang Ji, Linghao Song, Li Jiang, Hai Halen Li, and
Yiran Chen. Recom: An efficient resistive accelerator
for compressed deep neural networks. In Design,
Automation & Test in Europe Conference & Exhibition
(DATE), 2018, pages 237–240. IEEE, 2018.
[86] Bing Li, Linghao Song, Fan Chen, Xuehai Qian, Yiran
Chen, and Hai Helen Li. Reram-based accelerator for
deep learning. In Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2018, pages 815–820.
IEEE, 2018.
[87] Lerong Chen, Jiawen Li, Yiran Chen, Qiuping Deng,
Jiyuan Shen, Xiaoyao Liang, and Li Jiang. Accelerator-
friendly neural-network training: learning variations and
defects in rram crossbar. In Proceedings of the Confer-
ence on Design, Automation & Test in Europe, pages 19–
24. European Design and Automation Association, 2017.
[88] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen
Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime:
A novel processing-in-memory architecture for neural
network computation in reram-based main memory.
In ACM SIGARCH Computer Architecture News,
volume 44, pages 27–39. IEEE Press, 2016.
15
