An Energy-Efficient VCO-Based Matrix Multiplier Block to Support On-Chip
  Image Analysis by Banerjee, Imon & Sanyal, Arindam
ar
X
iv
:1
61
2.
03
36
1v
1 
 [c
s.E
T]
  1
1 D
ec
 20
16
1
An Energy-Efficient VCO-Based Matrix Multiplier
Block to Support On-Chip Image Analysis
Imon Banerjee and Arindam Sanyal
Abstract—Images typically are represented as uniformly sam-
pled data in the form of matrix of pixels/voxels. Therefore, matrix
multiply-and-accumulate (MAC) forms the core of most state-of-
the-art image analysis algorithms. While digital implementation
of MAC has generally been the preferred approach, high power
consumption is an impediment to adopting it for medical image
analysis. In this work, we present a time-domain signal processing
architecture which performs MAC operations with 7bit accuracy
while consuming 400X lower energy than digital implementation.
The proposed architecture performs analog computation using
mostly digital circuits and is suitable for scaled CMOS technolo-
gies. The proposed time-domain MAC architecture is expected
to play a central role in empowering the advancement of various
on-chip image analysis operations.
Index Terms—On-chip image analysis, voltage-controlled os-
cillator, time-domain, matrix multiplication
I. INTRODUCTION
In the digital data era, 2D/3D image analysis operations
(e.g. registration, feature calculation, interpolation, fusion) are
the core processing block of a wide range of automated
systems, including computer aided diagnosis (CAD) [1], [2].
For example, in the modern CAD systems, an important
image analysis operation is the co-registration of positron
emission tomography (PET) image and magnetic resonance
image (MRI) which combines functional information from
PET images with anatomical information in MR images. The
efficient co-registration of PET and MRI images can pave the
way for a better understanding of physiological and disease
mechanisms in pre-clinical and clinical settings.
The co-registration algorithm applies rigid registration [3]
to map each voxel (x1, y1, z1) in the MR image into the co-
ordinate space of PET image (x2, y2, z2), provided that the
images of both modalities are acquired from the same subject
and the scanning processes have not introduced nonrigid
spatial transformations. To illustrate the algorithmic facets,
we present a workflow diagram in Fig. 1 that takes MRI and
PET images as inputs and uses rigid registration to compute
the co-registered MRI/PET image. The rigid registration is
done through multiply-and-accumulate (MAC) operations per-
formed on the original voxel locations in 3D space (x,y,z) with
4 × 4 translation matrix (U ) which represents translations(t)
and scaling(S) and 4×4 rotation matrices (R) which represent
rotation (see (1)).
I. Banerjee is with the Laboratory of Quantitative Imaging, Stan-
ford University School of Medicine, Stanford 94305, CA, USA. (e-mail:
imonb@stanford.edu)
A. Sanyal is with Department of Electrical Engineering, The State
University of New York at Buffalo, Buffalo 14260, NY,USA. (email:
arindams@buffalo.edu)
The transformation matrix (M = translation + rotation) for
each voxel is derived automatically by analyzing the difference
between source and target data of the registration, which can
again be described as a sequence of MAC operations. Note
that the MRI and the PET images with reasonable accuracy
have number of voxels more than 512×512×50. Therefore, a
standard registration operation requires > 13, 10, 7200 MAC
operations which can consume a significant amount of power
and computation time. Similarly, other image analysis opera-
tions, such as object segmentation, feature extraction, can be
decomposed into a set of sequential MAC operations. Hence,
reduction in power consumption during MAC operations is
a significant challenge that has to be addressed in order to
develop on-chip image analysis blocks.
Spurred on by Moore’s law and CMOS technology scaling,
the general approach towards CAD has been to perform
all mathematical operations digitally using computers. While
digital computation can provide high accuracy, the power
consumption has been prohibitively high to prevent portable
solutions. With CMOS technology scaling slowing down, there
is an interest in analog signal processing (ASP) to perform
mathematical computations in a low power fashion [4]. A key
enabler for ASP methodology is that most image processing
algorithms require only 6-8 bits precision. ASP excels at
approximate computing while consuming lower power than
digital computing. Recent approaches to analog computation
has been to use switches and capacitors as signal process-
ing elements [5]–[8] in advanced CMOS technologies. The
matrix multiplier reported in [8] performs calculations with
6b linearity while having an energy consumption of only
13fJ/operation at 1GHz. The work in [5] presents switched
capacitor multipliers and dividers. However, it uses amplifiers
which is not power efficient, specially in advanced CMOS
technologies. The works in [6]–[8] present power efficient
multipliers by doing away with amplifiers and relying on only
switched capacitors for signal processing. These multipliers
work in the charge domain and are good solutions for high
speed operations usually in the range of GHz of bandwidths.
However, charge leakage presents a significant challenge to
using switched-capacitor multipliers for medical signal pro-
cessing in which computations can require only a few MHz
of bandwidths. Thus, switched capacitor multipliers are not
very suitable for medical image processing.
In this letter, we present an alternate solution by performing
multiplication and addition operations in time domain, which
satisfies medical image processing requirements. The proposed
technique has significantly higher immunity to charge leakage
than switched capacitors and can trade-off speed for power
2Fig. 1. A workflow diagram of MRI and PET brain image co-registration.
U =
∣
∣
∣
∣
∣
∣
∣
Sx 0 0 0
0 Sy 0 0
0 0 Sz 0
tx ty tz 1
∣
∣
∣
∣
∣
∣
∣
;Rx =
∣
∣
∣
∣
∣
∣
∣
1 0 0 0
0 cos θ − sin θ 0
0 sin θ cos θ 0
0 0 0 1
∣
∣
∣
∣
∣
∣
∣
;Ry =
∣
∣
∣
∣
∣
∣
∣
cos θ 0 − sin θ 0
0 1 0 0
− sin θ 0 cos θ 0
0 0 0 1
∣
∣
∣
∣
∣
∣
∣
;Rx =
∣
∣
∣
∣
∣
∣
∣
∣
∣
cos θ − sin θ 0 0
sin θ cos θ 0 0
0 0 1 0
0 0 0 1
∣
∣
∣
∣
∣
∣
∣
∣
∣
(1)
without sacrificing computational accuracy. Time domain cir-
cuits are uniquely suitable for scaled CMOS technology. They
are highly digital and hence can operate at low supply voltages
in advanced CMOS technologies. In addition, the quantization
noise in time domain circuits is essentially transistor delay
which reduces with technology scaling. Thus, the proposed
architecture can be used for a power efficient hardware imple-
mentation of the rigid registration block (see Fig. 1) which is a
core element in PET/MRI modality fusion process. In addition,
the proposed time domain architecture holds the promise of
providing low power computational ability for a wide range
of portable CAD systems.
The rest of this letter is organized as follows: Section II
presents a brief review of existing analog-to-digital matrix
multipliers, Section III discusses the key idea behind the pro-
posed time-domain matrix multiply-and-add operator, Section
IV presents a CMOS circuit implementation and simulation
results, while the conclusion and future research direction are
brought up in Section V.
II. REVIEW OF ANALOG-TO-DIGITAL MATRIX
MULTIPLICATION AND ADDITION
A MAC operation is defined as
Y =
N∑
j=1
WjXj (2)
where Xj is the input and Wj is the weight.
Several analog signal processing techniques have been re-
ported which accept an input Xj , perform the multiplication in
(2) in analog domain and return a digital output Yj =WjXj .
A straightforward method is to use an operational amplifier
to perform voltage domain multiplication and addition as
described by (2). The operational amplifier needs to have a
high gain to perform the MAC operation accurately. However,
high gain operational amplifiers are power hungry and instead
four-quadrant CMOS multipliers [9] are often used for approx-
imate multiplication. To reach higher energy efficiency, [10]
uses charge coupling to implement approximate analog matrix
multipliers. [11] uses low performance, thin-film transistors
(TFT) to implement an approximate analog multiplier for
processing data from sensors.
More recently, there have been efforts to integrate analog-
digital matrix multiplication (AD-MM) inside analog-to-
digital converters (ADC) to lower power consumption. Capac-
itors and switches are used for passive multiplication in [7],
[8], [12]. Charge domain passive multipliers achieve very low
power consumption in advanced CMOS technology nodes as
they do not use any active amplifiers which are power hungry.
Addition can be done in current or charge domain in a low
power fashion.
III. TIME DOMAIN MATRIX OPERATIONS
While passive switched capacitor techniques are a good
solution for low energy, approximate multipliers, they suffer
from non-idealities due to charge leakage specially for ad-
vanced CMOS technologies which suffer from increased leak-
age. This problem is exacerbated at low speeds of operation,
which implies that further lowering of energy consumption
by reducing speed is challenging for these techniques. In
addition, often switched capacitor multipliers are used in
conjunction with voltage domain ADCs which suffer from
reduced dynamic range as the supply voltage is scaled down
in advanced CMOS technologies.
To counter these limitations of switched capacitor multipli-
ers, we propose a time domain multiplier. By shifting to time
domain, sensitivity to charge leakage is greatly diminished
and very low energy multipliers can be designed which are
3suitable for low bandwidth medical image processing. An
added advantage is that quantization noise of time-domain
multipliers come from transistor delay which reduces with
technology scaling. Thus, in advanced CMOS technologies,
time-domain multipliers are better suited for low-bandwidth
operations than switched capacitors.
A voltage-controlled oscillator (VCO) is a major building
block of time domain multiplier. The quantized phase of a
VCO can be written as
φ[k] = 2piKv
∫ Tint
0
Vin(t)dt (3)
where Vin is the input to the VCO, Kv is the VCO gain and
Tint is time over which the VCO phase is integrated. If Vin
changes slowly, φ[k] can be written as
φ[k] = 2piKv
N∑
j=1
(tj − tj−1)Vin[j]
= 2piKv
N∑
j=1
(∆tj)Vin[j] (4)
where Tint =
N∑
j=1
∆tj and Vin[j] = Vin(t = tj)
(4) is analogous to (2) with Wj ≡ 2piKv(∆tj) and
Xj ≡ Vin[j]. Thus, a VCO can be used to perform a MAC
operation in phase domain. The equivalent digital output of the
MAC operation can be readily obtained by simply sampling
the output of the VCO, without requiring a separate ADC
as in the charge-domain matrix multipliers. The accumulation
operation is done in phase domain and is highly linear.
Unlike charge-domain architectures, accumulation in phase
domain comes without any additional power consumption.
Nonlinearity in phase domain MAC operation is primarily due
to nonlinearity in voltage-to-phase conversion. Increasing the
integration time, Tint, allows reduction of VCO gain, Kv,
which in turn increases VCO linearity. This is particularly
suitable for medical image processing which does not require
a high bandwidth, and hence, linearity of the VCO can be
increased by lowering Kv and increasing Tint. As long as the
VCO is oscillating, there is no leakage error in the phase value
held by the VCO which is of importance for low bandwidth
medical signal processing.
IV. MATRIX MULTIPLIER ARCHITECTURE
Fig. 2 shows the conceptual block diagram of the proposed
time-domain matrix multiplier architecture along with its
timing diagram. A voltage-to-current (V/I) converter drives a
current-controlled oscillator (CCO) and the quantized phase
output of the CCO holds the result of multiplication of Vin
and 2piKvφ1. The duration of φ1 in the j−th sampling period
is ∆tj which is digitally controllable. Addition comes without
any hardware cost as the CCO holds on to its phase which
keeps accumulating with time. Fig. 2 illustrates how the
proposed architecture performs MAC operation.
Fig. 3 shows circuit implementation of the proposed ar-
chitecture. The architecture is implemented in a differential
Fig. 2. Architecture and timing diagram of proposed time domain matrix
multiplier.
fashion to suppress common-mode noise on the inputs as well
as noise from supply and ground. The input signal is applied
differentially to a V/I converter. The V/I converter drives two
pseudo-differential CCOs during the phase φ1 and the two
CCOs are run with a low current supply IL during the phase
φ2. The two CCOs are not stopped during φ2 to ensure that the
accumulated phase held by them are not corrupted by leakage.
By running the two CCOs at the same frequency during φ2,
no phase is accumulated differentially at φ2. The output of
each CCO stage is latched by the sampling clock into a flip-
flop (FF). At any given time, only one of the CCO stages is
in a state of either a positive or negative transition. Thus, for
an N -stage CCO, the instantaneous phase can be quantized
into 2N levels between (0, 2pi) corresponding to N positive
transitions and N negative transitions. The matrix multiplier
is highly digital and makes use of simple digital circuits to
perform time domain analog signal processing.
Vin
+ Vin
-
Ibias
Vdd Vdd
Encoder
in+ in-
out+
out-
FF FF FF
Encoder
FFFFFF
Vdd Vdd
ϕout
Ibias
R
Counter Counter
V/I converter
N stages
2NxCount + Phase (ϕ)
PhaseCount CountPhase
nTs (n+1)Ts
Quantized phase
CCO
N stages
2NxCount + Phase (ϕ)
Vdd Vdd
IL IL
ϕ2
ϕ1 ϕ1
ϕ2
Fig. 3. Circuit schematic of proposed time domain matrix multiplier.
As shown in Fig. 3, CCO phase increases monotonically
with time and wraps over after it crosses 2pi. A counter is
used to store the number of times the VCO phase overflows
over the period of integration. The total phase at any time
is given by (2N · Count + φˆ) where φˆ is the instantaneous
quantized phase.
Since VCO gain varies with process, voltage and tempera-
ture (PVT), the result of multiply-and-add will vary with PVT.
Hence, background tracking is needed to correct for VCO
gain variation. The V/I transconductance is given by 1/R for
gmR≫ 1, where gm is the transconductance of the V/I input
transistors and R is their source degeneration resistance. Thus,
4the V/I converter is relatively insensitive to PVT variations,
and hence, resistance trimming is not required for multipliers
with 6-8 bits accuracy. To track the CCO gain, width of the tail
current source can be changed depending on the output of a
counter which is clocked by the CCO as shown in Fig. 4. The
counter output is monitored by a comparator which is clocked
by a divided-down version of the sampling clock. The counter
is reset after every comparison. If the CCO is running too
fast, the comparator will reduce the tail current source width
and vice versa. This ensures that the counter output is held
equal to a preset value Fin which sets the CCO free-running
frequency.
The background tracking technique can be applied to a
reference multiplier and the comparator digital output word
can be applied to the tail current source of all the CCOs.
Process related mismatch between the tail current sources
of the different CCOs will limit the accuracy to which the
PVT sensitivity can be corrected. Fortunately, the matching
accuracy is not very stringent as only 6-8 bits accuracy is
required. In addition, CCO tail current devices are usually
made large to reduce flicker noise. Large device size also
reduces mismatches. For a large design with many multiplier
cells, a few local copies of the reference multiplier can be
distributed across the chip to account for gradient mismatches.
Vdd
CCO
V/I
xN
Counter
Vbias
Fin
N
Fs/M
RST
Fig. 4. VCO gain tracking architecture.
A MAC cell was designed in 40nm CMOS process. With a
400mV differential input running at 0.6MHz, the MAC cell op-
erates with 7-bit linearity. To test the accuracy of the proposed
MAC cell, a row vector (1×512) is multiplied with a column
vector (512× 1). The row vector
[
x0 x1 x2 · · · x511
]
is set to
[
0 sin(ωTs) sin(2ωTs) · · · sin(511ωTs)
]
and
all the elements in the column vector are set to Ts where Ts
= 10ns. Fig. 5 shows the simulated result for multiplication
of the row vector and the column vector as well as the error
between the output of the MAC operation and the desired
output. It can be seen from Fig. 5 that the proposed MAC cell
has a low quantization error and has 7 bit linearity.
The MAC cell consumes 86µW from a 1.1V supply while
operating at 100 MHz. The energy efficiency of the pro-
posed time-domain MAC cell is 2fJ/operation, compared to
968 fJ/operation energy efficiency of highly optimized digital
2.5 3 3.5 4 4.5 5
Time ( µs)
-0.2
-0.1
0
0.1
0.2
S
ig
n
al
 V
o
lt
ag
e 
(V
)
-0.01
-0.005
0
0.005
0.01
E
rr
o
r 
V
o
lt
ag
e(
V
)
desired output
MAC output
error signal
Fig. 5. Matrix multiply-and-add transient simulation.
matrix multipliers [13]. Thus, the proposed MAC cell has
more than 400X better energy efficiency than digital matrix
multipliers.
V. CONCLUSION
A time-domain architecture for performing multiplication-
and-addition operations is presented in this letter. The pro-
posed architecture exploits the low bandwidth and not-too
stringent accuracy requirement of medical image processing
algorithms to achieve drastic increase in energy efficiency
compared to digital matrix multipliers. The architecture is
highly digital and suitable for advanced CMOS technologies.
The proposed architecture can act as an enabler for developing
portable hardware solutions for 2D/3D image analysis.
REFERENCES
[1] Banerjee, I., Agibetov, A., Catalano, C.E., Patane`, G., Spagnuolo, M.,
‘Semantics-driven Annotation of Patient-Specific 3D Data: A Step to Assist
Diagnosis and Treatment of Rheumatoid Arthritis’, The Visual Computer
Journal, 2016, pp. 1–13.
[2] Slomka, P. J.: ‘Software approach to merging molecular with anatomic
information’, Journal of Nuclear Medicine, 2004, vol. 45, pp. 36S–45S.
[3] Goshtasby, A. Ardeshir.: ‘2-D and 3-D image registration: for medical,
remote sensing, and industrial applications’, John Wiley & Sons, 2005.
[4] St. Amant, R., Yazdanbakhsh, A., Park, J., Thwaites, B., Esmaeilzadeh,
H., Hassibi, A., Ceze, L., and Burger, D.: ‘General-purpose code accelera-
tion with limited-precision analog computation’, ACM SIGARCH Comput.
Archit. News , 2014, vol. 42, no. 3, pp. 505–516.
[5] Watanabe, K., and Temes, G.: ‘A switched-capacitor multiplier/divider
with digital and analog output’, IEEE Trans. Circuits Syst., 1984, vol. 31,
no. 9, pp. 796–800.
[6] Sadhu, B., Sturm, M., Sadler, B.M., and Harjani, R.: ‘Analysis and design
of a 5GS/s analog charge-domain FFT for an SDR front-end in 65nm
CMOS’, IEEE Journal of Solid-State Circuits, 2013, vol. 48, no. 5, pp.
1199–1211.
[7] Bankman, D. and Murmann, B.: ‘Passive charge redistribution digital-to-
analogue multiplier’, Elec. Lett., 2015, vol. 51, no. 5, pp. 386–388.
[8] Lee, E.H., and Wong, S.S.: ‘A 2.5GHz 7.7TOPS/W switched-capacitor
matrix multiplier with co-designed local memory in 40nm’, IEEE Journal
of Solid-State Circuits, 2016, pp. 418–420.
[9] Bult, K., and Wallinga, H.: ‘A CMOS four-quadrant analog multiplier’,
IEEE Journal of Solid-State Circuits, 1986, SSC-21, no. 3, pp. 430–445.
[10] Genov, R., and Cauwenberghs, G.: ‘Kerneltron: support vector machine
in silicon’, IEEE Trans. Neural. Netw., 2003, vol. 14, no. 5, pp. 1426–1434.
[11] Rieutort-Louis, W.R., Moy, T., Wang, Z., Wagner, S., Sturm, J.C., and
Verma, N.: ‘A large-area image sensing and detection system based on
embedded thin-film classifiers’, IEEE Journal of Solid-State Circuits, 2016,
vol. 51, no. 1, pp. 281–290.
[12] Wang, Z., Zhang, J. and Verma,N.: ‘Realizing low-energy classification
systems by implementing matrix multiplication directly within an ADC’,
IEEE Trans. Biomed. Circ. and Syst. , 2015, vol. 9, no. 6, pp. 825–837.
[13] Saha, P., Banerjee, A., Bhattacharyya, P., and Dandapat, A.: ‘Improved
matrix multiplier design for high-speed digital signal processing applica-
tions’, IET Circuits, Devices and Systems, 2014, vol. 8, no. 1, pp. 27–37.
