A 500 fW/bit 14 fJ/bit-access 4kb Standard-Cell Based Sub-VT Memory in 65nm CMOS by Meinerzhagen, Pascal Andreas et al.
A 500 fW/bit 14 fJ/bit-access 4kb Standard-Cell
Based Sub-VT Memory in 65nm CMOS
Pascal Meinerzhagen∗, Oskar Andersson†, Babak Mohammadi†, Yasser Sherazi†,
Andreas Burg∗, and Joachim Neves Rodrigues†
∗Institute of Electrical Engineering, EPFL, Lausanne, VD, 1015 Switzerland
Email: pascal.meinerzhagen@epfl.ch, andreas.burg@epfl.ch
†Department of Electrical and Information Technology, Lund University, Lund, 22100 Sweden
Email: oskar.andersson@eit.lth.se, babak.mohammadi@eit.lth.se, yasser.sherazi@eit.lth.se, joachim.rodrigues@eit.lth.se
Abstract— Ultra-low power (ULP) biomedical implants and
sensor nodes typically require small memories of a few kb,
while previous work on reliable subthreshold (sub-VT) memories
targets several hundreds of kb. Standard-cell based memories
(SCMs) are a straightforward approach to realize robust sub-
VT storage arrays and fill the gap of missing sub-VT memory
compilers. This paper presents an ultra-low-leakage 4kb SCM
manufactured in 65nm CMOS technology. To minimize leakage
power during standby, a single custom-designed standard-cell (D-
latch with 3-state output buffer) addressing all major leakage
contributors of SCMs is seamlessly integrated into the fully
automated SCM compilation flow. Silicon measurements of a
4kb SCM indicate a leakage power of 500 fW per stored bit
(at a data-retention voltage of 220 mV) and a total energy of
14 fJ per accessed bit (at energy-minimum voltage of 500mV),
corresponding to the lowest values in 65 nm CMOS reported to
date.
I. INTRODUCTION
Biomedical implants and sensor nodes, whose power and
area budgets are often dominated by embedded memories,
require ultra-low power consumption at low operating fre-
quencies, and are therefore preferably operated in the sub-VT
domain. While logic circuits operate reliably in this domain,
it is more difficult to build robust sub-VT memories.
Commercial memory compilers are mainly oriented toward
above-VT operation and yield SRAM macros based on the con-
ventional 6-transistor (6T) bitcell, which cannot be operated in
the sub-VT domain. Several research groups have proposed 8-
transistor (8T) [1,2] or 10-transistor (10T) [3] SRAM bitcells
reliably operating in the sub-VT domain. However, such sub-
VT SRAM macros still have high leakage currents often dom-
inating the leakage power of ultra-low-power (ULP) systems.
To remedy excessive leakage currents, [4] has proposed a 14-
transistor (14T) bitcell using high-threshold voltage (high-VT)
I/O transistors, stack forcing, and channel length stretching.
Designing dedicated 8T, 10T, or 14T SRAM macros for
each new system and for each memory configuration is asso-
ciated with a considerable design effort. Standard-cell based
memories (SCMs) are an interesting alternative to full-custom
sub-VT SRAM macros in order to significantly reduce the
design effort, ensure reliability, and even reduce the area cost
for storage capacities smaller than a few kb [5]. However,
as many SRAM macros [1–3], the SCMs presented in [5]
suffer from high leakage currents, as they are implemented
with latches from a commercial standard-cell library, which is
primarily optimized for speed, but not for leakage.
Contributions: In this work, a new approach to the efficient
design of embedded sub-VT ULP memories is proposed. Rely-
ing on the fact that the bitcells (latches) together with the read
multiplexers consume almost the totality of the leakage power
of SCMs, a single custom-designed standard-cell, namely a
low-leakage D-latch with 3-state output buffer, is integrated in
the automated SCM compilation flow. As opposed to previous
work [6], the proposed SCM design flow does not restrict
the leakage minimization to the bitcells, but extends it to the
peripheral circuits by using a 3-state read logic, accepting a
speed degradation of the otherwise rather fast SCMs [5] for
the benefit of lower leakage.
II. CUSTOM LOW-LEAKAGE LATCH DESIGN
Approximately 66% of the leakage power of SCMs are
consumed by the latches, whereas the read multiplexers
dominate the remaining power. This section addresses the
most dominant leakage contributors by a custom low-leakage
latch design. Latch topologies using 3-state buffers inherently
have transistor stacks and consequently low leakage currents,
while topologies using transmission-gates and static-CMOS
gates suffer from higher leakage currents [7]. The best latch
topology exhibiting the lowest leakage current has 1) the
lowest number of paths from VDD to ground, and 2) the highest
resistance on each such paths, directly leading to a topology
with 3-state buffers only. Having identified the best latch
topology, transistor stacking (for parts of the latch which do
not yet have transistor stacks) and channel length stretching are
applied to further reduce leakage currents. The stacking factor
is strictly limited to 2 since higher factors give diminishing
returns in leakage reduction [7] and compromise reliability for
sub-VT operation. Moreover, the point of diminishing returns
of channel length stretching is found to be 1.5-2X minimum
channel length [7]. The right-hand side of Fig. 1 shows
the transistor-level schematic of the final custom-designed
standard-cell latch (with 3-state output buffer), while the left-
hand side shows the SCM architecture.
III. LOW-LEAKAGE 3-STATE READ LOGIC
The read multiplexers, routing the selected word to the
data output, are an integral part of the read logic and can be
W
A
D
...
1
2
8
 r
o
w
s
32 columns
DataIn(31) DataIn(0)
C
lk
W
-A
d
d
r
Clock
gate
R
A
D
R
-A
d
d
r
DataOut(31) DataOut(0)
D Q
CK
D Q
CK
D Q
CK
D Q
CK
T
T T
T
...
..
.
..
.
..
.
OEOE
OE OE
Write WL
Custom
low-leakage
standard-cell
Read WL
Read WL
W
ri
te
 B
L
R
e
a
d
 B
L
W
ri
te
 B
L
R
e
a
d
 B
L
3-state read logic
(1 column)
D
Q
CK
CKB
CK
CKB
OE
OEB
W=Wmin
L=1.5Lmin
Fig. 1. Architecture of low-leakage 4kb standard-cell based memory (SCM):
the write logic uses clock-gates [8], while the 3-state inverters used for the
read functionality are integrated in the low-leakage latch design.
implemented with 3-state buffers [8], in order to address the
dominant leakage contributor of SCM peripheral circuits. The
already stacked output inverter of the custom-designed D latch
is easily converted into a 3-state inverter, thereby addressing all
major SCM leakage contributors by designing a single custom
standard-cell.
The remainder of this section aims at finding the optimum
transistor sizing of the 3-state drivers to simultaneously reduce
overall leakage and improve speed, which is not contradictory
in the sub-VT regime, as expatiated on below. The presented
4kb SCM consists of 128 rows and 32 columns, as shown in
Fig. 1. Thus, 128 3-state buffers are connected to the same read
bit-line (RBL). During a read operation, the 3-state buffer in
the selected word has to drive the RBL against 127 unselected,
yet leaking 3-state buffers. To investigate the impact of the 3-
state drive strength on the RBL (dis-)charge delay, a strong
and a weak driver, defined in Table I, are considered. For a
compact layout fitting nicely onto the standard-cell grid, and
symmetric rise and fall times being only a secondary goal for
the targeted low-speed ULP applications, the 3-state drivers are
non-symmetric with equal NMOS and PMOS transistor sizes.
As a result, RBL rise times are always longer than RBL fall
times. Table I shows the 50%-to-50% rising-RBL propagation
delay of the selected 3-state driver for the typical-typical
(TT) process corner at 27 ◦C, for both above-VT and sub-VT
supply voltages, and for both drive strengths. The considered
low-power (LP) high threshold-voltage (HVT) 65nm CMOS
technology has a nominal VDD and a threshold-voltage of
1.2 V and 650 mV, respectively. Thus, a VDD of 400 mV is
already deep in the sub-VT domain. Simulation results indicate
that the stronger 3-state driver is faster for operation at nominal
VDD where on-to-off current ratios (Ion/Ioff ) are as high as
107 (for both NMOS and PMOS transistors), whereas the
weaker 3-state driver is faster for sub-VT operation, due to
much lower Ion/Ioff ratios of around 104 and the resulting
non-negligible impact of the leakage current of unselected 3-
state drivers.
IV. RELIABILITY ANALYSIS
While bitcell read-failures and write-failures are avoided
by using a read buffer and by disabling the bitcell-internal
TABLE I
READ BIT-LINE (RBL) DELAY, TT CORNER, 27 ◦C.
Drive strength Strong Weak
W/Wmin, L/Lmin 15, 1 1, 2
VDD RBL delay
1.2 V 1.064 ns 2.126 ns
400 mV 3.336µs 2.688µs
180 190 200 210 220
0
0.2
0.4
0.6
0.8
1
1.2
1.4
V
DD
[mV]
H
o
ld
-f
a
ilu
re
 p
ro
b
a
b
ili
ty
 [
%
]
Simulated
Measured
0.16 0.18 0.2 0.22
0
0.2
0.4
0.6
0.8
V
DDhold
[V]
P
ro
b
a
b
ili
ty
Simulated Max(V
DDhold
)=210mV
Fig. 2. Simulated and measured hold-failure probability versus VDD. Inset:
Simulated distribution of VDDhold.
keeper, respectively, hold-failures limit VDD down-scaling [5].
To assess the minimum VDD required to hold data (VDDhold),
the minimum VDD for which both static noise margin (SNM)
values (corresponding to data ’1’ and ’0’, or, in other words,
to top and bottom eye of the butterfly curve [9]) are still
positive are extracted from a 1k-point Monte Carlo (MC)
circuit simulation (accounting for within-die (WID) parametric
variations, in the TT corner, at 27 ◦C). Fig. 2 shows the hold-
failure probability as a function of VDD, while the inset shows
the corresponding distribution of VDDhold. The first hold-
failure occurs at 200 mV, corresponding to a worst (maximum)
value of VDDhold equal to 210 mV.
Due to the strong impact of parametric variations and low
Ion/Ioff ratios in the sub-VT regime, the total leakage current
from a large number of disabled 3-state buffers might become
high enough, compared to the active drive-current of a single
weak 3-state buffer, to compromise the reliability of the 3-
state read logic. However, 1k MC runs accounting for WID
parametric variations in the slow-slow (SS) process corner at
27 ◦C indicate that for up to 128 words per RBL, a single
3-state driver successfully drives the RBL at a VDD as low as
400 mV.
V. SILICON MEASUREMENTS
Fig. 3 shows the chip microphotograph and the layout of
the 4kb SCM based on 3-state-enabled low-leakage latches
and manufactured in 65nm CMOS with LP-HVT transistors.
The silicon area of the 4kb SCM block is 315 x 165µm2,
corresponding to 12.7µm2 per bit. Functionality is verified
by writing and reading back checker-board and random data
patterns using a scan-chain test interface. Unless stated differ-
ently, the temperature is carefully controlled to 27 ◦C for all
silicon measurements.
Fig. 3. Chip microphotograph and zoomed-in layout.
0
10
20
30
020406080100120
B
it
 p
o
s
it
io
n
Address
f=10kHz, T=27°C
Column-wise read-failures at 380mV
0
10
20
30
020406080100120
B
it
 p
o
s
it
io
n
Address
f=10kHz, T=27°C
No read-failures at 420mV
Fig. 4. Measured error maps for VDD of 380 mV (top) and 420 mV (bottom).
A. Minimum VDD for Data Retention and Memory Access
The measured minimum required supply voltages to guar-
antee correct hold, write, and read functionality are 220, 300,
and 420 mV, respectively. The measured value of VDDhold
(220 mV) is in good agreement with the aforementioned
simulated value (210 mV), as shown in Fig. 2. It is apparent
that the low-leakage 3-state read logic limits the minimum
voltage for read/write access (VDDmin). For a closer inspection
of the onset of read failures, Fig. 4 shows error maps: a
green (bright) marker indicates correct access to a bitcell,
while a red (dark) marker indicates an access failure. For
VDD = 380mV, it is apparent that failures occur column-
wise, confirming that the 3-stated RBLs are the first point
of failure under VDD scaling. Completely error-free access is
measured at VDDmin = 420mV. Fig. 5 shows the the number
of inoperative columns, i.e., columns containing at least one
bitcell with access failure, as a function of VDD, while the inset
shows the total number of bitcell read-failures versus VDD.
B. Access Energy, Frequency, and Leakage Power
Fig. 6 shows the measured energy per bit-access per-
formed at maximum speed versus VDD. The measured energy-
minimum voltage is located at 500 mV, while the minimum
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0
2
4
6
8
10
12
V
DD
[V]
N
u
m
b
e
r 
o
f 
c
o
lu
m
n
-f
a
ilu
re
s
f=10kHz, T=27°C
0.34 0.36 0.38 0.4 0.42 0.44 0.46
0
100
200
300
400
500
600
700
800
V
DD
[V]
N
u
m
b
e
r 
o
f 
re
a
d
-f
a
ilu
re
s
f=10kHz, T=27°C
Bitcell
read-failures
Column
failures
Fig. 5. Measured number of inoperative columns versus VDD. Inset: Total
number of read-failures versus VDD.
400 450 500 550 600 650 700
12
14
16
18
20
22
24
26
V
DD
[mV]
E
n
e
rg
y
/b
it
 [
fJ
]
Maximum-speed operation
T=27°C
Measured energy minimum
f=110 kHz
f=1.5 MHz
M
e
a
s
u
re
d
 V
D
D
m
in
f=10 kHz
Fig. 6. Measured energy per bit-access.
energy dissipation per bit access is 14 fJ. At 675, 500, and
420 mV (VDDmin), the maximum measured operating frequen-
cies are 1.5 MHz, 110 kHz, and 10 kHz, respectively. The 3-
state read logic limits VDDmin and the read-access time, but
satisfies the ambition of ultra-low leakage power and access
energy, while the energy-minimum voltage is still higher than
VDDmin. At VDDhold = 220mV, data is correctly held with a
leakage power of 425-500 fW per bit (best and worst die), as
shown in Fig. 7.
C. Measurements at Human-Body Temperature
Biomedical implants encounter a typical working tempera-
ture of 37 ◦C. At 37 ◦C, the first completely error-free read
access to the entire array is measured at already 400 mV.
As a desirable effect of higher temperatures, the maximum
operating frequency doubles when heating the chips from 27
to 37 ◦C (measured at VDD = 420mV). Unfortunately, the
leakage power increases as well with increasing temperature,
as shown in Fig. 7.
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
2
4
6
8
10
12
V
DD
[V]
L
e
a
k
a
g
e
-p
o
w
e
r/
b
it
 [
p
W
/b
it
]
4 dies, 37°C
4 dies, 27°C
0.18 0.19 0.2 0.21 0.22 0.23
0.4
0.5
0.6
0.7
0.8
0.9
1
V
DD
[V]
L
e
a
k
a
g
e
-p
o
w
e
r/
b
it
 [
p
W
/b
it
]
V =0.22VDDhold
425-500fW/bit
Fig. 7. Measured leakage power per bit, including overhead of peripheral
circuits, measured for 4 dies, at 27 and 37 ◦C. Inset: Zoom around VDDhold.
VI. COMPARISON WITH PRIOR-ART SUB-VT MEMORIES
Compared to a previous study on SCMs considering only
commercially available standard-cell libraries [5], designing
merely one custom standard-cell (3-state-enabled low-leakage
latch) cuts the leakage power into half while maintaining the
same silicon area.
Table II shows the best (in terms of access energy and
leakage power) memories in 65nm CMOS reported to date.
The energy figures (Etot/bit) correspond to the total (active
and leakage) energy per memory access performed at maxi-
mum speed, normalized to the size of the data I/O bus. Unless
stated in parentheses, Etot/bit is given for VDDmin. The power
figures (Pleak/bit) correspond to the leakage power of the
memory macro (including peripheral circuits) during standby,
normalized to the macro’s storage capacity. Unless stated in
parentheses, Pleak/bit is given for VDDhold.
In [6], the standby leakage of the SRAM macro is dom-
inated by the leakage of peripheral circuits, due to the ag-
gressive reduction of array leakage. In this work, not only the
bitcell (latch), but also the leakage-dominant peripheral cir-
cuits (read multiplexers) are leakage-optimized, which clearly
pays off compared to [6] (see Table II).
With a total energy dissipation of 14 fJ per accessed bit and
a leakage power of 500 fW per stored bit, the presented work
outperforms all previous work in 65nm CMOS nodes. The
reported clock frequencies are suitable for a wide range of
biomedical applications, while most previously reported sub-
VT SRAMs are overdesigned. Even the silicon area of SCMs
is smaller compared to sub-VT SRAM hardmacros for storage
capacities of up to several kb, due to less area for peripheral
circuits [5]. For several tens of kb, an area-increase of roughly
4X [5], stemming from the larger bitcell, is acceptable for the
benefit of the clearly lower leakage power and access energy.
VII. CONCLUSIONS
This paper addresses the lack of ultra-low-power (ULP) sub-
VT memory compilers by utilizing a fully automated standard-
TABLE II
COMPARISON WITH PRIOR-ART SUB-VT MEMORIES IN 65NM CMOS
[3] [2] [6] This work
VDDmin [mV] 380 250 700 420
VDDhold [mV] 230 250 500 220
Etot/bit [fJ/bit] 54 (0.4V) 86 (0.4V) - 14 (0.5V)
Pleak/bit [pW/bit] 7.6 (0.3V) 6.1 6.0, 1.0a 0.5
a Leakage-power of bitcell only
cell based memory (SCM) compilation flow, especially in-
teresting for ULP biomedical systems requiring only small
storage capacities of several kb. A single custom-designed
standard-cell (D-latch with 3-state output buffer) is designed,
addressing all dominant SCM leakage contributors at once,
and integrated into the SCM compilation flow, cutting leakage
power into half compared to using only commercial standard-
cell libraries.
Silicon measurements show that a 3-state read logic with
up to 128 words per bit-line operates reliably in the sub-VT
regime down to 420 mV. Counter to intuition, weaker 3-state
buffers not only reduce leakage, but also shorten the bit-line
delay compared to stronger 3-state buffers. The 4kb SCM
manufactured in 65nm CMOS consumes a leakage power of
500 fW per stored bit (at data-retention voltage of 220 mV) and
dissipates a total energy of 14 fJ per accessed bit (at energy-
minimum voltage of 500 mV).
ACKNOWLEDGMENT
This work was kindly supported by the Swiss National Sci-
ence Foundation (PP002-119057), Swedish Vetenskapsra˚det
(621-2011-4540), and Swedish VINNOVA Industrial Excel-
lence Centre (SOS).
REFERENCES
[1] Y.-W. Chiu, J.-Y. Lin, M.-H. Tu, S.-J. Jou, and C.-T. Chuang, “8T single-
ended sub-threshold SRAM with cross-point data-aware write operation,”
in Proc. IEEE ISLPED, Aug. 2011.
[2] M. E. Sinangil, N. Verma, and A. P. Chandrakasan, “A reconfigurable
8T ultra-dynamic voltage scalable (U-DVS) SRAM in 65 nm CMOS,” in
IEEE JSSC, Nov. 2009.
[3] B. H. Calhoun and A. P. Chandrakasan, “A 256-kb 65-nm sub-threshold
SRAM design for ultra-low-voltage operation,” in IEEE JSSC, March
2007.
[4] S. Hanson, M. Seok, Y.-S. Lin, Z. Y. Foo, D. Kim, Y. Lee, N. Liu,
D. Sylvester, and D. Blaauw, “A low-voltage processor for sensing
applications with picowatt standby mode,” in IEEE JSSC, April 2009.
[5] P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues,
“Benchmarking of standard-cell based memories in the sub-VT domain
in 65-nm CMOS technology,” in IEEE JETCAS, Aug. 2011.
[6] Y. Wang, H. J. Ahn, U. Bhattacharya, Z. Chen, T. Coan, F. Hamzaoglu,
W. Hafez, C.-H. Jan, P. Kolar, S. Kulkarni, J.-F. Lin, Y.-G. Ng, I. Post,
L. Wei, Y. Zhang, K. Zhang, and M. Bohr, “A 1.1 GHz 12 uA/Mb-
leakage SRAM design in 65 nm ultra-low-power CMOS technology with
integrated leakage reduction for mobile applications,” in IEEE JSSC,
2008.
[7] B. Mohammadi, P. Meinerzhagen, O. Andersson, Y. Sherazi, A. Burg,
and J. Rodrigues, “A 0.28-0.8V 320 fW D-latch for sub-VT memories in
65-nm CMOS,” in Proc. IEEE CICC, under review, Sept. 2012.
[8] P. Meinerzhagen, C. Roth, and A. Burg, “Towards generic low-power
area-efficient standard cell based memory architectures,” in Proc. IEEE
MWSCAS, Aug. 2010.
[9] B. Calhoun and A. Chandrakasan, “Static noise margin variation for sub-
threshold SRAM in 65-nm CMOS,” in IEEE JSSC, July 2006.
