TS Cache: A Fast Cache with Timing-speculation Mechanism Under Low
  Supply Voltages by Shen, Shan et al.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
1 
  
Abstract—To mitigate the ever-worsening “Power Wall” 
problem, more and more applications need to expand their power 
supply to the wide-voltage range including the near-threshold 
region. However, the read delay distribution of the SRAM cells 
under the near-threshold voltage shows a more serious long-tail 
characteristic than that under the nominal voltage due to the 
process fluctuation. Such degradation of SRAM delay makes the 
SRAM-based cache a performance bottleneck of systems as well. 
To avoid the unreliable data reading, circuit-level studies use 
larger/more transistors in a bitcell by scarifying chip area and the 
static power of cache arrays. Architectural studies propose the 
auxiliary error correction or block disabling/remapping methods 
in fault-tolerant caches, which worsen both the hit latency and 
energy efficiency due to the complex accessing logic. This paper 
proposes the Timing-Speculation (TS) cache to boost the cache 
frequency and improve energy efficiency under low supply 
voltages. In the TS cache, the voltage differences of bitlines are 
continuously evaluated twice by a sense amplifier (SA), and the 
access timing error can be detected much earlier than that in prior 
methods. According to the measurement results from the 
fabricated chips, the TS L1 cache aggressively increases its 
frequency to 1.62X and 1.92X compared with the conventional 
scheme at 0.5V and 0.6V supply voltages, respectively. 
Index Terms—Timing speculation, cache, static random access 
memory (SRAM), low voltage. 
 
I. INTRODUCTION 
N recent years, energy efficiency has become more important 
for the system on chip (SoC) as the demand of Internet of 
Things (IoT) and other mobile devices increases in the market. 
Scaling down the supply voltage is one of the most commonly 
used methods in the low-power design, which brings the energy 
efficiency near to the optimal point [1]. Operating at low supply 
voltages, however, SRAM is more prone to faults under the 
process variations due to its minimum-sized transistors. As a 
result, memories demand a bigger design margin than that of 
logic circuits [14]. There are two major types of failures in 
memory cells: (1) timing failures that increase the cell access 
time and (2) unstable read/write operation [2]. The later 
problems can be solved by using the dedicated read port in cells, 
 
This paragraph of the first footnote will contain the date on which you 
submitted your paper for review. It will also contain support information, 
including sponsor and financial support acknowledgment.  
This work has been supported by the Provincial Natural Science Foundation 
of Jiangsu Province under Grant No. BK20181141. 
such as 8T [3][4] and 10T [5]. This work focuses on the former 
that dramatically degrades the read performance of SRAM 
under the low-voltage region. A potential timing failure during 
both reads and writes is essentially caused by the global process 
variation that could weaken both P and N devices by increasing 
their Vth [14]. In an SRAM reading, discharging the bitlines 
(BL) with large capacitances through those weakened memory 
cells becomes slower, making the small voltage difference 
between BL and BLB difficult to be sensed by a sense amplifier 
(SA). Fig. 1 shows a 10K Monte-Carlo simulation of 
discharging time corresponding to the bitline swing of 150mV 
in a 28nm 256-row SRAM array at 0℃  SS corner. At the 
nominal VDD, it takes only 135ps to develop enough voltage 
swing. For the 0.5V supply voltage, by contrast, the mean value 
and standard deviation of the distribution of discharging time 
increase to 7.4ns and 2.36ns respectively, where the long tail 
(up to 30ns) is for reading the minor weak bits safely. Therefore, 
an extra timing margin must be applied, which significantly 
limits the throughput of the low-power SRAM [13].  
The increase of memory latency makes the SRAM-based 
cache become the main performance bottleneck of systems 
under low supply voltages as well. Fig. 2 shows the delay and 
energy break down of a 28nm 32KB L1 cache. As the voltage 
scales down, discharging the bitlines in a data array accounts 
for 85.4% of the latency and 70.8% of the energy consumption 
at 0.5V VDD since the data array is designed to have a larger size 
S. Shen, T. Shao, X Shang, Y. Guo, M. Ling, J. Yang, L. Shi are with the 
National ASIC Research Center, Southeast University, Nanjing 210096, China 
(e-mail: shanshen,txshao, billionshang@seu.edu.con, 18751959816@163.com, 
trio, dragon, lxshi@seu.edu.con). 
* Corresponding author 
TS Cache: A Fast Cache with Timing-
speculation Mechanism Under Low Supply 
Voltages 
Shan Shen, Tianxiang Shao, Xiaojing Shang, Yichen Guo, Ming Ling*, Member, IEEE, Jun Yang, 
Member, IEEE, Longxing Shi, Senior Member, IEEE 
I 
μ=7.4ns 
σ=2.36ns
μ=105.4ps 
σ=8.6ps
85 95 105 115 125 135
0
1k
2k
3k
4k
5k
#
 o
f 
s
a
m
p
le
s
(a) BL discharging time (ps)
0 5 10 15 20 25 30
0
1k
2k
3k
4k
5k
#
 o
f 
s
a
m
p
le
s
(b) BL discharging time (ns)
 
Fig. 1.  The 10K Monte-Carlo simulation of discharging time corresponding 
to the bitline swing of 150mV in a 28nm 256-row SRAM array operating at 
(a) 0.9V (b) 0.5V VDD 0℃ SS corner. 
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
2 
and longer BLs compared with those in a tag array. 
 Prior work in the circuit-level improved the reliability of 
bitcells by using more transistors, such as 8T [4], 10T [5], and 
7T/14T [6]. However, simply using larger or more transistors 
in bitcells [7][10] comes at the cost of significant increases in 
chip area (lower density) and leakage power without any 
performance profit. Architectural-level solutions that tolerant 
faulty bits in a cacheline include (1) correcting defective bits 
through error correction codes (ECC), such as SECDED and 
OLSC [15]; (2) disabling faulty resources (such as words, lines, 
ways) [8]; (3) remapping faulty resources to create functional 
cachelines [9] or (4) mixing the large- and standard-sized 
SRAM cells in a cache [10]. To some extent, they make the 
trade-off between the data error probability and large hardware 
overhead or capacity loss.  
Another perspective to improve the SRAM performance is 
the timing-speculation approaches [11][12][13] in the circuit 
level. Unfortunately, the method in [11] is only suitable for the 
SRAM with the logic dominant timing path, while the Razor 
SRAM [12] requires a complex roll-back mechanism in the 
processor pipeline to correct the error data. Moreover, they only 
provide a limited latency reduction due to the too-late error 
detections. The shared capacitors introduced in [13], on the 
other hand, are area hungry and need to be carefully designed 
to avoid failures in error detections. Furthermore, these studies 
target the SRAM arrays rather than caches. 
In this paper, an SRAM with a novel timing-speculation 
mechanism is proposed to mitigate the performance 
degradation of memories in the low-power scenarios. The 
voltage difference between BL and BLB in the SRAM array is 
sensed twice, called cross-sensing, far before the conservative 
sensing time such that the timing error can be detected much 
earlier than that in the work [11][12]. Meanwhile, the cross-
sensing mechanism is simpler and more area efficient than the 
shared capacitors scheme [13]. Based on such SRAM array, we 
propose a Timing-speculation cache (TS cache) that has a 
boosted frequency and high energy efficiency operating at near-
threshold supply voltages. The contributions of this paper are: 
(1) a timing-speculation mechanism that can aggressively 
reduce the read latency of the 6T SRAM under low voltages; (2) 
an L1 cache based on the proposed timing-speculation 
mechanism; (3) comprehensive investigations and comparisons 
of the TS caches and the previous solutions. 
The rest of this paper is organized as follows. Section II 
presents related work. Section III introduces the mechanism of 
cross-sensing and the architecture of TS cache. Moreover, the 
noise introduced by the cross-sensing scheme is also discussed. 
Section IV presents a comparison of both cross-sensing and 
other timing-speculation techniques. The previous low-power 
fault-tolerant caches and the TS cache are also investigated. 
Section V shows the measurement results from the fabricated 
chips. Section VI outlines our conclusions. 
II. RELATED WORK 
A. Circuit-level Solutions 
Regarding circuit-level solutions for the low-power SRAM, 
larger transistors in a memory cell average out the Vth 
variability caused by non-uniformities in the channel doping 
and result in more robust devices with a lower probability of 
failure [7]. Another approach is to use assist transistors in a 
bitcell to improve the noise margin when the supply voltage 
scales to the near-threshold region, such as 8T [3][4], 10T [5], 
and 7T/14T [6]. Wang et al. [3] observed that access-time faults 
occur only when a “0” bit is read on an 8T cell for a full RBL 
swing. Thus, they proposed the zero-counting and adaptive-
latency cache (i.e., ZCAL cache) based on an 8T SRAM to 
detect access-time faults dynamically using a lightweight zero-
bit counting error detection code. When a fault occurs, the 
ZCAL cache extends its access time. However, the large-sized 
or 8T/10T cells significantly consume the SRAM area and static 
power, which is unacceptable for the L2 and L3 caches. 
B. Architectural Solutions 
From an architectural perspective, ECCs are commonly used 
to protect against soft errors. Considering a high bit failure rate, 
the simple error corrections such as parity bits or SECDED 
cannot deal with multi-bit errors in a data chunk. Thus, a 
stronger ECC with larger latency, area, and energy overhead 
have to be applied. For example, the Orthogonal Latin Square 
Code (OLSC) proposed in [15] sacrificed half cache capacity to 
store ECC bits. A compromised method put forward by Khan 
et al. used the heterogeneous 6T cell architecture [10]. Only 
clean data is stored in the non-robust cache ways, which are 
protected by a simple ECC mechanism. In case of an error, the 
correct data can be obtained from a lower-level cache or 
memory. Dirty data is stored only in the robust ways 
constructed with larger-sized memory cells, which is 
guaranteed by a modified replacement policy. The replacement 
policy, however, would incur extra cache way swapping and 
energy consumption. Concertina [9] allocated the faulty 
subblocks to the null cache subblocks, enabling the use of 100% 
of the LLC capacity. But detecting the available blocks and 
rearranging them in the remapping mechanisms increase the 
access latency and the complexity of cache management. 
Moreover, methods in [8] and [9] introduce more cache misses. 
C. Timing-speculation 
The concept of timing-speculation is first proposed in the 
logic circuits to eliminate the over-design margins by in-situ 
timing error detection. Ernst et al. [19] used the flip-flop and 
the shadow latch to double sample input data at different clock 
edges. The scheme is often used in dynamic voltage scaling 
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
tag array
d
e
la
y
 (
n
s
)
(a) voltage (V)
data array
data tag
0
1
2
3
4
5
(b)
re
a
d
 e
n
e
rg
y
 (
p
J
)  other
 WL
 SA
 BL
 
Fig. 2.  (a) Read latency break down of a 28nm 32KB cache at different supply 
voltages and (b) energy break down at 0.5V VDD in 25℃ TT corner. 
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
3 
(DVS) system to reduce the voltage margin. Karl et al. [11] 
applied this idea to SRAM, which contains shadow SAs in 
addition to the main SAs. The main SA is triggered 
speculatively at the clock negative edge. After a while, the 
shadow SA re-samples the bitlines to confirm the result. The 
system detects the number of errors where the two samples are 
different during voltage scaling. When the number of errors 
exceeds the pre-set threshold, the supply voltage cannot be 
further reduced. Khayatzadeh et al. [12] proposed the Razor 
SRAM that reads memory twice with dual ports in a pipelined 
manner. In most cases, the read output is available after the first 
cycle and then confirmed by comparing with the second sample 
in the next cycle. For weak bits, the error flag will be triggered 
due to the two unequal samples. A common disadvantage of the 
schemes in [11] and [12] is the long-time duration between the 
speculative and the confirm readings. Consequently, the too-
late generation of error flags limits their applications in SoC 
systems. For example, a complex roll-back mechanism must be 
implemented in the processor pipeline to correct the error data 
read from the Razor SRAM, which can be extravagant in a low-
power processor.  To solve these challenges, Yang et al. [13] 
proposed a double sensing scheme with selective bitline voltage 
regulation (DS-SBVR), where the bitline voltage is 
dynamically regulated by charge sharing between two sensing 
steps. Different from other timing speculation SRAMs, its error 
flag is generated much earlier. Unfortunately, the shared 
capacitors with large capacitances in [13] are tremendously area 
hungry. Besides, their capacitances must be carefully designed 
to avoid failures in error detection, which could possibly 
corrupt the data. Furthermore, all these studies focus on SRAM 
arrays rather than caches. 
 
1 To simplify our discussion in this paper, we only introduce one cache 
configuration in this paper. However, the general concepts of TS cache can be 
easily extended to other implementations with different cache parameters.  
III. TIMING-SPECULATION CACHE 
In this section, the mechanism of cross-sensing is introduced 
and the overall architecture of TS cache is described. Moreover, 
the noise analysis is also discussed in details.  
A. Cross-sensing Mechanism 
In the cross-sensing phase, two successive SA enable (SAE) 
signals are triggered and the inputs of a given SA are switched 
at the second SAE. Fig. 3 (a) demonstrates the mechanism of 
cross-sensing. Assume the offset voltage (VOS) of the SA is 
positive and the bitcells in a column store ‘1’. The first SAE, 
which arrives far before the conservative sensing, activates the 
SA to evaluate the voltage difference between the 
corresponding BL and BLB (V∆1 = VBL - VBLB). Due to the 
process, voltage, temperature (PVT) variations, the distribution 
of the voltage differences (samples) consists of two groups, A 
and B. The samples in group A are correctly read (Q1 = 1) 
because the voltage swing on BLB is large enough to be 
evaluated by the SA (V∆1 > VOS). On the other hand, samples in 
group B are wrongly read as ‘0’s for the small BLB swing (0 < 
V∆1 < VOS). After the finish of the first sensing, the SA inputs 
are switched and the SAE is triggered again. Thus, the second 
input voltage becomes negative (V∆2 = VBLB – VBL < 0) and is 
re-evaluated, which makes the samples of group A’ and B’ 
symmetric to those of group A and B in Fig. 3 (a). Since V∆2 < 
0 < VOS, the sensing outcomes Q2 are all ‘0’s. The timing error 
can be identified if Q1 = Q2 (for samples in group B and B’ in 
this case), which means the TS cache has to extend another 
cycle, such that the voltage swing of BLB can be enlarged by 
continuously discharging, to obtain the correct result. 
Otherwise, if Q1≠ Q2, a reliable read is confirmed, the 
requested data can be sent out earlier than the conventional 
approach.  
As Fig. 3 (b) shows, the analogical analysis can be derived 
when VOS < 0. By using the proposed cross-sensing method, the 
read delay of SRAM can be aggressively improved. For 
example, targeting on a 3σ correct reading probability, it 
reduces 60% of BL discharging time at 0.5V 25℃ TT corner 
(Fig. 7). Furthermore, the simulation result shows it reduces 
energy dissipation as well (discussed in section IV-B). 
B. Overall Architecture 
 Fig. 4 is the overall architecture of an instance of the TS 
cache, which is organized as 32KB 2-way set-associativity with 
a 64-bit width read port1. Logically, each row of the tag array 
stores two 32-bit tags of the two cache ways and each row of 
data array stores a 64-byte cacheline. In the physical layout of 
the TS cache, the tag and the data arrays consist of multiple 
SRAM sub-arrays, which will be shown in section V. 
In each data column, a switch comprised of 4 PMOS (P1~P4) 
is controlled by the switch (SWT) signal. When performing a 
normal bitlines sensing, P1 and P4 are activated to connect BL 
and BLB to the input, IN and INB, of the SA. To swap the 
WL
WL
group B  
Q2=0
VOS
SA
BL BLB
..
.
 1 
 1 
SAE
IN
group A
Q1=1
group 
B 
Q1=0
+ -
INB
group A 
Q2=0
V  =VBL-VBLB V  =VBLB-VBL
Q1
0
1
V 1 > 0
V 1 > 0
1
1
VOS > 0
V  
0< V 1  < VOS
V 1 > VOS
VOS < 0
VOS < V 2 < 0
V 2 < VOS
V  
V 2 < 0
V 2 < 0
Q2
0
0
1
0
error
/
/
error
(false-positive)
(a)
(b)
error flag
V 
#
 o
f 
s
a
m
p
le
s
 
Fig. 3.  (a) Mechanism of cross-sensing. Suppose that all bitcells store ‘1’s. (b) 
The truth table of error detection for different VOS. 
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
4 
connections between the bitlines and the SA, P2/P3 are turned 
on and P1/P4 are turned off. The gates of N1 and N2 in the 
latch-typed SA are used as the input in case of the leakage 
current from bitlines (SA) to the SA (bitlines), which might 
disturb the error detections. The two PMOS transistors, P5 and 
P6, pull up the Q and QB of the sense amplifier before SAE 
arrives. 
An error detector includes a group of dynamic latches storing 
the first read outcome Q1, and an XOR+AND gate that 
compares the two outcomes, Q1 and Q2, of the cross-sensing. 
When detecting the timing errors, the node VVDD precharged 
by P7 will be pulled down to the ground if any two read 
outcomes of a bitcell are equal (a timing error occurs). At the 
same time, this low voltage of VVDD is latched and the error 
flag is set. A too-large XOR+AND gate that merges many data 
columns will introduce a larger capacitance and a longer gate 
delay, which should be avoided in the design. Thus, TS cache 
uses 8 error detectors in a data array to detect any timing failure 
in each 64-bit data segment (which matches the read port width). 
This design can also reduce the error correction penalty in a 
pipelined cache by overlapping the extra cycle of error data 
reading with the cycles of correct words transmitting. Moreover. 
to reduce the leakage current from VVDD to VVSS, the gate 
length of NMOS in the XOR+AND gate is 10nm larger than 
those in other modules. The dynamic latches and XOR+AND 
gates used by error detectors largely reduce the area overhead 
compared to static implementations. 
 The timing diagram is shown in Fig. 5. The SAs are 
activated by the first SAE signal, and the QLCH signal 
immediately enables the dynamic latches to store this sensing 
results. The SWT signal keeps high during the second SA 
enabling. When the second read outcome is stable, the DTC 
activates the XOR+AND gate. The ERR signal will be latched 
until the data is correctly read out. In most cases, the timing 
error can be corrected in an extra cycle by keeping the bitlines 
discharging through a second WLE signal, shown in Fig. 5. It is 
possible that some extraordinary weak bits need more cycles to 
obtain the correct results, which may cause destructive readings 
in a 6T SRAM cell. However, such weaker cells can be 
identified by build-in self-testing (BIST) and corrected through 
redundancy cells or removed by block disabling. Another 
situation is that error data reading may occur in the non-hit 
cache ways. However, it is not necessary to correct these unused 
error data. All timing signals are generated by a configurable 
timing control unit with automatic PVT tracking [13], which 
can be flexibly configured to multiple cycles of the clock period 
(coming from the replica bitline). 
ERR
PRE PRE
SAE
DTC
SWT
ro
w
 d
e
c
o
d
e
r
SAE
re
q
u
e
s
t 
(a
d
d
r 
&
 t
a
g
)
ti
m
in
g
 c
o
n
tr
o
l
R
/W
 a
s
s
is
ts
 &
 W
L
 d
ri
v
e
rs
WLE WLE
data arraytag array
ro
w
 d
e
c
o
d
e
r
R
/W
 a
s
s
is
ts
 &
 W
L
 d
ri
v
e
rs
MUX & drivers
4
...
XOR+AND
Q1[0]
Q2[0]
DTC
Q2[63]
Q1[0]
Q2[0]
Q1[63]
Q2[63]
ERR
latch
ELCHDTC
Q[0] QLCH
dyn. latch
Q[63] QLCH
QLCH
Q1 Q1
dyn. latch
Q
VVDD
VVSS
512
64
P7
N3
CK CK
tag comparator
..
.
SA ... SA
B
L
B
L
B
..
.
..
.
..
.
switch
..
.
..
. ..
.
switch... switch
B
L
B
L
B
SA ... SA SA
BC BC...
BC...
BC BC...
BC
BC
BC
BC
WL WL
BC BC...
BC BC...
BC BC...
ti
m
in
g
 
c
o
n
tr
o
l
WAYSEL
64 512
SAE
IN INB
P5
BL[0] BLB[0]
P1 P2 P3 P4
P6
SWT
SWT
N1 N2
Q
QB
error detectors
ELCH
QLCH
BL precharge & write buffers BL precharge & write buffers
dyn. latch
DTC
 
Fig. 4.  The overall architecture of an instance of the TS cache with 32KB capacity, 2-way set-associativity, and a 64-bit width read port. 
  
First Cycle Extra Cycle
CLK
WLE
SAE
QLCH
DTC
ERR
BL
BLB
extended reading 
PRE
cross-sensing
SWT
 
Fig. 5.  The timing diagram of the TS cache.  
 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
5 
C. Noise Analysis 
 Unfortunately, a false-positive situation exists in the 
proposed scheme, in which case the error signal is triggered 
while the first read outcome is actually correct. Recalling Fig. 
3 (b), when a SA with a negative offset voltage senses a ‘1’, the 
first output is always correct since VOS < 0 < V∆1. After the SA 
input switching, the second read outcome can still be ‘1’ in the 
condition of VOS < V∆2 < 0, which is caused by a small 
amplitude of BLB swing. Furthermore, the charge sharing 
between the bitlines and SAs exacerbates these false-positives, 
shown in Fig. 6 (a). The equivalent capacitances of the SA input 
nodes IN and INB are CIN and CINB, respectively. As the first 
SAE raises, the voltage of IN is pulled up to VBL while the 
voltage of INB equals to VBLB. After swapping the inputs of SA, 
the charge (CBLB × VBLB + CIN × VBL)  at IN will be re-shared 
between CBLB and CIN, hence, VIN becomes  (CBLB × VBLB + CIN 
×  VBL) / (CBLB + CIN). The voltage difference V∆2 can be 
expressed by:  
 
V∆2 =
CBLB∗VBLB+CIN∗VBL
CBLB+CIN
−
CBL∗VBL+CINB∗VBLB
CBL+CINB
 (1) 
 
Assuming CBL = CBLB, CIN = CINB, the relation between V∆2 
and V∆1 is derived as 
 
V∆2 = −
CBL−CIN
CBL+CIN
(VBL − VBLB) = −
CBL−CIN
CBL+CIN
V∆1 (2) 
 
where the charge sharing shrinks the amplitude of V∆2 
compared to V∆1. The simulation results at 75℃ FF corner (Fig. 
6 (b)) with CBL = CBLB = 50fF, CIN = CINB = 0.5fF show that the 
absolute voltage difference is lowered by only 8% for a 256-
row SRAM in the second sensing, suggesting that the increase 
of false-positives caused by charge sharing is trivial. Fig. 7 
shows the bit error rate (BER) and the double-word (64 bits) 
error rate (DER) detected by the cross-sensing at different BL 
discharging time at 0.5V 25℃ TT corner. The cross-sensing 
(CS) method shorten the discharging latency from 6σ to the 3σ 
(1‰) region. For the discharging time of 3.2ns, the cross-
sensing only increases the BER and DER by merely 0.00023 
and 0.013 due to the existence of false-positives. 
 In addition, the false-negative situations, where the weak 
bits are wrongly recognized as the strong ones, will possibly 
happen in [13] when the amplitude of the regulated voltage is 
not sufficient. Because the false-negatives destruct the data 
reading, the shared capacitors of DS-SBVR SRAM must be 
designed carefully to avoid them. Contrastively, the cross-
sensing does not have this disadvantage. As Fig. 3 shows, all 
weak cells can be identified by the cross-sensing as long as their 
voltage differences on bitlines are smaller than the offset of SA 
(|V∆| < |VOS|). All in all, the cross-sensing scheme has more 
reliabilities and robustness in a circuit system.  
IV. COMPARISONS 
 In this section, comprehensive comparisons between the 
proposed scheme and prior approaches as well as discussions 
will be proceeded. To be fair, we first compare the cross-
sensing scheme with other timing-speculation SRAMs. 
Secondly, the energy-delay product (EDP), energy, and area 
overhead using TS cache and other fault-tolerant caches under 
the low-supply-voltage scenarios will be analyzed. 
A. Comparing with Other Timing-speculation SRAMs 
From a timing point of view, the speculative SRAM has two 
delay parameters: the TARRAY, defined as the delay of the first 
128 256 512
0.6
0.7
0.8
0.9
1.0
(b) number of rows
|V

2
 /
 V

1
|
(a)
CBL
CBLB
CBLB
CBL
CIN CINB
IN INB
charge sharing at 2nd SAE
SA
 1st SAE
 
Fig. 6.  (a) The charge sharing between BLs and the inputs of SA. (b) The 
influence of charge sharing on the second input voltage V∆2. 
  
60% reduction
bit error rate=1‰
0.8 2 3.2 4.4 5.6 6.8 8
10-9
10-7
10-5
10-3
10-1
101
e
rr
o
r 
ra
te
BL discharging time (ns)
 real BER
 CS BER
 real DER
 CS DER
 
Fig. 7.  Bit error rate (BER) and double-word error rate (DER) detected by the 
cross-sensing in a 28nm 256-row SRAM at 0.5V 25℃ TT corner.  
  
SAE
ERR
BL
BLB
Q
SAE
BL
BLB
Q
ERR
BL
BLB
SAE
Q
ERR
BL
BLB
SAE
Q
Conventinal
[11] & Razor SRAM [12]
This work
DS-SBVR [13]
TARRAY
TERROR
TERROR
TCONV
 
Fig. 8.  Timing diagrams of different timing-speculation techniques.  
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
6 
speculative output, and TERROR, defined as the delay of the final 
confirmation. Fig. 8 compares these timing parameters in 
different speculative SRAMs. In the conventional SRAM, SA 
is enabled until the voltage difference between BL and BLB is 
sufficiently large. The delay parameter TCONV is comprised of 
wordline driven, BL/BLB discharging, and SA sensing. The 
SRAM with the shadow SAs [11] releases the speculative 
outputs at the half of wordline enable time. The ideal TARRAY is 
only 50% of TCONV, while TERROR is equal to TCONV. However, 
this scheme requires that the TERROR must be smaller than a 
clock cycle to avoid the propagation of the wrong data. Thus, it 
is only suitable for a logic dominant path in which the logic 
delay occupies most of the clock period, but not suitable for an 
SRAM dominant path, such as caches in processors [13]. The 
principle of Razor SRAM [12] is similar to [11]. Since its 
TERROR is on a two-cycle timing path, it sends the risk data at the 
first cycle and detects errors in the next cycle. Therefore, it 
involves a roll-back mechanism when used in a processor 
pipeline and needs the stabilize registers to inhibit write-backs 
during error detection. The ideal TERROR of DS-SBVR [13] and 
this work is only a little larger than half of TCONV. The maximum 
throughput gain is defined as the ratio of the maximum 
throughput to that of the conventional SRAM. The theoretical 
maximum throughput gains of [11] and Razor SRAM are 1.5X 
and 2X, and those of DS-SBVR and this paper are 1.78X for the 
512-row array.  
Moreover, the capacitances of the shared capacitors in DS-
SBVR SRAM are a function of TARRAY, which means the 
capacitors must be elaborately designed according to the timing. 
Oppositely, in the cross-sensing mechanism, the TARRAY can be 
flexibly configured to achieve different frequency boosting 
without any error-detection failures (discussed in section III-C).  
The compared metrics are listed in Table I. A 128-row × 32-
column and 512-row × 32-column SRAM array layouts are 
presented using Cadence Virtuoso suit [18] in the same 28nm 
process technology to demonstrate the area overhead. The 
bitcells in the layouts including the push-rule 6T single port (SP) 
and 8T dual port (DP) are provided by TSMC foundry. The 
baseline SRAM includes the bitcell array, SAs, and the pre-
charge circuit without using error detection techniques. The 
SRAM in [11] including shadow SAs and the error detection 
circuit (XOR gates and MUX) consumes additional 17.6% chip 
area in the 128-row array. The Razor SRAM speculatively 
reads data through two independent ports and achieves great 
throughput gain at the cost of huge area overhead (45.1% for 
the 128-row array). The DS-SBVR SRAM has area cost of 
20.8%, which is mainly consumed by the shared capacitors. 
Thanks to the low-cost error detector, this work achieves the 
best area overhead of merely 6.4% for the 128 × 32 SRAM array 
and 1.8% for the 512 × 32 array. The data of energy and delay 
is collected from simulations same as that in section IV-B. The 
energy overhead refers to the ratio of the increase reading 
energy to that of the baseline SRAM array. In a read operation, 
energy is mainly consumed by the BL precharging, discharging, 
voltage sensing, and error detecting. The energy penalties of 
[11], [12], [13], and our work are 52.6%, 56.5%, 34.2%, and 
12.3% for the 128 × 32 sized arrays, respectively, which are 
mainly consumed by their error detection logic. It is worth to 
note that the cross-sensing scheme proposed in this paper even 
reduces the reading energy by 18.8% for the 512 × 32 sized 
array due to its lower BL swing to confirm the correct reading. 
The figure of merit (FoM) of power, performance, area (PPA) 
gain is defined as the maximum throughput / (area × energy). 
As we can see in Tab.1, the cross-sensing scheme achieves the 
best FoM, 1.34 and 2.15 for the two sized arrays among all 
speculation SRAMs. 
B. Comparisons with Other Fault-tolerant Caches 
1) Experimental setup 
In this work, all caches are implemented as 28nm single 
banks with 32KB capacity and 2-way set-associativity in the 
28nm process. The timing design of caches is according to the 
Monte-Carlo simulations using HSPICE at 0.5V 25℃  TT 
process corner to achieve the target yield. In the baseline 
version, the wordline enable time has a large margin to achieve 
6σ correct reading probability without using any error detection 
and correction techniques. Regarding the fault-tolerant caches, 
the WL enable time is configured to deliver the 1‰ BER (3σ 
correct reading probability). The energy dissipation is collected 
from the simulations of 8 data arrays and 4 tag arrays. The size 
of each data array is 256 × 128, while the tag array size is 64 × 
64.  
The TS cache is compared with other 4 fault-tolerant caches: 
the mixed-cell L1 [10], the ZCAL cache [4], the caches with 
SECDED and with OLSC ECC [15]. In the mixed-cell L1 cache 
[10], the robust cells are designed to have 2X size after our 
evaluation. One of the cache ways is constructed with the larger 
robust bitcells while another uses the standard cells. The ZCAL 
cache uses 8T cells with single-ended read port and 8 check bits 
for each 128-bit data segment. For the ECC caches, a segmented 
SECDED (21, 16) scheme is implemented, which can correct 1 
error out of the 16-bit data segment with 5 check bits (the 
probability of more than 2-bit error in a segment is P(error>2) 
TABLE I 
THE COMPARISON WITH OTHER TIMING-SPECULATION SRAMS 
 [11] Razor SRAM [12] DS-SBVR [13] This work 
Array Size 128×32 512×32 128×32 512×32 128×32 512×32 128×32 512×32 
Sensing Scheme 
Double-sensing with main 
and shadow SAs 
Double-sensing with dual 
ports in two consecutive 
cycles 
Double-sensing with selective 
bitline voltage regulation 
Cross-sensing 
Area Overhead 17.6% 4.8% 45.1% 50.1% 20.8% 7.6% 6.4% 1.8% 
Energy Overhead 52.6% 17.9% 56.5% 19.0% 34.2% 10.1% 12.3% -18.8% 
TCONV/TERROR 1X 1X 1X 1X 1.57X 1.78X 1.6X 1.78X 
Max. Throughput 1.5X 1.5X 2X 2X 1.57X 1.78X 1.6X 1.78X 
FoM in SRAM 0.83 1.21 0.88 1.13 0.97 1.50 1.34 2.15 
 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
7 
= 1.8e-9). The check bits of SECDED are stored together with 
the normal data word forming a larger SRAM array. We also 
evaluate a more complex ECC solution, the segment OLSC 
(128, 64) ECC, which reduces the error probability to P(error>4) 
= 7.1e-11 in a 64-bit data segment. However, this ECC scheme 
scarifies area and power consumption since the check bits are 
stored in a dedicated 32KB memory. The overhead of ECC 
methods refers to the results in [16].  
2) Evaluation 
The energy delay product (EDP) is defined as the product of 
the access energy and the average access latency [4], which is a 
lower-is-better metric. Fig. 9 shows the normalized EDP of 
different fault-tolerant schemes. The TS cache has the best EDP 
of 0.31 compared to the baseline. The large overhead of OLSC 
cache accounts for the largest EDP (0.59). The mixed-cell and 
TS caches improve the average read latency by 51.5% and 
49.1%. For the TS scheme, the read penalty comes from the 
error correction in extra cycles.  
Fig. 10 (a) shows the normalized read energy and area 
overhead. By using 8T bitcells, ZCAL cache [4] performs the 
lowest energy dissipation, only 0.5X compared to the baseline. 
It can be explained by the reduced frequency of reading ‘0’ that 
requires a full RBL swing in ZCAL cache. Among the solutions 
based on the 6T SRAM, the TS cache performs the highest 
energy efficiency. The mixed-cell L1 and SECDED cache 
consume more energy due to their larger SRAM arrays. 
Regarding the segment OLSC (128, 64) ECC, the dedicated 
memory makes OLSC cache consume 1.15X energy compared 
to the baseline and nearly 2X compared to the TS cache. Fig. 
10 (b) shows the normalized area. As we have expected, the 
OLSC cache consumes 2X chip area. Meanwhile, the ZCAL 
cache and the mixed-cell cache also have a large area overhead. 
Oppositely, the TS scheme has the smallest area thanks to the 
limited assist hardware, which makes it more attractive to be 
applied in the IoT devices.  
V. MEASUREMENTS 
A 32KB single-cycle 2-way set-associative TS cache 
prototype is fabricated with 28nm TSMC technology in this 
paper, which consists of 8 data arrays with the size of 256 rows 
× 128 columns, a 64-bit width read/write port, and 4 tag arrays 
with the size of 64 rows × 64 columns. Fig. 11 (a) is the die 
micrograph and (b) depicts the testing logic of the chip. The test 
controller generates the chip enable signal (CEN), the write 
enable signals (WEN). To mimic the cache behavior in a 
processor, the requested addresses and data are pre-programed 
in the controller. Before all cachelines being accessed 
sequentially, the data ‘0x55’ and ‘0xAA’ is written into each 
byte of the cache by address traversal. The read outcome (Q) is 
sent to the comparator to count the number of error bits and 
error words. The WL enable time can be configured by the 
timing control module in the TS cache to achieve various access 
delays. The testing logic repeats these procedures when the 
timing configuration or supply voltage changes. The internal 
CK is generated from the replica bitline (RBL) [13] and input 
to the timing control in TS cache. If any error occurs, the clock 
is gated to wait for the correct data. All measurement results are 
collected from 20 chips at the room temperature (25℃).  
Table II lists the CK periods generated by the RBL module. 
Since all pulse width of timing signal is multiple cycles of the 
CK period, a low deviation of CK is crucial for achieving the 
Comparator 
& 
Error 
Counter
Address
Data
CEN
WEN
ERR
Q
TS Cache
CK
CK_G
Clock Gating
Test 
Controller
RBL
(a)
SAs & Output Driver 
& Error Detector
Write Buffers & 
Precharge
Bitcells
T: Timing Control
Data 
Array
Data 
Array
Data 
Array
Data 
Array
Tag 
Array
Data 
Array
Data 
Array
Data 
Array
Data 
Array
D
e
c
o
d
e
r 
&
 W
L
 
D
ri
v
e
r
D
e
c
o
d
e
r 
&
 W
L
 
D
ri
v
e
r
D
e
c
o
d
e
r 
&
 W
L
 
D
ri
v
e
r
D
e
c
o
d
e
r 
&
 W
L
 
D
ri
v
e
r
Tag 
Array
Tag 
Array
Tag 
Array
T T
(b)  
Fig. 11.  (a) The die micrograph and (b) the testing logic of chips. 
  
TABLE II 
THE CK PERIODS IN MEASUREMENTS  
VDD Avg. (ns) Max (ns) Min (ns) 
0.5V 0.687 0.744 0.658 
0.6V 0.265 0.279 0.254 
0.7V 0.167 0.172 0.161 
0.8V 0.122 0.125 0.119 
0.9V 0.099 0.108 0.096 
 
m
ix
e
d
-c
e
ll
Z
C
A
L
S
E
C
D
E
D
O
L
S
C
T
S
0.0
0.2
0.4
0.6
0.8
1.0
1.2
 other
 BL
 tag
n
o
rm
. 
re
a
d
 e
n
e
rg
y
(a) m
ix
e
d
-c
e
ll
Z
C
A
L
S
E
C
D
E
D
O
L
S
C
T
S
0.0
0.4
0.8
1.2
1.6
2.0
n
o
rm
. 
a
re
a
(b)
 
Fig. 10.  (a) Energy per cache reading and (b) area for different fault-tolerant 
caches, normalized to the baseline version. 
  
b
a
se
lin
e
m
ix
e
d
-c
e
ll
Z
C
A
L
S
E
C
D
E
D
O
L
S
C
T
S
0.0
0.2
0.4
0.6
0.8
1.0
n
o
rm
. 
E
D
P
0.0
0.2
0.4
0.6
0.8
1.0
a
v
g
. 
re
a
d
 d
e
la
y
 
Fig. 9.  The EDP and the average read delay normalized to the baseline for 
different fault-tolerant caches.  
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
8 
target yield. At 0.5V VDD, the CK periods vary from 0.744ns to 
0.658ns, such low deviation can satisfy the requirement of the 
timing design in the TS cache. Fig. 12 shows BER at the 
different WL enable time (= the number of CKs × CK period) 
at different VDD. Obviously, the BER curves with longer and 
flatter tails as the supply voltage scales down indicate that an 
extremely large timing margin is indispensable to ensure the 
target reading yield. For example, it takes 19.23ns (= 28 × 
0.687ns) at 0.5V and 5.3ns (= 20 × 0.265ns) at 0.6V of the WL 
enable time to read all cache content correctly (total 20 × 32K × 
8 testing bits). Fig. 13 (a) shows the average throughput gain of 
the TS cache. As section IV-B illustrates, we configure the WL 
enable time to achieve the 10-3 BER where the best benefit point 
(77%) is at 0.6V supply voltage. For 0.5V supply voltage, the 
higher DER (nearly 10%) that brings more penalties to extend 
reading nullifies the performance benefit of frequency boosting. 
Compared with the baseline cache, as Fig. 13 (b) shows, the 
frequency of TS cache is boosted by 1.6X (100MHz) at 0.5V 
and 1.9X (350MHz) at 0.6V VDD with merely 3.72% die area 
overhead.  
VI. CONCLUSION 
To address the problem of cache performance degradation 
under near-threshold voltage region, the TS cache is proposed 
in this work. By using a highly efficient timing-speculation 
mechanism, this paper breaks through the limitation of all 
memory accesses must be completely correct. The erroneous 
reading can be quickly identified by the low-cost error detector, 
then be corrected in an extended cycle. A 28nm TS cache 
prototype is fabricated to demonstrate the effectiveness and 
efficiency of this scheme. According to the measurements 
results, the TS cache can aggressively improve the cache 
throughput and frequency under low-voltage region. Beyond 
that, based on the standard 6T SRAM array, TS cache consumes 
lower chip area and energy as well. This work also conducts 
comprehensive comparisons with existing timing-speculation 
SRAMs and fault-tolerant caches including both circuit- and 
architecture-level solutions. All the result shows that the TS 
cache has a better energy efficiency and suits for the low-power 
system. 
REFERENCES 
[1] Alioto, Massimo. "Ultra-low power VLSI circuit design demystified and 
explained: A tutorial." IEEE Transactions on Circuits and Systems I: 
Regular Papers 59.1 (2012): 3-29.  
[2] Mukhopadhyay, Saibal, Hamid Mahmoodi-Meimand, and Kaushik Roy. 
"Modeling and estimation of failure probability due to parameter 
variations in nano-scale SRAMs for yield enhancement." 2004 
Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No. 
04CH37525). IEEE, 2004. 
[3] Chen, Chien-Fu, et al. "A 210mV 7.3 MHz 8T SRAM with dual data-
aware write-assists and negative read wordline for high cell-stability, 
speed and area-efficiency." 2013 Symposium on VLSI Circuits. IEEE, 
2013. 
[4] WangP H, Cheng W C, Yu Y H, et al. “Zero-Counting and Adaptive-
Latency Cache Using a Voltage-Guardband Breakthrough for Energy-
Efficient Operations”. IEEE Trans. on Circuits and Systems, 2016, 63(10): 
969-973. 
[5] B. Calhoun and A. Chandrakasan, “A 256-kb 65-nm sub-threshold 
SRAM design for ultra-low-voltage operation,” IEEE J. Solid-State 
Circuits, vol. 42, no. 3, pp. 680–688, Mar. 2007. 
[6] Chen, Yen-Hao, et al. "A novel cache-utilization-based dynamic voltage-
frequency scaling mechanism for reliability enhancements." IEEE 
Transactions on Very Large Scale Integration (VLSI) Systems 25.3 
(2017): 820-832. 
[7] Zhou, Shi-Ting, et al. "Minimizing total area of low-voltage SRAM 
arrays through joint optimization of cell size, redundancy, and ECC." 
2010 IEEE International Conference on Computer Design. IEEE, 2010. 
[8] Hijaz F, Khan O. “NUCA-L1: A non-uniform access latency level-1 
cache architecture for multicores operating at near-threshold voltages”. 
ACM Transactions on Architecture and Code Optimization (TACO), 
2014, 11(3): 29. 
[9] Ferreron, Alexandra, et al. "Concertina: Squeezing in cache content to 
operate at near-threshold voltage." IEEE Transactions on Computers 65.3 
(2016): 755-769. 
[10] Khan S M, AlameldeenA R, Wilkerson C, et al. “Improving multi-core 
performance using mixed-cell cache architecture”. 2013 IEEE 19th 
International Symposium on High Performance Computer Architecture 
(HPCA). IEEE, 2013: 119-130. 
[11] Karl, Eric, Dennis Sylvester, and David Blaauw. "Timing error correction 
techniques for voltage-scalable on-chip memories." 2005 IEEE 
International Symposium on Circuits and Systems. IEEE, 2005. 
[12] Khayatzadeh, Mahmood, et al. "17.3 a reconfigurable dual-port memory 
with error detection and correction in 28nm fdsoi." 2016 IEEE 
International Solid-State Circuits Conference (ISSCC). IEEE, 2016. 
[13] Yang, Jun, et al. "A Double Sensing Scheme With Selective Bitline 
Voltage Regulation for Ultralow-Voltage Timing Speculative SRAM." 
IEEE Journal of Solid-State Circuits 53.8 (2018): 2415-2426. 
[14] Dreslinski, Ronald G., et al. "Near-threshold computing: Reclaiming 
moore's law through energy efficient integrated circuits." Proceedings of 
the IEEE 98.2 (2010): 253-266. 
[15] Chishti, Zeshan, et al. "Improving cache lifetime reliability at ultra-low 
voltages." Proceedings of the 42nd Annual IEEE/ACM International 
Symposium on Microarchitecture. ACM, 2009. 
[16] Duwe, Henry, et al. "Rescuing uncorrectable fault patterns in on-chip 
memories through error pattern transformation." 2016 ACM/IEEE 43rd 
Annual International Symposium on Computer Architecture (ISCA). 
IEEE, 2016. 
[17] Ernst, Dan, et al. "Razor: A low-power pipeline based on circuit-level 
timing speculation." Proceedings of the 36th annual IEEE/ACM 
International Symposium on Microarchitecture. IEEE Computer Society, 
2003. 
[18] Virtuoso Layout Suite, https://www.cadence.com/. 
 
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
10-6
10-5
10-4
10-3
10-2
10-1
100
101
B
E
R
WL enable time (# of CK)
 0.5
 0.6
 0.7
 0.8
 0.9
 
Fig. 12.  The bit error rate (BER) from measurement chips at different supply 
voltages.  
  
0.5 0.6 0.7 0.8 0.9
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
fr
e
q
u
e
n
c
y
 (
G
H
z
)
(b) voltage (V)
1.93X
1.74X
1.5X
1.27X
conv. cache
TS cache
100MHz @ 0.5V
0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
0
2
4
6
8
10
a
v
g
. 
th
ro
u
g
h
p
u
t 
g
a
in
 (
%
)
(a) voltage (V)
e
rr
o
r 
ra
te
 (
%
)
 
Fig. 13.  (a) The throughput improvement and the DER at different VDD. (b) 
The shmoo plot. 
 
  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
9 
Shan Shen was born in 1993. He received 
the B.S. degree in Microelectronics 
Department from Jiangnan University in 
2016. And in 2016, he studied in in School 
of Microelectronics from Southeast 
University, where he is currently pursuing 
the Ph.D. degree in Microelectronics.  
His research interests mainly include 
hardware designs in Computer 
Architecture and Memory System. 
 
 
 
 
Tianxiang Shao was born in 1995. He 
received the B.S. degree in Optics And 
Electronic Department from China Ji 
Liang University in 2017. He is currently 
pursuing the M.S. degree in the School of 
Microelectronics, Southeast University.  
His research interests mainly include 
the architecture of CPU, cache and SRAM 
 
 
 
 
 
 
 
 
Xiaojing Shang was born in 1996. He 
received the B.S. degree in Microelectronic 
Department from Jiangnan University in 
2017. He is currently pursuing the M.S. 
degree in the School of Microelectronics, 
Southeast University.  
His research interests mainly include the 
architecture of CPU, cache and DRAM. 
  
 
 
 
 
 
 
 Yichen Guo received the B.S. degree from 
Anhui University, Hefei, China, in 2015. 
He is currently pursuing the M.S. degree 
with the School of Electronic Science and 
Engineering, Southeast University, 
Nanjing, China.  
His current research interests include 
low-voltage static random access memory 
(SRAM) circuit design. 
 
 
 
 
 
Ming Ling is an associate professor 
working in the National ASIC System 
Engineering Technology Research Center, 
Southeast University. He received his B.S. 
degree, M.S degree and Ph.D from 
Southeast University in 1994, 2001 and 
2011, respectively.  
His main research interests include 
memory subsystem of SoC, embedded 
software and SoC architecture. 
 
 
 
 
Jun Yang (M’15) received the B.S., M.S., 
and Ph.D. degrees from Southeast 
University, Nanjing, China, in 1999, 2001, 
and 2004, respectively. He is currently a 
Professor with the School of Electronic 
Science and Engineering, Southeast 
University.  
He has co-authored over 50 academic 
papers, and holds 40 patents. His current 
research interests include near-threshold 
circuit design and Global Navigation Satellite System (GNSS) 
algorithm. 
 
 
 
 Longxing Shi (SM’06) received the B.S., 
M.S., and Ph.D. degrees from Southeast 
University, Nanjing, China, in 1984, 1987, 
and 1992, respectively. From 1992 to 2000, 
he was an Associate Professor with the 
School of Electronic Science and 
Engineering. Since 2001, he has been a 
Professor and the Dean of the National 
ASIC System Engineering Research Center.  
He has authored one book and over 130 
articles. His current research interests include ultralow-power 
IC design. 
 
