Near- and Sub-Threshold Design for Ultra-Low-Power Embedded Systems by Burg, Andreas Peter
Near- and Sub-Threshold Design for Ultra-Low-
Power Embedded Systems 
Andreas Burg,  
P. Meinerzhagen, A. Dogan, J. Constantin, M. M. Sabry Ali,  
G. Karakonstantis, D. Atienza, L. Benini 
EPFL, Switzerland 
1 
Energy autonomous systems with low- to moderate 
computing requirements 
(Wireless) sensor nodes 
Biomedical applications (wearable and implented) 
2 
Pace maker Watches 
Ultra Low Power Design with Voltage-Frequency Scaling 
 
System Level Strategies for Ultra-Low-Power 
Single core architectures 
ULP Multi core architectures 
 
Low-Power Memories for Scaled Voltages 
 
A New Way of Dealing with Reliability at Scaled Voltages 
3 
Dynamic power consumptions 
Switching & short-circuit power 
Depends on circuit activity and  
clock frequency 
 
Leakage power consumptions 
Independent of the activity 
4 
Supply voltage scaling  
reduces active and leakage power consumption 
but also reduces speed 
𝑃𝑑𝑦𝑛~𝛼𝐶𝑡𝑜𝑡𝑉𝐷𝐷
2 /𝑇𝑐𝑙𝑘 
𝑃𝑙𝑒𝑎𝑘~𝑘𝐼0𝑉𝐷𝐷 
Energy per operation as a metric of efficiency 
𝐸~𝛼𝐶𝑡𝑜𝑡𝑉𝐷𝐷
2 + 𝑘𝐼0𝑉𝐷𝐷𝑇𝑐𝑙𝑘 
Operation at critical path  
speed minimizes leakage 
 
Above VT (𝑉𝐷𝐷 > 𝑉𝑇) 
𝑇𝑐𝑙𝑘 increases linearly as 𝑉𝐷𝐷  
decreases 
Active energy dominates 
Voltage scaling improves  
energy efficiency significantly 
 5 
Tradeoff between energy and speed 
 
Below VT (𝑉𝐷𝐷 > 𝑉𝑇) 
𝑇𝑐𝑙𝑘 increases exponentially  
as 𝑉𝐷𝐷 decreases 
Very high cost for energy  
reduction 
Poor energy-delay product 
 
Near-VT operation: balance  
between energy and delay 
6 
Energy per operation as a metric of efficiency 
𝐸~𝛼𝐶𝑡𝑜𝑡𝑉𝐷𝐷
2 + 𝑘𝐼0𝑉𝐷𝐷𝑇𝑐𝑙𝑘 
 
Near/below VT (𝑉𝐷𝐷 < 𝑉𝑇) 
Exponential delay increase  
outweighs the reduced 𝑉𝐷𝐷 
Leakage starts to dominate 
 
 
Energy minimum voltage (EMV): balance between leakage 
and active energy 
 7 
J. Rodrigues, PATMOS 2011, Keynote 
Real-time embedded system requirements 
Handle a given workload with lowest power consumption 
Optimum solution 
Operation at the energy minimum voltage with power gating during 
idle periods to avoid leakage 
But, power gating is only effective when idle periods are long and  
memories can often not be power gated and are the major source  
of leakage 
8 
Operation @ EMV  
 
with power gating 
 
without power gating 
 
Operation below EMV  
without power gating 
9 
71% 
95% 
Voltage 
Constantin et al., VLSI-SoC, 2012 
For a given architecture, and a given workload 
Lower voltage results in lower power consumption 
But processing requirements determine the minimum 
operating voltage 
Need to minimize 
processing requirements 
to reduce active power 
and also leakage 
10 
Shimmer: developed by the Digital Health Group at Intel 
Careful algorithm design and optimization enables decent 
lifetimes, e.g., for ECG monitoring [Rincon et al., ITAB, 2011] 
 
11 
 TI MSP430 CPU 
• 16-bit operation 
• 10 kB RAM, 48 kB Flash 
• Up to 8 MHz 
• 8-channel, 12-bit ADC 
 Problems with state-of-the-art WBSN processors 
• Few widely used discrete components (e.g., TI MSP430 series) 
• Often outdated, complex architectures with significant legacy burden 
• Low operating frequencies even at nominal voltage 
• No or very limited voltage-frequency scaling for energy reduction  
12 
65% 
35% 
Active power 
Memory
Logic
7% 
93% 
Leakage power 
For embedded processors, memories occupy a large percentage of 
the silicon area 
No system power gating for many systems (retain contents) 
• Significant contribution to power consumption through 
leakage, especially at low voltages/frequencies 
 Achieve optimum energy efficiency: 
Reduce complexity AND memory equirements 
typical embedded processor 
13 
ASIP for Low-Power 
Embedded Systems 
Voltage 
Scaling 
A plethora of different factors influence energy efficieny 
Software (coding style) and compilation tools 
System architecture 
ISA + processor core architecture 
RTL and gate level (synthesis & constraints) 
Floorplan and physical design 
14 
Core 1 
L1 
L
2 
L3 
Core 2 
L1 
L
2 
D/IMem 
Increasing speed improves potential for voltage scaling but usually 
comes also at the expense of power 
Better processors reduce execution cycles but are more complex 
and require longer cycle times 
Higher maximum clock frequencies require more complex circuits 
and more area (more leakage) 
 
Need to consider leakage as well at scaled voltages  
Memories are a major source of leakage and often dominate power 
consumption. Nevertheless, processing requirements are often 
reduced at the cost of memory 
 
15 
Need to reduce execution time and memory requirements without 
adding too much overhead. 
16 
ASIP core architecture 
described in LISA (PD) 
• Iterate and optimize 
core architecture for 
lowest power consumption 
Automatic generation: 
• Software tool-chain 
• Cycle-accurate ISS 
• Synthesizable HDL 
Tool-based processor 
architecture 
exploration and tools 
chain generation 
Benchmark
LISA
Core Model
RTL
Description
Processor
Designer
C Compiler
processor descriptionsource code
Instruction Set
Simulator
verification
Program 
Binary
compilation + linking
ULP Multicore 
Architecture
VHDL source
Software Hardware
Fully P&R
Design
synthesis
place & route
Post-Layout 
Simulation
program + data
memory content
Power 
Analysis
vcd
RC extraction
parasitics
netlist +
layout
Initial exploration 
 Accurate power 
analysis using 
vcd-based post-layout 
gate-level simulations 
 
17 
Benchmark
LISA
Core Model
RTL
Description
Processor
Designer
C Compiler
processor descriptionsource code
Instruction Set
Simulator
verification
Program 
Binary
compilation + linking
ULP Multicore 
Architecture
VHDL source
Software Hardware
Fully P&R
Design
synthesis
place & route
Post-Layout 
Simulation
program + data
memory content
Power 
Analysis
vcd
RC extraction
parasitics
netlist +
layout
For small ULP systems, iterations through the 
complete design flow are possible 
Very simple, true RISC ISA 
 16-bit Harvard architecture, 3-stage  
pipeline, 24-bit instruction words 
 11 Single word, single cycle instructions 
Minimalistic ALU 
• ADD, SUB, AND, OR, XOR, Shift, Mult. 
 Simplified data memory interface 
 Addressing modes for efficient  
execution of DSP applications 
 Immediate branching + full data bypass 
 
 Less than 5% of an embedded platform (< 10 kGE) 
Operates up to 180 MHz for voltage scaling 
 Near-VT computing: Only 6pJ per Op at 0.5V, running at ~1 MHz 
18 
Processor model includes hardware 
support for: 
 Interrupt handling 
 Context / task switching 
 
FreeRTOS available on TamaRISC 
 C-compiler support for 32-bit operations 
 FreeRTOS compiled from original C-source 
 
Custom LISA simulation model (ISS) with support for 
 Sleep mode through clock-gating of the processing core: essential for ULP 
operation 
 Peripherals 
• Timer 
• ADC 
• Interrupt controller 
 
 
19 
MSP430 
COTS 
PIC24 
LISA 
TamaRISC 
LISA 
Max. Clock [MHz] 8 122 180 
Corea Area [kGE] N.A. 33.5 14.1 
Memory Area [kGE] N.A. 58.2 58.2 
Total Area [kGE] N.A. 91.7 72.3 
Application CS [kCycles] 800.0 99.9 99.8 
Application DWT [kCycles] 4700 1714.9 1834.3 
Execution Time CS [ms] 100 0.819 0.554 
Execution Time DWT [ms] 588 14.06 10.19 
20 
in 180 nm CMOS technology 
ISA + core architecture simplicity is key, 
TamaRISC outperforms PIC24 and MSP430 in all criteria! 
Energy-limited wireless sensor nodes 
Efficient data compression scheme (for communication and/or 
storage) 
ULP computing / circuit design (sub-threshold operation) 
Low complexity sensor data compression: Compressed Sensing 
Data independent 
 Low complexity 
21 
Samples of a sparse signal can be compressed efficiently while sensed 
Matrix-vector multiplication: 
Compress vector of input samples x, to sampled data vector y 
Random sensing matrix     (e.g. with uniformly distributed entries) 
Sensing matrix can be very sparse! [M11] 
 Few non-zero entries per column 
 Entries can be one 
22 
[M11] Mamaghanian, H.; et al., "Compressed sensing for real-time energy-efficient ECG compression on wireless body 
sensor nodes," IEEE Trans. on Biomedical Engineering, vol. 58, no. 9, pp. 2456–2466, 2011. 
 Efficient random number 
generation (indices) 
 Hardware support: 
instruction set extension! 
ASIP with dedicated ISE 
is crucial! 
Pseudo-random index sequence, as memory addressing offsets 
Precomputation: storage in data memory 
 6 Kbyte per sensing matrix (e.g. for 512 samples with 50% compression) 
 Large data memory footprint: large area, high leakage power! 
Computation at runtime: LFSR based 
 Software calculation: computational effort of 1k cycles / sample 
 
ULP operation in the sub-threshold regime 
significantly reduces core clock frequency: 
1–20 kHz 
Insufficient sampling rates: 1–20 Hz 
23 
CS Accumulation (CSA) instruction 
 Compression buffer address 
 Sample data 
Address random element in buffer 
 New random memory offset each cycle 
ISE hardware overhead: 
< 3% of total area 
Multiple LFSR steps per cycle 
 Reduce index correlation! 
24 
ECG data 
recon-
struction 
quality 
 50% data compression 
    (at 125 Hz sampling rate) 
 TamaRISC-CS with ISE 
16.5 cycles/sample 
       2.1 kHz clock freq. 
       30.6 nW power 
 Without ISE (std. RISC) 
1025.5 cycles/sample 
       128 kHz clock freq. 
       355 nW power 
 Reference impl. [Ma11]: 
storage approach (large memory): 390.5 cycles/sample 
 
25 
Power consumption improved by 11.6x through CS ISE! 
71% 
95% 
 SHA-3: set of cryptographic hash functions 
 Equally important design goals: hashing speed, memory footprint 
 Custom, algorithm-specific instruction set extensions 
    Average speedup: 172%     Average memory savings: 40% 
 
 
 
26 
18.00
23.00
28.00
33.00
38.00
3.00 5.00 7.00 9.00 11.00 13.00 15.00
C
o
re
 A
re
a 
[k
G
E]
 
MET Timing Constraint [ns] 
AT-Sweep Comparison 
no ISE Blake ISE Skein ISE
Keccak ISE JH ISE Groestl ISE
Algorithm 
PIC24 
[cycles/byte] 
PIC24 + ISE 
[cycles/byte] 
Reduction 
Area 
Overhead 
BLAKE 155.2 102.9 -34% ~ 0% 
Grøstl 462.3 57.6 -88% +10% 
JH 463.8 383.5 -17% +10% 
Keccak 188.3 131.7 -30% ~ 0% 
Skein 157.6 112.6 -29% ~ 0% 
Algorithm 
Data 
PIC24 [byte] 
Data 
+ISE 
Text 
PIC24 [byte] 
Text 
+ISE 
BLAKE 488 -59% 1,028 -20% 
Grøstl 982 -78% 2,619 -69% 
JH 1,550 -87% 4,649 -53% 
Keccak 448 -21% 3,480 -31% 
Skein 242 0% 5,734 -18% 
Negligible 
area overhead! 
Constantin et al., ASAP, 2012 
Leakage power 
savings: 40% 
Speedup  active power 
improvement: 1.2x - 8x 
Multi-core systems are on the rise to solve  
thermal issues in desktop and HPC 
Very high clock frequency 
Performance limited by technology  
and TDP 
Active power still dominates 
Ultra-low-power design 
• Leakage is a significant contributor 
• Performance is usually not technology limited 
• Voltage scaling limitations due to reliability issues 
 
27 
Are multi-core architectures power efficient for 
embedded ultra-low-power designs? 
Medical grade ECG recording requires multiple (4-8) leads 
 
Multi-channel signal analysis is  
often embarrassingly parallel 
Filtering 
Baseline removal 
Data compression 
 
Applications are well suited for multi-core platforms: process 
individual leads on multiple cores in parallel 
 
Low-power design concept: parallel processing enables more 
aggressive voltage-frequency scaling 
28 
29 
 Several ULP cores sharing multi-
bank DM 
 Multiple instruction, multiple data 
execution (MIMD) 
 Logarithmic interconnect [R11] 
 Simplified Overall Design 
 Simplified Clock network 
 Single Supply Voltage 
 No need for multi-port DM  
 Low leakage consumption 
However,  
 Occasional stalls for cores 
 Clock gating for reduced active 
power 
Dogan et al., IETCDS, 2012 
[R11] Rahimi, A.; et al., ”A fully‐synthesizable single‐ 
cycle interconnection network for Shared‐L1 processor clusters,” DATE, 2011. 
30 
Multi-core meets workload requirements at lower supply voltages 
31 
Multicore is the only viable solution for workloads 
higher than 50.1 MOps/s 
32 
Between 1.3–50.1 MOps/s workloads multi-core is 
more power efficient, up to 66%  
33 
The power consumptions become equal at 1.3 MOps/s 
Multi-core has reached voltage-scaling-limit at 5.58 MOps/s  
34 
Single-core is more power efficient for workloads 
lower than 1.3 MOps/s 
35 
Fig. 2. Custom-Designed Core (TamaRISC) Architecture
ALU supports addition, subtraction, shift, logical AND, OR
and XOR, as well as full 16-bit by 16-bit multiplications. All
ALU instructions work on 3 operands, using the exact same
addressing mode options for each instruction, which reduces
the complexity of the architecture, since the operand fetch
logic and the arithmetic operation are completely decoupled.
Additionally, the instruction word encoding is designed as
regular (ﬁxed bit positions) and as simple as possible to allow
for very efﬁcient decoding of the operands and the different
instruction words in general. The supported addressing modes
are register direct, register indirect (with pre- or post-increment
and decrement), aswell as register indirect with offset. Branch-
ing is possible in direct and register indirect mode, as well as
by an offset with 15 different condition modes (dependent on
the processor status ﬂags: carry, zero, negative and overﬂow).
B. Crossbar Interconnects
The crossbar interconnects, both the D-Xbar and I-Xbar,
are a Mesh-of-Trees (MoT) interconnection network to sup-
port high-performance communication between processors and
memories [12]. The interconnects are intended to connect a
number of processing cores (in our case 8 cores) to a multi-
banked memory on data (i.e., 16 banks) and instruction sides
(i.e., 8 banks). The total memory access latency is one clock
cycle, however in case of multiple conﬂicting requests, for fair
access to memory banks, a round-robin scheduler arbitrates
access and a higher number of cycles is needed depending on
the number of conﬂicting requests, with no latency in between.
To reduce memory access time and increase shared memory
throughput, a read broadcast can be used and no extra cycles
are needed when such a broadcast occurs.
C. Instruction Memory Organization
As shown in Fig. 3, a signiﬁcant amount of power (54%
of the total power consumption) is consumed by the IM in
the mc-ref architecture while executing the benchmark. This
is due to dedicated IM banks for each core. Biomedical signal
processing platforms often execute the same operations on dif-
ferent input data on multiple cores, thus the same instructions
are read from the IM banks if the cores are in synchronization.
Nevertheless, all the IM banks are accessed, thus power would
be wasted. However, this power can be reduced by minimizing
the number of accesses to the IM banks by reading the
Cores 
27% 
Data 
memory 
11% 
Data 
crossbar 
3% 
Instruction 
memory 
54% 
Clock 
5% 
Fig. 3. Power Distributions in the mc-ref architecture
identical instructions only once and broadcasting them to all
the cores. The IM banks have mostly identical contents, only
differ in few instructions due to different memory locations for
the working data. However, a single instance of the compiled
application can be used for all the cores, provided that the
cores can access different working data with the same instruc-
tion words. To this end, MMUs translate the same decoded
address to different physical memory addresses according to
the PID numbers. Using the same compiled application for
all the cores facilities IM sharing via the I-Xbar. As opposed
to the mc-ref architecture, each core can access the entire
IM, 96 kBytes in total. However, the instruction broadcasting
is beneﬁcial if the cores remain in synchronization. This
can be limited due to possible DM conﬂicts as analyzed
in Section IV-C2. Hence to exploit instruction broadcasting,
we reorganize the DM and data broadcasting is applied to
minimize the conﬂicts (cf. Section III-D).
Our proposed multi-core architecture enables two different
IM organizations: interleaved and banked instructions. The
ﬁrst one, ulpmc-int, interleaves instructions across the banks
to minimize IM conﬂicts in the case of synchronization lost
between the cores. The second one, ulpmc-bank, maps instruc-
tions into the minimum number of IM banks. The ulpmc-bank
is intended to reduce the memory leakage power consumption,
by applying power gating to the unused IM banks, which has
a signiﬁcant impact on the overall power consumption at low
workloads [9]. The only architectural difference between the
ulpmc-int and ulpmc-bank is the selection bits assignment for
the IM banks. The ulpmc-int selects the IM banks based on
the least signiﬁcant bits while the ulpmc-bank chooses the IM
banks according to the most signiﬁcant bits.
D. Data Memory Organization
The working data sets are individual for each core whereas
read-only data can be shared between all cores. Application
proﬁling of the benchmark for DM accesses shows a distri-
bution of 76% private versus 24% shared accesses. Out of
the shared accesses, 92% are on the CS random vector while
8% are on the Huffmann coding LUTs. To minimize data
access conﬂicts, the proposed architecture offers two different
sections in the DM: shared and private sections. The size of
the private and shared sections are conﬁgurable and deter-
mined during compilation of applications. The working data
is separate for each core, thus it is placed in the private section
whereas the shared LUTs are linked into the shared section.
The decoded address is used as the physical memory address
for a shared section access, whereas the address translation
is applied to generate the physical memory addresses for a
 High workloads:  > 50% of 
power due to instruction fetch 
Instruc( on*
memory*
46%*
Data*
memory*
47%*
Logic*
7%*
Leakage*Power*Distribu( ons*
 Low workloads: > 90% of the power 
due to leakage in memories 
Instruction fetch and instruction memory responsible for 
most of the power consumption 
Reduce number of m mory accesses by exploiting 
application characteristics 
Reduce amount of (active) memory 
Dogan et al., DATE, 2012 
Multi-core architecture with 
voltage scaling for ultra-low-
power embedded DSP, utilizing 
advanced memory organization 
[D12b] 
 
Fully shared data and instruction 
memories 
Logarithmic interconnects [R11] 
between cores and memories 
Simple low-power processing 
cores (extended TamaRISC) 
36 
[D12b] Dogan; Constantin; Ruggiero; Burg; Atienza, “Multi-Core Architecture Design for Ultra-Low-Power Wearable 
Health Monitoring Systems”, DATE, 2012. 
[R11] Rahimi, A.; et al., ”A fully‐synthesizable single‐cycle interconnection network for Shared‐L1 processor clusters,” 
DATE, 2011. 
 
Dogan et al., DATE, 2012 
37 
Core-7 Core-0 Core-1 Core-2 
Ti
m
e 
Lockstep 
execution 
Core synchronization 
Decoupled 
execution 
Application characteristics (multi-lead ECG) 
 Significant parts of the code execute on multiple data 
 Little or no data-dependent control flow 
 
Synchronize cores before running identical code segments in 
lockstep 
Dogan et al., DATE, 2012 
Power gating 
38 
Instruction 
broadcast 
Code segments for virtual SIMD execution 
Mapped to a single memory bank 
 Instructions broadcasting reduces memory access power 
Power gating of unused memory banks reduces leakage 
Dogan et al., DATE, 2012 
39 
Address translation handled by a simple MMU 
 Address space partitioning can be changed at runtime 
Focus on minimizing data access conflicts 
 
Private segments: mapped into separate memory banks 
 No conflicts for accesses to private data 
Shared data: interleaved across banks to avoid conflicts 
 Data broadcasting avoids conflicts in SIMD mode 
Dogan et al., DATE, 2012 
  
 
 
 
 
 
Reference: processing cores with dedicated instruction memories 
86% power savings on instruction fetch 
Xbars consume less than 8% of the total power 
Broadcasting entails no noticeable power overhead 
40 
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
IM DM Cores D-Xbar I-Xbar Clock
Tree
Reference
Proposed
A
ct
iv
e
 P
o
w
e
r 
[m
W
] 
Dogan et al., DATE, 2012 
41 
For an embedded biosignal 
compression application: 
 Exploited SIMD operations 
 Only 4% execution time 
increase 
 Up to 45.7% power saving at 
high workloads  
 Up to 38.8% power saving at low 
workloads 
• Unused memory banks are 
power gated 
0
0.2
0.4
0.6
0.8
1
1.2
Multi-Core with MIMD
Multi-Core with SIMD + partial power gating
Thanks to Virtual SIMD execution, a considerable amount of active 
power is saved at high workloads. Furthermore, leakage power is 
reduced dramatically for low workloads by partial power gating of 
memories 
Dogan et al., DATE, 2012 
42 
Virtuel SIMD execution highly relies on synchronous code execution 
 
 Lockstep execution not guaranteed even for embarrassingly parallel 
applications 
 Example: For a multiple-input biosignal filtering application, 
SIMD execution is exploited less than 5% of instructions 
 
 What may bring synchronous cores out of lockstep code execution?  
• Shared DM accesses may lead stalls of some cores 
• Conditional code sections in executed algorithms 
 
An algorithm with conditional code section 
43 
 Smart interconnects 
 Enhanced serving policy avoids 
synchronization loss due to DM 
access conflicts 
 
 Lightweight software-directed 
hardware synchronizer 
 Resynchronization after each 
sections of conditional code 
 Dedicated Instructions keep 
overhead low 
B
a
n
k
-
0
B
a
n
k
-
1
I M CORES DM
D
-
X
b
a
r
B
a
n
k
-
1
5
Synchr oni zer
     
B
a
n
k
-
0
B
a
n
k
-
1
B
a
n
k
-
7
I
-
X
b
a
r
Read Por t
I d=1
Cor e
I d=7
Cor e
Cor e
I d=0 St al l
Read Por t
St al l
Read Por t
St al l
I SE
I SE
I SE Lock
Lock
Lock
Wr i t e
Por t s
W
r
i
t
e
 
P
o
r
t
s
Check- i n
Check- outWakeUp
St al l
St al l
St al l
WakeUp
WakeUp
Read Por t
Read Por t
Read Por t
Dogan et al., DATE, 2013 
44 
0.01 
0.1 
1 
10 
100 
1 10 100 1000 
P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
, 
m
W
 
Number of Operations, MOps/s 
w/o synchronizer with synchronizer 
211 MOps/s 
15.38 mW  
0.01 
0.1 
10 
100 
P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
, 
m
W
 
 
89 MOps/s 
10.46 mW  
0. 1 
0.1 
1 
10 
100 
1 10 100 1000 
P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
, 
m
W
 
 
Number of Operations, MOps/ s 
w/o synchronizer with synchronizer 
156 MOps/s 
12.61 mW  
290 MOps/s 
18.27 mW  
P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
, 
m
W
 
0.01 
0.1 
1 
10 
100 
1 10 100 1000 
P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
, 
m
W
 
 
Number of Operations, MOps/ s 
w/o synchronizer with synchronizer 
167 MOps/s 
13.93 mW 
336 MOps/s 
20.09 mW  
Biosignal Filtering App. Iterative Square Root Kernel  Biosignal Feature Extraction App. 
 Power reduction 
 Up to 2.3x increase for SIMD operations 
 64% power savings when voltage scaling is considered 
 Performance gains 
 Up to 2.4x speed-up due to better access coordination 
 
45 
Memory size in SoCs 
50x in 25 years 
[ITRS’11] 
46 
 Increasing need for embedded memories in SoCs & ASICs [ITRS’11] 
Memories are responsible for most of the active and leakage power 
in ULP embedded systems 
Memories are the first point of failure under scaled voltages and 
limit the yield  
SRAMs for scaled voltages (especially near- and sub-VT) are  
tedious handcrafted custom designs 
At low voltages, 6T SRAM suffers from: 
Read failures (bit-flips) 
Write contention (inability to flip bitcell) 
Hold failures (vanishing static noise margin) 
Access time failures 
 
Many new SRAM bitcells for robust  
sub-VT operation 
8T [1], 10T [2], and many more ... 
 
Complex write and read assist techniques 
47 
[1] Verma, ISSCC’07; [2] Calhoun, JSSC’07 
SNM degradation [2] 
8T SRAM 
bitcell with  
read buffer 
Intel 10-T SRAM 
[S. Jain, ISSC’12] 
Memory compilers for scaled and 
sub-VT voltages are not available 
Standard-cell based memories (SCMs) 
 Synthesize from standard-cell libraries, or 
 from few specialized custom standard cells 
SCMs have many advantages 
Immediately functional across a large voltage range 
 From sub-VT to medium-performance near-VT  
Merge with logic (where appropriate)  power reduction 
Naturally implemented as two-port memories (higher bandwidth) 
Fine-granular organizations with any dimension 
Minimum design effort 
Generic description in any HDL 
Modifications at design time 
Portability (unless custom cells) 
Avoids complex power routing 
48 
LDPC decoder 
with SCMs 
Write Logic 
 Clock-gates (b): smaller 
and less power than 
enable flip-flops (a) 
Read Logic 
 Above/Near-VT 
 Multiplexers (c): 
smaller, faster, and 
less power than tri-
state buffers 
 Sub-VT 
 Tri-state buffers (d): 
less leakage 
(energy) than 
multiplexers 
Array of Storage Cells 
 Latch arrays smaller 
than flip-flop arrays, 
but longer write-
address setup time 
 
 
 
(a) Enable flip-flops (b) Clock gates 
(d) Tri-state buffers (c) Multiplexers 
Results are true for different technology nodes, 
fabs, and library providers 
Meinerzhagen et al., MWSCAS’10; Meinerzhagen et al., JETCAS’11 
49 
50 
Meinerzhagen et al., MWSCAS’10 
Silicon area:  
latch arrays and 
flip-flop arrays vs 
SRAM macrocells 
 
Up to ~1kb, SCMs 
smaller than 
SRAM macrocells 
 
For >1kb, SCMs 
have higher area 
cost than SRAM 
SNM 
SNM 
Meinerzhagen et al., JETCAS’11; [1] Calhoun et al., JSSC’05; [2] Calhoun et al., JSSC’07 
WID process variations in DSM CMOS compromise reliability 
Under gradual voltage scaling 
 Sequential cells fail earlier than combinational CMOS gates [1] 
Most reliability issues of 6T SRAM are avoided by latch 
 Write failures: unusually strong keeper ↔ feedback disabled 
 Read failures: degradation of SNM ↔ isolated output 
 Hold failures: SRAM bitcell = latch in non-transparent phase 
 Still good SNM at VDD=300mV 
SNM degradation 6T SRAM [2] 
Is
o
la
ti
o
n
 b
u
ff
e
r 
Disable keeper, 
No fight 
Standard-cell latch 
51 
Latch resembles very much a 
robust 10T SRAM cell 
 Transistor stacking and stretching for leakage reduction 
 3-state buffers used for read operation further reduce leakage, area, and routing 
52 Meinerzhagen et al., ESSCIRC 2012 
Chip microphotograph and zoomed-in layout picture 
 
Area cost of 12.7 µm2 per bit (including peripherals) 
 
 Scan-chain test 
interface 
 
Functionality 
verification: 
W/R random and 
checker-board 
patterns 
 
Oven to control 
temperature: 
27 or 37°C 
315µm 
165µm 
53 Meinerzhagen et al., ESSCIRC 2012 
[1] MIT: Calhoun and Chandrakasan, JSSC 2007; [2] MIT: Sinangil, Verma, and Chandrakasan, 
JSSC 2009; [3] Intel CRL: Wang et al., JSSC 2008; [4] STM: Clerc et al., ESSCIRC 2012 
 Lowest leakage-power/bit ever reported in 65nm CMOS 
 Lowest active energy/bit-access ever reported in 65nm CMOS 
 Reduce leakage in array and periphery! 
Benefits of designing 1 custom standard cell 
 Leakage power reduced by 50% (at no area increase) w.r.t. commercial standard 
cell latch [Meinerzhagen et al., JETCAS’11] 
 
Considered work: Full macro, measured, 65nm node 
[D. Sylvester, ISCAS’11] has lower leakage power in 180nm CMOS 
Meinerzhagen et al., ESSCIRC 2012 54 
[1] [2] [3] [4] This work 
VDDmin [mV] 380 250 700 350 420 
VDDhold [mV] 230 250 500 250 220 
Etot/bit [fJ/bit] 54 (0.4V) 86 (0.4V) - 55 14 (0.5V) 
Pleak/bit [pW/bit] 7.6 (0.3V) 6.1 6.0, 1.0
a  - 0.5 
a Leakage-power of bitcell only 
55 
Technology scaling and near threshold  
computing improves energy efficiency 
 
Voltage scaling compromises reliable  
operation due to  
 Process variations (permanent faults) 
 Timing violations (data dependent faults) 
 Single event upsets (temporary faults) 
 
BUT, biomedical systems must continue to  
operate within reliability limits 
2 
Dreslinski et al. IEEE, Feb 2010 
Need for recovery mechanisms to restore 
«sufficiently reliable» operation 
Protection against errors can be done in hardware of software 
 
Hardware protection 
 Error correcting codes (ECC) 
 Robust 
 High energy and area cost 
Firmware/software protection 
 Backward error correction   -------------------- Execution Time↑ 
 Forward error correction (SW-based ECC) ----------  Complexity↑ 
 
State-of-the-art methodology: protect  
everything for 100% reliable operation 
57 
Kim et al. MICRO-40 
Do we need to protect all data 
and operations equally? 
Main idea: Computations do not contribute equivalently to the QoS 
Equivalent computation vulnerability 
Protection overhead is significant 
Significance-based vulnerability 
 Few critical computations: guaranteed protection 
Many non-critical: best-effort protection 
 
 
3 
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
%
 Q
o
S
 d
e
li
v
e
re
d
 
% Correct Computations 
Bad case 
Good case 
 
 
Significance-Driven Computation can achieve large energy 
savings while ensuring minimal QoS degradation 
59 
Need for low-power data 
compression on sensor node 
Data compression on the sensor node: 
reduce data to the essence to reduce 
power for radio communication! 
Courtesy of ESL@EPFL 
ECG data compression using Wavelet transform 
 
 
 
 
 High compression ratio (up to 80%) 
 High SNR (up to 25 dB) 
Signal representation in time- and wavelet domain: 
 Wavelet: energy concentrated in few coefficients Hint to unequal significance 
60 
DWT 
Significance sensitivity analysis 
Black box approach 
 Inject error (k) and observe faulty output 𝑌𝑘 
 
Sensitivity metric: Percentage root-mean  
square difference (PRD) 
PRD<2%     “very good signal” 
2% ≤ PRD ≤ 9%  “good signal” 
PRD>9%     “not suitable signal” 
 
Simulation setup/conditions 
 Sensor: 3 lead ECG 
ECG database: MIT-BIH 
Processor: TamaRISC @ 2MHz 
Data memory: 32KB (2 Byte words) 
61 
Less significant significant 
𝑃𝑅𝐷 𝑘 =
𝑌 − 𝑌𝑘 2
𝑌 2
 
Significant data: full protection 
 I: 12.5% significant 
 II: 25% significant 
 III: 37.5% significant 
 IV:50 % significant 
 
Significant data 
 Full software based error correction 
Less significant data 
 Only detection of errors 
 Remove erroneous data 
 
62 
Error injection 
Rate: 1e-6 to 1e-9 bit/cycle 
Effective error occurrence: <40% of less significant data 
Case I and II (12.5/25% significant) 
 80% less overhead, 37% energy savings 
 But can not meet QoS requirements 
63 
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.00E-091.00E-081.00E-071.00E-06
N
o
rm
a
liz
e
d
 e
n
e
rg
y
 c
o
n
s
u
m
p
ti
o
n
 
Error injection rate (bit/cycle) 
No sig Case I Case II case III Case IV
Case III performance (37.5/50% sign.) 
 43% less overhead, 20% energy savings 
 Guarantee QoS 
Significance driven computation achieves 20% lower energy 
consumption, while guaranteeing good quality signal  
Voltage frequency scaling all the way down to sub-VT is a key 
technology for achieving ultra-low-power embedded systems 
 
Optimization of software-programmable architectures for low-
power consumption across all layers of abstraction  
Consider the right balance between complexity and voltage scaling 
Memory is often as important as logic, especially at scaled voltages 
 
Application specific processors bring significant benefit for memory 
reduction and enabling more voltage scaling 
 
64 
Multi-core architectures are an efficient means for ULP operation 
provided that  
application characteristics are properly exploited 
 Systems are optimized to exploit synergies between cores 
 
Finally, reliability issues must be addressed. New design paradigms 
such as deviation from 100% correct operation are promising ideas 
 
65 
  
 
 
 
Collaborators and Partners at  
  … Embedded Systems Laboratory at EPFL (Prof. David Atienza) 
  … Micrel Lab at University of Bologna (Prof. Luca Benini) 
  … Lund University (Prof. J. Rodrigues) 
 
66 
Thank you for your attention! 
 
Q & A 
andreas.burg@epfl.ch 
67 
