Application-Specific Processor Design for Low-Complexity & Low-Power Embedded Systems by Constantin, Jeremy Hugues-Felix
Jeremy Constantin, TCL, EPFL, Switzerland; Advisor: A. Burg (TCL); Collaboration: D. Atienza (ESL)
Application-Specific Processor Design
for Low-Complexity & Low-Power Embedded Systems
Low-Power Embedded Systems Power Savings through Voltage Scaling
Custom Low-Power Processor Design & Evaluation Flow
Application to Compressed Sensing Application to Cryptography
Application to Biomedical Signal Analysis Conclusion
Portable, autonomous devices:
• Highly resource constrained systems
• Wireless (sensor) nodes
• Limited battery life-time
• Application-specific use cases
• Flexibility & programmability desired
Embedded systems require…
Low power consumption through
• Low computational complexity
• Low memory footprint
Voltage
Supply voltage down-scaling:
• Active power: PA = C ∙ VDD2 ∙ f
• Drastic power savings
Near- and sub-VT operation:
• Reliability issues (solvable with 
circuit-level techniques and 
custom embedded memories)
• Strong performance degradation
To enable aggressive voltage scaling,
alleviate the performance degradation using
application-specific instruction set processors (ASIPs)
TamaRISC: custom RISC MCU
• Baseline processing core (< 10 kGE)
• 16-bit Harvard architecture
• 3-stage pipeline
• 24-bit instruction word
• 14 single word, single cycle instructions
• Minimalistic 16-bit ALU
• Addressing modes for efficient execution 
of signal processing applications
• Sleep mode, Interrupt, and basic
HW-loop support
• Embedded real-time OS support
• Custom C-compilerTamaRISC
Instruction Fetch






























AND / OR / XOR / L/R(A)-Shift
16-bit (un)signed Multiplier
Writeback










































































Rapid design space 
exploration based on single 
golden processor model
ASIP core architecture 







Ultra-low-power ASIP: TamaRISC-CS [1]
• Low-complexity CS algorithm
• Instruction set extension (ISE) enables 
efficient random index generation
• 16-bit multi-step LFSR-based PRNG
• Drastically reduced memory footprint
• CS application speedup: 62x
• Improved power consumption: 11.6x




for i := 1 → n do
sample := getSample()






Pseudocode of Reduced Complexity CS Algorithm
[1] Constantin, J., et al., “TamaRISC-CS: An Ultra-Low-Power Application-Specific
Processor for Compressed Sensing,” 20th IFIP/IEEE International Conference on
Very Large Scale Integration (VLSI-SoC), 2012
[2] Constantin, J., et al., “Instruction Set Extensions for Cryptographic Hash
Functions on a Microcontroller Architecture,” 23rd IEEE International Conference on
Application-specific Systems, Architectures and Processors (ASAP), 2012
[3] Constantin, J., et al., “Investigating the Potential of Custom Instruction Set
Extensions for SHA-3 Candidates on a 16-bit Microcontroller Architecture,”
Cryptology ePrint Archive, 2012
[4] Dogan, A., et al., “Low-power Processor Architecture Exploration for Online
Biomedical Signal Analysis,” IET Circuits, Devices & Systems, 2012
[5] Dogan, A., et al., “Multi-Core Architecture Design for Ultra-Low-Power Wearable
Health Monitoring Systems,” Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2012
[6] Dogan, A., et al., “Synchronizing Code Execution on Ultra-Low-Power
Embedded Multi-Channel Signal Analysis Platforms,” Design, Automation & Test in
Europe Conference & Exhibition (DATE), 2013
Considerable speedup, memory footprint reduction and in turn
power-savings, enabled by small but dedicated instruction set
extensions with extremely low hardware overhead.
Operation in near- and sub-VT regime is feasible, even for more
demanding online real time signal processing tasks, when
combined with custom tailored ASIP systems.
Cryptographic Hash Functions (SHA-3) [2,3]
• Individual ASIP cores, algorithm-specific ISE
 Extension of computational units
 Finite state machines for data address generation
 Lookup table integration
• Average speedup: 172%
• Average memory savings: 40%















MET Timing Constraint [ns]
AT-Sweep Comparison
no ISE Blake ISE Skein ISE
Keccak ISE JH ISE Groestl ISE
No significant additions 
to the processor 
datapath necessary
Negligible total area






BLAKE 155.2 102.9 1.51 ~ 0%
Grøstl 462.3 57.6 8.03 +10%
JH 463.8 383.5 1.21 +10%
Keccak 188.3 131.7 1.43 ~ 0%








BLAKE 488 -59% 1,028 -20%
Grøstl 982 -78% 2,619 -69%
JH 1,550 -87% 4,649 -53%
Keccak 448 -21% 3,480 -31%
Skein 242 0% 5,734 -18%
ISE example: state row rotation









Wearable Online Health Monitoring Systems [4,5,6]
• TamaRISC integrated into single- and multi-core 
architectures for multi-lead ECG signal analysis
• Custom processor optimizations and extensions 





for medium to high 
workloads (> 2 MOp/s)
• Banked shared memories
• Crossbar interconnects
• Virtual addr-space, MMU
• Data & instr. broadcast
• Core synchronization for 
efficient SIMD execution
