# **Application-Specific Processor Design** for Low-Complexity & Low-Power Embedded Systems

### Jeremy Constantin, TCL, EPFL, Switzerland; Advisor: A. Burg (TCL); Collaboration: D. Atienza (ESL)

## Low-Power Embedded Systems

### **Portable, autonomous devices:**

- Highly resource constrained systems
- Wireless (sensor) nodes
- Limited battery life-time
- Application-specific use cases
- Flexibility & programmability desired

Embedded systems require... Low power consumption through Low computational complexity



# **Power Savings through Voltage Scaling**

### **Supply voltage down-scaling:**

- Active power:  $P_{\Delta} = C \cdot V_{DD}^{2} \cdot f$
- Drastic power savings
- Near- and sub-V<sub>T</sub> operation:
- Reliability issues (solvable with circuit-level techniques and custom embedded memories)
- Strong performance degradation



To enable aggressive voltage scaling, alleviate the performance degradation using

#### Low memory footprint

### application-specific instruction set processors (ASIPs)

### **Custom Low-Power Processor**



#### TamaRISC: custom RISC MCU

- Baseline processing core (< 10 kGE)
- 16-bit Harvard architecture
- 3-stage pipeline
- 24-bit instruction word
- 14 single word, single cycle instructions
- Minimalistic 16-bit ALU
- Addressing modes for efficient execution of signal processing applications
- Sleep mode, Interrupt, and basic HW-loop support
- Embedded real-time OS support
- Custom C-compiler

### **Design & Evaluation Flow**

Rapid design space exploration based on single golden processor model



**ASIP core architecture** described in LISA (PD) Automatic generation:

- Software tool-chain
- Cycle-accurate ISS
- Synthesizable HDL



Accurate power analysis using vcd-based post-layout gate-level simulations

# **Application to Compressed Sensing**

# **Application to Cryptography**

#### Ultra-low-power ASIP: TamaRISC-CS [1]

- Low-complexity CS algorithm
- Instruction set extension (ISE) enables efficient random index generation
- 16-bit multi-step LFSR-based PRNG
- Drastically **reduced memory** footprint
- CS application **speedup**: **62x**
- Improved **power** consumption: **11.6x**
- ISE induces less than 3% area overhead





**Instruction set** extensions are key for sub-V<sub>T</sub> operation

#### **Cryptographic Hash Functions (SHA-3) [2,3]**

- Individual ASIP cores, algorithm-specific ISE
  - Extension of computational units
  - Finite state machines for data address generation
  - Lookup table integration
- Average **speedup**: **172%**
- Average **memory** savings: **40%**
- Maximum core area overhead: 10%

| <b>ISE example</b> : state row rotation memory to memory (64 bytes) |               |                                                                                                      |                                                      |  |  |  |  |
|---------------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------|------------------------------------------------------|--|--|--|--|
|                                                                     | /             | Reference ISA                                                                                        | ISE                                                  |  |  |  |  |
|                                                                     | Instructions: | 124                                                                                                  | 4 2                                                  |  |  |  |  |
|                                                                     | Cycles:       | 126                                                                                                  | 4 37                                                 |  |  |  |  |
|                                                                     | Speedup:      | 1.0                                                                                                  | 1.4                                                  |  |  |  |  |
|                                                                     |               | hift by 1<br>hift by 3<br>hift by 5<br>hift by 7<br>hift by 0<br>hift by 2<br>hift by 4<br>hift by 6 | FSM-based<br>memory<br>address pattern<br>generation |  |  |  |  |

| Algorithm                                    | PIC24<br>[cycles/byte]                             | PIC24 + ISI<br>[cycles/byt                   | E<br>e] | Speedup                                                | Area<br>Overhead                             |
|----------------------------------------------|----------------------------------------------------|----------------------------------------------|---------|--------------------------------------------------------|----------------------------------------------|
| BLAKE                                        | 155.2                                              | 102                                          | .9      | 1.51                                                   | ~ 0%                                         |
| Grøstl                                       | 462.3                                              | 57                                           | .6      | 8.03                                                   | +10%                                         |
| JΗ                                           | 463.8                                              | 383                                          | .5      | 1.21                                                   | +10%                                         |
| Keccak                                       | 188.3                                              | 131                                          | .7      | 1.43                                                   | ~ 0%                                         |
| Skein                                        | 157.6                                              | 112                                          | .6      | 1.40                                                   | ~ 0%                                         |
|                                              |                                                    |                                              |         |                                                        |                                              |
| Algorithm                                    | Data<br>PIC24 [byte]                               | Data<br>+ISE                                 | PIC     | Text<br>C24 [byte]                                     | Text<br>+ISE                                 |
| Algorithm<br>BLAKE                           | Data<br>PIC24 [byte]<br>488                        | Data<br>+ISE<br>-59%                         | PIC     | Text<br>C24 [byte]<br>1,028                            | Text<br>+ISE<br>-20%                         |
| Algorithm<br>BLAKE<br>Grøstl                 | Data<br>PIC24 [byte]<br>488<br>982                 | Data<br>+ISE<br>-59%<br>-78%                 | PI      | Text<br>C24 [byte]<br>1,028<br>2,619                   | Text<br>+ISE<br>-20%<br>-69%                 |
| Algorithm<br>BLAKE<br>Grøstl<br>JH           | Data<br>PIC24 [byte]<br>488<br>982<br>1,550        | Data<br>+ISE<br>-59%<br>-78%<br>-87%         | PI      | Text<br>C24 [byte]<br>1,028<br>2,619<br>4,649          | Text<br>+ISE<br>-20%<br>-69%<br>-53%         |
| Algorithm<br>BLAKE<br>Grøstl<br>JH<br>Keccak | Data<br>PIC24 [byte]<br>488<br>982<br>1,550<br>448 | Data<br>+ISE<br>-59%<br>-78%<br>-87%<br>-21% | PIC     | Text<br>C24 [byte]<br>1,028<br>2,619<br>4,649<br>3,480 | Text<br>+ISE<br>-20%<br>-69%<br>-53%<br>-31% |



### No significant additions to the processor datapath necessary

# **Application to Biomedical Signal Analysis**

Wearable Online Health Monitoring Systems [4,5,6]

- TamaRISC integrated into single- and multi-core architectures for multi-lead ECG signal analysis
- Custom processor optimizations and extensions



### Conclusion

Considerable speedup, memory footprint reduction and in turn power-savings, enabled by small but dedicated instruction set extensions with extremely low hardware overhead.

tailored to ultra-low-power multi-core operation





Banked shared memories • Crossbar interconnects • Virtual addr-space, MMU Data & instr. broadcast • Core synchronization for efficient SIMD execution

Parallel computing architectures provide better power efficiency for medium to high workloads (> 2 MOp/s)

Operation in near- and sub-V $_{T}$  regime is feasible, even for more demanding online real time signal processing tasks, when combined with custom tailored ASIP systems.

[1] Constantin, J., et al., "TamaRISC-CS: An Ultra-Low-Power Application-Specific [4] Dogan, A., et al., "Low-power Processor Architecture Exploration for Online Processor for Compressed Sensing," 20th IFIP/IEEE International Conference on Biomedical Signal Analysis," IET Circuits, Devices & Systems, 2012 [5] Dogan, A., et al., "Multi-Core Architecture Design for Ultra-Low-Power Wearable Very Large Scale Integration (VLSI-SoC), 2012 [2] Constantin, J., et al., "Instruction Set Extensions for Cryptographic Hash Health Monitoring Systems," Design, Automation & Test in Europe Conference & Functions on a Microcontroller Architecture," 23<sup>rd</sup> IEEE International Conference on Exhibition (DATE), 2012 [6] Dogan, A., et al., "Synchronizing Code Execution on Ultra-Low-Power Application-specific Systems, Architectures and Processors (ASAP), 2012 [3] Constantin, J., et al., "Investigating the Potential of Custom Instruction Set Embedded Multi-Channel Signal Analysis Platforms," Design, Automation & Test in Extensions for SHA-3 Candidates on a 16-bit Microcontroller Architecture," Europe Conference & Exhibition (DATE), 2013 Cryptology ePrint Archive, 2012



ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE



Telecommunications Circuits Laboratory

