Technical University of Denmark



# A heterogeneous multi-core platform for low power signal processing in systems-onchip

Paker, Ozgun; Sparsø, Jens; Haandbæk, Niels; Isager, Mogens; Nielsen, Lars Skovby

*Published in:* Proceedings of the 28th European Solid-State Circuits Conference, 2002. ESSCIRC 2002.

Publication date: 2002

Document Version Publisher's PDF, also known as Version of record

# Link back to DTU Orbit

Citation (APA):

Paker, O., Sparsø, J., Haandbæk, N., Isager, M., & Nielsen, L. S. (2002). A heterogeneous multi-core platform for low power signal processing in systems-on-chip. In Proceedings of the 28th European Solid-State Circuits Conference, 2002. ESSCIRC 2002. IEEE Press.

# DTU Library Technical Information Center of Denmark

#### **General rights**

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

# A heterogeneous multi-core platform for low power signal processing in systems-on-chip

Özgün Paker<sup>1</sup> Jens Sparsø<sup>1</sup> <sup>1</sup>Informatics and Mathematical Modeling Technical University of Denmark 2800 Lyngby, Denmark {opa,jsp}@imm.dtu.dk

Niels Haandbæk<sup>2</sup> Mogens Isager<sup>2</sup> Lars Skovby Nielsen<sup>2</sup> <sup>2</sup> Oticon A/S, Strandvejen 58 2900 Hellerup, Denmark {nch,moi,lsn}@oticon.dk

# Abstract

This paper presents a low-power and programmable DSP architecture – a heterogeneous multiprocessor platform consisting of standard CPU/DSP cores, and a set of simple instruction set processors called mini-cores each optimized for a particular class of algorithm (FIR, IIR, LMS, etc.). Communication is based on message passing.

The mini-cores are designed as parameterized soft macros intended for a synthesis based design flow. A 520.000 transistor  $0.25 \,\mu m$  CMOS prototype chip containing 6 mini-cores has been fabricated and tested. Its power consumption is only 50% higher than a hardwired ASIC and more than 6–21 times lower than a general purpose CPU/DSP core while executing non-trivial industrial applications.

#### 1. Introduction

The design space between the hardwired ASICs (energy-efficiency) and the general-purpose DSP's (flexibility) attracts a significant amount of research interest[9, 1, 4]. Today, the challange that most system designers are facing is to achieve energy-efficiency and flexibility simultaneously. The work described in this paper is an attempt to develop a programmable platform whose energy efficiency approaches that of a dedicated ASIC.

The application domain we are considering: audio signal processing – and more specifically digital hearing aids; has extremely low power consumption requirements. Total power consumption in the order of 0.5 - 1.0 mW (at 1.0 V supply) is typical. For this reason, many commercial hearing aids are based on hardwired ASIC solutions (including the recently published [5]) but fully programmable DSP-based solutions are also starting to emerge [4].



Figure 1. A heterogeneous multiprocessor architecture.

# 2. Overall architecture and related work

The design of an audio signal processing application (as for example a hearing aid) usually starts with a specification in Matlab – often in the form of a complex Simulink data-flow structure of filters and other signal processing blocks that communicate at the sampling rate: FIR, IIR, N-LMS, Viterbi, FFT, etc. The idea pursued in this paper is to provide a platform composed of simple instruction set processors called mini-cores each optimized for one of these classes of algorithms, and to provide a communication network that supports message passing among mini-cores as shown in figure 1. In addition to the specialized mini-cores to implement less regular signal processing algorithms and control dominated tasks.

Such a multi-core platform is both flexible and energy efficient: Its energy efficiency stems from the mini-cores being small and each optimized for a given class of algorithms and from the fact that communication across the interconnection network occur at a very moderate rate (basically corresponding to the sample rate). Its flexibility stems from the individual mini-cores being programmable and from the multitude of different processor cores – the latter compensating for the specialization of the individual mini-cores.

Designing a mini-core based platform for a given application involves instantiating different mini-cores as well as different versions of some of the mini-cores. To enable this we envision a traditional synthesis-based ASIC design flow, where (parameterized) VHDL descriptions of the different mini-cores are mapped into netlists of standard cells. This soft-macro approach has further advantages: (1) it allows the integration of other proprietary circuits on the same chip, and (2) the implementation is foundry independent.

A related approach is taken in the Pleiades project [9]. Here an on-chip general-purpose microprocessor (ARM8) is augmented with an array of heterogeneous programmable units (e.g. MAC-unit, memory, address generator etc.) that are connected by a reconfigurable interconnect. The configuration of the interconnect as well as these programmable units corresponds to wiring up a dedicated data flow circuit. Because of the energy inefficiency related to a configurable interconnect, and the high communication rate between the programmable units, the interconnect is highly optimized, exploiting low-swing full-custom circuitry [10]. In this respect our approach is different: the mini-cores keep data structures and operator modules local, and the communication rate is typically very low, close to the sample rate.

Another related work that targets wireless communication is [1] where an instruction set processor with a configurable datapath is presented. The datapath consists of simple functional units that are used to configure a compound computational unit with macro-operations/instructions. In our approach, we avoid the complexity of configurable structures by using dedicated compound combinational circuitry.

The low-power DSP's presented in [6] and [2] all use a variety of full-custom circuit techniques, and some of them even use dual  $V_t$  processes to obtain high speed and low standby power consumption at the same time. The Coyote processor developed by GN Resound and Audiologic is among the most power efficient designs in existence today [4]. This design has a specialized instruction set and a special add-multiply-accumulate unit called PMAC. Compared with our approach it is a much more coarse grained processor, and when it comes to power efficiency it benefits from a hand-crafted full-custom design methodology and (like any other traditional DSP) it suffers from its size and from its highly flexible datapath.

# 3. Architecture implementation

To evaluate our architecture, we designed a test chip with 6 mini-cores and a bus based interconnect. More details on the mini-core architecures and the prototype can be found in [8].



Figure 2. Die photo of the test chip.

### 3.1. Introduction

The mini-cores have been designed for minimum power consumption: They are very small and they provide efficient support for operand access and compound operation. The latter results in a very low instruction count for a given task and in combination with the small size this results in a surprisingly low power consumption.

## 3.2. The Test chip

Die photo of the test chip is shown in figure 2. The test chip is implemented using a  $0.25\mu m$  CMOS STMicroelectronics standard cell library. The core area is approximately  $5mm^2$  and contains 520 K transistors. We have designed a test board that is connected to a PCI based Xilinx FPGA board, and tested our prototype via a host PC. The chip is fully functional at 1.8 Volt.

The mini-cores on the test chip are instantiated with different memory sizes and are running parts of a nontrivial industrial application.

#### 3.3. The FIR mini-core

The FIR mini-core is a simple 2-stage pipelined mini-DSP with a special and small instruction set (only 15 instructions) for handling FIR filters efficiently. A frequently used FIR filter for audio applications is the inter-



Figure 3. An interpolated FIR filter used in hearing aids.

polated linear phase filter as shown in figure 3. The coefficients for such a filter are symmetric around the midpoint of the impulse response and most of them are zero.

The FIR mini-core has a custom (asmacc) add/subtractmultiply-accumulate instruction that is frequently used in symmetric FIR filter programs. Because of the custom instruction set, symmetric FIR Filters implemented on the FIR mini-core typically use significantly fewer instructions per sample as compared to a DSP processor.

## 3.4. The IIR mini-core

The basic building element for implementing a high order IIR filter is a second order IIR filter of direct form II implementation as shown in figure 4, known as a "biquad."

The IIR mini-core is a simple 3-stage pipelined mini-DSP with a special instruction set and data path designed to implement an entire biquad section in two clock cycles. For this purpose, it has a dual-multiply-accumulate unit that computes two multiplications and additions simultaneously.

Another feature of the IIR mini-core that differentiates it from a DSP processor is the specialized register file used to store the delay elements of a biquad section. Each "register" is a two-place push-down stack.

#### 3.5. Interconnect network

The mini-cores communicate over a network using message passing supported by send and receive instructions. Only point-to-point channels are supported. This abstraction is provided by the network interface units (labeled "NI" in figure 1), which separate the mini-core design from the specific interconnection topology.

A mini-core executing a receive instruction goes to "sleep" until the requested data item shows up at the specified channel. Likewise a mini-core executing a send instruction halts until the network consumes the data item in the output buffer. These sleep modes are handled by clock gating at the module level. A mini-core is only clocked when necessary, and this results in significant power savings.

The test chip currently has a bus based interconnect network with a simple round robin arbitration scheme (based on a circulating token). Power consumption of the current network is approximately 10% while running a typical filter application.



Figure 4. A biquad section.

Table 1. Comparing the mini-cores with hardwired ASICs and a low-power DSP core, extrapolating to 16 KHz sampling rate, 1V power supply and similar semiconductor process.

| IIR filter    | ASIC          | Mini-cores    | The ARC processor |
|---------------|---------------|---------------|-------------------|
| Inst./sample: | -             | 10            | 20                |
| Power @1V:    | $4.2 \ \mu W$ | $6.8 \ \mu W$ | $>148 \ \mu W$    |
| Filterbank    | ASIC          | Mini-cores    | The ARC processor |
| Inst./sample: | -             | 73            | 153               |
| Power @1V:    | $48 \ \mu W$  | $71 \ \mu W$  | >423 µW           |

#### 4. **Results**

In section 4.1, we will compare our mini-core designs with a 32-bit synthesizable DSP/RISC core, developed by ARC International, and hardwired ASICs developed by our industrial partners. To enable a fair comparison all power figures will relate to a 0.25  $\mu$ m CMOS process assuming a supply voltage of 1.0 V. Following this, we will report idle power consumption of the mini-cores, and the interconnect network in section 4.2. Finally, in order to put the mini-core approach into a broader perspective, section 4.3 will provide W/MIPS figures for a collection of other designs reported in the literature.

#### 4.1. Benchmark comparisons

We have used two benchmark programs in this evaluation: (1) a highpass IIR filter with two biquad stages, (2) a filterbank consisting of interpolated FIR filters that divides the input signal into 7 frequency bands [3]. We have assumed 16 KHz sampling rate for all benchmarks. All designs are clocked at the minimum clock speed that meet the required throughput. The power supply for all the benchmarks is 1 V.

The data for power in table 1 was based on simulated results, except for the filterbank ASIC and mini-core results which were obtained from actual measurements. Our experience is that power consumption estimates obtained through simulation is 15–20% higher.

The ARC processor is a synthesizable 32-bit RISCcore intended for low-power, high performance SoC based designs. The basic CPU can be extended with a MAC unit and an XY data memory. Furthermore, it has a userdefined extendable instruction set. The specific instance that we have evaluated includes the basic CPU, 2x128x32 bits of XY-memory, and a 24-bit pipelined MAC unit. The processor data includes the power consumption of the XY memory but not the program memory as we used a behavioral model for the program memory in the simulations. The results presented therefore represent a lower bound, as indicated by the ">" symbol in the table.

### 4.2. Idle power

We have measured idle power consumption of the chip by running a test program that puts all the mini-cores in "sleep" mode. Mainly power is dissipated in the inter-

Table 2. Comparing the mini-core approach with other designs in literature.

| Design        | Technology            | Power metric        |
|---------------|-----------------------|---------------------|
| Coyote        | 0.25 μm               | 100 $\mu$ W/MIPS    |
| Lee et al.,   | $0.35 \mu \mathrm{m}$ | 210 $\mu$ W/MHz     |
| Mutoh et al., | $0.5 \ \mu m$         | 1100 $\mu$ W/MHz    |
| Pleiades      | $0.25 \mu \mathrm{m}$ | 10-100 $\mu$ W/MOPS |
| Phonak IC     | $0.25 \ \mu m$        | 14.4 $\mu$ W/MOPS   |
| Mini-cores    | 0.25 μm               | 11-26 $\mu$ W/MIPS  |

connection network due to several reasons: (1) the freerunning system clock feeds this block first before it is gated and distributed to the mini-cores, (2) even though the arbitration protocol is simple and scalable, it is not power efficient as the arbitration needs to be handled at each clock cycle, contributing to idle power. The power consumption of this block is  $6.2 \,\mu$ W at 1 V at 1 MHz. We have also been looking into asynchronous solutions for the network that are showing promise in terms of idle and overall power consumption [7].

On the other hand, mini-core "sleep" mode measurements report power consumption less than 1  $\mu$ W. This supports the architecture concept as we envision even unused mini-cores in a SoC design, depending on the application. For this to work, idle power consumption of the minicores should be negligible.

#### 4.3. Some additional comparisons

Many articles on low power DSP architectures report only energy-per-instruction measures like W/MIPS, or W/MMACs (Mega Multiply-Accumulate per second). These figures should be taken with some care as they ignore the instruction-count-per-task issue.

Based on the power figures and benchmark programs reported in the previous section we can estimate an absolute power efficiency of a mini-core to be around 21-53  $\mu$ W/MIPS (for relatively complex instructions), or 26-62  $\mu$ W/MMACs. These results are obtained using a normal standard cell library. The foundary also offers a lowpower version of the process and cell library, which exhibits half the power consumption. For comparison purposes it would thus be fair to claim 11-26  $\mu$ W/MIPS, and 13-31  $\mu$ W/MMACs.

Table 2 shows a comparison with some other designs reported in the literature. They were introduced in section 2. All these designs involve at least some full-custom layout, and can be characterized as "optimized" DSP's where an instruction typically involves one multiply-accumulate operation and some address pointer updating. For the Pleiades architecture and the Phonak IC, it is rather unclear what is meant by an "instruction" or an "operation," and it is therefore unclear how to compare with our design where an instruction may be rather complex and involve several "operations", hinting that perhaps 6-13  $\mu$ W/MOPS for our design is more relevant for com-

parison as the mini-cores do more work in an instruction compared to a general purpose DSP core as table 1 shows.

## 5. Conclusion

This paper presented a low-power and programmable DSP architecture – a heterogeneous multiprocessor platform consisting of standard CPU/DSP cores, and a set of simple instruction set processors called mini-cores each optimized for a particular class of algorithm (FIR, IIR, LMS, etc.). Communication is based on message passing. The mini-cores are parameterized in word-size, memorysize, etc. and can be instantiated according to the needs of the application at hand.

Results obtained from the design of a prototype chip show a remarkably low-power consumption that is only 1.5–1.6 times larger than commercial hardwired ASICs and more than 6–21 times lower than current state of the art low-power DSP processors. This is due to: (1) the small size of the processors and (2) a smaller instruction count for a given task.

In summary, the work reported in this paper represents an argument in favor of heterogeneous multi-core architectures where even compute intensive tasks are executed by small application domain specific instruction set processors.

- T. A. Lee, D. C. Cox, J. Nichols, and S. Asghar. "Low Power Reconfigurable Macro-Operation Signal Processing for Wireless Communications". In 48th IEEE Vehicular Technology Conference, volume 3, pages 2560–2564, May 1998.
- [2] W. Lee and et al. "A 1-V Programmable DSP for Wireless Communications". *IEEE Journal of Solid State Circuits*, 32(11):1766 –1776, November 1997.
- [3] T. Lunner and J. Hellgren. A digital filterbank hearing aid – design, implementation and evaluation. In *Proceedings of ICASSP'91*, pages 3661–3664, Toronto, Canada, 1991.
- [4] F. Møller, N. Bisgaard, and J. Melanson. "Algorithm and Architecture of a 1V Low Power Hearing Instrument DSP". In *International Symposium on Low Power Electronics and Design*, pages 7–11, August 1999.
- [5] P. Mosch, G. V. Oerle, S. Menzl, N. Rougnon-Glasson, K. V. Nieuwenhove, and M. Wezelenburg. "a 720 μW 50 MOPs 1V DSP for a Hearing Aid Chip Set". In *Proceedings ISSCC 2000*, pages 238–239, Feb. 2000.
- [6] S. Mutoh and et al. "A 1-V Multithreshold-Voltage CMOS Digital Signal Processor for Mobile Phone Application". *IEEE Journal of Solid State Circuits*, 31(11):1795–1802, November 1996.
- [7] S. F. Nielsen and J. Sparsø. Analysis of low-power SoC interconnection networks. In *IEEE 19th Norchip Conference*, pages 77–86, November 2001.
- [8] Ö. Paker, J. Sparsø, N. Haandbæk, M. Isager, and L. S. Nielsen. A heterogenous multiprocessor architecture for low-power audio signal processing. In A. Smailagic and H. D. Man, editors, *IEEE Computer Society Workshop on VLSI*, pages 47–53, April 2001.
- [9] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. Rabaey. "A 1-V Heterogenous Reconfigurable DSP IC for Wireless Baseband Digital Signal Processing". *IEEE Journal of Solid State Circuits*, 35(11):1697–1704, November 2000.
- [10] H. Zhang, M. Wan, V. George, and J. Rabaey. "Interconnect Architecture Exploration for Low Energy Reconfigurable Single-Chip DSPs". In *IEEE Computer Society Workshop On VLSI'99*, pages 2–8, April 1999.