Energy efficient embedded systems consist of a heterogeneous collection of very specific building blocks, connected together by a complex network of many dedicated busses and interconnect options. The trend to merge multiple functions into one device makes the design and integration of these "systems-on-chip" (SOC's) even more problematic. Yet, specifications and applications are never fixed and require the embedded units to be programmable.
Introduction
Embedded systems (e.g. a cell phone, a GPS receiver, a portable DVD player, a HDD camcorder) use an architecture that is a heterogeneous collection of very specific building blocks, connected together by a complex network of many dedicated busses and interconnect options. General-purpose programmable processors are not used for energy efficiency reasons. Typically, multiple small embedded processor cores with accelerators, IP cores, etc. are used. The trend to merge multiple functions into one device (e.g. a cell phone with video capabilities) makes the design and integration of these "systems-on-chip" (SOC's) even more challenging.
Yet, specifications and applications are never fixed and require the embedded units to be programmable. A good balance between energy efficiency and programmability can be obtained by using programmable domain-specific processors. A well known example are the programmable digital signal processors (DSPs). DSPs are developed for wireless communication systems (mostly driven by cellular standards). In a first generation this meant that DSPs were adapted to execute many types of filters (e.g. FIR, IRR), later communication algorithms such as Viterbi decoding and more recently Turbo decoding are added.
A first trend we notice is that more applications and multiple applications run in parallel or on demand on the device, e.g. video decoding, data processing, multiple standards, etc. A second trend we notice is that these new applications tend to run either on a separate domain specific programmable processor or on a hardware accelerator (the distinction between the two being rather blurry) next to the embedded DSP or micro-controller instead of being tightly coupled into the instruction set of the host processor.
A third trend we notice is that general-purpose programming environments are getting more heterogeneous and domain-specific. The general-purpose solutions are for energy efficiency reasons augmented with domain specific units, accelerators, IP cores, etc. This is clearly visible in FPGA's, as the new generations now include specialized blocks such as embedded core's, block RAM's and large numbers of multipliers. One successful example is the Virtex-Pro family of Xilinx [17] . These devices contain up to four Power PC cores, multiple columns of SRAM, multiple columns of multipliers, Gbits IO transceivers, etc.
The architecture design of this heterogeneous SOC is a search in a three dimensional design space, which we call the reconfiguration hierarchy [12] . First in the Y direction: at what level of abstraction should the programming be introduced? Secondly in the X direction: which component of the architecture should be programmable? Thirdly in the Z direction: what is the timing relation between processing and the configuration/programming? Programming can be introduced at multiple levels of abstraction. When it is introduced at the instruction set level, it is called a "programmable processor". When it is introduced at the CLB level of an FPGA, it is called a reconfigurable device. Regarding components, a processor has four basic components: data paths, control, memory and interconnect. One has a choice of making some or all of them programmable. Then the third question is to compare the processing activity to the binding time. It makes a system configurable, reconfigurable, or dynamic reconfigurable.
The challenge is to develop a design environment to navigate in this three dimensional design space. Several SOC pl at form s have been present ed i n l i t erat ure. Most of t hem focus on general -purpose regul ar archi t ect ures, e. g. [2] . Very few focus on t he l ow power i ssue and t he need t o t une t he archit ecture towards t he appl i cat i on. One exam pl e i s t he l ow power Maya pl at form [18] . Uni que t o our desi gn approach i s t hat we com bine t he desi gn and program m i ng of t he archit ecture wit h an environm ent to expl ore t he best opt i ons.
The paper i s organi zed as foll ows. Sect i on 2 and 3 l ook at t he archi t ect ure desi gn, whi l e sect i on 3 and 4 di scuss t he desi gn expl orat i on, co-desi gn and co-si m ul at i on chal l enges. 
Energy efficient heterogeneous SOCs.
The system desi gner needs an archi t ect ure pl at form t hat gi ves hi m t he l owest energy consum pt i on, but at the sam e t i m e provi des enough fl exibi l i t y t o al l ow re-program m i ng or re-confi gurat i on. The key t o energy effi ci ency i s to t une t he archit ecture to t he appli cati on dom ai n. This m eans freezi ng fl exi bi l i t y i n t he X (com ponent s) and Y (l evel of abst ract i on) di rect i on of t he reconfi gurat i on hi erarchy. A hi erarchy of so-cal l ed "Y chart s" al l ows us t o do t hi s in a t op-down fashi on [5] .
A com pl ex SOC wil l consist of m ul t i pl e dom ai n speci fi c processing engi nes. Each processor is program m abl e t o a m ore or less degree. It can be highl y program m abl e i f t he processor i s a m icro-cont rol l er or a DSP engi ne or a bl ank box of CLB uni t s. The effi ci ency goes up as dom ai n speci fi c i nst ruct i ons are added. An exam pl e of this is t he addi t i on of a MAC i nst ruct i on t o a DSP processor. Loosel y coupl ed co-processors wi l l be m ore energy effi ci ent but l ess fl exibl e as they fit a narrower appl i cati on dom ai n. An exam pl e i s t he Turbo coder accel erat i on uni t . The ul t i m at e energy effi ci ent bl ock i s t he opt i m i zed hard IP uni t . Yet , it does not provi de any fl exi bi l i t y. In SOC a range and col l ect i on of t hese bl ocks are used.
Si m i l arl y argum ents can be m ade for the i nt erconnect com ponent of a SOC. Current l y, we see onl y t wo ext rem e opt i ons: ei t her dedi cat ed one-t o-one connect i ons and speci ali zed busses, whi ch have t he l owest power consum pt i on (t o a fi rst order) or general -purpose gl obal busses or i nt erconnect , as provi ded by FPGA's [17] or net works on chi p [2] . The l at t er t wo are bot h general -purpose solut i ons at different levels of abstraction t o gi ve t he desi gner a m axim um fl exi bi l i t y and program m abi l i t y.
The RINGS archi t ect ure [16] i s an archi t ect ure pl at form t hat gives t he designer the opti on t o expl ore the energy fl exibi l i t y t rade-offs. An exam ple i s shown in Fig. 1 . A RINGS archi t ect ure cont ai ns a het erogeneous set of bui l di ng blocks: program m abl e cores, bot h DSP's and m i crocontrol l ers, program m abl e and/or reconfigurable hardware accel erat or uni t s, speci al i zed IP bui l di ng bl ocks, front -end bl ocks, and so on. When desi gning a soluti on based on RINGS, i t i s i m port ant t hat the dom ai n expert has freedom t o select t he appropri at e l evel of fl exibi l i t y, rangi ng from ful l y program m abl e approaches, such as em bedded m i cro cont rol l ers or FPGA bl ocks t o hi ghl y opt i m i zed IP bl ocks. For di fferent dom ai ns, t he fl exi bi l i t y wi l l be support ed in di fferent ways as dom ai ns have di fferent charact eri st i cs. Thi s dom ai n specifi c flexibi l i t y can be expressed as a dom ai n specifi c abstracti on pyram id as shown for Net worki ng, Vi deo, and Si gnal Processing on Fi g. 1. In case of Vi deo, t he engi ne wi l l consist of el em ents expressed i n t he Vi deo pyram i d, for exam pl e dedi cat ed co-processors.
The SOC i s connect ed t oget her at the t op l evel by a supervi sing soft ware program , whi ch typi call y runs on an em bedded m i cro-cont rol l er. At t he bot t om l evel , the reconfi gurabl e i nt erconnect gl ues i t toget her. The program m i ng paradigm used i n RINGS i s a reconfigurable net work-onchip. Al so in this network, fl exibi l i t y can be t raded for energy effi ci ency at di fferent l evel s of abst ract i on. Desi gners can i nst ant i at e an arbi t rary net work of 1D and 2 D rout er m odul es l eadi ng t o an archi t ect ure i l l ust rat ed i n Fi g. 2. Thi s net work i l l ust rat es t he t hree bi ndi ng t i m e concept s. At t he l evel of confi gurat i on, t he stat i c net work archi t ect ure wi t h routers is instanti ated. Reconfigurati on i s done by m eans of reprogram m i ng t he rout i ng t ables and programm i ng by gi vi ng each packet a t arget address. A traditional reconfiguration is obtained by reprogramming the routing tables in each node. An alternative approach is to use an easy to reconfigure physical channel. One example of this is a CDMA based reconfigurable interconnect [6] [16] . Fig. 3 . shows a conceptual picture of a source-synchronous CDMA implementation. Each sender and receiver gets a unique spreading code. By changing the Walsh code, a different configuration is obtained. Traditional busses, which are a TDMA channel, require hardware switches for reconfiguration. CDMA interconnect has the advantage that reconfiguration can occur "on-the-fly." 3 Ultra low power com ponents.
The focus of this section is on the architecture design options to design ultra low power processor components, in many cases without losing performance.
DSP processors have real-time constraints or need to maximize their throughput for a given task while at the same time minimize the power or energy consumption. Therefore, the design of DSP processors is very challenging, as it has to take into account contradictory goals: an increased throughput request at a reduced energy budget. On top there are new issues due to very deep submicron technologies such as interconnect delays and leakage. For instance, hearing aids used analog filters 15 years ago, were designed as digital ASIC-like circuits 5 years ago. Today they are designed with powerful DSP processors below 1 Volt and 1 mW of power consumption [8] . Hearing aids companies require DSP processors just because they require flexibility, i.e. to program the applications in-house.
The design of ultra-low power DSP cores has to be performed at all design levels, i.e. system, architecture, circuit and technology levels. We will focus in this section to DSP architectures, but VHDL implementations as well as cell libraries are important too. Latch-based implementations including gated clocks described in VHDL or Verilog, low-power standard cell libraries and leakage reduction circuit techniques are necessary to reduce power consumption at these low levels.
Various DSP architectures can be and have been proposed to reduce significantly the power consumption while keeping the largest throughput. Beyond the single MAC DSP core of 5-10 years ago, it is well known that parallel architectures with several MAC working in parallel allow the designers to reduce the supply voltage and the power consumption at the same throughput. It is why many VLIW or multitask DSP architectures have been proposed and used even for hearing aids. The key parameter to benchmark these architectures is the number of simple operations executed per clock cycle, up to 50 or more. However, there are some drawbacks. The very large instruction words up to 256 bits increase significantly the energy per memory access. Some instructions in the set are still missing for new better algorithms. Finally the growing core complexity and transistor count becomes a problem because leakage is roughly proportional to the transistor count. To be significantly more energy efficient, there are basically two ways, however impacting either flexibility or the ease of programming: (1) to design specific very small DSP engines for each task, in such a way that each DSP task is executed in the most energy efficient way on the smallest piece of hardware [9] . For N DSP tasks within a given application, the resulting architecture will be N coprocessors or hardware accelerators around a controller or a simple DSP core as illustrated on Fig. 1. (2) to design reconfigurable architectures such as the DART cluster [3] , in which configuration bits allow the user to modify the hardware in such a way that it can much better fit to the executed algorithms. Fig. 4 shows an example.
Option (1) is definitively the best one regarding power consumption. Each DSP task uses the minimal number of transistors and transitions to perform its work. The control code unavoidable in every application is also efficiently executed on the controller or on the simple DSP, and some unexpected DSP tasks can be executed on the simple DSP if no accelerator is available. However, the main issue is the software mapping of a given application onto so many heterogeneous processors and co-processors (see Section 4). Transistor count could be high and some co-processors fully useless for some applications. Regarding leakage, unused engines have to be cut off from the supply voltages, resulting in complex procedures to start/stop them. Reconfigurable DSP architectures are much more power efficient than FPGAs. E.g. the MAGIC DSP consumes 1mW/MHz in 1.8V, 0.18µm CMOS. The same MAGIC in an Altera Stratix FPGA consumes about 10mW/MHz of dynamic power, but has a huge static power of 900 mW. So at 10 MHz, it consumes 1000 mW. The key point is to reconfigure only a limited number of units within the DSP core, such as some execution units and addressing units [11] . The latter are interesting, as the operands fetch is generally a severe bottleneck in parallel machines for which 8-16 operands are required each clock cycle. So, sophisticated addressing modes can be dynamically reconfigured depending on the DSP task to be executed. However, the power consumption is necessarily increased due to the relatively large number of reconfiguration bits that have to be loaded in the configuration registers. Similarly, the reconfigurable units are necessarily more complex that nonreconfigurable units in terms of transistor count and therefore consume more. Software issues are also difficult, as users can define new instructions or new addressing modes that are difficult to support by the development tools.
Desi gn & archi tecture expl orati on.
The way a system behaves depends on the architecture, the way the applications are written, and how these applications are mapped onto the architecture as compactly expressed by the Y-chart [5] . Examples of architectures for low-power have already been given in other sections. On such architecture, mapping is typically done in case of reconfigurable fabrics by the behavioral synthesis tool and the place and route tools. In case of DSPs and CPUs, the mapping is typically performed by C-compilers dedicated to a particular type of DSP or CPU. An important question remains: how to specify the applications that they can take advantage of the architecture in an effective manner.
A low-power architecture will typically employ different levels of parallelism like bit-level parallelism, instruction parallelism or task-level parallelism to take advantage of voltage scaling as already explained in the previous section. To successfully map a DSP application at a high level, the applications need to express task-level parallelism. This parallelism is typically not present, as the applications are written in sequential languages like C or Matlab. Therefore, mapping them is often a manual process that is very tedious and time consuming, leading to a sub optimal system. A designer would like to have tool support that converts automatically the sequential specification into a parallel format. Moreover, the tool should allow him to 'play' with the amount of parallelism extracted from the specification. In general, such tools are lacking in embedded system design. Some companies, like Pico and Art (ARM/Adelante) try to provide limited commercial solutions but this field is still very much subject to research. The Compaan tool suite [13] aims at providing designers the option to play with parallelism for applications that are so-called "Nested Loop Programs," a very natural fit for DSP applications. A DSP application is specified in a subset of Matlab and is automatically converted by Compaan into a network of parallel processes. These processes can be specified in "C' and mapped, using a conventional C compiler, onto a DSP or CPU. On the other hand, they can also be specified in VHDL and mapped using the appropriate tools onto some reconfigurable fabric or realized as a dedicated IP core [19] . Hence, "programming" the RINGS architecture is reduced to putting some processes onto the CPUs and DSPs while others are mapped onto FPGAs or use dedicated IP cores.
There are many ways we can find parallelism in the application and in the way we partition the processes of the CPUs, DSPs and reconfigurable resources. Being able to explore these options early on in the design phase is crucial to get efficient embedded low-power systems. To allow designers to do this exploration, Compaan is equipped with a suite of techniques [14] like Unfolding, Skewing and Merging, to allow designers to play with the level of parallelism exposed in the derived network of processes. Skewing and Unfolding increase the amount of parallelism, while Merging reduces parallelism. By performing these techniques, many different networks can be created that can be mapped in different ways onto the architecture. When applied in a systematic way, the design space can be explored and the best performing network of processes can be picked.
The difference in utilization of the architecture for a particular network can be huge. By rewriting a DSP application (like Beam-forming) using the presented techniques, we are able to achieve performances on a QR algorithm (7 Antenna's, 21 updates) ranging from 12MFlops to 472MFlops. We realized QR using commercial floatingpoint IP cores from QinetiQ, that are pipelined 55 (Rotate) and 42 (Vectorize) stages. We achieved this performance increase without doing anything to the architecture or mapping tools, but only by playing with the way the QR application is written, effectively improving the way the pipelines of the IP cores are utilized. Using a system like Compaan, an experienced designer should be able to obtain very different performing networks in days, having the opportunity to explore different systems and picking the one that uses the least amount of power.
Domain-Specific Co design Environments
As di scussed in t he previ ous secti on, paral l el i sm and di stri but ed processing are key to energy effi ci ent archi t ect ures. Because t he ensem bl e of archi t ect ure el em ent s (processors, busses, m em ori es) cooperate towards a com m on appl i cat i on, t he desi gner faces a consi derabl e co-si m ul at i on and co-desi gn probl em . A key requi rem ent i s t o have a good desi gn m odel . Such a m odel al l ows bui l di ng of si m ul ati on t ool s, com pil ers and code generators. We wi l l look at a hi ghl y successful desi gn m odel for program m abl e system s: t he i nst ruct i on-set archi t ect ure (ISA). Next we wi l l consi der t he approach t aken by t he RINGS archi t ect ure. In a cl assic Von-Neum ann archi t ect ure, t he i nst ruct i on-setarchi t ect ure (ISA) m odel m ai nt ai ns a si ngl e, consi stent and abst ract ed vi ew t o t he operat i on of t he system . Such a vi ew t i es four i ndependent archi t ect ure concept s t oget her: cont rol , i nt erconnect , st orage, and dat a operat i ons [15] . Thi s way t he ISA becom es a t em pl at e for t he underl yi ng t arget archi t ect ure, for whi ch com pi l er al gori t hm s (schedul i ng et c) can be devel oped. Oft en however, the ISA is unabl e to offer t he right target tem plate -in term s of paral l eli sm , storage capabi l i t i es or ot her.
In t he RINGS archi t ect ure, we do not use an ISA as an i nt erm edi ate design m odel , but approach each of the four com ponents t hat m ake up an ISA i ndependent l y. We enum erate t hem bel ow and l ook at the requi rem ent s t hey i mpose on co-si m ul at i on and co-desi gn.
• Data Operations: Energy effi cient operati on requi res us to speciali ze each operator as m uch as possibl e. A RINGS system contains m ul t i pl e processing cores. These can i ncl ude hardwi red or program m abl e (DSP or RISC) processors. We t hus need t o be abl e t o com bi ne i nst ruct i onset sim ul at i on wi t h hardware sim ul at i on.
• Storage: Energy effi cient operati on requires us to di st ri but e st orage. In addi t i on t o t he hi gh-l evel desi gn t ransform at i ons di scussed i n t he previ ous sect i on, we t arget to m i ni m i ze storage bandwi dt h and use m ult i ple di st ri but ed m em ori es. Each processor i n RINGS wi l l work i nsi de of a pri vat e m em ory space. Many operat i ons i n m ul t i m edi a can be i m pl em ented wit h dedi cated storage archit ectures that t ake onl y a fract i on of t he energy cost of a ful l -bl own ISA. Exam pl es are m atrix t ransposit i on or scan-conversi on. Such dedi cat ed st orage can be capt ured as a hardwi red processor.
• Interconnect: The energy effi ci ent i nt erconnect archit ect ure di scussed i n sect i on 2 requi res expl i ci t expressi on of i nt erconnect operati ons -i n contrast t o an ISA where t hi s i s i m pl i ci t l y encoded i n t he i nst ruct i on form at . A net workon-chi p can be m odel ed as a dedi cat ed hardware archi t ect ure [1] . On t op of t he net work-on-chi p a sui t abl e net work prot ocol m ust be i m pl em ented, for exam ple m essage-passi ng wi t h t he MPI standard [7] . However, also t hi s prot ocol is subject t o speci al i zat i on and/or hard-codi ng. For exam pl e, a hardwi red DCT codi ng uni t at t ached t o a DSP core t hrough RINGS wi l l have a fi xed com m unicati on pat t ern. Thi s pat t ern can be har d-coded i n a col l apsed and opt i m i zed prot ocol stack. • Control: Energy effi cient operati on requi res us to spl i t t he dat a-fl ow and cont rol -fl ow i n a RINGS archi t ect ure and handl e t hem i ndependent l y. Fi g. 5 cl ari fi es t hi s point. It shows t he effect of m ovi ng an AES encrypt i on operat i on gradual l y from hi gh-l evel software (Java) i m pl em ent at i on t o dedi cated hardware i m pl em entati on, whi l e at the sam e t i m e m ai nt ai ni ng t he i nt erface t o t he hi gh l evel Java m odel . It can be seen t hat t he i nt erface overhead goes from 0. 8% for a C-accel erat ed AES t o 8000% for a hardware-accel erat ed AES! Thi s overhead obvi ousl y i s caused by al l the i nt erfaces m ovi ng dat a from Java t o C t o hardware and back. Wi t h t he MPI m essage passing schem e, we have t he fr eedom t o rout e cont rol fl ow and a dat a fl ow i ndependent l y as m essages. Thi s way, we can eli m i nate or m i nim i ze t his i nt erface overhead. When we put t he elem ents t oget her, we concl ude t hat t he RINGS co-desi gn environm ent should accom m odat e m ul tiple instruction-set sim ul ators with user-speci fi ed hardware m odel s. Al l of t hese m ust be em bedded i n a m odel of an on-chi p net work. The t i m i ng accuracy of t he si m ul at i on should be preci se enough t o si m ul at e i nt eract i ons such as net work-on-chi p com m uni cat i on confl i ct s. On t he ot her hand, t he sim ul ati on m ust also be fast enough to support reasonabl e desi gn expl orat i on capabi l i t i es.
We have bui l t t he ARMZILLA envi ronm ent t o eval uat e one cl ass of RINGS archi t ect ures, nam el y t hose t hat can be bui l t wi t h one or m ore ARM cores, a net work-on-chi p, and dedicated hardware processors. Fi g. 6 i l l ustrates t he ARMZILLA set up. There are t hree com ponent s: a hardware sim ul ati on kernel (GEZEL), one or m ore i nst ruct i on-set sim ul ators (ISS), and a confi gurati on unit. The GEZEL kernel [4] capt ures hardware m odel s wit h t he FSMD (Fini t e-St at e-Machi ne wi t h Dat apat h) m odel -of-com put at i on. It uses a speci al i zed l anguage and a scri pt ed approach t o prom ot e i nt eract i ve desi gn expl orat i on. The cycl e-t rue m odel s of GEZEL can al so be aut om at i cal l y convert ed t o synt hesizabl e VHDL. For the ARM ISS we use t he cycl e-t rue SimIT-ARM environment [10] . The ARM ISS uses memory-mapped channels to connect to the GEZEL hardware models. Finally, the configuration unit specifies a symbolic name for each ARM ISS, and associates each ISS with an executable. This way the memory-mapped communication channels can be set up, and the hardware GEZEL models can address each ARM memory space uniquely. An example of what can be done with the ARMZILLA environment is shown in Table 1 . This table shows cycle counts that were obtained after partitioning a JPEG encoding algorithm. The reference implementation runs on a single-ARM ISS model. In the second implementation, we separate the chrominance and luminance channels over two ARM processors. This seems a logical partition that splits the data operations roughly in two parts. But, it also creates a communication bottleneck in the on-chip network and the resulting implementation becomes slower then the O3-level optimized single-processor implementation. The third implementation shows a better partitioning. In this case, the data streams are routed out of the ARM and into dedicated hardware processors for JPEG encoder subtasks. These processors can communicate directly amongst themselves. 6 Conclusions.
In this paper, we presented architecture design and design exploration for low power systems-on-chip. Low power is obtained by tuning all components of the architecture (datapaths, control, memory and interconnect) to the application. This can occur at different levels of abstraction. The design of this type of SOC requires support by design models and methods. The design environments Compaan and Gezel /Armzilla are illustrations of supporting tools for this design space exploration.
7 References.
