High-performance embedded systems face challenging issues that are stressed when used in portable applications. Required tradeoffs involve energy consumption, computing power and NRE costs.
High-performance embedded systems face challenging issues that are stressed when used in portable applications. Required tradeoffs involve energy consumption, computing power and NRE costs.
Solutions have been proposed enhancing standard processors with specialized peripherals for critical functions [1] . Mask-level configurable processors are available [2] , but such solutions are necessarily bound to defined application domains. An alternative is to combine processors with programmable gate arrays [3] to support configuration of specific logic at deployment time, but this involves issues not common to high level languages programmers. A possible solution to both problems is to utilize embedded programmable logic to dynamically reconfigure the processor instruction set at run-time depending on the currently executed algorithm. In order to use C as main programming tool and ensure a user-friendly environment for application development, hardware interfacing is made easier by modelling the programmable logic as an extension of the instruction set.
Our computational model is based on a Very Long Instruction Word Risc processor (XiRisc, eXtended instruction set Risc), that uses embedded logic to create a pipelined, run-time configurable data-path (PiCo-GateArray, pGA) . Figure 14 .3.1 shows the system architecture, based on a classic five-stage pipeline. Each cycle two 32b instructions are fetched from the instruction cache and processed concurrently determining two different execution flows. A set of hardwired function units is shared between the two channels, performing general purpose DSP calculations such as byte-wise parallel arithmetic, multiply-accumulation and branch-decrement operations. Programming and elaboration of the pGA is controlled by assembly instructions inserted in the program flow (Fig. 14.3.2) . The configuration process is concurrent with processor elaboration. To minimize its effect on performance, configuration bits are loaded from a dedicated cache, while the core has free access to other memory resources.
The aim of the introduced architecture is to maintain a high level of parallelism, decreasing energy usage by cutting down the main sources of consumption (instruction fetch and accesses to data memories). Computation is based on three concurrent flows, two of which are fed each cycle by the two 32b instruction fetch and one based on the independent, variable latency pipeline running on the pGA. The three channels exchange data via the processor 32-slots register file, featuring 4 read and 4 destination ports two of which are reserved to the pGA. System control, memory and I/O interface, and general purpose arithmetic are operated by the two hardwired channels, whose VLIW configuration allows to maintain a very high access rate to memory, while application specific computational kernels are more effectively implemented on the configurable device. Each pGA operation has its own pipeline pattern, independent from the core. Program flow consistency is preserved through an hardware stall logic based on a register locking mechanism.
The instruction set, including instructions controlling the pGA, is supported by a compilation, profiling and software evaluation environment based on a customization of the GNU-Gcc tool chain [4] .
The pGA is an array of rows, each representing a possible stage of a customized pipeline. As shown in Fig. 14.3.3 , each row is connected to the other rows and to the register file through programmable interconnect channels. Pipeline activity is controlled by a dedicated configurable control unit (c.u.), which generates two signals for each row. The first enables the execution of the pipeline stage, allowing registers in the row to sample new data, while the second controls initialization of a state possibly held inside the array. In fact, the c.u. can force the array to hold internal state and supports implementation of high level language constructs such as for/while loops.
Each row is composed of 16 Reconfigurable Logic Cells (RLC) and a configurable horizontal interconnect channel. A RLC contains two 4-input, 2-output look-up tables, four registers and an internal loop back connection to easily implement accumulators. On the feedback path, a block controlled by the c.u. supports two different kinds of state initialization: one with constant values known at compile time, and one with variable values. A carry chain block and dedicated wires along each row allow fast propagation of carry signals, achieving a single cycle 32b addition. In order to minimize reconfiguration time, the pGA includes a 4-layer cache of configurations distributed inside the array, requiring one clock cycle to switch among them. While one layer is computing, reconfiguration of other layers is allowed, thus, ideally achieving no penalty even when the number of used configurations is more than four. Dedicated on-chip memories form a second level cache which provides a wide bus width (192b) so that reconfiguration requires an average of 100 cycles only.
A prototype chip was fabricated in 0.18µm CMOS technology, composed of the VLIW core enhanced by an 8-row pGA. Figure 14. 3.5 details overall energy consumption. Memory contribution, which is the most relevant, roughly scales with the number of clock cycles, therefore a relation exists between speedup and power performance. pGA consumption ranges from 30% to 60% of the overall value, hence XiRisc shows energy consumption improvements when the speedup is higher than 1.5x-2.5x, achieving a reduction of up to 90%. Prototype technical informations are described in Fig. 14.3 .6, and a chip micrograph is displayed in Fig. 14.3 .7.
XiRisc proved to be a flexible architecture, offering good performance for a wide range of algorithms. Where a high level of parallelism can be exploited (e.g. Motion estimation) or for bit-level operations (e.g. DES), the pGA remarkably improves efficiency, offering speedups from 4x to 14x. For multiply-intensive computations (e.g. DCT), where general purpose programmable devices are not efficient, the VLIW data-paths allow speedups up to 1.8x. Average processor core and memories power consumption is 1.5 mW/MHz, while pGA consumption is 200µW/MHz for each active row. [ 
