We present Synchroscalar, a tile-based architecture for embedded processing that is designed to provide the flex ibility of DSPs while approaching the power efficiency of ASICs. We achieve this goal by providing high parallelism and voltage scaling while minimizing control and commu nication costs. Specifically, Synchroscalar uses columns of processor tiles organized into statically-assigned frequency-voltage domains to minimize power consump tion. Furthermore, while columns use SIMD control to min imize overhead, data-dependent computations can be supported by extremely flexible statically-scheduled com munication between columns.
Introduction
Next-generation embedded applications demand high throughput with low power consumption. Current ap proaches often use Application-Specific Integrated Cir cuits (ASICs) to satisfy these constraints. However, rapidly evolving application protocols, multi-protocol embed ded devices, and increasing chip NRE costs all argue for a more flexible solution. In other words, we want the flexi bility of a programmable Digital Signal Processor (DSP) with energy efficiency more similar to an ASIC. We pro pose the Synchroscalar architecture, a tile-based DSP designed to efficiently meet the throughput targets of ap plications with multi-rate computational subcomponents. We focus upon next-generation signal processing appli cations which can not be efficiently supported on today's DSPs, including Orthogonal Frequency Division Multi plexing (OFDM) for 802.11a, MPEG4 encoding, stereo feature extraction and correlation, and software radio dig ital down conversion. Contrary to traditional microproces sor design goals of the highest performance possible, our goal is to design the lowest power solution for set perfor mance targets. Consequently, our metric of success is the lowest system power to achieve a solution, not raw perfor mance.
While conventional wisdom credits the low power of ASIC implementations to their low area per opera tion [8] , Synchroscalar invests area to achieve programma bility while compensating with voltage scaling to achieve low power. Specifically, Synchroscalar uses multiple pro cessor tiles and wide buses to exploit parallelism in order to achieve performance targets while running at low fre quencies. Ideally, linear gains in performance translate to quadratic reductions in power due to voltage scal ing.
In designing Synchroscalar, we focused on several key features of ASICs that lead to their energy efficiency -high parallelism, low control overhead, and custom interconnect.
Our design achieves power efficiencies within 8-30X of known ASIC implementations, which is 10-60X better than conventional DSPs. The success of the Synchroscalar de sign stems from the nature of its target application classexploiting their multi-rate structure, intra-task data paral lelism, and statically predictable control and communica tion. To this end, Synchroscalar uses a column-oriented 2D tile structure that follows three design principles.
First, Synchroscalar exploits parallelism to perform volt age scaling. We minimize hardware complexity by scaling voltages spatially rather than temporally. Columns of pro cessors are statically assigned voltages rather than dynami cally varying voltage for each processor. Computations are mapped to the appropriate frequency and voltage, and com munication facilitates moving from one voltage domain to another.
Second, Synchroscalar amortizes control overhead by grouping each column of processors into a single thread of control, implemented with a single SIMD control unit and program memory.
Third, Synchroscalar minimizes communication over head through substantial investment in statically config urable interconnect. Specifically, the Synchroscalar's low clock frequencies enable the use of wide segmented buses. Because communication can be heavily data dependent and consequently inefficient to manage under SIMD control, we introduce a decoupled communication controller in each column to orchestrate data motion using static schedules. This enables extremely low overhead register-to-register inter-tile communication which allows us to compete with the dedicated interconnects of ASICs.
In the remainder of this paper, we provide an overview of the Synchroscalar architecture and our multi-rate appli cations to establish the context of our study. Then we de scribe our evaluation methodology, including power mod els, SPICE simulations, VHDL synthesis, software tool chain, and cycle-level simulation. We analyze our results and discuss our intuitions from this analysis. We then con clude with related and future work.
The Synchroscalar Architecture
The high level observation that led to this design was that if an application can be parallelized efficiently on an archi tecture, then the clock and voltage can be scaled down in order to reduce power consumption. The column-oriented nature of Synchroscalar allows us to greatly reduce the complexity of control, communication, clock distribution and voltage scaling. SIMD controllers are used to amor tize the control overhead and support efficient application parallelization in each column, while Data Orchestration Units (DOUs) provide communication flexibility by sup porting statically-scheduled zero-overhead irregular com munication. Interconnect bandwidth is highest within and between columns, in order to provide high-speed com munication within an application. Lower bandwidth is re quired for communication between components. Addition ally, each column of four tiles is supported by a specific clock generator and voltage and are configured at startup.
We will use the Digital Down Converter (DDC) appli cation as an example of how the parallelization and map ping process works. Parallelization begins by recognizing stages in the application with a specific data rate between each stage. The first two stages of this application are the digital mixer and the CIC integrator (see Section 3 for full application descriptions). After exploring the trade-offs be tween computation and communication with varying levels of parallelization (described in Section 5) we find that the first stage, the mixer, should run on 8 tiles and the integra tor on 8 tiles to minimize power consumption. The mixer is then mapped to the first two columns and the integrator to the third and fourth columns.
Once the parallelization and mapping is complete, the clock and voltage can then be scaled down based on the ap plication needs. Given the DDC's target execution rate of 64 million samples per second, the mixer tiles need to run at 120 MHz and the integrator tiles at 200 MHz. These clock rates are generated from reference clocks which are fed into clock dividers in each column as shown in Figure 1 . Sup ply voltages are also externally supplied, and SPICE simu lations from Section 4 indicate that the mixer tiles can oper ate at 0.8V and the integrator tiles at 1.0V. With this simple example and overview in mind, the remainder of this sec tion describes the major components of Synchroscalar in greater detail.
Parallelism
Parallelism is critical to the success of Synchroscalar, for it is through parallelism that we can reduce the clock fre quency, and thus voltage, while continuing to meet perfor mance targets. In this we are greatly aided by the staticallypredictable, highly data-parallel nature of signal process ing. While our applications are all hand-parallelized in this study, future work will focus on automated tools. We be lieve that automation is realistic, since our applications fit the Synchronous Dataflow (SDF) model of computation used in existing DSP design tools such as Ptolemy from UC Berkeley [6, 7, 9] and Simulink from Mathworks and SPW from Cadence.
The dataflow models allow for two forms of parallelism within a Synchroscalar column and between columns. SDF also provides predictability by restricting the number of data values produced and consumed by a task to be a con stant. This restriction imposed by the SDF model offers the advantage of static scheduling and decidability of key ver ification problems such as bounded memory requirements and deadlock avoidance [21] .
SIMD Control
Low-overhead control is critical to the efficiency of ASICs. The data-parallel nature of signal processing appli cations allows a reduction in the cost of instruction fetch and decode through a single SIMD controller that sends in structions to the tiles in a column. The SIMD controller performs all control instructions, only forwarding com putation instructions to the tiles. To communicate data for conditional branches, the SIMD controller is con nected to the segmented bus with the tiles.
In order to support branch prediction, there would need to be a mechanism to squash instructions that have already been sent to the processing elements. Instead, we provide a short pipeline in the control unit to calculate branches quickly, and delay instructions from reaching the process ing elements. This introduces a single-cycle stall for each conditional branch. For zero-overhead loops, there is still no delay, because the PC is used for decision making, not the actual instruction.
Note that applications do not always parallelize evenly into columns of 4 tiles, requiring occasional idle tiles. Idle tiles are assumed to consume negligible power through sup ply gating, so we sacrifice their area to simplify our design. Idle tiles are decided at startup.
Reconfigurable Interconnect
Synchroscalar exploits parallelism to increase efficiency, but these gains must not be lost to the communication overhead to support this parallelism. In particular, latencycritical communication must not be allowed to increase the Given our design goal of low system clock rates, we find that we can approximate the specialized interconnects of ASICs through a combination of segmented buses and a decoupled communication controller called the Data Or chestration Unit (DOU). Refer back to Figure 1 to see these buses and controllers arranged in each column. We note that signal processing applications exhibit much higher commu nication requirements within computational blocks than be tween blocks. Consequently, we allocate only a single hor izontal bus between columns, which both meets bandwidth requirements and facilitates gather-scatter operations.
Synchroscalar buses are 256 bits wide, grouped into 8 32-bit separable vertical buses that are segmented in be tween each of the tiles. Although 256 bits wide might seem power-hungry, we shall see in Section 4 that the power con sumption of the busses is small compared to the cost of sup porting a higher frequency tile.
In addition, by suitably controlling the segment con trollers, the bus can perform several parallel communica tions. For instance, if all the controllers are turned on, the bus becomes a low-latency broadcast bus, and all tiles able to receive the same data in a single cycle. Alterna tively, two messages can pass between neighboring tiles us ing the same wires in different segments, achieving the ap proximate bandwidth of a mesh if code is allocated to the tiles intelligently. The segmentation of the bus allows Syn chroscalar to achieve higher levels of local bandwidth for very little cost in area and power and reduces tile idle time due to remote data dependencies.
Data Orchestration Unit (DOU)
A key feature of Syn chroscalar is statically-scheduled communication provided by the DOU decoupled controllers located in each col umn. The goal of the DOUs are to provide zero-overhead 
Figure 3. DOU Implementation
data movement between producer and consumer tiles. A producer writes to a special register, and, at a staticallyscheduled time, a consumer can read that value from a re ceive register. The DOUs provide separate, cycle-by-cycle control of data motion and interconnect configuration. This flexibility facilitates irregular data motion and allows our applications to be efficiently scheduled in SIMD tile com putations. The DOU operates at the maximum frequency, the frequency of the bus. Since the DOUs are very small the power contribution of the DOUs is minimal. There is one DOU for each of the columns on Syn chroscalar. The gray boxes that overlap the data bus in Fig  ure 1 represent the segmenters, and the gray lines that con nect DOU to the segmenters are the control lines that are necessary to control each of the 8 splits. Figure 2 depicts a detailed logical diagram of the segmenters.
The DOU is simply a state machine, where each of the DOU's state's outputs control the segmenters. The DOU must be programmed with the desired communi cations patterns for the column-bus it controls. There are 128 states in the DOU. Each state entry in the DOU has five types of fields, CNTRi , SEGi, Bufferi, NXTSTATE0i, NXTSTATE1i,as shown in Figure 3 .
The CNTR field specifies which of the four DOU down counters should be checked for a given state in the DOU. If the counter specified by the CNTR field is zero, then the next state is the state pointed to by the NXTSTATE0 field of that given state and the down counter is reset to its ini tial value. If not, the DOU state machine proceeds to the state pointed to by the NXTSTATE1 field and decrements CNTR. There are four 32-bit down counters that are preprogrammed with the dynamic instruction count of the as sociated loop, allowing four nested loops. The SEG and Buffer fieldsare the outputs of a given state. They control the bus segmenters for a given column and the communica tions buffers for each tile in a given column, respectively.
Here is a quick example of the DOU's operation. Fig  ure 4 shows a nested pseudo-code loop. It requires two DOU counters for I and J. The I loop counter would need to be 4*A and the J loop counter would need to be 2*B, as suming the FOR instruction loop can be encoded in a sin gle assembly instruction. The output pattern would need to be programmed for each of the instructions that access the global data bus. the output pattern is a "don't care".
Synchroscalar Tiles are based on the ADI/Intel Blackfin DSP ISA [20] , but with control provided by the SIMD con troller instead of in each tile. Additionally, each of the tiles has a read and a write buffer as shown in Figure 2 . These buffers have a dual purpose. Their first function is to adapt the tile voltage to the bus voltage, as tile voltages across the Synchroscalar design may be different. Secondly, the buffers align a word of data onto the desired split of the global data bus. Register R7 is the designated communica tions register on each of the tiles. The DOU controls the alignment of this register on the data bus. Only three bits are required from the DOU to control the placement of the data on the data bus for each of the 8 splits of the bus.
Clock and Voltage Domains
All of the design decisions above come together to pro vide clock and voltage scaling per column. Since the appli cations have known performance needs, the SDF model pro vides predictable performance and distinct tasks that are ex ploited by our SIMD column-based design. These columns become separate clock and voltage domains. Each task is performed at the lowest frequency that meets the applica tion constraints and the corresponding voltage, down to the chosen voltage and frequency floors of 0.7 V and 100 MHz.
To further reduce complexity, we support only a small set of frequencies and voltages for a given design. The com putational rates of each algorithmic block implemented in each column, however, must be matched to the target data rates. If one block runs too fast, then the subsequent block will not be able to keep up with the data produced.
A simple procedure for matching rates is to choose the minimum frequency necessary for each column, then add nops to throttle the computational rate. Unfortunately, adding nops to application code may not be convenient if the throttling rate is not a good multiple of the existing loop structure. Instead, we introduce a simple mechanism for flexible computation throttling in our multi-rate system called Zero Overhead Rate Matching. We add a simple pro grammable counter to each SIMD controller. This counter allows us to periodically dynamically insert nops to the tiles in each column in any period of cycles, thus allowing per fect rate matching.
Applications
To drive the design of Synchroscalar, we selected four signal-processing applications, each of consider able complexity involving several computational subcom ponents. Each cannot be executed at the required rate by any known commercial DSP at this time. These ap plications are: Digital Down Conversion, Stereo Vision, 802.11a, and MPEG-4. The next four sections briefly de scribe each of these applications.
Digital Down Conversion(DDC) Digital Down Conversion (DDC)
is an integral component of many communication systems, and functions primarily to convert a received sig nal to baseband such that the signal of interest can be processed. This particular DDC was configured to support GSM cellular requirements of up to 64 M Samples per sec ond. It is comprised of a Numerically Controlled Oscillator, digital mixer, Cascaded-Integrator-Comb (CIC) filter and a two-stage filter in the form of a compensating 21-tap filter (CFIR) and a 63-tap filter (PFIR).
Stereo Vision (SV)
, used in the Mars Rover [26] , has two stages: point feature extraction and point feature cor relation. Each frame processed is 256 by 256 pixels in monochrome and is processed at a rate of 10 frames per sec ond. Tomasi and Kanade's [10] algorithm for point feature extraction was employed and for point feature correla tion, singular value decomposition [30] was used.
802.11a
is an end-to-end application. This IEEE standard for wireless communications supports data rates up to 54 Mbps. It is coded using OFDM and employs up to 12 20 MHz channels in the 5 GHz frequency range. The four ma jor components in the 802.11a receiver are the FFT, Demod ulation, De-Interleaving and a K=7 Viterbi Decoder.
MPEG-4 is an ISO/IEC standard adopted in 1998. For en coding, we implement Motion Estimation, DCT and Quan tization which constitute about 90video encoder [36] . For Synchroscalar, both CIF and QCIF MPEG4 encoding was performed at 30 frames per second.
Methodology
We now present an evaluation framework for Syn chroscalar including: an application mapping methodol ogy, tile and interconnect power models, VHDL synthesis, and cycle-accurate simulation.
Implementing Applications on Syn chroscalar
In this section, we will outline the procedure used to map applications to Synchroscalar and evaluate their per formance and power efficiency. The process involves find ing an efficient mapping of an application on the Syn chroscalar architecture, validating it for functional correct ness and then determining the appropriate frequency and voltage of operation of each column. The frequency and voltage values are plugged into an empirical power model for Synchroscalar to evaluate the power consumption for that mapping. The detailed procedure is outlined below.
1. Start with the description of the application on a single tile. 2. Choose the number of tiles, N, that minimizes power. 3. Partition the application among the N tiles and insert data transfer op eration to model the communication between the tiles. 4. Assume every data transfer takes one clock cycle. Statically schedule all the data transfers. 5. NOPS are introduced appropriately to avoid structural hazards due to bus conflicts. 6. Use the cycle-accurate simulator to determine the number of clock cy cles required per input data sample. Code and data are in local tile memories when computing the clock cycle count. 7. Given the input data rate and number of cycles required by each tile, frequency of operation for each column of tiles is computed. Let � � be the frequency of operation of the � �� column. 8. Using SPICE and the Berkeley Predictive Technology Models we find the required supply voltage (� � ) for a given frequency and voltage for an assumed critical path delay of 20 FO4s. 9. The total power is estimated using the following equations
where � is defined as the normalized power in milliwatts per MHz (mw/MHz) at the reference voltage � , and it includes the active ��� power consumed by the tile (including the data memory) and the DOU and the SIMD controller in each column, � is the average bus capaci tance switched per cycle, and � is the number of tiles.
Based on the procedure outlined above, it is clear that the key factors that influence our model are -the power model for the tile, the power model for the buses or inter connect and the leakage power. Next we will describe how we model these parameters and their validation.
Tile Model
To model the power of the tile we need two things. First, is the parameter U that that represents the normalized power of the tile and its associated components. The second pa rameter needed is the relationship between the frequency of operation of the tile and the operating voltage.
The parameter U is estimated as follows. The tile con sists of 2 40-bit ALUs, 4 8-bit video ALUs, 2 40-bit ac cumulators, 2 16x16 multipliers, 1 40-bit barrel shifter, a 32x32 register file with four read ports and 2 write ports, 32KB data memory and glue logic. It was mod eled in VHDL and synthesized using the Synopsys Design compiler. The multipliers, register file and mem ory were not synthesized. We mapped the design to a 0.25� ASIC library, at a supply voltage of 2.5V, and used De sign Power to estimate the power from the gate-level netlist. We scaled the results to 130 nm geometry and found that the normalized power of the datapath was approxi mately 0.03mW/MHz. To this we added the contribution of the register file (0.11mW/MHz), [27] , and the data mem ory (1.75mW/MHz) [28] , by scaling the data appropri ately. Hence, the total normalized power of the tile was estimated to be 1.89mW/MHz. To this we add the amor tized overhead from the DOU and the SIMD controller. Assuming that there are four tiles per column, the contri bution of the SIMD controller and the DOU to the power of each tile is roughly 0.25mW/MHz, for a total normal ized of 2.14mW/MHz, which corresponds to the parameter U in the equation above.
Interconnect Model
We assume that by doing a custom logic implementa tion with appropriate transistor sizing we would cut the power of the synthesized portions of the logic in the SIMD controller and the tile by around 30%. With this assump tion, we estimate the normalized power to be approximately 0.642mW/MHz, which reduces to 0.1mW/MHz at 1V sup ply. Although no Blackfin core power numbers are avail able, we can compare our estimate to a similar core from NEC the SPXK5 [37] , which consumes 0.07mw/MHz in 130 nanometer technology. Given that we are using an esti mate, however, we will discuss the sensitivity of our results to tile power at the end of the results section.
The relationship between operating frequency and sup ply voltage of a column is found as follows. We assume the critical path is 20 F04 gates, which is pessimistic, but appro priate for an embedded DSP core [16] . Using the Predictive Berkeley Technology Models [17] we SPICE a 20 FO4 crit ical path and plot the relationship between frequency and voltage. The graph in Figure 5 shows the variation of the fre quency and voltage for the 130 nm process assuming criti cal paths of 15 and 20 FO4 lengths. This graph is captured as a look-up table to determine the appropriate voltage of operation of a tile given the frequency.
The interconnect model is largely based on the data from "The Future of Wires" paper [16] . In 0.18� tech, the gate ca pacitance of a minimum sized transistor is about 1-2fF [16] . This value is expected to remain constant over shrinking process technologies. The projected value of wire capaci tance for a semi-global wire in 0.13 � technology is per unit length is 387fF/mm. Assuming length of the chip is about 10mm (that corresponds to the length of the bus) the wire capacitance is about 3870fF. This suggests that even if the drivers and repeaters are 10-times the minimum size, their capacitance is about 20fF. If there are 8 drivers for each bus, it adds only 160fF to the wire capacitance. Also, we find that the gate and drain capacitances are orders of magni tude smaller than the wire capacitance per unit length. The drain-source capacitance of the segmenters and the gate and drain capacitances of the drivers are ignored. Thus the in terconnect is modeled by the wire capacitance to a first or der approximation. A summary of the key parameters of our model and their sources is given in Table 1 .
Leakage Power Estimation
Given that we are scaling the supply voltage aggres sively, it is important to include the contribution of the leak age current in our estimations. Additionally, the fact that we trade area for power in Synchroscalar makes our leak age analysis even more critical. We use an analytical model to compute the leakage current � �� � � �� �� ��
where � �� is the on current that depends on the process but is roughly equal to 0.3 �A per micron width, � � = �� �� which is roughly 26 millivolts at room temperature and � depends on the devices structure but is roughly between 1.3 to 1.5 and � �� is the threshold voltage.
The leakage current increases with decrease in threshold voltage and increase in temperature. In order to model the leakage, we make the following assumptions: Using these numbers we calculate the leakage current, �� � which happens 830 pA per transistor for a minimum sized transistor. This leakage correlates well with the num ber published by Intel on their 130 nm process where leak age current varies from 0.65 nA per transistor to 32.5 nA per transistor depending if the threshold voltage of the tran sistor is high or low, respectively [41] .
We estimate 1.8 million transistors per tile, so we be lieve the leakage power to be around 1.5 mAmps assum ing 830 pA of leakage per transistor. Of course, this esti mate makes several assumptions, such as the average tran sistor width. While all results in this paper will assume 830 pA of leakage per transistor, we will present a leakage sen sitivity analysis in the results section of this paper.
Cycle-Accurate Simulation
To obtain cycle-accurate performance measurements, we adapted an object-oriented variant of SimpleScalar to model the Synchroscalar architecture. The instruction set was re targeted to the Blackfin ISA [20] and communication mech anisms were added.
The applications were compiled down to assembly, and the inner-loops hand-optimized. Inter-tile communication is hand-scheduled, and appropriate nops are inserted for syn chronization between different clock domains.
Tile Area Estimation
The tile, the SIMD controller and the DOU were mod eled in VHDL and synthesized using Synopsys Design compiler for a 0.25� ASIC library and scaled to 0.13�. The various components of the tile and the SIMD controller are shown in Table 2 . Memory, register file, and multipliers were not synthesized. Their area was estimated from [15] which has technology independent models for various com ponents. We assume 32KB SRAM of data memory per tile and 2KB SRAM for instruction memory. The ���� field models the glue logic and the wiring overhead between the top-level blocks. The area of the tile is 1.82 �� � , the area of the SIMD controller and the DOU, which are shared by the whole column of four tiles, is approximately 0.25�� � and 0.0875 �� � respectively.
Results
Since all of our applications have set performance tar gets, our metric is the system power required to achieve those targets. Table 2 
. Tile and DOU and SIMD Control Area Estimation
Synchroscalar: software implementation of challenging sig nal processing applications with energy efficiency gener ally within 8-30X of ASIC solutions and 10-60X better than DSPs performing even reduced data rate versions of the ap plications. The remainder of this section describes the bene fits of Synchroscalar's unique column-oriented voltage scal ing, parallelization's influence on the system power, inter connect costs and leakage current.
Power Savings
The multiple column-oriented voltage domains yields advantages as shown by comparing the Single Voltage and Multiple Voltages columns in Table 4 and in Figure 5 For applications where there are a few tiles that run at high frequencies that cannot be parallelized into multiple tiles, we see the greatest power saving due to the voltage scaling. The Stereo Vision application is one such appli cation. In other applications, where there is not one com putationally demanding algorithm with limited exploitable parallelism, the power saved due to the voltage frequency scaling is much smaller. The wireless 802.11a application is one such instance. The true benefits of voltage scaling can be better demonstrated when applications need to be com posed. This can be seen in the data where we have com posed an AES-based message authentication code with the 802.11a receiver. Figure 7 shows how much power is consumed for dif ferent levels of parallelization of the our applications. By allocating more parallel resources we are able to run the applications at a lower frequency and a lower voltage, thereby saving power. However, there are diminishing re turns for further parallelization in increased communica- Figure 7 is good example of how diminishing returns from additional communica tions requirements prevent us from further parallelizing the 802.11a application efficiently. This communications over head negatively impacts our power efficiency and is repre sented by the dark portions of each of the application's bars in Figure 7 . Another source of diminishing returns from further parallelization is a supply voltage floor. While tiles could operate at supply voltages lower that 0.7 V, due to leakage and noise constraints, we chose 0.7 V as the mini mum supply voltage supported. Therefore, by further paral lelizing an algorithm that is already running at the minimum supply voltage would not yield further power savings, and would likely increase the power consumption due to leak age and added communications cost. 
Effects of Parallelism

Leakage Sensitivity Analysis
Since Synchroscalar trades spatial parallelism for tem poral parallelism and the power dissipation due to leakage is proportional to the spatial parallelism, a careful analy sis of leakage must be considered. Figures 9 and 10 show how different levels of parallelization of our four applica tions perform under varying levels of leakage currents. In the figures, the horizontal axis shows the leakage current 
Effects of Interconnect
In Figure 8 we have mapped how the power-area ef per Synchroscalar tile, and the vertical axis shows the power consumption of the applications in mW. The lowest leak age current (1.5 mA/tile) corresponds to the leakage per tile as calculated in Section 4.4. The largest leakage current ficiency of the Synchroscalar architecture scales for the Viterbi ACS with different sets of bus widths and different numbers of tiles. The Viterbi ACS is used here as the Viterbi Decoder has the most demanding communications require ments of any of the individual algorithms tested on the Syn chroscalar architecture. The three curves on Figure 8 each represent a Viterbi ACS trellis being completed on 8, 16 and 32 tiles. Each of the curves are comprised of power results for a few different bus widths (32b, 64b, 128b, etc...). We can see from this figure that increasing the bus width from 128 to 256 bits significantly improves the power efficiency of Synchroscalar on the Viterbi Decoder for all three im plementations. However, another such doubling of the bus width has a smaller reduction on our overall power con sumption as the curves become less steep. This leads us to choose a 256 bit bus for Synchroscalar. While it would be possible to attain lower power consumptions for the Viterbi graphed corresponds to the leakage current if each tile used only low Vt transistors as published by Intel [41] , which we believe represents the highest leakage current that we would consider in the development of Synchroscalar.
Of particular interest are the cross-over points between different levels of parallelization of an application, as in Figure 10 for MPEG4. Moving from eight to twelve tiles allows Synchroscalar to reduce the overall power consump tion through frequency reduction and voltage scaling. These gains outweigh the leakage penalty and communications overhead. However, when moving from twelve to 36 tiles, the structure that has the best overall power consumption depends heavily on the leakage current. When tiles leak less than 14.8 mA (corresponding to 8.3 nA/transistor), the higher parallelized structure of 36 tiles is more efficient, but when tiles leak more than 14.8 mA, the twelve tile struc ture is more efficient. Figure 10 . Leakage sensitivity for MPEG4, SV
Discussion
How much should one parallelize the applications? The factors that limit the amount of parallelization are the volt age floor, i.e. the minimum possible voltage that we could run a given tile at, the leakage current, and the structure of the application.
Additional parallel harware helps here to reduce power because we are scaling the voltage aggressively as well. Once all tiles are operating at voltage floor, parallelizing fur ther is not advantageous, as further attempts for additional parallelization could increase the communiations overhead. It would the be the goa of a compilation tool for Syn chroscalar to help parallelize applications so that they are running as close to the voltage floor as possible.
Our results are sensitive to the � ���� number that we derived in the methodology section. Since tile power is the dominant factor in the total Synchroscalar power, our power results are roughly linear with the � ���� Our qual itative results are valid for a large range of realistic val ues of tile power. For instance, let us compare the Syn chroscalar power consumption with the power consumption of the Blackfin DSP which are both in 0.13� technology.
Using the 0.1 mW/MHz estimate of power per tile for Syn chroscalar, we have shown that the DDC application runs at 2.43 W for 64e6 samples/second or 38.0 nW/sample. The Blackfin DSP can run at 280 mW for 113e3 samples/second at 600 MHz or 2478 nW/sample -a factor of 60 difference.
So clearly, even if our estimate of � ���� is off by a factor of two, we are still demonstrating significant power savings.
Related Work
The challenges presented by next generation applica tions in terms of higher data rates, lower power require ments, shrinking time-to-market requirements, and lower cost has resulted in tremendous interest in embedded archi tectures and platforms for communication appliances in the past few years. Researchers have approached the problem from several different angles. The DSP architecture com panies have proposed highly parallel VLIW machines cou pled with hardware accelerators or co-processors for the computation-intensive functions. The TI OMAP [18] is a good example of this category of solutions. However, this is not power efficient. You would need very high clock fre quencies to meet the throughput constraints for the applica tions considered in this paper.
The SCORE project at UC Berkeley [11] uses a FPGAlike fabric with specially tailored interconnect to exploit parallelism and improve power efficiency. The PLEIADES project at UC Berkeley [44] proposes an interconnection of a low power FPGA, datapath units, memory, and pro cessors, optimized for different application domains. The PLEIADES researchers conclude that a hierarchical gen eralized mesh interconnect structure [43] is most appro priate for their architecture as it balances both the global and the local interconnect. Our results are in agreement with this conclusion in general but given that we are target ing streaming computations, we have greater emphasis on near-neighbor communication and have stayed away from a general mesh. Other reconfigurable machines, such as RAPID [12] and Piperench [33] , illustrate interesting alter natives to our choice of tiles, and may be amenable to our coarse-grained voltage-frequency scaling techniques.
The adaptive SOC project at University of Mas sachusetts [22] advocates an array of processors con nected by a statically scheduled communication fabric. They allow different processors to operate at differ ent clock frequencies and demonstrate significant power savings on video processing benchmarks. The key dif ferences between this work and Synchroscalar are in the structure and contents of the tiles and the memory archi tecture. In aSOC the tiles are hardwired functional blocks such as Viterbi decoder, FFT, DCT etc., while in Syn chroscalar we assume programmable DSPs as the build ing blocks for the tiles. As a result, the memory archi tecture of the system is radically different, changing the data transfer and communication scheduling prob lem as well. Intel's tile based architecture [14] shares the same objectives as ours, but the interconnection net work is very different. Also, the tiles in [14] are much coarser grained, which means their power consump tion will likely be higher. The tile-based architecture from University of Texas [29] resembles Synchroscalar struc turally but it is designed for wire-delay scalability, not power efficiency given a data rate constraint, which is the unique feature of our work. Synchroscalar's use of spa tial rather than temporal flexibility is somewhat inspired by the MIT RAW project [39] [38] , but our mecha nisms for ASIC-like performance are significantly differ ent. The Imagine [31] processor approaches a similar prob lem domain from a stream-oriented perspective. The paral lelization strategies used by Imagine are complementary to the voltage scaling, data orchestration, and multi-rate opti mization used in Synchroscalar. The Smart Memories [24] project is another tile-based architecture whose reconfig urable tiles would also be complementary to Synchroscalar mechanisms. While the SIMD components of our applica tions are dominant, some phases could benefit from other models of computation.
Recently, there has been a revival of interest in the glob ally asynchronous and locally synchronous (GALS) ap proach to processor implementation [4] including the use of multiple clock domains and multiple voltages [25] [34] .
The key difference between GALS approach and the Syn chroscalar approach is the restriction of using only ra tionally related frequencies between different columns. This avoids the use of asynchronous FIFOs with their syn chronization overhead. So, Synchroscalar is similar to Numesh [35] , rather than the GALS approach.
Conclusion
The design principles of Synchroscalar -high paral lelism, efficient interconnect, low control overhead, and custom voltage/frequency domains -will lead to a new set of embedded architectures with efficiency approaching ASICs and with the programmability of DSPs. Our study has shown a promising proof-of-concept through hand op timization and code development. Future work will focus on a software tool chain to automate and optimize applica tion parallelization and communication scheduling.
Acknowledgments
This work is supported by NSF ITR grants 0312837 and 0113418, and NSF CAREER and UC Davis Chancellor's fellowship awards to Fred Chong. Jedidiah Crandall's work was supported in part by a United States Department of Ed ucation Government Assistance in Areas of National Need (DOE-GAANN) grant P200A010306. Diana Franklin's fac ulty position is funded by a Forbes Endowment. We would also like to thank Rajeevan Amirtharajah, Bevan Baas, Dean Copsey and Matthew Farrens at Univeristy of California at Davis, Mark Oskin at Univeristy of Washington and Timo thy Sherwood at Univeristy of California at Santa Barbara.
