Abstract. Embedded devices have hard performance targets and severe power and area constraints that depart significantly from our design in tuitions derived from general-purpose microprocessor design. This paper describes our initial experiences in designing Synchroscalar, a tile-based embedded architecture targeted for multi-rate signal processing applica tions.
Introduction
Next-generation embedded applications demand high throughput with low power consumption. Current approaches often use Application-Specific Integrated Cir cuits (ASICs) to satisfy these constraints. However, rapidly evolving application protocols, multi-protocol embedded devices, and increasing chip NRE costs all argue for a more flexible solution. In other words, we want the flexibility of a programmable DSP with energy efficiency more similar to an ASIC. We propose the Synchroscalar architecture, a tile-based DSP designed to efficiently meet the throughput targets of applications with multi-rate computational subcom ponents.
In designing Synchroscalar, we focused on three key features of ASICs that lead to their energy efficiency -high parallelism, custom interconnect, and low control overhead. Parallelism is important in that it allows the frequency of an architecture to be reduced linearly with investment in logic, modulo communica tion. This linear reduction, when coupled with voltage scaling, yields a quadratic decrease in power and a linear decrease in system energy. Low communication latency, however, is important in maintaining the parallelism necessary for these energy gains. ASICs accomplish low latency through custom interconnect. We find that, in the low frequency domain, a tile-based processing architecture can use segmentable global busses to achieve low latency with high energy efficiency. Control overhead of the busses is kept low by using statically scheduled segmen tation and data motion. Control overhead of the tiles can be reduced by grouping columns into SIMD execution units.
In the remainder of this paper, we provide an overview of the Synchroscalar architecture to establish the context of our study. Then we provide some simple tile and interconnect models which we used to guide our design. We use these models to conduct an analysis of FIR, FFT, Viterbi, and AES kernels running on different points in the design space. We discuss our intuitions from this analysis and conclude with future work for our project.
Synchroscalar Architecture
In this section, we introduce the proposed Synchroscalar architecture and the rationale behind it. As noted in the previous section, we were motivated by the need for an embedded architecture with the flexibility of a general purpose processor (DSP) and the power efficiency of an application specific integrated circuit. We examined ASIC implementations of Viterbi, FFT, AES, FIR and found that the key sources of the power efficiency of an ASIC are -Parallelism, multiple clock and voltage domains -Customized interconnect mirroring the dataflow inherent in the computation -Distributed memory to provide high bandwidth -Customized functional blocks to implement the computation -Absence of instructions, removing instruction cache accesses and decode logic If we want to approach the efficiency of an ASIC, our architecture should retain as many of the key strengths of an ASIC as possible. This directs us towards a tiled-based multiprocessor architecture with multiple clock and voltage domains, reconfigurable interconnect, and low-overhead SIMD control.
Abstractly, Synchroscalar is a two dimensional array of processing elements (PEs), each column potentially operating at different fixed frequencies and hence voltage. There is a single vertical bus connecting the elements in a column, and these vertical buses are connected by a single horizontal bus for communication between columns. In reality, in order to reduce the distance between PEs in a single column, the column is folded over. There are PEs on both sides of the vertical bus. That is the basis for Synchroscalar, as shown in Figure1 (we do not plan to support dynamic frequency/voltage scaling at present). Because of the data-parallel nature of computation, each PE can be viewed as one functional unit of a SIMD machine. There is a SIMD controller for each pair of columns. Each PE (tile) has a single DSP engine with two functional units, SRAM, register file, and communication interfaces. For brevity, we will refer to this cluster of bus, two columns, and SIMD controller as a single column. Although the tiles are SIMD, the communication patterns are not identical, so programmable engines are required for controlling communication.
Programming Model
The architecture of Synchroscalar is motivated by Synchronous Dataflow (SDF) model of computation [2, 3, 4] . DSP design environment tools created by Synop sys and Cadence use this model. SDF is a subset of general purpose dataflow that restricts the number of data values produced and consumed by an actor to be a constant. The restric tion imposed by the SDF model offers the advantage of static scheduling and decidability of key verification problems such as bounded memory requirements and deadlock avoidance [8] Synchroscalar can be viewed as a architecture to support SDF computation model efficiently. This predictability is crucial to pro viding the generality of programming units while retaining much of the efficiency of ASICs.
Clock and Voltage Domains
Clock and voltage domains are per-column, with the task parallelized within the column. Tasks can be mapped to different columns depending on their com putational requirements. This mapping is crucial to performance, because once set, the voltage and frequency of a column may not change. Mapping algorithms must be developed to provide minimize communication and maximize power savings. Computationally-intensive tasks are performed at the best available frequency and voltage that meets the performance requirements. Other tasks can be mapped to columns operated at lower frequency and voltage. We employ rational clocking [15] for the frequencies of different columns. If f m and f n are the frequencies of two columns of PEs then f m /f n = M/N where M and N are integers. While this allows a wide range of selection of frequencies, the relation between the two frequencies provides the predictable communication points between the domains required for statically scheduled communication. Rational clocking eliminates the synchronization overhead with asynchronous or GALS systems while still giving us the flexibility of different frequency do mains.
ASICs benefit from high-bandwidth, low-latency communication provided by custom interconnects. We exploit low clock frequencies and static scheduling to maximize throughput while minimizing latency. Static scheduling is required to maintain guaranteed performance. Although the clock frequencies are low enough to traverse a column in a single cycle, we segment the bus in order to increase the usable bandwidth. Segment controllers are turned on or off by signals from a central per-column segment controller. As shown in Fig.1 , the bus connecting two columns of PEs is partitioned into segments [23] by segment controllers.
The column segment controllers are small state machines which can be reconfigured for each algorithm. By suitably controlling the segment controllers, the bus can perform several parallel communications. For instance, if all the controllers are turned off, the bus becomes a broadcast bus, all PEs able to re ceive the same data. Alternatively, two messages can pass between neighboring columns using the same wires in different segments if the segment controller be tween them is on. The tasks are mapped to the tile architecture such that the communication between the PEs is minimized. Highly communicating tasks are assigned to neighboring PEs. This reduces the number of segments the data has to travel, and hence saves power.
SIMD Control
In order to reduce the cost of instruction fetch and decode, a single SIMD con troller sends instructions to the PEs in a column. The SIMD controller performs all control instructions, only forwarding computation instructions to the PEs. To communicate data (used for conditional branches), the SIMD controller is connected to the segmented bus with the PEs. In order to use branch prediction, there needs to be a mechanism to squash instructions that have already been sent to the processing elements. Instead, we provide a short pipeline in the control unit to calculate branches quickly, and delay instructions from reaching the processing elements. This introduces a single-cycle stall for each conditional branch. For zero-overhead loops, there is still no delay, because the PC is used for decision-making, not the actual instruction. Our implementation incurs no extra overhead for these loops which are critical to DSP performance.
With the Synchroscalar architecture and motivation for context, we now present a general framework within which to evaluate the surrounding design space. The framework will use some simple first-order models of tile and interconnect power, validated with datapoints in the literature and VHDL designs of custom Synchroscalar elements. Although our models are by necessity abstract enough to cover the design space, we argue that the important scaling effects are captured and that our qualitative conclusions are valid.
Tile Model
We use the voltage frequency scaling given by the Newton's alpha law f= k* Our tile is based on the low power 16-bit VLIW DSPs similar to the Intel-ADI MSA-based Blackfin [7] and the SPXK5 from NEC [19] . The minimum core power is assumed to be 0.07mW/MHz similar to [19] . (We are in the process of finishing a detailed VHDL model for the tile and and validating this assumption). The SRAM power is given 0.02mA/MHz for 32kB of memory. This number was obtained from the circuit given in [12] , by scaling for technology and size.
Interconnect Model
The interconnect model is largely based on the data given in [6] . We find that the gate and drain capacitances are orders of magnitude smaller than the wire capacitance per unit length. We thus model only the wire capacitance. The drain-source capacitance of the segmenters and the gate and drain capacitances of the drivers are ignored. In 0.18u tech, the gate capacitance of a minimum sized transistor is about 1-2fF [6] . This value is expected to remain constant over shrinking process technologies. The projected value, in 0.13u tech, of wire capacitance of a semi-global wire, per unit length is 387fF/mm. The chip length is about 10mm and hence the wire capacitance is about 3870fF. This suggests that even if the drivers and repeaters are 10-times the minimum size, their capacitance is about 20fF. If there are 8 drivers for each bus, it adds only 160fF to the wire capacitance. We are in the process of completing VHDL models for the segment controllers, SIMD controller and the communication interfaces. We plan to augment our results with this in the future, but we believe that they are unlikely to change the major trends in the results reported here.
Applications
The main objective of this paper is an exploration of the design space defined by the goals of the Synchroscalar architecture. Specifically, we are interested in the impact of various architectural parameters such as the number of tiles, the interconnect structure, the width of the buses on the power while meeting the performance constraints of an application.
For an initial driving application, we choose the 54 Mbps 802.11(a) wireless LAN physical layer. This is currently outside the scope of DSP processors and is currently done with ASICs or DSPs with co-processors for the computationally intensive applications. The computationally challenging aspects of 802.11(a) are Viterbi decoder, FFT, and large FIR filters for equalization. We will evaluate each of these function on the Synchroscalar architecture. We derive the perfor mance (throughput) targets for each function so that we can meet the 54 Mbps data rate. In addition we also use the Advanced Encryption Standard (AES) as a benchmark as it contains very different kind of computation, intensive on bit manipulation and table look-ups, to see how our architecture fares on such workloads.
The FIR filter is used in the equalization function in the OFDM receiver. We model a 128-tap FIR filter and assume that the data rate is 64 Mbps. We also model a 128 point FFT and assume the data rate is 256 Mbps. FFT and IFFT are key components of the OFDM receiver. For the Viterbi Decoderwe assume the constraint length for the decoder K=7 and the data rate is 54 Mbps. This is the most computation intensive part of the OFDM receiver.
Our initial experimental procedure is as follows:
1. Write the function in C and verify using Blackfin Visual DSP simulation environment 2. Replace the performance critical sections of the code with Blackfin assembly code, to achieve optimal performance. This corresponds to the implementa tion on a single tile. 3. Next map the application into multiple tiles and using a homebrewed tool to assist in pruning the search space. 4. Manually insert the communication instructions 5. Estimate the clock cycle count for the application. 6. Using the power model for the interconnect and the tile described in the previous section, estimate the power. The parameterized power models were described in Excel and that was used to generate the graphs reported in the next section.
While an extensive cycle-level simulation infrastructure is currently under development, we felt that hand-counts were appropriate for guiding the early design of the architecture. In particular, our driving signal processing applica tions are very amenable to hand-analysis as their computations are focused on a small number of kernels. 
Results
Our results focus on several key design questions. We explore the parallelism available in each algorithm by varying the number of processing tiles, the com munication bandwidth necessary through varying global bus widths, and the power efficiency of communication by exploring segmented buses.
Power Normalized to 8x2 Figure 2 shows that as the number of tiles increases, there is the traditional tradeoff between computation and communication, but performance is not our goal. As the performance increases, we lower the clock frequency to maintain All three applications observe an initial decrease in total power, but by the 8x2 tile configuration, the decreasing returns of parallelization is outweighing the benefits voltage scaling. Thus we should provide either 2x2 or 4x2 tiles in each column.
Architectural Configurations

Impact of Bus Width
We then vary bus width. Data dependencies prevent effective overlap of commu nication and computation. This makes fast communication critical to efficiency, else processor idle time will lead to wasted power. We note that processor power accounts for the majority of our system power and that it is impractical to turn processors on and off for periods on the order of a dozen cycles. Conse quently, we see in Figures 3 -5 that increasing bus width decreases processor idle time, which decreases system power. For FIR, the power begins increas ing again at 256 bits because FIR can not take advantage of the increased width. We further note that Amdahl's law comes into play, and we see the greatest power savings as we initially double bus width, cutting communication latencies in half. As we continue to invest in bus bandwidth, processor idle time becomes a smaller fraction of total run time. With cost as a concern, an area-conscious design philosophy would be to choose a bus width of 64 or 128 bits, where we get the most bang for the buck. 
Impact of Segmented Buses
Segmenting the bus allows two simultaneous, short-distance messages to use the same bits in the wire. At the low frequencies of the Synchroscalar system, segmenters are simple transmission gates with little signal restoration or latency involved. Figure 6 shows that as the number of tiles in the column increases, the savings from segmentation also increases, because there are more messages that can traverse the bus at once. Dramatic savings are seen in Viterbi with 8x2 tiles. Even at 4x2, the applications observe 17-54% power savings.
Discussion
Our simple design-space exploration has revealed several results that challenge our intuitions of microprocessor design. Primarily, substantial global intercon 6 nect makes sense in this domain. Low operating frequencies allow signals to traverse global buses in a single cycle. Data dependencies and tile power make the latency of global communication critical. Furthermore, statically-scheduled segmented buses allow the power and utilization of our interconnect to approx imate more specialized interconnects as used in ASICs.
Related Work
The challenges presented by next generation applications in terms of higher data rates, lower power requirements, shrinking time-to-market requirements, and lower cost has resulted in a tremendous interest in architectures and platforms for embedded communication appliances in the past few years. Researchers have approached the problem from several different angles. The DSP architecture companies have proposed highly parallel VLIW machines coupled with hardware accelerators or co-processors for the computation-intensive functions. The TI OMAP is a good example of this category of solutions. The programmable logic community has been very active in this area, as well, and there are numerous architectural proposals that are derivatives of the standard FPGA. The SCORE project at UC Berkeley [5] and the PipeRench project at CMU [16] are especially noteworthy. They use the dynamic reconfigurability of field-programmable gate arrays to exploit power and performance efficiency. The PLEIADES project at UC Berkeley [21] proposes an interconnection of a low power FPGA, datapath units, memory, and processors, optimized for different application domains. The Pleiades researchers conclude that a hierarchical generalized mesh interconnect structure [22] is most appropriate for their architecture because it balances both the global and the local interconnect. Our results are in agreement with this conclusion in general but given that we are targetting streaming computations such as those encountered in a wirless transceiver, we have greater emphasis on near-neighbor communication, so we have stayed away from a general mesh. The adaptive SOC project at University of Massachussets [10] advocates an array of processors connected by a statically scheduled communication fabric. They allow different processors to operate at different clock frequencies and demonstrate significant power savings on video processing benchmarks. The key differences between this work and Synchronscalar are in the structure and con tents of the tiles and the memory architecture. In aSOC the tiles are hardwired functional blocks such as Viterbi decoder, FFT, DCT etc., while in Synchroscalar we assume programmable DSPs as the building blocks for the tiles. As a result, the memory architecture of the system is radically different, changing the data transfer and communication scheduling problem as well. But, it would be inter esting to compare the results between the Synchroscalar and aSOC approaches.
Recently, there has been a revival of interest in locally synchronous and glob ally asynchronous (GALS) approach to processor implementation [1] including the use of multiple clock domains and multiple voltages [11] [17] . The key dif ference between GALS approach and the Synchroscalar approach is the restric tion of using only rationally related frequencies between different columns. This avoids the use of asynchronous FIFOs with their synchronization overhead. So, synchroscalar is similar to Numesh [18] , rather than the GALS approach.
Synchroscalar's use of spatial rather than temporal flexibility is somewhat inspired by the MIT RAW project [20] [9], but our focus on low power and embedded applications is significantly different. Nevertheless, we expect to be further inspired by the extensive compiler work from the RAW group. Although their compiler algorithms are geared towards dynamic general microprocessor algorithms such as speculation and caching, we expect to leverage their experi ences with program analysis and resource allocation.
Another project with a less embedded focus is the Imagine stream processor, a tile architecture at Stanford [13] . Their experience with streaming applications will also be invaluable to the design of our high-level software. Our emphasis on Synchroscalar regions for power reduction and static scheduling of rationallyclocked communication, however, will add significant challenges to our software solutions. Furthermore, both Imagine and RAW are focused on large-system scalability rather than the inexpensive design points of small, embedded systems. We believe that Synchroscalar's differing focus in cost and power will lead to significantly new tradeoffs and design decisions.
Conclusion
The goal of this work was to guide the initial design of tile-based embedded ar chitecture. Through simple power models, we found that our original intuitions regarding interconnect did not apply to the low-frequency, data-dependent na ture of our application domain. We found that wide, segmented global buses give us some of the low latency and flexibility that conventional DSPs lack. We plan to continue our evaluation of the Synchroscalar architecture through extensive design and simulation of end-to-end applications. We are confident that a novel architecture can meet the challenges of tomorrow's embedded applications.
