Synchroscalar: A Multiple Clock Domain,
Power-Aware, Tile-Based Embedded Processor
John Oliver � , Ravishankar Rao � , Paul Sultana � ,
Jedidiah Crandall � , Erik Czernikowski � , Leslie W. Jones IV � ,
Diana Franklin � , Venkatesh Akella � , and Frederic T. Chong �
University of California, Davis �
California Polytechnic State University, San Luis Obispo �
Abstract

We present Synchroscalar, a tile-based architecture for
embedded processing that is designed to provide the ﬂex
ibility of DSPs while approaching the power efﬁciency of
ASICs. We achieve this goal by providing high parallelism
and voltage scaling while minimizing control and commu
nication costs. Speciﬁcally, Synchroscalar uses columns
of processor tiles organized into statically-assigned
frequency-voltage domains to minimize power consump
tion. Furthermore, while columns use SIMD control to min
imize overhead, data-dependent computations can be
supported by extremely ﬂexible statically-scheduled com
munication between columns.
We provide a detailed evaluation of Synchroscalar in
cluding SPICE simulation, wire and device models, syn
thesis of key components, cycle-level simulation, and
compiler- and hand-optimized signal processing applica
tions. We ﬁnd that the goal of meeting, not exceeding, per
formance targets with data-parallel applications leads to
designs that depart signiﬁcantly from our intuitions de
rived from general-purpose microprocessor design. In
particular, synchronous design and substantial global in
terconnect are desirable in the low-frequency, low-power
domain. This global interconnect supports parallelization
and reduces processor idle time, which are critical to en
ergy efﬁcient implementations of high bandwidth signal
processing. Overall, Synchroscalar provides programma
bility while achieving power efﬁciencies within 8-30X of
known ASIC implementations, which is 10-60X better than
conventional DSPs. In addition, frequency-voltage scal
ing in Synchroscalar provides between 3-32% power sav
ings in our application suite.

1. Introduction
Next-generation embedded applications demand high
throughput with low power consumption. Current ap
proaches often use Application-Speciﬁc Integrated Cir
cuits (ASICs) to satisfy these constraints. However, rapidly
evolving application protocols, multi-protocol embed
ded devices, and increasing chip NRE costs all argue for a
more ﬂexible solution. In other words, we want the ﬂexi
bility of a programmable Digital Signal Processor (DSP)
with energy efﬁciency more similar to an ASIC. We pro
pose the Synchroscalar architecture, a tile-based DSP
designed to efﬁciently meet the throughput targets of ap
plications with multi-rate computational subcomponents.
We focus upon next-generation signal processing appli
cations which can not be efﬁciently supported on today’s
DSPs, including Orthogonal Frequency Division Multi
plexing (OFDM) for 802.11a, MPEG4 encoding, stereo
feature extraction and correlation, and software radio dig
ital down conversion. Contrary to traditional microproces
sor design goals of the highest performance possible, our
goal is to design the lowest power solution for set perfor
mance targets. Consequently, our metric of success is the
lowest system power to achieve a solution, not raw perfor
mance.
While conventional wisdom credits the low power
of ASIC implementations to their low area per opera
tion [8], Synchroscalar invests area to achieve programma
bility while compensating with voltage scaling to achieve
low power. Speciﬁcally, Synchroscalar uses multiple pro
cessor tiles and wide buses to exploit parallelism in order
to achieve performance targets while running at low fre
quencies. Ideally, linear gains in performance translate
to quadratic reductions in power due to voltage scal
ing.
In designing Synchroscalar, we focused on several key
features of ASICs that lead to their energy efﬁciency – high
parallelism, low control overhead, and custom interconnect.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

2. The Synchroscalar Architecture
The high level observation that led to this design was that
if an application can be parallelized efﬁciently on an archi
tecture, then the clock and voltage can be scaled down in
order to reduce power consumption. The column-oriented
nature of Synchroscalar allows us to greatly reduce the
complexity of control, communication, clock distribution
and voltage scaling. SIMD controllers are used to amor
tize the control overhead and support efﬁcient application
parallelization in each column, while Data Orchestration
Units (DOUs) provide communication ﬂexibility by sup
porting statically-scheduled zero-overhead irregular com
munication. Interconnect bandwidth is highest within and

SIMD
Controller

SIMD
Controller

SIMD
Controller

SIMD
Controller

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

DOU
Clock
Divider0

DOU
Clock
Divider1

OUT DATA

IN DATA

Our design achieves power efﬁciencies within 8-30X of
known ASIC implementations, which is 10-60X better than
conventional DSPs. The success of the Synchroscalar de
sign stems from the nature of its target application class –
exploiting their multi-rate structure, intra-task data paral
lelism, and statically predictable control and communica
tion. To this end, Synchroscalar uses a column-oriented 2D
tile structure that follows three design principles.
First, Synchroscalar exploits parallelism to perform volt
age scaling. We minimize hardware complexity by scaling
voltages spatially rather than temporally. Columns of pro
cessors are statically assigned voltages rather than dynami
cally varying voltage for each processor. Computations are
mapped to the appropriate frequency and voltage, and com
munication facilitates moving from one voltage domain to
another.
Second, Synchroscalar amortizes control overhead by
grouping each column of processors into a single thread of
control, implemented with a single SIMD control unit and
program memory.
Third, Synchroscalar minimizes communication over
head through substantial investment in statically conﬁg
urable interconnect. Speciﬁcally, the Synchroscalar’s low
clock frequencies enable the use of wide segmented buses.
Because communication can be heavily data dependent and
consequently inefﬁcient to manage under SIMD control, we
introduce a decoupled communication controller in each
column to orchestrate data motion using static schedules.
This enables extremely low overhead register-to-register
inter-tile communication which allows us to compete with
the dedicated interconnects of ASICs.
In the remainder of this paper, we provide an overview
of the Synchroscalar architecture and our multi-rate appli
cations to establish the context of our study. Then we de
scribe our evaluation methodology, including power mod
els, SPICE simulations, VHDL synthesis, software tool
chain, and cycle-level simulation. We analyze our results
and discuss our intuitions from this analysis. We then con
clude with related and future work.

DOU
Clock
Divider2

Clock
Divider3

PLL

Figure 1. The Synchroscalar Architecture

between columns, in order to provide high-speed com
munication within an application. Lower bandwidth is re
quired for communication between components. Addition
ally, each column of four tiles is supported by a speciﬁc
clock generator and voltage and are conﬁgured at startup.
We will use the Digital Down Converter (DDC) appli
cation as an example of how the parallelization and map
ping process works. Parallelization begins by recognizing
stages in the application with a speciﬁc data rate between
each stage. The ﬁrst two stages of this application are the
digital mixer and the CIC integrator (see Section 3 for full
application descriptions). After exploring the trade-offs be
tween computation and communication with varying levels
of parallelization (described in Section 5) we ﬁnd that the
ﬁrst stage, the mixer, should run on 8 tiles and the integra
tor on 8 tiles to minimize power consumption. The mixer is
then mapped to the ﬁrst two columns and the integrator to
the third and fourth columns.
Once the parallelization and mapping is complete, the
clock and voltage can then be scaled down based on the ap
plication needs. Given the DDC’s target execution rate of
64 million samples per second, the mixer tiles need to run
at 120 MHz and the integrator tiles at 200 MHz. These clock
rates are generated from reference clocks which are fed into
clock dividers in each column as shown in Figure 1. Sup
ply voltages are also externally supplied, and SPICE simu
lations from Section 4 indicate that the mixer tiles can oper
ate at 0.8V and the integrator tiles at 1.0V. With this simple
example and overview in mind, the remainder of this sec
tion describes the major components of Synchroscalar in
greater detail.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

2.1. Parallelism

Processor 2

Processor 0
0

0
4-to-16
Decoder

4-to-16
Decoder

From 3
DOU
Reg 32
R7

3 From
DOU
32 Reg
R7

6

6

7

7
D e cod er

Parallelism is critical to the success of Synchroscalar, for
it is through parallelism that we can reduce the clock fre
quency, and thus voltage, while continuing to meet perfor
mance targets. In this we are greatly aided by the staticallypredictable, highly data-parallel nature of signal process
ing. While our applications are all hand-parallelized in this
study, future work will focus on automated tools. We be
lieve that automation is realistic, since our applications ﬁt
the Synchronous Dataﬂow (SDF) model of computation
used in existing DSP design tools such as Ptolemy from UC
Berkeley [6, 7, 9] and Simulink from Mathworks and SPW
from Cadence.
The dataﬂow models allow for two forms of parallelism 
within a Synchroscalar column and between columns. SDF
also provides predictability by restricting the number of
data values produced and consumed by a task to be a con
stant. This restriction imposed by the SDF model offers the
advantage of static scheduling and decidability of key ver
iﬁcation problems such as bounded memory requirements
and deadlock avoidance [21].

From
DOU
Processor 3

3

Processor 1
0

0

4-to-16
Decoder
From 3
DOU
Reg 32
R7

4-to-16
Decoder

6

32

6

7

32

7

3 From
DOU
32 Reg
R7

Figure 2. Segment Controllers

2.3. Reconﬁgurable Interconnect

idle time of power-hungry processor tiles, especially when
idle times are too small to shut down tiles.
Given our design goal of low system clock rates, we
ﬁnd that we can approximate the specialized interconnects
of ASICs through a combination of segmented buses and
a decoupled communication controller called the Data Or
chestration Unit (DOU). Refer back to Figure 1 to see these
buses and controllers arranged in each column. We note that
signal processing applications exhibit much higher commu
nication requirements within computational blocks than be
tween blocks. Consequently, we allocate only a single hor
izontal bus between columns, which both meets bandwidth
requirements and facilitates gather-scatter operations.
Synchroscalar buses are 256 bits wide, grouped into 8
32-bit separable vertical buses that are segmented in be
tween each of the tiles. Although 256 bits wide might seem
power-hungry, we shall see in Section 4 that the power con
sumption of the busses is small compared to the cost of sup
porting a higher frequency tile.
In addition, by suitably controlling the segment con
trollers, the bus can perform several parallel communica
tions. For instance, if all the controllers are turned on,
the bus becomes a low-latency broadcast bus, and all tiles
able to receive the same data in a single cycle. Alterna
tively, two messages can pass between neighboring tiles us
ing the same wires in different segments, achieving the ap
proximate bandwidth of a mesh if code is allocated to the
tiles intelligently. The segmentation of the bus allows Syn
chroscalar to achieve higher levels of local bandwidth for
very little cost in area and power and reduces tile idle time
due to remote data dependencies.

Synchroscalar exploits parallelism to increase efﬁciency,
but these gains must not be lost to the communication
overhead to support this parallelism. In particular, latencycritical communication must not be allowed to increase the

Data Orchestration Unit (DOU) A key feature of Syn
chroscalar is statically-scheduled communication provided
by the DOU decoupled controllers located in each col
umn. The goal of the DOUs are to provide zero-overhead

2.2. SIMD Control
Low-overhead control is critical to the efﬁciency of
ASICs. The data-parallel nature of signal processing appli
cations allows a reduction in the cost of instruction fetch
and decode through a single SIMD controller that sends in
structions to the tiles in a column. The SIMD controller
performs all control instructions, only forwarding com
putation instructions to the tiles. To communicate data
for conditional branches, the SIMD controller is con
nected to the segmented bus with the tiles.
In order to support branch prediction, there would need
to be a mechanism to squash instructions that have already
been sent to the processing elements. Instead, we provide
a short pipeline in the control unit to calculate branches
quickly, and delay instructions from reaching the process
ing elements. This introduces a single-cycle stall for each
conditional branch. For zero-overhead loops, there is still
no delay, because the PC is used for decision making, not
the actual instruction.
Note that applications do not always parallelize evenly
into columns of 4 tiles, requiring occasional idle tiles. Idle
tiles are assumed to consume negligible power through sup
ply gating, so we sacriﬁce their area to simplify our design.
Idle tiles are decided at startup.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

64 Bits

CNTR

Counter 1

Counter 2

Counter 3

SEG0−SEG3

1111
0000
0
126 States1
0000
1111
0
1
00000
11111

Bits 2 4

Counter 0

o
o
o

4

NXTSTATE 0−1

Buffer0−Buffer3

o
o
o

4

4

8

8

8

8

o
o
o

o
o
o

7

7

Figure 3. DOU Implementation
data movement between producer and consumer tiles. A
producer writes to a special register, and, at a staticallyscheduled time, a consumer can read that value from a re
ceive register. The DOUs provide separate, cycle-by-cycle
control of data motion and interconnect conﬁguration. This
ﬂexibility facilitates irregular data motion and allows our
applications to be efﬁciently scheduled in SIMD tile com
putations. The DOU operates at the maximum frequency,
the frequency of the bus. Since the DOUs are very small the
power contribution of the DOUs is minimal.
There is one DOU for each of the columns on Syn
chroscalar. The gray boxes that overlap the data bus in Fig
ure 1 represent the segmenters, and the gray lines that con
nect DOU to the segmenters are the control lines that are
necessary to control each of the 8 splits. Figure 2 depicts a
detailed logical diagram of the segmenters.
The DOU is simply a state machine, where each of
the DOU’s state’s outputs control the segmenters. The
DOU must be programmed with the desired communi
cations patterns for the column-bus it controls. There are
128 states in the DOU. Each state entry in the DOU has
ﬁve types of ﬁelds, CNTRi , SEGi, Bufferi, NXTSTATE0i,
NXTSTATE1i,as shown in Figure 3.
The CNTR ﬁeld speciﬁes which of the four DOU down
counters should be checked for a given state in the DOU.
If the counter speciﬁed by the CNTR ﬁeld is zero, then the
next state is the state pointed to by the NXTSTATE0 ﬁeld
of that given state and the down counter is reset to its ini
tial value. If not, the DOU state machine proceeds to the
state pointed to by the NXTSTATE1 ﬁeld and decrements
CNTR. There are four 32-bit down counters that are preprogrammed with the dynamic instruction count of the as
sociated loop, allowing four nested loops. The SEG and
Buffer ﬁeldsare the outputs of a given state. They control
the bus segmenters for a given column and the communica
tions buffers for each tile in a given column, respectively.
Here is a quick example of the DOU’s operation. Fig
ure 4 shows a nested pseudo-code loop. It requires two
DOU counters for I and J. The I loop counter would need
to be 4*A and the J loop counter would need to be 2*B, as
suming the FOR instruction loop can be encoded in a sin
gle assembly instruction. The output pattern would need to

For(i=0; i<A; i++){
Outer_Instruction1;
Outer_Instruction2;
For(j=0; j<B; j++){
Inner_Instruction1;
Inner_Instruction2;
}
Outer_Instruction3;
}

Figure 4. Example DOU code

be programmed for each of the instructions that access the
global data bus. the output pattern is a “don’t care”.
Synchroscalar Tiles are based on the ADI/Intel Blackﬁn
DSP ISA [20], but with control provided by the SIMD con
troller instead of in each tile. Additionally, each of the tiles
has a read and a write buffer as shown in Figure 2. These
buffers have a dual purpose. Their ﬁrst function is to adapt
the tile voltage to the bus voltage, as tile voltages across
the Synchroscalar design may be different. Secondly, the
buffers align a word of data onto the desired split of the
global data bus. Register R7 is the designated communica
tions register on each of the tiles. The DOU controls the
alignment of this register on the data bus. Only three bits
are required from the DOU to control the placement of the
data on the data bus for each of the 8 splits of the bus.

2.4. Clock and Voltage Domains
All of the design decisions above come together to pro
vide clock and voltage scaling per column. Since the appli
cations have known performance needs, the SDF model pro
vides predictable performance and distinct tasks that are ex
ploited by our SIMD column-based design. These columns
become separate clock and voltage domains. Each task is
performed at the lowest frequency that meets the applica
tion constraints and the corresponding voltage, down to the
chosen voltage and frequency ﬂoors of 0.7 V and 100 MHz.
To further reduce complexity, we support only a small set
of frequencies and voltages for a given design. The com
putational rates of each algorithmic block implemented in
each column, however, must be matched to the target data
rates. If one block runs too fast, then the subsequent block
will not be able to keep up with the data produced.
A simple procedure for matching rates is to choose
the minimum frequency necessary for each column, then
add nops to throttle the computational rate. Unfortunately,
adding nops to application code may not be convenient if
the throttling rate is not a good multiple of the existing
loop structure. Instead, we introduce a simple mechanism
for ﬂexible computation throttling in our multi-rate system
called Zero Overhead Rate Matching. We add a simple pro
grammable counter to each SIMD controller. This counter
allows us to periodically dynamically insert nops to the tiles
in each column in any period of cycles, thus allowing per
fect rate matching.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

3. Applications
To drive the design of Synchroscalar, we selected
four signal-processing applications, each of consider
able complexity involving several computational subcom
ponents. Each cannot be executed at the required rate
by any known commercial DSP at this time. These ap
plications are: Digital Down Conversion, Stereo Vision,
802.11a, and MPEG-4. The next four sections brieﬂy de
scribe each of these applications.
Digital Down Conversion(DDC) Digital Down Conversion
(DDC) is an integral component of many communication
systems, and functions primarily to convert a received sig
nal to baseband such that the signal of interest can be
processed. This particular DDC was conﬁgured to support
GSM cellular requirements of up to 64 M Samples per sec
ond. It is comprised of a Numerically Controlled Oscillator,
digital mixer, Cascaded-Integrator-Comb (CIC) ﬁlter and a
two-stage ﬁlter in the form of a compensating 21-tap ﬁlter
(CFIR) and a 63-tap ﬁlter (PFIR).
Stereo Vision (SV) , used in the Mars Rover[26], has
two stages: point feature extraction and point feature cor
relation. Each frame processed is 256 by 256 pixels in
monochrome and is processed at a rate of 10 frames per sec
ond. Tomasi and Kanade’s [10] algorithm for point feature
extraction was employed and for point feature correla
tion, singular value decomposition [30] was used.
802.11a is an end-to-end application. This IEEE standard
for wireless communications supports data rates up to 54
Mbps. It is coded using OFDM and employs up to 12 20
MHz channels in the 5 GHz frequency range. The four ma
jor components in the 802.11a receiver are the FFT, Demod
ulation, De-Interleaving and a K=7 Viterbi Decoder.
MPEG-4 is an ISO/IEC standard adopted in 1998. For en
coding, we implement Motion Estimation, DCT and Quan
tization which constitute about 90video encoder [36]. For
Synchroscalar, both CIF and QCIF MPEG4 encoding was
performed at 30 frames per second.

4. Methodology
We now present an evaluation framework for Syn
chroscalar including: an application mapping methodol
ogy, tile and interconnect power models, VHDL synthesis,
and cycle-accurate simulation.

4.1. Implementing
chroscalar

Applications

on

Syn

In this section, we will outline the procedure used to
map applications to Synchroscalar and evaluate their per
formance and power efﬁciency. The process involves ﬁnd
ing an efﬁcient mapping of an application on the Syn

chroscalar architecture, validating it for functional correct
ness and then determining the appropriate frequency and
voltage of operation of each column. The frequency and
voltage values are plugged into an empirical power model
for Synchroscalar to evaluate the power consumption for
that mapping. The detailed procedure is outlined below.
1. Start with the description of the application on a single tile.
2. Choose the number of tiles, N, that minimizes power.
3. Partition the application among the N tiles and insert data transfer op
eration to model the communication between the tiles.
4. Assume every data transfer takes one clock cycle. Statically schedule
all the data transfers.
5. NOPS are introduced appropriately to avoid structural hazards due to
bus conﬂicts.
6. Use the cycle-accurate simulator to determine the number of clock cy
cles required per input data sample. Code and data are in local tile
memories when computing the clock cycle count.
7. Given the input data rate and number of cycles required by each tile,
frequency of operation for each column of tiles is computed. Let �� be
the frequency of operation of the ��� column.
8. Using SPICE and the Berkeley Predictive Technology Models we ﬁnd
the required supply voltage (�� ) for a given frequency and voltage for
an assumed critical path delay of 20 FO4s.
9. The total power is estimated using the following equations

� ����

��

�����

�

��

�

����

�

�

��

����� ���� �

�

� � �� �����

��

�������

�� � �

�

������ ���� �
� � � � ��� � ��
where � is deﬁned as the normalized power in milliwatts per MHz
(mw/MHz) at the reference voltage ���� , and it includes the active
power consumed by the tile (including the data memory) and the DOU
and the SIMD controller in each column, � is the average bus capaci
tance switched per cycle, and � is the number of tiles.

���

Based on the procedure outlined above, it is clear that
the key factors that inﬂuence our model are - the power
model for the tile, the power model for the buses or inter
connect and the leakage power. Next we will describe how
we model these parameters and their validation.

4.2. Tile Model
To model the power of the tile we need two things. First,
is the parameter U that that represents the normalized power
of the tile and its associated components. The second pa
rameter needed is the relationship between the frequency of
operation of the tile and the operating voltage.
The parameter U is estimated as follows. The tile con
sists of 2 40-bit ALUs, 4 8-bit video ALUs, 2 40-bit ac
cumulators, 2 16x16 multipliers, 1 40-bit barrel shifter,
a 32x32 register ﬁle with four read ports and 2 write
ports, 32KB data memory and glue logic. It was mod
eled in VHDL and synthesized using the Synopsys
Design compiler. The multipliers, register ﬁle and mem
ory were not synthesized. We mapped the design to a 0.25�
ASIC library, at a supply voltage of 2.5V, and used De
sign Power to estimate the power from the gate-level
netlist. We scaled the results to 130 nm geometry and found

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

Parameter
Technology
Minimum Voltage
Maximum Voltage
Threshold Voltage
Temperature
Oxide Thickness
Dielectric
Strength of Oxide
Max Frequency
Tile Power
Tile Size
Wire Cap.
Wire pitch

1200

Operating Frequency (MHz)

1000

800

600

400

15 FO4 (Mhz)
20 FO4 (MHz)

200

0
2.12

1.82

1.52

1.22

0.92

Figure 5. Voltage-Frequency curve for a 20
FO4 pipelined processor

that the normalized power of the datapath was approxi
mately 0.03mW/MHz. To this we added the contribution of
the register ﬁle (0.11mW/MHz), [27], and the data mem
ory (1.75mW/MHz) [28], by scaling the data appropri
ately. Hence, the total normalized power of the tile was
estimated to be 1.89mW/MHz. To this we add the amor
tized overhead from the DOU and the SIMD controller.
Assuming that there are four tiles per column, the contri
bution of the SIMD controller and the DOU to the power
of each tile is roughly 0.25mW/MHz, for a total normal
ized of 2.14mW/MHz, which corresponds to the parameter
U in the equation above.
We assume that by doing a custom logic implementa
tion with appropriate transistor sizing we would cut the
power of the synthesized portions of the logic in the SIMD
controller and the tile by around 30%. With this assump
tion, we estimate the normalized power to be approximately
0.642mW/MHz, which reduces to 0.1mW/MHz at 1V sup
ply. Although no Blackﬁn core power numbers are avail
able, we can compare our estimate to a similar core from
NEC the SPXK5 [37], which consumes 0.07mw/MHz in
130 nanometer technology. Given that we are using an esti
mate, however, we will discuss the sensitivity of our results
to tile power at the end of the results section.
The relationship between operating frequency and sup
ply voltage of a column is found as follows. We assume the
critical path is 20 F04 gates, which is pessimistic, but appro
priate for an embedded DSP core [16]. Using the Predictive
Berkeley Technology Models [17] we SPICE a 20 FO4 crit
ical path and plot the relationship between frequency and
voltage. The graph in Figure 5 shows the variation of the fre
quency and voltage for the 130 nm process assuming criti
cal paths of 15 and 20 FO4 lengths. This graph is captured
as a look-up table to determine the appropriate voltage of
operation of a tile given the frequency.

Blackﬁn DSP [20]
Estimated [17]
[17]
Assumed
[17]

5e6 V/cm
600 MHz
0.1mW/MHz
�
1.82
387 fF/ m
16

[17]
SPICE using [17]
See estimate above
From Section 4.6
Semi global [16]
Semi global wiring [16]

�

��
�

Source

Table 1. Technology Parameters

0.62

Supply Voltage (V)

Value
130 nm
0.7V
1.65 V
0.332 V
40 C
3.3 nm

4.3. Interconnect Model
The interconnect model is largely based on the data from
”The Future of Wires” paper [16]. In 0.18� tech, the gate ca
pacitance of a minimum sized transistor is about 1-2fF [16].
This value is expected to remain constant over shrinking
process technologies. The projected value of wire capaci
tance for a semi-global wire in 0.13 � technology is per unit
length is 387fF/mm. Assuming length of the chip is about
10mm (that corresponds to the length of the bus) the wire
capacitance is about 3870fF. This suggests that even if the
drivers and repeaters are 10-times the minimum size, their
capacitance is about 20fF. If there are 8 drivers for each bus,
it adds only 160fF to the wire capacitance. Also, we ﬁnd
that the gate and drain capacitances are orders of magni
tude smaller than the wire capacitance per unit length. The
drain-source capacitance of the segmenters and the gate and
drain capacitances of the drivers are ignored. Thus the in
terconnect is modeled by the wire capacitance to a ﬁrst or
der approximation. A summary of the key parameters of our
model and their sources is given in Table 1.

4.4. Leakage Power Estimation
Given that we are scaling the supply voltage aggres
sively, it is important to include the contribution of the leak
age current in our estimations. Additionally, the fact that
we trade area for power in Synchroscalar makes our leak
age analysis even more critical. We use an analytical model
to compute the leakage current ��� �
��� �

� ��� ��

��� ����
����

where ��� is the on current that depends on the process but
is roughly equal to 0.3 �A per micron width, �� = �� ��
which is roughly 26 millivolts at room temperature and �
depends on the devices structure but is roughly between 1.3
to 1.5 and ��� is the threshold voltage.
The leakage current increases with decrease in threshold
voltage and increase in temperature. In order to model the
leakage, we make the following assumptions:

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

1. All the devices are operating at the threshold voltage of 0.332V
2. The temperature is 80 degrees Celsius
� in 0.13 tech
3. A transistor density of 1 million transistors per
nology
4. Tiles that are not used in an application are assumed not to contribute
to the leakage current

��

�

Using these numbers we calculate the leakage current,
��� � which happens 830 pA per transistor for a minimum

sized transistor. This leakage correlates well with the num
ber published by Intel on their 130 nm process where leak
age current varies from 0.65 nA per transistor to 32.5 nA
per transistor depending if the threshold voltage of the tran
sistor is high or low, respectively [41].
We estimate 1.8 million transistors per tile, so we be
lieve the leakage power to be around 1.5 mAmps assum
ing 830 pA of leakage per transistor. Of course, this esti
mate makes several assumptions, such as the average tran
sistor width. While all results in this paper will assume 830
pA of leakage per transistor, we will present a leakage sen
sitivity analysis in the results section of this paper.

4.5. Cycle-Accurate Simulation
To obtain cycle-accurate performance measurements, we
adapted an object-oriented variant of SimpleScalar to model
the Synchroscalar architecture. The instruction set was re
targeted to the Blackﬁn ISA [20] and communication mech
anisms were added.
The applications were compiled down to assembly, and
the inner-loops hand-optimized. Inter-tile communication is
hand-scheduled, and appropriate nops are inserted for syn
chronization between different clock domains.

4.6. Tile Area Estimation
The tile, the SIMD controller and the DOU were mod
eled in VHDL and synthesized using Synopsys Design
compiler for a 0.25� ASIC library and scaled to 0.13�.
The various components of the tile and the SIMD controller
are shown in Table 2. Memory, register ﬁle, and multipliers
were not synthesized. Their area was estimated from [15]
which has technology independent models for various com
ponents. We assume 32KB SRAM of data memory per tile
and 2KB SRAM for instruction memory. The ���� ﬁeld
models the glue logic and the wiring overhead between the
top-level blocks. The area of the tile is 1.82 ��� , the area
of the SIMD controller and the DOU, which are shared by
the whole column of four tiles, is approximately 0.25���
and 0.0875 ��� respectively.

5. Results
Since all of our applications have set performance tar
gets, our metric is the system power required to achieve
those targets. Table 3 summarizes the primary success of

TILE COMPONENT
2 40-bit ALUs
1 40-bit Shifter
2 40-bit Accumulators
2 16x16 mult
32 KB SRAM
32x32 Regﬁle 4 read and 2 write ports
Rest
Total
SIMD CONTROLLER and DOU
DOU
2 KB Instruction SRAM
Sequencer
LBANK
STACK32
Rest
Total

��

Area � � �
48000
500000
11060
100000
5,570,560
650000
393000
7,270,000
350000
350,000
225000
59000
180000
140000
650000

Table 2. Tile and DOU and SIMD Control Area
Estimation
Synchroscalar: software implementation of challenging sig
nal processing applications with energy efﬁciency gener
ally within 8-30X of ASIC solutions and 10-60X better than
DSPs performing even reduced data rate versions of the ap
plications. The remainder of this section describes the bene
ﬁts of Synchroscalar’s unique column-oriented voltage scal
ing, parallelization’s inﬂuence on the system power, inter
connect costs and leakage current.

5.1. Power Savings
The multiple column-oriented voltage domains yields
advantages as shown by comparing the Single Voltage and
Multiple Voltages columns in Table 4 and in Figure 5.1.
Multiple voltages allow power savings of up to 81% for ap
plication components and up to 32% for full applications.
For applications where there are a few tiles that run at
high frequencies that cannot be parallelized into multiple
tiles, we see the greatest power saving due to the voltage
scaling. The Stereo Vision application is one such appli
cation. In other applications, where there is not one com
putationally demanding algorithm with limited exploitable
parallelism, the power saved due to the voltage frequency
scaling is much smaller. The wireless 802.11a application is
one such instance. The true beneﬁts of voltage scaling can
be better demonstrated when applications need to be com
posed. This can be seen in the data where we have com
posed an AES-based message authentication code with the
802.11a receiver.

5.2. Effects of Parallelism
Figure 7 shows how much power is consumed for dif
ferent levels of parallelization of the our applications. By
allocating more parallel resources we are able to run the
applications at a lower frequency and a lower voltage,
thereby saving power. However, there are diminishing re
turns for further parallelization in increased communica-

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

Application
DDC

Stereo
Vision

802.11a

MPEG4
QCIF

MPEG4
CIF

Platform
Synchroscalar
Intel Xeon 2.8 GHz [19]
Blackﬁn 600 MHz [2]
Graychip [40]
Synchroscalar
Intel Xeon 2.8 GHz [19]
Blackﬁn 600 MHz [2]
FPGA [5]
Synchroscalar
Atheros [4]
Icefyre [32]
IMEC [42]
NEC [37]
D. Su [13]
Blackﬁn 600 MHz [2]
Synchroscalar
Amphion [1]
Philips [23]
Blackﬁn 600 MHz [2]
Synchroscalar
Toshiba [3]

Process
(�)
0.13
0.13
0.13
UNK
0.13
0.13
0.13
UNK
0.13
0.25
0.18
0.18
0.18
0.25
0.13
0.13
0.18
0.18
0.13
0.13
0.13

Area

���
139.88
146
2.5
UNK
52.89
146
2.5
UNK
74.05
34.68
UNK
20.8
119
22
2.5
32.32
110k gates
20
2.5
31.74
43

Power
(mW)
2427.23
71000
280
250
857.40
71000
280
15K-25K
3930.53
203
720
146
474
121.5
280
47.24
15
30
280
370.03
160

Voltage
(V)
.7-1.3
1.45
1.2
3.3V
1.2-1.5
1.45
1.2
UNK
0.7-1.7
2.5
UNK
1.8
1.5
2.7
1.2
0.7
UNK
1.8
1.2
1.1, 0.7
1.5

Notes
Programmable, 64 MS/s
Programmable, only 19.0 MS/s, 1/3 required rate
Programmable, only 112.6 kS/s, 1/500 required rate
ASIC, 64 MS/s
Programmable, 10 f/s 256x256, stereo
4.96 f/s, 1/3 required rate
Programmable, 1.46 f/s, 1/7 required rate
30f/s 320x240, not stereo, no SVD, 1.75x rate
Programmable, 54 Mbps RX only
ASIC
ASIC Chipset, including ADC
ASIC, area includes ADC/DAC
ASIC, MAC+PHY layer, Core Power only
PHY Layer only
Programmable, only 556 Kbps
QCIF @ 30 f/s
Application-Speciﬁc Core, QCIF @ 15 f/s
ASIP, QCIF @ 15 f/s
Programmable, QCIF @ 15f/s
CIF @ 30 f/s
SOC, CIF @ 15 f/s

Table 3. Power Comparison of Synchroscalar with other platforms.
Application
DDC

Stereo
Vision
802.11a

802.11a +
AES

MPEG4 30f/s
QCIF
MPEG4, 30f/s
CIF

Algorithm
Digital Mixer
CIC Integrator
CIC Comb
CFIR
PFIR
TOTAL
SVD
PFE
TOTAL
FFT
De-mod/De-Interleave
Viterbi ACS
Viterbi Traceback
TOTAL
FFT
De-mod/De-Interleave
Viterbi ACS
Viterbi Traceback
AES
TOTAL
Motion Estimation
DCT, Quant, IQ, IDCT
TOTAL
Motion Estimation
DCT, Quant, IQ, IDCT
TOTAL

No. of
Tiles
8
8
2
16
16
50
1
16
17
2
1
16
1
20
2
1
16
1
16
36
8
2
10
8
8
16

Frequency
(MHz)
120
200
40
380
370

Voltage
(V)
0.8
1.0
0.7
1.3
1.3

500
310

1.5
1.2

90
60
540
330

0.8
0.7
1.7
1.2

90
60
540
330
110

0.8
0.7
1.7
1.2
0.8

70
60

0.7
0.7

280
60

1.1
0.7

Power
(mW)
76.29
241.54
18.86
1071.22
1031.75
2427.23
114.27
742.68
857.40
16.74
4.71
3848.01
61.07
3930.53
14.80
4.71
3848.01
61.07
159.50
2443.68
42.53
4.71
47.24
351.21
18.82
370.03

Power (mW)
Single Voltage
191.83
403.58
18.86
1071.22
1031.75
2717.24
114.27
1151.55
1266.28
79.60
28.45
3848.01
83.22
4039.28
49.36
28.45
3848.01
83.22
556.56
2866.14
42.53
4.71
47.24
351.21
46.48
397.68

% Power Savings
Due to Multiple Voltages
60 %
40%
66%
0%
0%
11%
0%
36%
32%
79%
83%
0%
27%
3%
75%
83%
0%
27%
71%
11%
0%
0%
0%
0%
60%
7%

Table 4. Power Results Summary for DDC, SV, 802.11a and MPEG4 on the Synchroscalar Processor
further parallelization is a supply voltage ﬂoor. While tiles
tions requirements and leakage current. The 802.11a par
could operate at supply voltages lower that 0.7 V, due to
allel implementations shown in Figure 7 is good example
leakage and noise constraints, we chose 0.7 V as the mini
of how diminishing returns from additional communica
mum supply voltage supported. Therefore, by further paral
tions requirements prevent us from further parallelizing the
lelizing an algorithm that is already running at the minimum
802.11a application efﬁciently. This communications over
supply voltage would not yield further power savings, and
head negatively impacts our power efﬁciency and is repre
would likely increase the power consumption due to leak
sented by the dark portions of each of the application’s bars
age and added communications cost.
in Figure 7. Another source of diminishing returns from

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

5000.0

64 10

4500.0

9

4000.0

3000.0

Power, Voltage
Scaling

2500.0
2000.0

7

Power (W)

Power(mW)

32

8

Additional
Power w/ No
Voltage Scaling

3500.0

6
64

5

128

4

1500.0

3

1000.0

2

500.0

1

0.0

0

256

32 Tiles

DDC

SV

80211.a MPEG4 - MPEG4 - 80211.a
CIF
QCIF
+ AES

Figure 6. Power Consumption by Application

1024

512

0

20

16 Tiles
40

60

8 Tiles
80

100

Area (mm2)

120

140

160

Figure 8. Power Consumption of Viterbi ACS
with varying bus widths and parallelization

6

Interconnect + Leakage
5

Compute Power

Power(W)

4

3

5.4. Leakage Sensitivity Analysis

2

1

26

C

D
D

C

14

Ti
le
s
D
T
D
ile
C
50 s
T
SV iles
5
T
SV iles
9
Ti
S
le
s
80 V 1
7
2.
Ti
11
le
a
80
12 s
2.
Ti
11
le
a
80
20 s
2.
Ti
11
le
a
s
3
M
PE 6 T
il e
G
s
4
M
8
PE
Ti
G
le
4
s
M
PE 1 2
Ti
G
le
4
s
M
PE 2 0
Ti
G
le
4
36 s
Ti
le
s

0

D
D

ACS it would come at a signiﬁcant area cost. This trade-off
is made in light of the fact that the other applications show
little need for the increased bandwidth above a 256 bit bus.

Figure 7. Power Consumption of Applications
with varying parallelization

5.3. Effects of Interconnect
In Figure 8 we have mapped how the power-area ef
ﬁciency of the Synchroscalar architecture scales for the
Viterbi ACS with different sets of bus widths and different
numbers of tiles. The Viterbi ACS is used here as the Viterbi
Decoder has the most demanding communications require
ments of any of the individual algorithms tested on the Syn
chroscalar architecture. The three curves on Figure 8 each
represent a Viterbi ACS trellis being completed on 8, 16 and
32 tiles. Each of the curves are comprised of power results
for a few different bus widths (32b, 64b, 128b, etc...). We
can see from this ﬁgure that increasing the bus width from
128 to 256 bits signiﬁcantly improves the power efﬁciency
of Synchroscalar on the Viterbi Decoder for all three im
plementations. However, another such doubling of the bus
width has a smaller reduction on our overall power con
sumption as the curves become less steep. This leads us to
choose a 256 bit bus for Synchroscalar. While it would be
possible to attain lower power consumptions for the Viterbi

Since Synchroscalar trades spatial parallelism for tem
poral parallelism and the power dissipation due to leakage
is proportional to the spatial parallelism, a careful analy
sis of leakage must be considered. Figures 9 and 10 show
how different levels of parallelization of our four applica
tions perform under varying levels of leakage currents. In
the ﬁgures, the horizontal axis shows the leakage current
per Synchroscalar tile, and the vertical axis shows the power
consumption of the applications in mW. The lowest leak
age current (1.5 mA/tile) corresponds to the leakage per
tile as calculated in Section 4.4. The largest leakage current
graphed corresponds to the leakage current if each tile used
only low Vt transistors as published by Intel [41], which we
believe represents the highest leakage current that we would
consider in the development of Synchroscalar.
Of particular interest are the cross-over points between
different levels of parallelization of an application, as in
Figure 10 for MPEG4. Moving from eight to twelve tiles
allows Synchroscalar to reduce the overall power consump
tion through frequency reduction and voltage scaling. These
gains outweigh the leakage penalty and communications
overhead. However, when moving from twelve to 36 tiles,
the structure that has the best overall power consumption
depends heavily on the leakage current. When tiles leak
less than 14.8 mA (corresponding to 8.3 nA/transistor), the
higher parallelized structure of 36 tiles is more efﬁcient, but
when tiles leak more than 14.8 mA, the twelve tile struc
ture is more efﬁcient.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

8000
7000

Power (mW)

6000
802.11a 36
Tiles
802.11a 20
Tiles
802.11a 12
Tiles
DDC 50
Tiles
DDC 26
Tiles
DDC 14
Tiles

5000
4000
3000
2000
1000
0
1.5

7.4

14.8

22.2

29.6

37.0

44.4

51.8

59.3

mA Leakage per Tile

Figure 9. Leakage sensitivity for DDC,
802.11a
3500

3000

Power (mW)

2500

SV 17
Tiles
SV 9 Tiles

2000
SV 5 Tiles

1500

MPEG4
36 Tiles
MPEG4
12 Tiles

1000

MPEG4 8
Tiles

500

0
1.5

7.4

14.8

22.2

29.6

37.0

44.4

51.8

59.3

mA Leakage per Tile

Figure 10. Leakage sensitivity for MPEG4, SV

5.5. Discussion
How much should one parallelize the applications? The
factors that limit the amount of parallelization are the volt
age ﬂoor, i.e. the minimum possible voltage that we could
run a given tile at, the leakage current, and the structure of
the application.
Additional parallel harware helps here to reduce power
because we are scaling the voltage aggressively as well.
Once all tiles are operating at voltage ﬂoor, parallelizing fur
ther is not advantageous, as further attempts for additional
parallelization could increase the communiations overhead.
It would the be the goa of a compilation tool for Syn
chroscalar to help parallelize applications so that they are
running as close to the voltage ﬂoor as possible.
Our results are sensitive to the ����� number that we
derived in the methodology section. Since tile power is
the dominant factor in the total Synchroscalar power, our
power results are roughly linear with the ����� Our qual
itative results are valid for a large range of realistic val
ues of tile power. For instance, let us compare the Syn
chroscalar power consumption with the power consumption
of the Blackﬁn DSP which are both in 0.13� technology.

Using the 0.1 mW/MHz estimate of power per tile for Syn
chroscalar, we have shown that the DDC application runs at
2.43 W for 64e6 samples/second or 38.0 nW/sample. The
Blackﬁn DSP can run at 280 mW for 113e3 samples/second
at 600 MHz or 2478 nW/sample - a factor of 60 difference.
So clearly, even if our estimate of ����� is off by a factor of
two, we are still demonstrating signiﬁcant power savings.

6. Related Work
The challenges presented by next generation applica
tions in terms of higher data rates, lower power require
ments, shrinking time-to-market requirements, and lower
cost has resulted in tremendous interest in embedded archi
tectures and platforms for communication appliances in the
past few years. Researchers have approached the problem
from several different angles. The DSP architecture com
panies have proposed highly parallel VLIW machines cou
pled with hardware accelerators or co-processors for the
computation-intensive functions. The TI OMAP [18] is a
good example of this category of solutions. However, this
is not power efﬁcient. You would need very high clock fre
quencies to meet the throughput constraints for the applica
tions considered in this paper.
The SCORE project at UC Berkeley [11] uses a FPGAlike fabric with specially tailored interconnect to exploit
parallelism and improve power efﬁciency. The PLEIADES
project at UC Berkeley [44] proposes an interconnection
of a low power FPGA, datapath units, memory, and pro
cessors, optimized for different application domains. The
PLEIADES researchers conclude that a hierarchical gen
eralized mesh interconnect structure [43] is most appro
priate for their architecture as it balances both the global
and the local interconnect. Our results are in agreement
with this conclusion in general but given that we are target
ing streaming computations, we have greater emphasis on
near-neighbor communication and have stayed away from
a general mesh. Other reconﬁgurable machines, such as
RAPID [12] and Piperench [33], illustrate interesting alter
natives to our choice of tiles, and may be amenable to our
coarse-grained voltage-frequency scaling techniques.
The adaptive SOC project at University of Mas
sachusetts [22] advocates an array of processors con
nected by a statically scheduled communication fabric.
They allow different processors to operate at differ
ent clock frequencies and demonstrate signiﬁcant power
savings on video processing benchmarks. The key dif
ferences between this work and Synchroscalar are in the
structure and contents of the tiles and the memory archi
tecture. In aSOC the tiles are hardwired functional blocks
such as Viterbi decoder, FFT, DCT etc., while in Syn
chroscalar we assume programmable DSPs as the build
ing blocks for the tiles. As a result, the memory archi
tecture of the system is radically different, changing

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

the data transfer and communication scheduling prob
lem as well. Intel’s tile based architecture [14] shares
the same objectives as ours, but the interconnection net
work is very different. Also, the tiles in [14] are much
coarser grained, which means their power consump
tion will likely be higher. The tile-based architecture from
University of Texas [29] resembles Synchroscalar struc
turally but it is designed for wire-delay scalability, not
power efﬁciency given a data rate constraint, which is the
unique feature of our work. Synchroscalar’s use of spa
tial rather than temporal ﬂexibility is somewhat inspired
by the MIT RAW project [39] [38], but our mecha
nisms for ASIC-like performance are signiﬁcantly differ
ent. The Imagine [31] processor approaches a similar prob
lem domain from a stream-oriented perspective. The paral
lelization strategies used by Imagine are complementary to
the voltage scaling, data orchestration, and multi-rate opti
mization used in Synchroscalar. The Smart Memories [24]
project is another tile-based architecture whose reconﬁg
urable tiles would also be complementary to Synchroscalar
mechanisms. While the SIMD components of our applica
tions are dominant, some phases could beneﬁt from other
models of computation.
Recently, there has been a revival of interest in the glob
ally asynchronous and locally synchronous (GALS) ap
proach to processor implementation [4] including the use
of multiple clock domains and multiple voltages [25] [34].
The key difference between GALS approach and the Syn
chroscalar approach is the restriction of using only ra
tionally related frequencies between different columns.
This avoids the use of asynchronous FIFOs with their syn
chronization overhead. So, Synchroscalar is similar to
Numesh [35], rather than the GALS approach.

7. Conclusion
The design principles of Synchroscalar – high paral
lelism, efﬁcient interconnect, low control overhead, and
custom voltage/frequency domains – will lead to a new
set of embedded architectures with efﬁciency approaching
ASICs and with the programmability of DSPs. Our study
has shown a promising proof-of-concept through hand op
timization and code development. Future work will focus
on a software tool chain to automate and optimize applica
tion parallelization and communication scheduling.

8. Acknowledgments
This work is supported by NSF ITR grants 0312837 and
0113418, and NSF CAREER and UC Davis Chancellor’s
fellowship awards to Fred Chong. Jedidiah Crandall’s work
was supported in part by a United States Department of Ed
ucation Government Assistance in Areas of National Need
(DOE-GAANN) grant P200A010306. Diana Franklin’s fac

ulty position is funded by a Forbes Endowment. We would
also like to thank Rajeevan Amirtharajah, Bevan Baas, Dean
Copsey and Matthew Farrens at Univeristy of California at
Davis, Mark Oskin at Univeristy of Washington and Timo
thy Sherwood at Univeristy of California at Santa Barbara.

References
[1] Amphion. Amphion CS6701 Hybrid MPEG-4 Video En
coder. http://www.amphion.com/cs6701.html.
[2] Analog Devices Press Release: Blackﬁn. http://www. elec
tronicstalk.com/ news/anc/anc199.html, March 2003.
[3] H. Arakida, M. Takahashi, Y. T. 1, T. Nishikawa, , H. Ya
mamoto, T. Fujiyoshi, Y. Kitasho, Y. Ueda, M. Watanabe,
T. Fujita, T. Terazawa, K. Ohmori, M. Koana, H. Naka
mura, E. Watanabe, H. Ando, T. Aikawa, and T. Furuyama.
A 160mW, 80nA standby, MPEG-4 audiovisual LSI with
160mb embedded DRAM and a 50GOPS adaptive post ﬁl
ter. In International Solid-State Circuits Conference, Digest
of Technical Papers, Feb. 2003.
[4] B. M. Baas. A parallel programmable energy-efﬁcient archi
tecture for computationally intensive DSP systems. In Con
ference Record of the Thirty-Seventh Asilomar Conference
on Signals, Systems, and Computers, nov 2003.
[5] A. Benedetti. Personal communication, 2003.
[6] S. Bhattacharya, P. Murthy, and E. Lee. Software synthesis
from dataﬂow graphs, 1996.
[7] S. Bhattacharya, P. Murthy, and E. Lee. Synthesis of
embedded software from synchronous dataﬂow speciﬁca
tions. Journal of VLSI Signal Processing, (21):151–166,
June 1999.
[8] R. Brodersen. Low voltage design for portable systems. In
International Solid State Circuits Conference, Feb. 2002.
[9] J. Buck et al. Ptolemy: A framework for simulating and pro
totyping heterogenous systems. Int. Journal in Computer
Simulation, 4(2), 1994.
[10] C. Tomasi and T. Kanade. Detection and tracking of point
features, 1991.
[11] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and
A. DeHon. Stream computations organized for reconﬁg
urable execution (SCORE). In FPL, pages 605–614, 2000.
[12] P. Cronquist, D.C.and Franklin, S. Berg, and C. Ebeling.
Specifying and compiling applications for RaPiD. In K. L.
Pocek and J. Arnold, editors, IEEE Symposium on FPGAs for
Custom Computing Machines, pages 116–125, Los Alami
tos, CA, 1998. IEEE Computer Society Press.
[13] David Su and Masoud Zargari and Patrick Yue and Shahriar
Rabii and David Weber and Brian Kaczynski and Srenik
Mehta and Kalwant Singh and Sunetra Mendis and Bruce
Wooley. A 5ghz cmos transceiver for ieee 802.11a wireless
lan. In International Solid State Circuits Conference, 2002.
[14] E. Tsui and K. Ganapathy. A new distributed dsp architec
ture based on the intel ixs for wireless client and infrastruc
ture. In HOT CHIPS 14, Aug. 2002.
[15] S. Gupta, S. Keckler, and D. Burger. Technology inde
pendent area and delay estimates for microprocessor build
ing blocks. In Technical Report TR2000-05, Department of
Computer Science, University of Texas, 2000.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

[16] R. Ho, K. Mai, and M. Horowitz. The future of wires. In
Proceedings of the IEEE, volume 89, pages 490–504, April
2001.
[17] C. Hu. Berkeley predictive technology model. http://www
device.eecs.berkeley.edu/ ptm/introduction.html.
[18] T. Instruments. Omap backgrounder. http://www.ti.com/
corp/docs/press/backgrounder/omap.shtml.
[19] Intel xeon processor 2.8 ghz datasheet. http://www. in
tel.com/design/xeon/datashts/298642.htm, March 2003.
[20] R. Kolagotla, J. Fridman, B. Aldrich, M. Hoffman, W. An
derson, M. Allen, D. Witt, R. Dunton, and L. Booth. High
Performance Dual-MAC DSP Architecture. IEEE Signal
Processing Magazine, July 2002.
[21] E. A. Lee and D. G. Messerschmitt. Static scheduling of
synchronous dataﬂow programs for digital signal processing.
IEEE Transactions on Computers, C-36(1), January 1999.
[22] J. Liang, S. Swaminathan, and R. Tessier. aSOC: A scalable,
single-chip communications architecture. In IEEE PACT,
pages 37–46, 2000.
[23] R. P. Llopis, R. Sethuraman, C. A. P. H. Peters, S.Maul, and
M. Oosterhuis. A low-cost and low-power multi-standard
video encoder. In First IEEE/ACM/IFIP International Con
ference on Hardware/Software Codesign and System Synthe
sis, 2003.
[24] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and
M. Horowitz. Smart memories: A modular reconﬁgurable
architecture. In 27th Annual International Symposium on
Computer Architecture (27th ISCA-2000) Computer Archi
tecture News, Vancouver, British Columbia, Canada, June
2000. ACM SIGARCH / IEEE. Published as 27th Annual
International Symposium on Computer Architecture (27th
ISCA-2000) Computer Architecture News, volume 28.
[25] D. Marculescu and A. Iyer. Power and performance evalu
ation of globally asynchronous locally synchronous proces
sors. In D. DeGroot, editor, Proceedings of the 29th Interna
tional Symposium on Computer Architecture (ISCA-02), vol
ume 30, 2 of Computer Architectuer News, pages 158–170,
New York, May 25–29 2002. ACM Press.
[26] L. Matthies, B. Chen, and J. Petrescu. Stereo vision, residual
image processing and mars rover localization, 1997.
[27] I. Mavroidis. A low power 200 MHz multiported register ﬁle
for the vector IRAM chip. In Report No. UCB/CSD-01-1145,
MS Thesis, University of California, Berkeley, 2001.
[28] M. Mori, B. Amrutur, K. Mai, M. Horowitz, I. Fukushi,
T. Izawa, and S. Mitarai. A 1V 0.9mW at 100 MHz 2kx16b
SRAM utilizing a half-swing pulsed-decoder and write-bus
architecture in 0.25 �m dual-Vt CMOS. In IEEE Interna
tional Solid-State Conference, Digest of Technical Papers,
1998.
[29] R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keck
ler. A design space evaluation of grid processor architectures.
In Proceedings of the 34th annual ACM/IEEE international
symposium on Microarchitecture, pages 40–51. IEEE Com
puter Society, 2001.
[30] M. Pilu. A direct method for stereo correspondence based on
singular value decomposition, 1997.

[31] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. LopezLagunas, P. R. Mattson, and J. D. Owens. A bandwidthefﬁcient architecture for media processing. In International
Symposium on Microarchitecture, pages 3–13, 1998.
[32] P. Ryan, T. Arivoli, L. de Souza, G. Foyster, R. Keaney,
T. McDermott, A. Moini, S. Al-Sarawi, J. O’Sullivan,
U. Parker, G. Smith, N. Weste, , and G. Zyner. A single
chip PHY COFDM modem for IEEE 802.11a with integrated
ADCs and DACs. In International Solid State Circuits Con
ference, 2002.
[33] H. Schmit et al. Pipeline reconﬁgurable FPGA. Journal of
VLSI Signal Processing Systems for Signal, Image and Video
Technology, 24(2):129–146, March 2000.
[34] G. Semeraro et al. Energy-efﬁcient processor design using
multiple clock domains with dynamic voltage and frequency
scaling. In HPCA, pages 29–42, 2002.
[35] D. Shoemaker, F. Honore, C. Metcalf, and S. Ward. Numesh:
An architecture optimized for scheduled communication.
Journal of Supercomputing, 10(3), 1996.
[36] W. Stechele. Algorithmic complexity, motion estimation and
a vlsi architecture for mpeg-4 core proﬁle video codecs. In
International Symposium on VLSI Technology, Systems and
Applications, 2001.
[37] M. Y. T. Kumura, M. Ikekawa and I. Kuroda. VLIW DSP
for Mobile Applications. IEEE Signal Processing Magazine,
July 2002.
[38] M. Taylor, J. Kim, J. Miller, D. Wentzla, F. Ghodrat,
B. Greenwald, H. Ho, m Lee, P. Johnson, W. Lee, A. Ma,
A. Saraf, M. Seneski, N. Shnidman, V. Frank, S. Amaras
inghe, and A. Agarwal. The raw microprocessor: A compu
tational fabric for software circuits and general purpose pro
grams, 2002.
[39] M. B. Taylor et al. The Raw microprocessor: A computa
tional fabric for software circuits and general-purpose pro
grams. IEEE Micro, 22(2):25–35, Mar./Apr. 2002.
[40] Texas Instruments GC4014 quad receiver chip datasheet.
http://www-s.ti.com/sc/psheets/ slws132/slws132.pdf, April
1999.
[41] S. Thompson, M. Alavi, M. Hussein, P. Jacob, C. Kenyon,
P. Moon, M. Prince, S. Sivakumar, S. Tyagi, and M. Bohr.
130nm logic technology featuring 60nm transistors, low-k
dielectrics, and cu interconnects. Intel Technology Journal,
6(2):5–13, May 2002.
[42] W. Eberle and V. Derudder and L. Van der Perre and G.
Vanwijnsberghe and M. Vergara and L. Deneire and B. Gy
selinckx and M. Engels and I. Bolsens and and H. De Man. A
digital 72Mb/s 64-QAM OFDM transceiver for 5GHz wire
less LAN in 0.18um CMOS. In International Solid State Cir
cuits Conference, 2002.
[43] H. Zhang, M. W. V. George, and J. Rabaey. Interconnect
Architecture Exploration for Low Energy Reconﬁgurable
Single-Chip DSP. In Proceedings of the Workshop on VLSI,
Orlando, Florida, April 1999.
[44] H. Zhang, V. Prabhu, V. George, M. Benes, A. Abnous, and
J. Rabaey. A 1-V heterogenous reconﬁgurable DSP IC for
wireless baseband digital signal processing. IEEE Journal of
Solid State Circuits, 35:1697–1704, November 2000.

Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04)
1063-6897/04 $ 20.00 © 2004 IEEE

