Balancing Design Options with Sherpa by Timothy Sherwood et al.
Balancing Design Options with Sherpa
Timothy Sherwood
  Mark Oskin
† Brad Calder
‡
sherwood@cs.ucsb.edu oskin@cs.washington.edu calder@cs.ucsd.edu
 Department of Computer Science, University of California, Santa Barbara
†Department of Computer Science and Engineering, University of Washington
‡Department of Computer Science and Engineering, University of California, San Diego
ABSTRACT
Application speciﬁc processors oﬀer the potential of rapidly
designed logic speciﬁcally constructed to meet the perfor-
mance and area demands of the task at hand. Recently,
there have been several major projects that attempt to au-
tomate the process of transforming a predetermined proces-
sor conﬁguration into a low level description for fabrication.
These projects either leave the speciﬁcation of the proces-
sor to the designer, which can be a signiﬁcant engineering
burden, or handle it in a fully automated fashion, which
completely removes the designer from the loop.
In this paper we introduce a technique for guiding the de-
sign and optimization of application speciﬁc processors. The
goal of the Sherpa design framework is to automate certain
design tasks and provide early feedback to help the designer
navigate their way through the architecture design space.
Our approach is to decompose the overall problem of choos-
ing an optimal architecture into a set of sub-problems that
are, to the ﬁrst order, independent. For each sub-problem,
we create a model that relates performance to area. From
this, we build a constraint system that can be solved us-
ing integer-linear programming techniques, and arrive at an
ideal parameter selection for all architectural components.
Our approach only takes a few minutes to explore the de-
sign space allowing the designer or compiler to see the po-
tential beneﬁts of optimizations rapidly. We show that the
expected performance using our model correlates strongly
to detailed pipeline simulations, and present results show-
ing design tradeoﬀs for several diﬀerent benchmarks.
Categories and Subject Descriptors:
C.4 [PERFORMANCE OF SYSTEMS]: Modeling techniques
General Terms:
Design
Keywords:
design space exploration, application speciﬁc processor (ASIP),
area minimization, computer architecture, peicewise linear
model.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
CASES’04, September 22–25, 2004, Washington, DC, USA.
Copyright 2004 ACM 1-58113-890-3/04/0009 ...$5.00.
1. INTRODUCTION
The continuing expansion of the embedded market has
created a signiﬁcant demand for low–cost high–performance
computing solutions that can be realized with a minimum
of engineering eﬀort. While die area and power consump-
tion is a concern when developing a desktop–processor, in
the embedded design world these factors have a signiﬁcant
impact on the cost–point and ultimate functionality of a
product. Application speciﬁc processors allow a designer
to meet these goals by specializing the features of the pro-
cessor to a particular embedded application. Architectural
choices such as cache sizes, issue width, and functional unit
mix can be traded oﬀ in the cost–energy–performance space
on a per–application basis. Customization may also include
extensions in the form of custom accelerators, for instance
the inclusion of encryption instructions [30]. Building em-
bedded processors tuned to a speciﬁc application enables a
design that minimizes area and power while still meeting
performance requirements. At the same time, because the
design is based on a carefully tested template, the design
time is reduced and reliability increased signiﬁcantly over a
pure ASIC based approach.
In the desktop processing realm one might take an “ev-
erything and the kitchen sink” approach, focusing only on
bottom–line performance. But in the ﬁercely competitive
embedded domain system cost (area) is the dominating fac-
tor. Smaller processors means more integration resulting in
less packaging cost, and less area means more chips per die
and less defects per chip.
In this space, it is vital to determine how to eﬀectively
divide your area between the instruction cache, data cache,
branch predictor, additional functional units, and other com-
ponents. This is because cost is the single most important
attribute for embedded applications. When designing a cus-
tom processor the designer needs to ﬁnd the lowest cost im-
plementation that meets the target performance constraints,
and this problem by it’s very nature requires that all opti-
mizations be viewed in concert with all other parts of the
processor. Faced with this extremely important goal of min-
imizing area and power, and a combinatorial number of de-
sign options, how should an embedded processor engineer
proceed?
The goal of this paper is to introduce a technique that
assists this processor engineer (or automated compiler) in
the navigation of the vast application speciﬁc processor de-
sign space. It is not practical to evaluate every possible
alternative through detailed simulation. But by building a
set of estimators and solvers a very good characterizationof the design space can be built and explored. We present
the Sherpa framework as a guide for this application spe-
ciﬁc processor exploration. Sherpa eﬃciently explores the
custom–processor design space because, while the space is
non–linear and full of local minima it is, to ﬁrst order, de-
composable. Custom processors are currently built from
in–order processing cores [13, 25, 4, 1], and the design fea-
tures of these, as we will show in Section 5, can be broken
apart and modeled as independent optimization problems.
Sherpa is a framework for exploring these sub–problems
and recombining them to ﬁnd ideal parameters for the usual
processor features such as caches, and branch predictors.
It is fully general, however, and can also explore whether
or not to use custom instruction accelerators, or variations
on functional unit type, for instance whether to use a slow
or fast multiplier, or no hardware multiplier at all. More
signiﬁcantly, it optimizes these selections against all of the
other design possibilities.
The Sherpa framework begins with an application and
carves up the entire processor design space into a set of
loosely independent regions. Each region represents a logical
architectural component, such as the data cache hierarchy,
branch predictor, custom accelerator, etc. These regions are
then treated as separate sub–problems and are explored us-
ing data driven analytical models or high level simulation.
The result of these independent explorations is a set of char-
acterizing functions that are combined to form a model of
the entire design space suitable for constrained minimiza-
tion. This global model is then optimized through standard
integer-linear programming techniques to arrive at desirable
parameter selections for each architectural component.
While integer–linear programming has been widely used
in the ASIC design community to automate dataﬂow graph
partitioning, to our knowledge we are the ﬁrst to apply this
powerful optimization technique to total processor design
space exploration. The signiﬁcant research advance we con-
tribute in this paper is the characterization of the processor
design space into component models, the constraint formal-
ism to combine them, and the validation of these techniques
in this environment. The research in this paper bridges the
gap between optimizing small ASIC dataﬂow graphs, and
complete processor designs; it lays the foundation for opti-
mization of more complex architectures. Additional contri-
butions of this paper are:
• An automated design space surveyor for embedded in–
order custom processors.
• A methodology to approximate various design options
for processor features with piece–wise linear functions
relating area to performance.
• An integer–linear programming formulation for com-
bining these sub–problems using these piece–wise lin-
ear functions. This provides an approach for rapidly
ﬁnding ideal parameters for the processor architecture.
• An in–depth analysis of trading oﬀ performance and
area against each other. This leads to the result that
when holistically designing a processor, instead of fo-
cusing only on a single component, area and perfor-
mance optimizations can be used to the same eﬀect.
• A demonstration of using Sherpa to decide whether to
use a custom instruction accelerator. For our purposes
we illustrate this with the design choice of whether or
not to use a hardware multiplier, however, the tech-
nique generalizes to any custom instruction one is try-
ing to decide whether or not to include in a design,
and several diﬀerent alternatives can be evaluated at
the same time.
• A demonstration of how to use Sherpa to quickly eval-
uate the potential beneﬁt of an architectural optimiza-
tion. As a case study we examine the use of applica-
tion speciﬁc ﬁnite state machine predictors for branch
prediction [24].
2. DESIGN NAVIGATION
There are currently several commercial ventures that of-
fer customizable RISC processors [13, 25, 4] such as the
XTensa processor from Tensilica [13]. Tensilica provides
support for XTensa in terms of compilation, synthesis, and
veriﬁcation [21]. Embedded designers use these tools by pro-
viding a high level speciﬁcation for a processor core. Pro-
duced from them is a component ready for integration onto
a systems-on-a-chip (SOC) product. While this is a signiﬁ-
cant advancement for customized processors, the designer is
still for the most part unassisted in determining the optimal
processor speciﬁcation for their embedded application. This
is where our research is focused with the Sherpa framework.
In this section we discuss the typical design ﬂow used when
generating a customized processor. We then illustrate where
Sherpa integrates into this workﬂow to provide rapid feed-
back and design optimization.
2.1 Process Flow for Creating an Application
Speciﬁc Processor
We will discuss the workﬂow used to create an applica-
tion speciﬁc processor, and then how Sherpa enhances it.
For concreteness our discussion will focus on the process
used for systems similar to the XTensa processor from Ten-
silica, however, Sherpa can also be used by more automated
tool-chains such as those proposed for the PICO [1, 2] cus-
tom processor (discussed further in Section 6). The current
design process, as follows, involves several iterated steps to
ﬁnd an optimized processor conﬁguration:
1. The embedded application is provided in a high-level
source language, such as C or Java. The design team
determines performance targets, e.g. page renders per
minute, or frames per second.
2. The application is compiled to a binary, or in some
cases several diﬀerent binaries depending upon the ar-
chitectural options (available registers, instructions,
etc).
3. Proﬁles of the application on the target platform are
generated from simulation. These proﬁles focus the
design teams eﬀort on application and architectural
bottlenecks. Architectural changes involve a lengthy,
manual enumeration of the potential options such as
cache and branch table sizes in an attempt to ﬁnd an
ideal conﬁguration.
4. The design team then creates a new binary compiled
to the potential architectural choices from the previous
step. The processor speciﬁcation is passed through the
backend toolset, which generates an RTL description
of the custom processor.Application Source Code Binary Requirements
Sherpa Design
Guide
Synthesis Emulation
costly feedback
early feedback
early use of requirements
Figure 1: The Application Speciﬁc Processor design ﬂow with the Sherpa design guide. Sherpa minimizes the number
of times the design must be iterated on. This allows the designer to spend their time optimizing the algorithms used
or the way they are coded instead of having to iterate over the design space.
5. The binary is then re-simulated on the new custom
processor and the entire process is iterated to further
optimize the conﬁguration.
The above process, when done manually, can be tedious
and time consuming. It leaves to the designer the choice of
what architectural options to modify by guessing the pos-
sible outcome prior to executing a lengthy tool-chain and
simulation step. Unfortunately, the entire space of design
options for a conﬁgurable core is vast in both performance
and cost. Furthermore, this search space is ﬁlled with local
optimas requiring something more clever than a straightfor-
ward greedy approach.
2.2 Using Sherpa to Guide Design
While engineers can perform the above process by hand,
they need to ﬁrst spend signiﬁcant time becoming famil-
iar with the target application, and then must quantify the
behavior of the program through exhaustive experimenta-
tion and analysis. We instead provide Sherpa as an au-
tomated system to examine program behavior and suggest
near-optimal (which are often optimal) design decisions, al-
lowing the designer to make informed decisions early and ex-
periment with diﬀerent optimizations. In addition, Sherpa
also allows the designer to more easily deal with processor
design tradeoﬀs resulting from late changes to the applica-
tion.
Figure 1 shows the application speciﬁc design ﬂow when
using the Sherpa design guide. The goal of the Sherpa de-
sign navigator is to allow its user to make design tradeoﬀs
early in the exploration phase and quickly narrow in on the
bottlenecks of a system. Sherpa analyzes the application
code from the existing custom processor design workﬂow
and quickly searches the entire architectural design space
for global optima. The embedded product design team then
uses this information to narrow in on the ﬁnal custom pro-
cessor architecture.
Figure 2 shows the internal process of the Sherpa Design
Guide (the gray box) shown in Figure 1. Sherpa takes the in-
put application, and proﬁles it to generate a set of represen-
tative traces. Next, Sherpa decomposes the vast custom pro-
cessor design space into a set of independent sub-problems
(e.g., data cache, branch predictor, etc). The application
trace is then used to initiate these models. Finally, Sherpa
combines the models using an integer-linear program (ILP)
solver to locate a globally ideal custom processor architec-
ture conﬁguration. This conﬁguration, along with the design
tradeoﬀs Sherpa made, is then passed back to the designer
for examination. The designer can then focus attention on
the signiﬁcant architectural components and application sec-
tions.
3. IMPLEMENTATION
We now describe the implementation of a Design Navi-
gator, Sherpa, beginning with a description of our overall
approach and a description of the target architecture, and
ﬁnishing with a detailed evaluation of the algorithms used.
We use the design navigator to ﬁnd ideal sizings for instruc-
tion and data cache, branch prediction logic, and multiplier
settings.
3.1 Overall Approach
Our initial implementation is based on a constraint sys-
tem that expresses the entire range of architectural options.
The constraint system is structured so that a integer-linear
solver will ﬁnd an optimal solution from the possible design
options while maintaining hard constraints, such as perfor-
mance needs, but at the same time optimizing for design
goals (such as area).
Our approach initially treats each customizable compo-
nent separately. For each component, an area-performance
model is developed by simulating a variety of component
conﬁgurations with a representative instruction trace. This
collection of area-performance models is then combined with
integer-linear programming and solved to yield an ideal over-
all processor conﬁguration. Before discussing the details of
this process we wish to describe the customizable processor
we are targeting in more detail.
This paper optimizes a customizable architecture similar
to the in-order XTensa Processor [13]. The reason that we
have chosen to use this processor model is that the work
done by Tensilica is mature enough that they have a work-
able design path from processor speciﬁcation all the way
down to silicon. They are able to do this because they work
with the CAD tool designs further down the design chain
to insure that the parameterized processors they build will
be able to be synthesized into an eﬃcient form. The Sherpa
guide can be used as an automated front end to the XTensa
design process to guide the selection of ideal parameters of
the diﬀerent components.
The parameterized processor we model is a single issue
in-order 32-bit RISC core with parameterizable data cache,
instruction cache, branch prediction, an optional fast integer
multiplier and core. There is a basic core size which con-
tains the primary functionality of the chip such as the execu-
tion box, instruction handling, bus controller, and memory
management unit. We assume that the area of this core is
unchanging and leave its optimization to future work. TheAutomated
Profiling
Sub-Problem
Characterization
Data
Linearization
Constraint
System
Building
Architectural
Template
Sub-Problem
Performance Mapping
Constraint
Exploration
ILP
Solver
Post-Solve
Visualization
Specifications
From Binary Requirements
Feedback from Detailed Evaluation
Figure 2: Overview of the Sherpa design guide implementation. The system starts with the input binary and the set
of requirements. The binary is proﬁled and the diﬀerent sub-problems are analyzed. In the data linearization stage
a set of piecewise linear functions are built for each sub-problem. These, along with an automatically constructed
performance model, are converted into a constraint system which is solved using integer-linear programming. A set of
diﬀerent constraints can then be iterated over. The output of the solver can then be visualized for human examination
and converted into a set of processor speciﬁcations for use by synthesis.
parameterizable parts of the processor, are more auxiliary
components used to reduce delay.
3.2 Simple Linear Constraint System
Suppose we wish to customize only a single architecture
parameter; for example the instruction cache size. Then
we could simply model the performance of the processor
with various cache sizes and pick the smallest one that still
met our performance requirement. However, suppose in-
stead that we were asked to choose the ideal instruction and
data cache sizes that together made a system that met the
performance requirement. In our case we deﬁne “ideal” as
the conﬁguration with the lowest area. A naive approach is
to enumerate through and simulate all possible instruction
and data cache size combinations and choose the optimal
one. This exhaustive search will ﬁnd the optimal conﬁgura-
tion but is neither fast nor scalable.
Our approach is to divide this design space into two sub-
problems (instruction cache area versus delay and data cache
area versus delay), and to explore each sub-problem in-
dependently. The model presented in this paper approx-
imates the performance penalty from a sub-problem as a
ﬁxed length processor stall, where this stall aﬀects the en-
tire pipeline. For example, a cache miss can be represented
as a ﬁxed 50 cycle processor stall, and a branch mispredic-
tion as a 6 cycle stall. Even though all of the sub-problems
we examine are not completely independent (e.g., branch
prediction and instruction cache delays) the probability of
these delays overlapping for an in-order processor is small
for the programs we examined. We measured the amount
of dependence between the diﬀerent subproblems using de-
tailed cycle accurate simulation and found it to be negligible.
This issue is discussed further in Section 5.
By allowing ourselves the simplifying assumption that these
are independent sub-problems we can express the overall
performance of our example two parameter processor by
Equation 1:
TTotal = TBase + TDataStall + TInstrStall (1)
where TTotal is the total execution time of the program,
TBase is the time to execute the program assuming a perfect
memory, and the stalls for both the data cache and instruc-
tion cache are added for the corresponding conﬁgurations
being considered.
In order to compare the diﬀerent conﬁgurations, we need
to create a function that relates area to performance. Fig-
ure 3 shows the performance versus area for two cache con-
ﬁgurations chosen for the data cache. For purposes of il-
lustration, we assume this is a simple linear function; the
next section introduces more realistic models. Given the
two points in Figure 3, the contribution to processor stall
from the cache can be expressed in terms of its structure
via the equation TStall = k × Area,w h e r ek is a constant.
Performing this operation on both sub-problems creates two
independent linear functions that relate performance to area
that must be combined to ﬁnd the ideal balance between the
two. Rewriting Equation 1 we get:
TTotal = TBase +( kData × AreaData)+( kInstr × AreaInstr)( 2 )
Using this equation we set a performance constraint TTotal ≤
Tmax, and solve for the two unknowns: AreaData and AreaInstr.
The goal is to maintain this performance constraint while
minimizing the area, which is expressed as:
Minimize: AreaTotal = AreaData + AreaInstr (3)
The result of this formulation is a system of linear equa-
tions and an optimization criteria. This can be eﬀectively
solved using linear programming tools. Commercial solvers
from CPLEX are available, but we found the freely available
lpsolve tool to be suﬃcient [6].
3.3 Data Linearization
The previous example of ﬁnding the ideal balance between
a set of instruction and data cache conﬁgurations relied upon
the assumption that there is a strict linear equation to rep-
resent cache area to performance. Clearly this assumptionMINra
A
r
e
a
Penalty
ra
MAXra
Figure 3: This graph shows two de-
sign points MINra and MAXra, and
the creation of a simple linear func-
tion between these two points relat-
ing performance to area.
a R c
A
r
e
a
Delay
R b R
Figure 4: A graphical representa-
tion of the Piecewise Linear Model
w h i c hi su s e dt oq u a n t i f yt h ea r e a -
performance tradeoﬀs for each sub-
problem. The Pareto optimal de-
sign points are estimated by inter-
secting line segments for which the
slope and intercept are known.
x
A
r
e
a
0 1 R
Segment a
Segment b
Segment c
Figure 5: Translated piece-wise lin-
ear model. Note that for each line
s e g m e n tw es e et h es a m eh e i g h to n
the y-axis but now the lines are
stretched to be over the range of
[0,1].
is simplistic and we must support a more complex area-
performance model.
For this we turn to a piece-wise linear approximation of
the actual function. To form this approximation we begin by
identifying Pareto-optimal design points. A point p is said
to be Pareto-optimal (on the axis in Figure 4) if there is no
point p
  such that p
  requires less area and incurs less stall
time. Intuitively, if a point is not Pareto-optimal then there
is another design point that will achieve as good or better
performance with less area. We use these Pareto-optimal
points to help form a piece-wise linear approximation of the
area-performance function.
Starting from the Pareto-optimal design points we greed-
ily construct straight line segments. A point is added to
a line segment if the correlation between that line segment
and all the points that approximate it does not drop below a
threshold. The correlation coeﬃcient, r, provides the extent
to which the bounding Pareto-optimal points being consid-
ered lie on a straight line. For the results we present in this
paper we insure that r is always greater than 0.98. When
the correlation coeﬃcient drops below this threshold, a new
line segment is started. To insure that the line segments
always intersect, the last data point included in a given seg-
ment is also included in the next segment. By modeling the
data as a series of line segments, the optimization process is
greatly simpliﬁed as the resulting problem is almost exactly
linear programming.
The result of piecewise linearization can be seen in Fig-
ure 4. The points that are Pareto-optimal are connected
by line segments that capture the overall area-performance
trend. In this example three line segments Ra, Rb and Rc
make up the entire piece-wise linear approximation. Our
goal now is going to be to modify the constraint system to
use these far more precise sub-problem models, which we
describe next.
3.4 Final Constraint System
A mixed integer-linear program is similar to a linear pro-
gram except that some variables are optionally constrained
to be whole numbers. Solving a mixed integer-linear pro-
gram as opposed to a linear program is a straightforward
process. We refer the interested reader to [17] for details on
the internals of this algorithm.
To create our mixed integer-linear constraint system, the
ﬁrst step is to include each piece-wise linear function for
each sub-problem into the constraint system. This is per-
formed by breaking up each line-segment of the piece-wise
linear approximations into separate components. These line
segments are then translated to have a uniform length and
starting point along the horizontal axis. Figure 5 depicts the
same piece-wise linear function from Figure 4 after transla-
tion is performed. For each piece, we generate the functions
for delay and area where the function for area and delay has
been normalized between zero and one.
The linear function for area is calculated as
Areax = BaseAreax+SlopeAreax∗Rx, and the function for
delay is calculated as Time x = BaseTimex+SlopeTimex∗
Rx. I nb o t ho ft h e s ee q u a t i o n s ,Rx is a design point with
a value between zero and one. BaseAreax and BaseTimex
are the start area and time for each linear piece in Figure 4.
Once distilled into separate line-segments we construct
a mixed integer-linear program where only a single line-
segment (from the piece-wise linear function) is selected for
each sub-problem. The single line-segment is chosen using
integer programming. To illustrate, Figure 6 depicts a new
constraint system with these constraints. This constraint
system works by choosing a tuple (iRx,R x) with Equation 4.
iRx is an integer value of either zero or one. Rx is a linear
value between zero and one. This tuple selects which line
segment to use with the integer variable iRx,a n dt h ed e s i g n
point within that line-segment to examine using the vari-
able Rx. Equation 4 cannot be implemented directly, and
we describe next how to construct it using integer and linear
constraints.
Equations 5 and 6 perform a line segment selection for
the sub-problem by ensuring that only one line segment is
selected, since each iRx c a no n l yh a v eav a l u eo f0o r1 .
Equation 5 insures that only one iRx will be equal to one,(iRx,R x)=ChooseOne({(iR1,R 1),(iR2,R 2),...,(iRn,R n)})( 4 )
iR1 + iR2 + ... + iRn =1 ( 5 )
iR1 ≥ 0,i R 2 ≥ 0, ... , iRn ≥ 0( 6 )
0 ≤ R1 ≤ iR1 (7)
0 ≤ R2 ≤ iR2 (8)
...
0 ≤ Rn ≤ iRn (9)
0 ≤ Timerequired − Timeperfect (10)
− (BaseTime1 ∗ iR1 + SlopeTime1 ∗ R1)
− (BaseTime2 ∗ iR2 + SlopeTime2 ∗ R2)
− ...
− (BaseTimen ∗ iRn + SlopeTimen ∗ Rn)
Minimize:( BaseArea1 ∗ iR1 + SlopeArea1 ∗ R1)(11)
+( BaseArea2 ∗ iR2 + SlopeArea2 ∗ R2)
+ ...
+( BaseArean ∗ iRn + SlopeArean ∗ Rn)
Figure 6: Constraint system for a piece-wise linear sub-
subproblem model.
and all the rest will be equal to zero. Since we normalized
the linear functions to represent the area and delay using a
variable between 0 and 1, we are able to reuse the variable
iRx to constrain the selection of the design point Rx.T h i s
is shown in Equations 7 - 9, where the design points along
the chosen line segment are examined.
Once a given tuple is selected, the translated Basearea,time
and Slopearea,time values are used to determine performance
impact and area cost of the choice. The modiﬁed perfor-
mance bound is illustrated in Equation 10 and the new min-
imization goal is shown in Equation 11.
Figure 6 shows only the constraint system for one sub-
problem. To generate the overall constraint system to ex-
amine the trade-oﬀs between multiple sub-problems, each
sub-problem has its own constraints to choose an iRx and
Rx particular to that sub-problem using equations similar to
those shown in Equations 5 - 9. The time delay for each sub-
problem is subtracted from Equation 10, and the area delay
for each sub-problem is added into Equation 11. Then the
overall system is run through the constraint solver to mini-
mize the area, while meeting the speciﬁed Time required.
4. SUB-PROBLEM CHARACTERIZATION
In the previous sections we presented our general method-
ology of partitioning the complete application speciﬁc pro-
cessor design space into a set of loosely independent sub-
problems. In this section we discuss the sub-problems we
explored and how we modeled them. The sub-problems of
the design space that we chose to explore in depth are the
instruction and data caches, the branch predictor conﬁgu-
ration, and whether or not to include a fast hardware mul-
tiplier. The general methodology for constructing each of
these models is to explore a sampling of points through di-
rect simulation and then construct piece-wise linear models
from the Pareto optimal design points.
For a given subproblem (e.g., branch predictor) we can
trade oﬀ multiple designs (gshare, bimodal, 2-level) by plot-
ting them all on the same Pareto graph. The piece-wise
linear model built from that graph represents the ideal de-
sign to use for a given performance/area point. The same
approach would be taken if we were to have the option of
diﬀerent types of accelerators for a given instruction or sets
of instructions.
In this section we describe how we estimate the area for
each design point considered as well as how to calculate
the performance penalty for that conﬁguration for the trace
being examined.
4.1 Data and Instruction Cache
We start with an examination of the caches in our ar-
chitecture. As will be seen later in Section 5.2, the caches
are the dominant conﬁgurable area of an application speciﬁc
processor. Because of this it is imperative that the models
correctly capture the area-performance tradeoﬀs since they
are the ﬁrst order determinant of the overall accuracy of our
technique.
Cache design is a well-studied problem in computer archi-
tecture, and we wish to build upon past work in this area
where appropriate. Mulder [20] presents a validated area
model for caches and register ﬁles. Reinman and Jouppi [22]
present a very detailed cache design space walker based on
the work presented in [29]. CACTI, when given a set of
cache parameters, such as associativity and cache size, will
ﬁnd the delay optimal cache array partitioning scheme (i.e.
the fastest physical device layout for the given cache param-
eters). For our research we use thise design tool to aggres-
sively optimize the partition of the cache.
One concern with using CACTI to optimize for area-optimal
caches is that its internal optimization goal is programmed
to ﬁnd delay-optimal designs. To verify this would not skew
our results we modiﬁed the CACTI tool to produce Fig-
ure 7, which depicts the delay versus area tradeoﬀ of over
9,000 cache partitions that are examined for a single cache
parameter set. Out of all of these conﬁgurations, only three
Pareto optimal ones were generated. This shows that the
most performance-eﬃcient partition is extremely close to the
most area-eﬃcient. Figure 8 depicts this area variation for a
range of diﬀerent cache conﬁgurations. We found at most a
4% diﬀerence in area from the most area-eﬃcient to the most
performance-eﬃcient design. Therefore, we chose to use the
delay optimized cache partitioning from CACTI for our re-
sults. Fortunately the cache partitionings are not program
dependent hence they can be calculated once and stored in
a database for later use by any optimization process.
To capture the program-dependent performance eﬀect of
the diﬀerent cache options, we need to estimate the number
of cache hits and misses the target applications will have for
each cache conﬁguration. This can be done through simu-
lation or analytical modeling [3]. Eﬃcient cache simulators
have been proposed to simulate many diﬀerent conﬁgura-
tions at once [28]. For our work we found that we could
quickly simulate all of the reasonable power of two cache
sizes (256 bytes to 64K), associativity (direct mapped to 4-
way and fully associative) and line size (8 bytes to 64 bytes)
options. We then ﬁnd the area of each of these conﬁgura-0 500 1000 1500 2000
Delay (ps)
0
2000
4000
6000
8000
10000
12000
A
r
e
a
 
(
r
b
e
)
Pareto Optimal Points
Figure 7: Graph of area-delay tradeoﬀs for a sin-
gle cache conﬁguration. This graph was generated
by modifying the CACTI tool to output info on all
conﬁgurations considered. For the example cache
parameter set shown, over 9000 diﬀerent conﬁgura-
tions were considered but only three Pareto optimal
conﬁgurations were found.
Size/Assoc 1 2 4 8
1K ±2.68% ±3.71% ±0.23% ±0.00%
4K ±3.30% ±0.49% ±0.25% ±0.12%
16K ±3.56% ±1.32% ±0.79% ±0.13%
64K ±2.14% ±0.65% ±0.66% ±0.40%
Figure 8: Percent area variation between the most
area eﬃcient design examined by the CACTI design
tool, and most delay eﬃcient design. Results are
shown for a 1K to 64K cache from direct mapped to
8-way associative. In CACTI, the cache partitioning
scheme with the minimum delay is returned. A vari-
ation of less than 4% is seen in area when optimizing
a cache conﬁguration for area versus performance.
tions using the previously mentioned area model and CACTI
partitioner.
Figure 9 shows the area versus miss rate for the instruction
cache for the applications we explored, and Figure 10 shows
the area versus miss rate for the data cache. From these
points, we created the piece-wise linear models as described
in the prior section.
Since our application speciﬁc processor conﬁguration is for
embedded in-order processors, we can easily map the cache
miss penalty to performance for our constraint system. We
assume that each cache miss results in a penalty of 50 cycles,
and the processor stalls on a cache miss. We also examined
hit under miss data caches, but that optimization provided
only marginal performance gains. Therefore, in this paper
we only present results for caches that stall on misses.
4.2 Branch Predictors
To examine the custom design space for branch prediction
we examine selecting from the following well know branch
predictors: (1) a table of 2-bit counters, (2) a global cor-
relation predictor, and (3) a meta predictor that uses both
local and global correlation information.
We examine the branch prediction miss rate for these dif-
ferent predictors for a variety of table sizes. Figure 11 shows
the area versus miss rate for the Pareto optimal points in
the diﬀerent programs examined. For all of the programs ex-
amined the most area eﬃcient branch predictor design was
the table of 2-bit predictors similar to the one used in the
XScale embedded processor [18]. To achieve better perfor-
mance, signiﬁcant resources are needed by the other pre-
dictor methods. For example, the Pareto curve for adpcm
chooses a per-branch 2-bit for the low area points, and a
meta predictor for higher area. It shows that the per-branch
2-bit counters give the lowest miss rate for an area less than
128K square features, and a meta predictor gave the best
miss rate for an area greater than 128K. In crossing this
design boundary the area is increased above 128K, but the
misprediction rate is reduced from 45% down to 26%.
The misprediction rate measured here is used to calculate
the performance penalty for our in-order processor model.
B ya s s u m i n ga6c y c l es t a l lf o re a c hb r a n c hm i s p r e d i c t i o n ,
and knowing how many branches we mispredict, we can esti-
mate the total number of stall cycles incurred by the branch
predictor.
4.3 Multiplier Unit
The last conﬁgurable option we examined was a processor
with and without a large fast hardware multiplier instead of
the much smaller and slower iterative multiplier. This is
modeled by counting the number of multiply instructions
executed in each program. The reason that we have chosen
this conﬁgurable option is to show that our design optimiza-
tion technique can also cleanly handle binary decisions such
as whether or not to include a specialized functional unit.
While deciding whether or not to use a fast multiplier is a
binary decision, in an overall context the decision is com-
plicated by the fact that its usefulness must be traded oﬀ
against other, non-binary decisions. Thus, the actual deci-
sion is extremely diﬃcult to answer by conventional tech-
niques.
We model a hardware multiplier in the following way: if
the fast hardware multiplier is used, then a multiply takes
2 cycles, with the area cost of 3 million square features.
The area of the multiplier is derived from [23]. If the fast
hardware multiplier is not used, the multiply is performed
with a software routine, estimated to take 250 cycles to ex-
ecute [26].
4.4 Core Area
Finally we are left with the area of the actual execution
core. Inside the core is the data path of the machine along
with all of the control. While the control and data paths
could be further optimized, we leave these subproblems for
future work. These include the instruction fetch control,
the instruction memory management unit, the data mem-
ory management unit, the bus controller and the basic func-
tional units. The area for the remaining non-conﬁgurable set
of functionality we used for this research is derived from [23]
and is estimated to be 21 million square features.
5. PUTTING IT ALL TOGETHER
Now that we have seen how the constraint system is built
and solved in Section 3 and how the sub-problems are formu-
lated and characterized in Section 4, we are ready to examine
how they work together to explore the design space.0% 5% 10% 15%
Miss-Rate
2M
4M
8M
16M
32M
64M
128M
A
r
e
a
 
(
S
q
u
a
r
e
 
F
e
a
t
u
r
e
s
) adpcm
bzip
gs
gzip
jpeg
Figure 9: Instruction cache miss
rate versus area tradeoﬀ for diﬀer-
ent cache sizes, associativities, and
line sizes. Only the Pareto Optimal
points are shown.
0% 10% 20% 30%
Miss-Rate
2M
4M
8M
16M
32M
64M
128M
256M
A
r
e
a
 
(
S
q
u
a
r
e
 
F
e
a
t
u
r
e
s
) adpcm
bzip
gs
gzip
jpeg
Figure 10: Data cache miss rate
versus area tradeoﬀ for diﬀerent
cache sizes, associativities, and line
sizes. Only the Pareto Optimal
points are shown.
0% 10% 20% 30% 40%
Miss-Rate
128K
256K
512K
1024K
2048K
4096K
A
r
e
a
 
(
S
q
u
a
r
e
 
F
e
a
t
u
r
e
s
) adpcm
bzip
gs
gzip
jpeg
Figure 11: Branch misprediction
rate versus area tradeoﬀ. Sev-
eral branch prediction architectures
were examined and the Pareto opti-
mal area/miss rate points are plot-
ted for each program.
0246
Estimated CPI
0
2
4
6
A
c
t
u
a
l
 
C
P
I
Figure 12: Veriﬁcation of Performance Estimator. The
estimated CPI time as calculated by our formulation
is plotted as a function of the performance determined
through detailed cycle level simulation.
5.1 Running the Overall Solution
When running the system on a given program we must go
through a few steps. The ﬁrst step is sub-problem charac-
terization as presented in the previous section. During this
step we perform simulation of the diﬀerent design options
for each sub-problem. From this we generate the Pareto-
optimal points and piece-wise linear approximation to be
used for the next stage. The simulation time here varies
both on what is being investigated and the program itself.
We use fairly brute force methods for examining each space,
and this process is the most time consuming of all the oper-
ations.
We built our simulation infrastructure on top of ATOM [27]
because it provides both high performance and ease of use
for a RISC architecture similar to those supported by conﬁg-
urable cores. The design point enumerations for each of the
subproblems are calculated at run-time. For the programs
we examined it takes anywhere from a couple of seconds
to 10 minutes per simulation. However this process can be
easily speed up though the use of smarter simulation algo-
rithms [28], or analytical modeling.
In addition to generating the estimated performances of
the diﬀerent sub-problems, we also need to estimate the total
area consumed by each design point. This is done once for
each given subproblem, and the results can be used for each
of the diﬀerent programs. This step takes less than 1 minute
to run.
The ﬁnal step of the optimization process is the actual
construction of the constraint system. This step is very
fast since many diﬀerent design parameters can be exam-
ined quickly. To create charts of the pareto-optimal solu-
tions, shown in Figures 13 and 17, we generated almost 50
design solutions in less than 8 seconds. Each one of these
design solutions is the ideal combination of the diﬀerent sub-
problems and represents the best design point of many hun-
dreds of millions. Because of this fast turn around many
new design choices can be evaluated in real-time. For ex-
ample, one could answer the question: what would the area
impact be of reducing the cache miss penalty by 30%. This
could be answered without re-running any simulations.
5.2 Results
We used ATOM [27] to proﬁle the applications, generate
traces and simulate cache and branch models. All appli-
cations were compiled on a DEC Alpha AXP-21264 archi-
tecture using the DEC C compiler under OSF/1 V4 oper-
ating system using full optimizations (-O4). We chose a
set of ﬁve benchmarks which could have applications in an
embedded environment The application ijpeg is from the
SPEC95 benchmark suite, and gzip and bzip are from the
SPEC2000 benchmark suite. We also include two programs
from the MediaBench application suite – adpcm is a speech
compression and decompression program, and gs is an im-
plementation of a postscript interpreter.
The ﬁrst step in testing our system is to verify that the
performance we estimate using our combination of piece-
wise linear subproblems accurately matches with a detailed
pipeline simulation of the hardware. We compare the es-
timated CPI gathered for several design points chosen at
random, with the CPI of a detailed cycle-level simulation
using SimpleScalar 3.0a [7]. We assumed in our constraint
model and the simulator that all cache and branch struc-
tures could be accessed in one cycle. The processor was123
Normalized Execution Time
0M
50M
100M
150M
A
r
e
a
 
(
S
q
u
a
r
e
 
F
e
a
t
u
r
e
s
) adpcm
bzip
gs
gzip
jpeg
Figure 13: Total Area (in millions of square features) as
a function of Normalized Execution Time for the diﬀer-
ent programs. The areas shown represent the minimum
total size of the optimized core that can achieve the per-
formance shown. The execution time is normalized to
the performance of the base executable executed with
no stalls of any kind.
Fixed Vary Avg StdDev Max StdDev
D-cache I-cache 0.000000% 0.000000%
D-cache Branch 0.000000% 0.000000%
I-cache D-cache 0.004775% 0.046030%
I-cache Branch 0.001985% 0.038971%
Branch D-cache 0.007916% 0.080000%
Branch I-cache 0.006044% 0.095263%
Table 1: Quantifying the independence of subproblems.
The ﬁrst column in the table, labeled Fixed,s h o w st h e
subproblem under examination, and the second column,
labeled Vary, is the subproblem to which independence is
being evaluated. For example, the ﬁrst row in the table
compares the independence of the data cache miss rate
from the instruction cache miss rate. For the ﬁrst row
this is determined by holding the data cache size con-
stant, varying the size of the instruction cache, and an-
alyzing the change in data cache miss rate. This is then
repeated for several sizes of data cache. The numbers
reported are the average standard deviation and max-
imum standard deviation in miss rates across all ﬁxed
sizes evaluated.
single issue, and used the same latencies for the diﬀerent
sub-problems as described in section 4.
Figure 12 shows the results of this veriﬁcation procedure.
In this graph we have plotted the estimated performance
of the processor from our constraint system against that
found through detailed cycle-level simulation. The results
show a strong correlation between the estimates and actual
values, with a correlation coeﬃcient of r =0 .99829 for over
200 conﬁgurations chosen at random for evaluation across
the diﬀerent benchmarks. This shows that our performance
estimation is accurate (otherwise the points in Figure 12
would show trends that do not follow the diagonal line drawn
on the graph.)
Table 1 shows our results to further verify the indepen-
dence of subproblems. The ﬁrst column in the table, labeled
Fixed, shows the subproblem under examination, and the
second column, labeled Vary, is the subproblem to which in-
dependence is being evaluated. For example, the ﬁrst row in
the table compares the independence of the data cache miss
rate from the instruction cache miss rate. For the ﬁrst row
this is determined by holding the data cache size constant,
varying the size of the instruction cache, and analyzing the
change in data cache miss rate. This is then repeated for
several sizes of data cache. The numbers reported are the
average standard deviation and maximum standard devia-
tion in miss rates across all ﬁxed sizes evaluated. The results
show that the subproblems are independent with less than
a 0.1% standard deviation at maximum when holding one
component constant and varying the other component. It
can therefore be concluded from this graph that the data
cache miss rate does not change signiﬁcantly as the instruc-
tion cache miss rate varies for the cache sizes and processor
model we are exploring.
5.2.1 Optimization of Tradeoffs
We now examine the results of running the entire system
on several programs. Figure 13 shows the resulting mini-
mized area for each program for several performance design
points. The area is shown in square features. The perfor-
mance constraint is shown normalized to the base executable
executed with no stalls for any components. For example,
in order for gs to be executed in 1.5x the amount of time
it would take to execute it with no stalls of any sort, the
processor will need an area of 60M (square features).
As expected, relaxing the performance constraint reduces
the area needed by the processor. All of the programs ex-
hibit a very strong elbow in their performance area plot.
Hence, the major working sets of the program are captured
using a small amount of resources, and you will have to add
signiﬁcantly more more area on top in order to improve per-
formance.
A closer look at an individual application shows how the
constraint system trades oﬀ performance and area between
all of the diﬀerent sub-problems we examined. Figure 14
shows this breakdown for the program ijpeg. When the
performance is tightly constrained, a great deal of area is
devoted to the cache and branch predictor, as well as in-
cluding the special integer multiply functional unit. As we
relax the performance constraint, the optimizer balances oﬀ
the diﬀerent sub-problems in such a way that the perfor-
mance criteria is met and the area is minimized.
In relaxing the performance constraint, the ﬁrst area com-
ponent to be reduced is the large branch predictor, while
both the data and instruction cache are signiﬁcantly reduced
in size. Over the performance range between 1.14 and 1.44
the instruction cache size does not change signiﬁcantly and
the data cache is reduced.
It is interesting to note that analysis of our design tradeoﬀ
graphs can be be used to ﬁnd the working set sizes for the
caches. Between 1.44 and 1.50 the performance constraint is
relaxed to the point that the instruction cache can be shrunk
to the next working set size. However to make up for this
increase in instruction miss rate, the data cache actually has
to increase in size. This sort of tradeoﬀ can again be seen
between 1.56 and 1.62 where the multiplier is cut out and
the size of the data cache is again increased. It is these sorts
of complex tradeoﬀs that our system has been designed to
optimize.1
.
1
2
1
.
1
9
1
.
2
5
1
.
3
1
1
.
3
8
1
.
4
4
1
.
5
0
1
.
5
6
1
.
6
2
1
.
6
9
1
.
7
5
1
.
8
1
1
.
8
8
1
.
9
4
2
.
0
0
Normalized Execution Time
0
50
100
A
r
e
a
 
(
m
i
l
l
i
o
n
s
 
o
f
 
s
q
u
a
r
e
 
f
e
a
t
u
r
e
s
) Branch Pred
Instr Cache
Multiplier
Data Cache
Core
Figure 14: A zoom in of the tradeoﬀs made for the application jpeg.
The axis are the same as shown in Figure 13 but here only a small
portion of the total constraints examined are shown. In addition,
the area of each component has been broken out into a separate
stack to show the relative importance of each sub-problem for this
application.
1
.
2
5
1
.
3
1
1
.
3
7
1
.
4
3
1
.
5
1
.
5
6
1
.
6
2
1
.
6
8
1
.
7
5
1
.
8
1
1
.
8
7
1
.
9
3
Normalized Execution Time
0%
10%
20%
30%
P
e
r
c
e
n
t
 
A
r
e
a
 
R
e
d
u
c
t
i
o
n
Figure 15: Percent reduction in area
for gs when adding the option to use
a custom ﬁnite state machine predictor
for branch prediction into our Sherpa
constraint system.
1234
Normalized Execution Time
20M
25M
30M
35M
40M
A
r
e
a
Improvement at
Same Performance
Optimization
Resultant Area
Performance
Figure 16: This is an example to show the relationship
between performance optimizations and their eﬀect on
area. Performance optimizations can reduce execution
time and at the same time can signiﬁcantly reduce the
amount of area needed to meet a given performance con-
straint.
5.2.2 Optimization Evaluation
The goal of Sherpa is to use it to not only help guide the
design of the architectural components already discussed in
this paper, but to be able to quickly evaluate new archi-
tecture optimizations. This allows the designer to see the
signiﬁcance of their ideas to the overall processor design
when trying to minimize area. To show this, we investi-
gated the application speciﬁc optimization by Sherwood and
Calder [24] for automatically creating custom ﬁnite state
machine (FSM) predictors for individual branches.
This optimization augments the branch predictor with
custom branch prediction entries. These entries have hard
coded to them given branch addresses, and a custom ﬁnite
state machine tailored to that branch. We include the cus-
tom FSM predictor into the Sherpa constraint system as
another possible option for the branch predictor. Figure 15
shows the percent area reduction in relationship to perfor-
mance when evaluating the use of custom ﬁnite state ma-
chines for branch prediction for gs.
The reason for the large reduction in area for the lower
execution times comes from the increase in performance
provided by the custom FSM branch predictor. Figure 16
graphically shows the reason for this by showing the rela-
tionship between performance optimizations and area. Opti-
mizations that reduce execution time can at the same time
signiﬁcantly reduce the amount of area needed to meet a
given performance constraint.
5.2.3 Performance/AreaVersus Total Area Tradeoffs
Another sort of tradeoﬀ analysis that can be provided by
our system is computational eﬃciency analysis. Figure 17
shows a plot of performance per-unit area versus total area.
In this plot we can see that the best performance eﬃciency
for most of the programs we examined lies between 30 and
40 million square features, less than twice the size of the core
functionality of the chip. While performance per unit area
may not be useful to an embedded designer seeking to meet
a given performance constraint, it is a result that would
be very useful to chip multiprocessor designers in helping
to maximize total processing power of a given chip real-
estate. Application targeted chip multiprocessors, such as
the Piranaha project [5], which was targeted at transaction
processing, seeks to get higher performance by clustering
together many simple processors with high computational
eﬃciency onto a single chip. Our constraint system could50 100 150
Area
0.00
0.01
0.02
0.03
P
e
r
f
o
r
m
a
n
c
e
/
A
r
e
a
adpcm
bzip
gs
gzip
jpeg
Figure 17: Performance per unit area of the diﬀerent
benchmarks shown as function of total chip area. While
performance per unit area may not be useful to an em-
bedded designer, seeking to meet some performance con-
straint, it is a result that could be easily used by chip
multiprocessor designers to help maximize to total pro-
cessing power of a given chip real-estate.
be used to guide the design of the various subsystems on the
chip to maximize the total performance.
6. RELATED WORK
There is a great deal of related work in hardware/software
co-design, application speciﬁc processor generators, and an-
alytic performance estimation, but we have found none that
address the problem we are attempting to solve. While a full
listing of related work is not feasible in a paper of this length,
we do point out some representative papers from each area
and describe how our research ﬁts into the broader context.
6.1 Conﬁgurable Cores
There are currently many conﬁgurable core designs, that
can be customized or otherwise modiﬁed to application spe-
ciﬁc requirements. The XTensa processor [13] allows the de-
signer to input speciﬁcations that they want in a processor,
and the XTensa tools generate the needed components ready
for integration into a system-on-chip. Another example of
a conﬁgurable RISC core is the LX4380 [25] processor core.
LX4380 supports adding up to 12 new instructions to the
core as needed by a designer. The ARC processor core [4]
is similar in design and intent to the LX4380. The tools
available for these three conﬁgurable cores do not assist the
designer in ﬁnding good design points for their application.
We provide an automated system that will examine pro-
gram behavior and suggest near-optimal design decisions,
allowing the designer to make informed decisions early and
experiment with diﬀerent optimizations.
6.2 Design Exploration for Application
Speciﬁc Processors
An approach being examined for designing customized
processors is to automatically generate trial processors (with
a machine description language [14, 19], a GUI [15], or a tem-
plate as described above) for a speciﬁc application, and then
examine their performance and feed this evaluation back
into the automated processor generator.
The Lx [9, 10, 8] customizable VLIW architecture builds
custom hardware for loop nests in applications. They have
a clustered VLIW architecture that can be customized to a
given application domain with a semi-automated optimiza-
tion step. The system starts by generating a trial architec-
ture, for which it generates compilers and other tools. Key
parts of this application code are then compiled and mea-
sured, and are used to generate a new trial architecture, and
the process is repeated.
The PICO project [1, 2] uses a fully automated approach
to creating a customized processor. Their system starts by
designing a set of Pareto optimal memory hierarchies, pro-
cessors, and custom hardware accelerators. Then, diﬀerent
combinations of these points are tested by assembling the
processor and simulating its performance in detail. The best
combinations are noted and combinations that are similar to
these are evaluated. They report that the system takes from
10 hours to several days.
The iterative approaches for Lx and PICO yield very good
results, but take too long to be used interactively. We pro-
pose to use a high level modeling of performance ﬁrst via
Sherpa to allow the designer to very rapidly examine com-
plex design tradeoﬀs in real time, and then feed this infor-
mation into the iterative processor generation performed for
Lx and PICO.
The Platune System [12, 11] has goals similar to ours, in
that it too seeks to reduce the number of processor conﬁgu-
rations that must be explored. The Platune System makes
use of the fact that while some processor parameters are
coupled, there are many others that are independent or are
just one-way dependent. They then use this information
entered by hand by the user to prune there design space,
skipping over points that the user knows will not be Pareto
optimal. The BUILDABONG project performs these step
automatically and intelligently prunes those conﬁgurations
that it belives will not be Pareto optimal [16]. Sherpa, on
the other hand, uses linear programming to model many
discrete points that are near linear as a line segment which
allows for the points to be searched analytically and thus
at very high speed. We believe that a combination of these
techniques will perform even better than any technique by
itself.
7. SUMMARY
Our approach to exploring this design space is to ﬁrst di-
vide it into separate regions mapped by largely independent
parameters. These regions form individual sub-problems
that can be eﬃciently and accurately modeled using data-
driven analytical techniques. We formulate the results of
these models into a single large constraint-based integer-
linear program that can be eﬃciently solved using conven-
tional mathematical techniques.
We showed that the results from solving of the constraint
system lined up very closely with the actual performance
numbers obtained through detailed simulation. We exam-
ined the tradeoﬀs that were made between the diﬀerent com-
ponents included in optimization and saw how they can play
oﬀ each other in a complex manner. We further demon-
strated how the Sherpa framework can be applied to rapidly
evaluate potential optimizations and their impact on both
the performance and area of the processor.
A goal of this research is to provide a more scientiﬁcally
sound methodology for evaluating novel architectural tech-
niques in the embedded space. The traditional method of
proposing a new technique, and then examining the perfor-
mance enhancement relative to a single baseline data-point
is not very meaningful to overall system design in cost sen-sitive domains. An architect needs a way of visualizing the
way new techniques trade-oﬀ against a range of potential
design options. The methodology presented in this paper
provides the ability to perform this range analysis. It places
architectural changes in a global setting allowing the archi-
tect to gain a full picture of its usefulness.
To our knowledge we are the ﬁrst to apply integer-linear
programming to total-processor design space exploration.
The signiﬁcant research advance we contribute in this pa-
per is the characterization of the processor design space into
piece-wise linear models, the constraint formalism to com-
bine them, our formation of piece-wise linear functions, and
the validation of these techniques in this environment. The
research in this paper lays the foundation for optimization
of more complex architectures.
Acknowledgments
We would like to thank the anonymous reviewers and Mike
Swift for providing useful comments on this paper. This
work was funded in part by NSF grant No. CCR-0311712
and an equipment grant from Intel.
8. REFERENCES
[ 1 ]S .A b r a h a m ,B .R a u ,R .S c h r e i b e r ,G .S n i d e r ,a n d
M. Schlansker. Eﬃcient design space exploration in pico. In
Proc. of International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems, pages
71–79, San Jose, California, November 2000.
[ 2 ]S .G .A b r a h a ma n dS .A .M a h l k e .A u t o m a t i ca n de ﬃ c i e n t
evaluation of memory hierarchies for embedded systems. In
32nd International Symposium on Microarchitecture, 1999.
[3] Anant Agarwal, Mark Horowitz, and John Hennessy. An
analytical cache model. ACM Transactions on Computer
Systems, 7(2):184–215, 1989.
[4] ARC. Whitepaper: Customizing a soft microprocessor core.
http://www.arccores.com, 2001.
[5] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk,
S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese.
Piranha: A scalable architecture based on single-chip
multiprocessing. In 27th Annual International Symposium
on Computer Architecture, Vancouver, Canada, June 2000.
[6] M. Berkelaar. lp solve: a mixed integer linear program
solver. ftp://ftp.es.ele.tue.nl/pub/lp solve, September 1997.
[7] D. C. Burger and T. M. Austin. The simplescalar tool set,
version 2.0. Technical Report CS-TR-97-1342, University of
Wisconsin, Madison, June 1997.
[8] Paolo Faraboschi, Geoﬀrey Brown, Joseph A. Fisher,
Giuseppe Desoli, and Fred Homewood. Lx: a technology
platform for customizable vliw embedded processing. In
27th Annual International Symposium on Computer
Architecture, pages 203–213, 2000.
[9] J. A. Fisher, P. Faraboschi, and G. Desoli. Custom-ﬁt
processors: Letting applications deﬁne architectures. In
29th International Symposium on Microarchitecture, pages
324–335, December 1996.
[10] Joseph A. Fisher. Customized instruction-sets for
embedded processors. In Proceedings of the Design
Automation Conference, 1999, pages 253–257, 1999.
[11] T. Givargis and F. Vahid. Platune: A tuning framework for
system-on-a-chip platforms. IEEE Transactions on
Computer Aided Design, 21(11), November 2002.
[12] T. Givargis, F. Vahid, and J. Henkel. System-level
exploration for pareto-optimal conﬁgurations in
parameterized systems-on-a-chip. In International
Conference on Computer Aided Design, November 2001.
[13] R. E. Gonzalez. Xtensa: A conﬁgurable and extensible
processor. IEEE Micro, 20(2):60–70, March-April 2000.
[14] G. Hadjiyiannis, P. Russo, and S. Devadas. A methodology
for accurate performance evaluation in architecture
exploration. In In Proceedings of the Design Automation
Conference (DAC 99), pages 927–932, 1999.
[15] M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi,
A. Kitajima, and M. Imai. Eﬀectiveness of the asip design
system peas-iii in design of pipelined processors. In In
Proceedings of Asia and South Paciﬁc Desing Automation
Conference 2001 (ASP–DAC 2001), pages 649–654, 2001.
[16] Ralph Weper Jurgen Teich, Michael Thies. Eﬃcient
architecture/compiler co-exploration for asips dirk ﬁscher.
In International Conference on Compilers, Architectures,
and Synthesis for Embedded Systems, October 2001.
[17] E. Lawler and D. Wood. Branch and bound methods: A
survey. Operations Research, 14(291):699–719, 1966.
[18] S. Leibson. Xscale (strongarm-2) muscles in.
Microprocessor Report, September 2000.
[19] T. Morimoto, K. Saito, H. Nakamura, T. Boku, and
K. Nakazawa. Advanced processor design using hardware
description language aidl. In In Proceedings of Asia and
South Paciﬁc Desing Automation Conference 1997
(ASP–DAC 1997), pages 387–390, 1997.
[20] J. Mulder. An area model for on-chip memories and its
applications. IEEE Journal of Solid States Circuits,
26(2):98–106, February 1991.
[21] M. Puig-Medina, G. Ezer, and P. Konas. Veriﬁcation of
conﬁgurable processor cores. In Proceedings of the Design
Automation Conference (DAC2000), pages 426–431, 2000.
[22] G. Reinman and N. Jouppi. Cacti version 2.0.
http://www.research.digital.com/wrl/people/jouppi/CACTI.html,
June 1999.
[23] S. Santhanam. Strongarm 110: A 160mhz 32b 0.5w cmos
arm processor. In Proceedings of HotChips VIII,p a g e s
119–130, 1996.
[24] T. Sherwood and B. Calder. Automated design of ﬁnite
state machine predictors for customized processors. In
Annual International Symposium on Computer
Architecture, June 2001.
[25] C. Snyder. Synthesizable core makeover: Is lexra’s
seven-stage pipelined core the speed king? In
Microprocessor Report, June 2001.
[26] C.D. Snyder. Fpga processors cores get serious.
Microprocessor Report, 14(9), September 2000.
[27] A. Srivastava and A. Eustace. ATOM: A system for
building customized program analysis tools. In Proceedings
of the Conference on Programming Language Design and
Implementation, pages 196–205. ACM, 1994.
[28] Rabin A. Sugumar and Santosh G. Abraham.
Set-associative cache simulation using generalized binomial
trees. ACM Transactions on Computer Systems,
13(1):32–56, 1995.
[29] S. Wilton and N. Jouppi. Cacti: An enhanced cache access
and cycle time model. In IEEE Journal of Solid-State
Circuits, May 1996.
[30] Lisa Wu, Chris Weaver, and Todd Austin. Cryptomaniac: a
fast ﬂexible architecture for secure communication. In 28th
Annual International Symposium on Computer
Architecture, pages 110–119, 2001.