Static task mapping for tiled chip multiprocessors with multiple voltage islands by Nikitin, Nikita & Cortadella, Jordi
Static Task Mapping for Tiled Chip Multiprocessors with
Multiple Voltage Islands
Nikita Nikitin
Universitat Polite`cnica de Catalunya
Barcelona, Spain
Jordi Cortadella
Universitat Polite`cnica de Catalunya
Barcelona, Spain
Abstract—The complexity of large Chip Multiprocessors (CMP) makes
design reuse a practical approach to reduce the manufacturing and design
cost of high-performance systems. This paper proposes techniques for
static task mapping onto general-purpose CMPs with multiple pre-defined
voltage islands for power management. The CMPs are assumed to contain
different classes of processing elements with multiple voltage/frequency
execution modes to better cover a large range of applications. Task
mapping is performed with awareness of both on-chip and off-chip
memory traffic, and communication constraints such as the link and
memory bandwidth. Besides proposing a linear programming model for
small systems, a novel mapping approach based on Extremal Optimization
is proposed for large-scale CMPs. This new combinatorial optimization
method has delivered very good results in quality and computational cost
when compared to the classical simulated annealing.
I. INTRODUCTION
Chip-multiprocessing (CMP) is becoming a major trend to take
advantage of Moore’s law under the power consumption limitations
dictated by the heat dissipation problems in high performance com-
puting systems. Commercial and prototype implementations have
shown the performance gains achieved by CMPs having up to
a hundred cores [1]–[5]. As we move down to deep nanometric
technologies, the design complexity of such systems increases sig-
nificantly. Manufacturing costs and time-to-market compromise the
viability of new products that are customized for specific applications.
Design reuse is a pragmatic solution to this problem, in both CMP
design and deployment. For an effective reuse during deployment,
CMPs are designed general-purpose, to support a variety of applica-
tions. Hence, a methodology for efficient mapping of applications
onto CMPs is essential. Many approaches have been proposed
to solve the mapping problem for application-specific and multi-
processor on-chip systems (SoCs) [6]. However, there are significant
differences between the SoCs and CMPs, that are of the great
importance for the mapping problem. To understand these differences
we have to consider two aspects of CMPs: the tiled architecture and
organization of power management.
A. Tiled CMP architecture
To reduce the design time, tile replication was shown to be an
efficient reuse methodology for many-core CMPs [4], [5]. This led to
the concept of tiled architecture, characterized by regular structures of
homogeneous tiles, each one consisting of a processing core, a cache
memory and a router. Further research in this area inspired designs
with heterogeneous tiles, preserving the regularity of the structure,
but introducing several classes of tiles [1], [7], [8]. Such systems
may include some specialized processors (e.g., graphics, DSP) or
different implementations of the same architecture (e.g., in-order/out-
of-order, multi-threading) with varied power-performance trade-offs.
Figure 1(a) depicts a tiled CMP with three classes of tiles: general-
purpose cores (C), cores with graphics units (G) and DSPs (D). Each
tile also incorporates cache memories of two levels (L1, L2) and
C C C C
C G D C
C D G C
C C C C
Core
L2
L1
R
MC MC
(a)
t8
t1 t2
t6t7
t3
t9 t13
t11 t12 t16
t10
t5
t4
t15
t14
(b)
Fig. 1: (a) Tiled CMP; (b) task graph to be mapped onto CMP.
an on-chip router (R), connecting it to the interconnection fabric.
Two memory controllers (MC) are placed at the periphery to provide
communication with the off-chip memory.
B. Power management
CMPs are designed to operate under a certain power budget that
assures the performance and thermal properties of the system. One
of the most effective ways to manage power is to floorplan various
voltage islands and assign the best voltage and frequency for each
core [9].
Unfortunately, voltage islands have a high design cost. Firstly, the
floorplanning of the system is constrained by the design of the power
delivery network and the location of the level shifters. Secondly,
and more important, power management requires different voltage
regulators for each power supply. Off-chip regulators need extra area
on the PCB that may be unacceptable if the system has a large amount
of power domains. On-chip regulators involve a significant area
overhead and power consumption due to the large inductances and
switching capacitors required to provide a stable supply voltage [10].
It is therefore realistic to consider that future CMPs will have
many cores (hundreds) and voltage islands with several cores (e.g., 4
or 8). This fact imposes an additional constraint in the task mapping
problem: even though some cores could possibly run at lower voltages
and frequencies, sharing the island with other cores may prevent to
take advantage of this flexibility. Hence, it is convenient to allocate
tasks in a way that cores within the same voltage islands can share
similar voltage/frequency parameters.
Up to now, the research on power-aware mapping has assumed
that the voltage islands are defined pre-silicon during task mapping
in application-specific SoCs, often disregarding the cost of imple-
menting the voltage islands. A broad overview of the related work
on SoC application mapping and island planning can be found in [6].
The approach in [11] considers performance constraints, but does
not account for the communication component of power. A more
realistic approach is proposed in [12] in which computation and
communication are both optimized taking into account a third com-
ponent related to voltage shifters. Thermal-aware island partitioning
via evolutionary algorithms was proposed in [13]. The distinction of
different processor classes was introduced in [14], but assuming that
every processor can run at an independent voltage level.
C. Task mapping for tiled CMPs
The mapping problem we want to address differs from previous
ones in that the CMP is assumed to be already manufactured.
Therefore, the voltage islands have been already floorplanned and
the maximum bandwidth of the links between cores is also known a
priori. Another peculiarity of CMP mapping (as opposed to SoCs),
captured by this work, is the presence of traffic to the off-chip
memory, as well as the limited bandwidth of the memory controllers
(MCs). Finally, the methods proposed for task mapping must be
scalable and suitable to handle systems with hundreds of cores.
Hence, scalability becomes a major concern of this work.
The work in [15] proposed a framework for accurate compiler-level
mapping of applications onto homogeneous mesh CMPs through de-
tailed analysis of the instructions and allocation of data. The approach
presented in this paper differs by considering the variety of processing
units, offered by heterogeneous CMPs. It also demonstrates better
scalability, due to the higher-level abstraction of application with a
task graph.
The examined problem consists of statically mapping a set of
parallel tasks onto a many-core CMP and selecting the voltages of the
CMP islands so that the total communication and computation power
is minimized. The application to be mapped is represented as a graph
of parallel tasks (Fig. 1(b)) with specified average communication
requirements between the tasks, that is a common assumption for
mapping onto the on-chip systems [6]. The partitioning of application
into parallel tasks can be obtained by profiling [15].
Every task has an associated throughput constraint (instructions
per second) that guarantees the required QoS for that task. A variety
of processor classes is supported, each one characterized by a set
of voltage/performance/power parameters that can be selected to find
the best performance/power trade-off for each task. However, this
flexibility is constrained by an important limitation: all the cores in
the same island must work with the same voltage.
This work will also assume that the cores are organized in a mesh
with XY-routing [16]. The task mapping must satisfy the link and
memory controller bandwidth constraints defined a priori to avoid a
saturation of the communication fabrics.
The main contributions of this work can be summarized as follows:
• Specification of the problem for power-aware task mapping
onto manufactured CMP with several tile classes, subject to
throughput constraints.
• Mathematical formulation of the problem as a mixed-integer lin-
ear programming problem (MILP) capable of delivering optimal
solutions for examples of small size.
• Scalable approach based on Extremal Optimization (EO) [17],
shown to outperform the optimization by simulated annealing,
both in quality of results and computational cost. The scalability
of the method is proved by examples with hundreds of cores.
We would like to emphasize the fact that Extremal Optimization
is a combinatorial optimization technique mostly unknown in the
EDA community. The results obtained for this problem have been
surprisingly good and very competitive with regard to Simulated
Annealing. We believe these results may encourage further research
in other areas related to layout synthesis. A related work, although
in a different context, can be found in [18].
The structure of the paper is as follows. Next section presents an
overview of the mapping problem by considering a small example.
Section III proposes an MILP formulation of the problem. The
metaheustistic techniques are explained in Section IV. Section V
discusses the experimental results.
1.0
1.0 0.5
t1 t2
t3 t4
(a) Task graph
C2C1
(b) Processor mesh
t1 t2
t3 t4
(c) Communication-
optimal mapping
  V       F       P
1.0   1.00  0.30
0.8   0.75  0.15
  V       F       P
1.0   0.50  0.10
0.8   0.35  0.05
  t1     t2      t3      t4
0.7   0.8   0.4   0.3
C1
C2
Throughput
(GIPS)
     IPC
t1, t2   1.0t3, t4   0.8 
     IPC
t1, t2   0.8t3, t4   1.0 
(d) Processor and throughput data
t1 t3
t2 t4
V1
V2
(e) Throughput-feasible map-
ping
t1 t4
t2 t3
V1
V2
(f) Power-optimal mapping
Fig. 2: Task mapping example.
II. PROBLEM OVERVIEW
This section discusses the task mapping problem using a small
example. Let us assume a task graph with four tasks (Fig. 2(a)). There
are three flows between the tasks, with the bandwidths specified in
the arcs of the graph (in Gbps). Figure 2(b) depicts a CMP with four
processors. There are two classes of processors: C1 (light) and C2
(dark). The task graph must be mapped onto the CMP.
1) Communication-optimal mapping: Figure 2(c) shows a task
mapping that optimizes the communication metric, that is the product
of bandwidth and hop-count. Assuming the distance between the
neighboring processors is one hop, the communication cost of this
mapping is
CCost1 = 1.0 · 1 + 1.0 · 1 + 0.5 · 2 = 3.0 (Gbps).
2) Throughput-feasible mapping: Now let us take into account
the processor parameters and consider the throughput requirements
of the tasks. Figure 2(d) describes the processor parameters. They can
operate at two voltages, 1.0 and 0.8V . The corresponding frequency
(F , in GHz) and power (P , in W ) for each voltage is shown in the
tables. Due to the nature of the tasks and the implementation of each
processor, each task may be executed with a different performance
(Instructions Per Cycle (IPC)) in each class of processors. Finally,
each task may require a specific throughput (given in giga-IPS in
Fig. 2(d)).
The mapping in Fig. 2(c) is infeasible with introduction of the
throughput constraints. Consider task t2 assigned to a C2-processor.
The maximum performance that C2 can provide for t2 is IPC(t2) ·
F (1.0V ) = 0.8 · 0.5 = 0.4 GIPS, while the throughput requirement
for t2 is 0.8 GIPS.
To satisfy the requirements, tasks t2 and t3 are swapped (see
Fig. 2(e)). This mapping satisfies the throughput constraints and still
keeps the optimal value for the communication metrics.
3) Power-optimal mapping: As a final step, let us consider the
partitioning of the CMP into voltage islands. Let us assume the
CMP has two islands, separated by the bold dotted line, as shown
in Fig. 2(e). Processors in the same island must operate at the same
voltage level, that is the minimal voltage required to satisfy all the
throughput constraints for the tasks mapped to this island.
For the mapping in Fig. 2(e), the upper island has to operate at
1.0V dictated by the throughput constraint of t3. The lower island
also has to run at 1.0V , because of t2. Thus, the computation power,
calculated using the data from Fig. 2(d), is Pcomp = 0.30 + 0.10 +
0.30 + 0.10 = 0.80 W . Let the energy to transfer one bit for one
hop be Ebit = 0.1nJ/bit. Then the communication power is
Pcomm = CCost2 · Ebit = 3.0Gbps · 0.1nJ/bit = 0.3 W,
and the total power P = Pcomp + Pcomm = 1.10 W .
Notice that if we swap tasks t3 and t4 (Fig. 2(f)), the upper
island can lower the voltage to 0.8V without violating the throughput
constraints. The new computation power is Pcomp = 0.15 + 0.05 +
0.30 + 0.10 = 0.60 W . The communication cost is increased:
CCost3 = 1.0 · 1 + 1.0 · 2 + 0.5 · 1 = 3.5 (Gbps · hop), so the
communication power becomes Pcomm = CCost3 ·Ebit = 3.5 · 0.1 =
0.35 W . However, the total power P = 0.95 W decreases, making
the assignment in Fig. 2(f) the best one in terms of total power.
The previous example demonstrates the importance of the task
mapping problem when trying to minimize power consumption in a
CMP with multiple classes of processors and voltage islands. The
next section shows how optimal solutions for small instances of the
problem can be found based on an MILP formulation.
III. A MATHEMATICAL MODEL
This section gives a formal definition of the problem via a Mixed-
Integer Linear Programming model. This model will be later used
as the basis of a heuristic method for large-scale systems based on
Extremal Optimization.
A. Parameters of the problem
The parameters of the problem are summarized in Table I. The
variables of the MILP formulation are outlined in Table II.
A task graph TG(T ,F) is a directed graph with vertices rep-
resenting the tasks ti ∈ T . Each arc represents a flow fsd ∈ F
that defines the communication from task ts to td. Every flow has
a minimum required bandwidth Bsd. Every task ti has a throughput
constraint IPS(ti), that is the minimum number of instructions per
second required to provide the service delivered by the task. Λ(ti)
defines the total traffic rate between ti and the memory controller.
The ratio between the traffic to and from the controller is specified by
the parameter ρ. Note that Λ(ti) value can be approximated, given
the amount of data, operated by the task (i.e. the working set), and
the size and miss-ratio of the tile cache.
A CMP is represented by a mesh of processors PM(P,L) with
dimensions of W ·H cells, where P is the set of processors and L
is the set of communication links. Links are organized into an on-
chip network with regular mesh topology [19]. The communication
capacity between the neighboring cells is determined by the global
parameter Cap (all links are assumed to have the same capacity).
Every cell represents a processor pj , belonging to one of the processor
classes in C = {c1, .., cC}. Different classes of processors have
distinct performance executing each task. The performance of pj to
execute task ti, measured in instructions per cycle, is specified by
the function IPC(ti, pj).
Task parameters
TG Task graph with tasks ti and flows fsd
Bsd Bandwidth requirement for flow fsd
IPS(ti) Throughput requirement for task ti (instr./sec.)
Λ(ti) Traffic of task ti to the memory controller
ρ Ratio of traffic rates to and from controller
Processor grid parameters
PM(P,L) Mesh of processors (P) with communication links (L)
Cap Maximum capacity of the communication links
IPC(ti, pj) Performance of pj executing ti (instr./cycle)
V Set of available operating voltages vk
F (pj , vk) Frequency of processor pj at voltage vk
P (pj , vk) Power of processor pj at voltage vk
MC(pj) Memory controller associated with pj
McDist(pj) Distance from pj to associated controller
McBw Maximum bandwidth of memory controllers
Voltage island parameters
{ιn} Set of voltage islands
I Map from processors to voltage islands
TABLE I: Input parameters of the problem.
Variable Type Description
aijk task ti is assigned to processor pj with voltage vk
vnk Binary voltage island ιn operates at voltage vk
rlsd link l belongs to the route of flow fsd
mlsd Real mapping indicator for the terminals of fsd
hxsd, h
y
sd hop-count (x and y) of route fsd
TABLE II: Variables of the MILP formulation.
The processors may operate at different voltages. We assume a
set of voltages V = {v1, .., vV } available for all processors. The
frequency and power of pj operating at voltage vk are defined by the
functions F (pj , vk) and P (pj , vk), respectively. Every pj belongs to
some voltage island ιn, as defined by the island map I. The voltage of
an island can be adjusted independently of the other islands, however,
all processors in an island must operate at the same voltage.
A CMP has a set of controllers to access the off-chip memory.
Guided by the existing implementations [4], [5], in this work we
assume controller placement at the periphery of processor mesh.
However, this does not limit the proposed approach from having the
controllers placed inside the mesh, that was demonstrated beneficial
by the recent research [20]. Another assumption we make is that
every processor pj is associated with one controller, as defined
by the function MC(pj). This assumption can be eliminated by
specifying the probabilities of accessing different controllers for pj .
Function McDist(pj) returns the hop-count distance from pj to the
related controller. The McBw parameter sets the maximum controller
bandwidth to guarantee performance of memory access.
B. Cost function
The goal of the model is to minimize power consumption under a
set of design and performance constraints.
The binary variables aijk define whether task ti is mapped onto
processor pj operating at voltage vk. The total power consumption
for computation can be defined as follows:
Pcomp =
∑
ti∈T
∑
pj∈P
∑
vk∈V
aijk · P (pj , vk).
The power consumption for communication has two terms: the on-
chip communication, defined by the flows between the tasks and
the off-chip communication, defined by the traffic to the memory
controllers. To model the first term, we introduce the variables
hxsd and h
y
sd that represent the hop-count of flow fsd in the x-
and y-directions, respectively. The power consumption for inter-task
communication can be defined as
P tcomm =
∑
fsd∈F
Bsd · (hxsd + hysd) · Ebit,
and the term related to communication with memory controllers
Pmccomm =
∑
ti∈T
∑
pj∈P
∑
vk∈V
aijk · Λ(ti) · McDist(pj) · Ebit.
where Ebit is the estimated energy for transmitting one bit over a
link. The objective of the problem is to minimize the total power:
min : P = Pcomp + Pcomm = Pcomp + P tcomm + P
mc
comm. (1)
C. Constraints
The first two constraints are the classical requirements for an
assignment problem. Every task ti has to be assigned to some
processor pj and every processor can hold one task at most:
∀ti ∈ T :
∑
pj∈P
∑
vk∈V
aijk = 1. (2)
∀pj ∈ P :
∑
ti∈T
∑
vk∈V
aijk ≤ 1. (3)
The next step is to model the communication component of the power.
A set of constraints is introduced to calculate the hop-count of each
flow assuming an XY-routing. Each processor pj is located in a tile
at column xj and row yj of the mesh (Fig. 3a). The coordinates are
uniquely defined by the index j: xj = j mod W and yj = bj/W c,
where W is the width of the mesh. For any task ti, we define (xi, yi)
as the location of the processor assigned to the task. Then, the location
is specified by the expressions over the task assignment variables:
xi =
∑
pj∈P
(j mod W )
∑
vk∈V
aijk
yi =
∑
pj∈P
(bj/W c)
∑
vk∈V
aijk. (4)
For every flow fsd, the source and destination tasks, ts and td,
are mapped onto processors ps and pd, with coordinates (xs, ys)
and (xd, yd), respectively, defined by (4). The horizontal hop-count,
hxsd = |xs − xd|, and the vertical hop-count, hysd = |ys − yd| are
defined by the following constraints1:
xs − xd ≤ hxsd, xd − xs ≤ hxsd
ys − yd ≤ hysd, yd − ys ≤ hysd. (5)
The next group of constraints defines the relations between voltage
islands and throughput. Let the binary variable vnk represent the fact
that the voltage island ιn operates at voltage vk. First, for every island
ιn only one voltage has to be selected:
∀ιn ∈ I :
∑
vk∈V
vnk = 1. (6)
To enforce that all processors in the same voltage island work with
the same voltage, the following constraint is added:
∀ιn ∈ I, ∀vk ∈ V :
∑
ti∈T
∑
pj∈ιn
aijk ≤ Num(aijk) · vnk , (7)
1the pair of inequalities and the fact that the h variables are implicitly
minimized with the cost function (since this implies minimization of power),
guarantee the equality with the absolute value.
xj
yj
0 1 2 3
0
3
2
1 pj EAST MC
WEST MC
(a) Processor pj located in the
cell (xj , yj) of the mesh.
xlb xle
yl
0 1 2 3
0
3
2
1 EAST MC
WEST MC
(b) East link from cell
(xlb, yl) to cell (xle, yl).
Fig. 3: Definition of the processor and link location in mesh.
where Num(aijk) is the number of aijk variables in the LHS of the
inequality. Expression (7) in combination with (6) guarantees that
only the assignment variables for the selected voltage may take non-
zero values.
The throughput constraint should guarantee that for each task ti
executed on processor pj the product of IPC(ti, pj) and the processor
frequency F (pj , vk) defined by the current voltage, is not less than
the required throughput IPS(ti). Hence, the following relation is
specified for each ti ∈ T :∑
pj∈P
∑
vk∈V
aijk · IPC(ti, pj) · F (pj , vk) ≥ IPS(ti). (8)
The last group of constraints aims at satisfying the requirements for
link capacity and memory controller bandwidth, under the assumption
of XY-routing. We start by considering the link capacity. The total
link bandwidth can be expressed as the sum of the bandwidths of all
flows that pass through the link. There are two terms that contribute
to link bandwidth, related to the inter-task and memory controller
traffic, hence the constraint can be written as
∀l ∈ L : TaskTerm(l) + McTerm(l) ≤ Cap. (9)
Let us consider the task term first. In XY-routing, the data is
always sent in the X-direction first and the Y-direction afterward.
Hence, the route of a flow will pass through a link, only in case
the source and destination tasks are mapped to a specific subset
of processor locations. Thus, for every link l and flow fsd we
define the binary properties, MapSrc(l, fsd) and MapDst(l, fsd),
that indicate whether the source and destination tasks are mapped
onto the locations that imply link l to be on the flow route.
To guarantee that l is on the route of fsd, both properties should be
asserted, i.e., MapSrc(l, fsd) · MapDst(l, fsd) = 1. This is a non-
linear constraint that we linearize by introducing the real variables
mlsd:
MapSrc(l, fsd) + MapDst(l, fsd) = m
l
sd. (10)
Since the mapping properties can only take binary values, the mlsd
variable can only take three values: 0, 1, or 2. We use another scaling
of mlsd to the binary variables r
l
sd, that take non-zero values only
when mlsd = 2, i.e. both mapping properties are true:
rlsd ≥ mlsd − 1, 2 · rlsd ≤ mlsd. (11)
The variables rlsd are equal to 1 if the route of flow fsd goes
through link l. Now the task term for link l is written as:
TaskTerm(l) =
∑
fsd∈F
Bsd · rlsd. (12)
Next we explain how to represent the mapping properties
MapSrc(l, fsd) and MapDst(l, fsd). Consider the horizontal east
link of a cell with coordinates (xlb, yl) to a cell (xle, yl) (Fig. 3b).
The XY-route of flow fsd can only pass through the link in case
the source task ts is mapped onto one of the two dotted processor
cells. Indeed, the processor ps should be located on the same row
and in a column that is lower than or equal to the origin of the link:
(xs ≤ xlb) ∧ (ys = yl). Hence, the source mapping property for an
east link is:
MapSrc(l, fsd) =
∑
pj :(xj≤xlb)∧(yj=yl)
∑
vk∈V
asjk. (13)
For the destination task td the requirement is to be located in a
column that is greater than or equal to the link endpoint xle (striped
cells). The destination mapping property becomes:
MapDst(l, fsd) =
∑
pj :(xj≥xle)
∑
vk∈V
adjk. (14)
The mapping properties for south, west and north links are derived
in a similar manner.
Now consider the term related to the memory controller traffic.
For link l let us denote Req(l) the set of processors that send
requests to their controllers through link l. Similarly, Rep(l) is the set
of processors that receive replies from controller through l. Hence,
traffic to and from controllers through l is defined by the rate of tasks
mapped to the sets Req(l) and Rep(l):
McTerm(l) =
∑
ti∈T
∑
pj∈Req(l)
∑
vk∈V
aijk · ρ · Λ(ti)+∑
ti∈T
∑
pj∈Rep(l)
∑
vk∈V
aijk · (1− ρ) · Λ(ti). (15)
The sets Req(l) and Rep(l) can be expressed similarly to the
MapSrc and MapDst properties for links. As an example, let us
consider the same east link (Fig. 3b). Assuming XY-routing, pj sends
requests in the direction of the associated controller. Hence, the set
of processors sending requests through the link is limited by those,
associated with the EAST controller and located on the same row in
the column that is lower or equal to the link origin:
Req(l) = {pj : (MC(pj) = EAST) ∧ (xj ≤ xlb) ∧ (yj = yl)}.
The sets for other links are derived similarly. The inequalities (9)
together with the scaling relations (10), (11) and definitions (12)-(15)
guarantee that the link capacity constraints are met.
The last step is to specify the bandwidth constraints of the memory
controllers. The bandwidth of controller mcκ ∈ MC is defined by the
rates of tasks mapped onto processors, associated with mcκ:
∀mcκ ∈ MC :
∑
ti∈T
∑
pj :MC(pj)=κ
∑
vk∈V
aijk · Λ(ti) ≤ McBw. (16)
D. Problem formulation
The problem can now be formulated as follows.
Minimize: power consumption (1)
subject to:
assignment constraints and hop-count definition (2)-(5),
voltage selection constraints (6), (7),
throughput constraints (8),
link capacity constraints (9)-(15)
and memory bandwidth constraints (16).
IV. SIMULATED ANNEALING AND EXTREMAL OPTIMIZATION
This section discusses two metaheuristics commonly used to solve
complex combinatorial problems: Simulated Annealing (SA) [21]
and Extremal Optimization (EO) [17]. Both methods are inspired by
equilibrium statistical physics. SA has been successfully applied in
many EDA problems, mostly related to layout synthesis. However,
EO has emerged as a very competitive alternative that can give
superior results in quality and computational cost. This section shows
how EO can be customized to effectively solve the task mapping
problem. The results prove the superiority with regard to SA.
Both metaheuristics start from an initial mapping obtained by
greedily placing the tasks with highest throughput to the fastest
processors. It is assumed that the system is not highly throughput-
constrained and that a feasible initial assignment can be achieved by
a greedy heuristic. The bandwidth constraints may be violated in the
initial mapping and will be handled during the optimization process.
A. Simulated annealing
The general algorithm for SA is described in procedure 1. To
evaluate every configuration, two functions are used. Cost() returns
the cost of a configuration, calculated as the total system power
according to equation (1). CapP() calculates the penalty for link
capacity and memory bandwidth violations:
CapP =
∑
l∈L
max
(
Bl − Cap
Cap
, 0
)
+
∑
mc∈MC
max
(
Bmc − McBw
McBw
, 0
)
,
where Bl is the total bandwidth of flows routed through link l and
Bmc is the bandwidth of controller mc. If all constraints are satisfied,
then CapP = 0.
The SA algorithm implements a conventional annealing schedule.
Given the initial temperature Tinit and the cooling factor α, a
new solution (NewSol) is generated (line 4) and may be accepted
probabilistically, depending on the current temperature Tcur (line 5).
The value of Tcur decreases as the system evolves in time (line 8).
For every temperature, a number of moves that depends on the size
of the system (P , that is the number of cells) is generated.
NewSol is obtained by swapping a pair of random tasks and
is accepted probabilistically according to Procedure 2. To penalize
capacity violations, the cost is calculated as shown in lines 1-2, where
λ = Tinit/Tcur is the weight for penalization, that grows in time.
This decreases the probability to accept infeasible solutions as the
simulation advances.
Solutions with better cost are always accepted (line 3), whereas
worse-cost solutions are accepted with probability Pa:
Pa =
(
1− |∆Cost|
CurCost
)
· e−2·
Tinit
Tcur ,
with ∆Cost = NewCost−CurCost. This probability depends on two
factors. The former avoids the acceptance of solutions with high cost
degradation. The latter reduces the probability of hill climbing as the
temperature cools down.
B. Extremal optimization
Extremal optimization, inspired by the principle of evolution in
ecosystems, is metaheuristic for complex combinatorial problems.
Ecosystems were observed to evolve by selecting against its worst
components.
For every possible solution, EO evaluates the fitness of each
component in the system. A high fitness value indicates that the
component has a comfortable low-cost status in the system.
Procedure 1 SIMULATEDANNEALING
1: Tcur = Tinit
2: while improvement in last k iterations do
3: for P iterations do
4: generate NewSol
5: if ACCEPT(CurSol, NewSol) then CurSol← NewSol
6: if Cost(NewSol) < Cost(BestSol) then BestSol← NewSol
7: end for
8: Tcur = α · Tcur
9: end while
10: return BestSol
Procedure 2 ACCEPT(CurSol, NewSol)
1: CurCost← Cost(CurSol) + λ CapP(CurSol)
2: NewCost← Cost(NewSol) + λ CapP(NewSol)
3: if NewCost < CurCost then return true
4: else return true with probability Pa
EO focuses on improving the status of components with low
fitness. At each iteration, some of the worst-fit components are
replaced by other components that contribute to improve their fitness.
Local optima are avoided by randomizing the selection process.
The components are ranked according to their fitness in ascending
order (the worst components have lower indices in the rank). The
components are randomly selected by some probability distribution
biased towards the ones with lowest fitness values. The power-law
distribution is a typical one for EO. For example, if the system has N
components ranked from 1 to N in ascending order of their fitness,
the index i of the selected component can be calculated as follows:
i = dN · pτe (17)
where p is a random number obtained from a uniform distribution
in the interval [0, 1] and τ is the exponent of the power law. In our
experiments, we used values of τ in the interval [3, 4].
For the task mapping problem, at each iteration EO selects a pair of
tasks to be swapped: an unfavorable task (tu) and a replacement task
(tr). Unlike SA, EO uses information about the system cost when
selecting the swapped tasks. This results into a faster progress towards
the final solution. In addition, EO accepts new solutions uncondition-
ally without depending on any temperature cooling schedule, thus
making the algorithm easier to tune.
The mapping problem is a multiobjective optimization problem,
since the Pcomp, P tcomm and Pmccomm terms of the cost function (1)
depend on weakly related voltage level and hop-count values. It
was observed in [18] (and proved by our experiments) that the
multiobjective EO operates better by interleaving the optimization of
individual objectives in time, rather than trying to optimize all of them
simultaneously. This suggests to introduce different fitness functions
for the optimization of three power components and alternate them
at different iterations of the algorithm.
The EO algorithm is outlined in procedure 3. After the definition
of an initial solution (greedily), the execution is continued until no
further improvement is observed during a certain number of iterations.
The algorithm used in this work is a variation of EO called
Continuous Extremal Optimization [22]. This variant combines EO
with a local search at the beginning of each iteration, that is performed
by sequentially swapping P random pairs of tasks and accepting only
those that improve the cost. This variant contributed to improve the
cost of the final solution and the speed of the algorithm.
Procedure 3 EXTREMALOPTIMIZATION
1: CurSol← BestSol← ”Some initial solution”
2: while some improvement in the last k iterations do
3: Local search: swap P randomly selected pairs sequentially and
4: accept only those that improve the cost of CurSol
5: if (iter mod 3) = 1 then /* improve task comm. cost */
6: sort all tasks in ascending order of Φcommu,t
7: select tu according to equation (17)
8: sort all tasks ti 6= tu in ascending order of Φcommr
9: select tr according to equation (17)
10: else if (iter mod 3) = 2 then /* improve mc comm. cost */
11: sort all tasks in ascending order of Φcommu,mc
12: select tu according to equation (17)
13: sort all tasks ti 6= tu in ascending order of Φcommr
14: select tr according to equation (17)
15: else /* improve comp. cost */
16: sort all tasks in ascending order of Φcompu
17: select tu according to equation (17)
18: sort all tasks ti 6= tu in ascending order of Φcompr
19: select tr according to equation (17)
20: swap tasks tu and tr in CurSol
21: if Cost(CurSol) < Cost(BestSol) then
22: BestSol← CurSol
23: end while
24: return BestSol
The core of the algorithm focuses on selecting the pair of tasks that
must be swapped. The fitness functions alternate depending on the
iteration number. In one case, fitness is oriented to improve the power
consumption generated by inter-task communication, considering the
hop-counts and bandwidth parameters. In the second case, the power
of communication with memory controllers is optimized. The last
case addresses the power generated by computations.
The first task, tu, is selected by using the Φu fitness function and
sorting the tasks according to the fitness value. The second task, tr , is
selected by ranking the task according to the improvement in cost that
the swap would produce (Φr functions). The power law described by
equation (17) is used to select the tasks randomly.
Finally the locations of tasks of tu and tr are swapped uncondi-
tionally and BestSol is updated if the cost is better than any other
solution visited so far.
C. Fitness functions
To model the fitness for the power consumption generated by the
inter-task traffic on the mesh, Φcommu,t ranks the tasks according to the
product of total traffic and the square of hop-count of the involved
flows:
Φcommu,t (ti) = −
∑
fsd:(ts=ti)∨(td=ti)
Bsd · (hxsd + hysd)2.
The square of hop-count penalizes tasks with longer flows, rather
than those with high bandwidth, since Bsd is a constant parameter
that cannot be changed. The selection of the ranked tasks tends to
pick tasks with high communication cost. The negative sign allows
to rank the tasks in ascending order of fitness.
Similar fitness is used to select unfavorable task ti mapped to
processor pj for the controller-related term of power:
Φcommu,mc (ti) = −Λ(ti) · McDist(pj)2.
Although the fitness functions selected for both terms of communi-
cation power look similar, we consider them as individual candidates
for multiproduct optimization. The intrinsic difference between the
two types of communication is that an inter-task flow depends on
mapping of both, source and destination tasks, while the memory
controller flow depends on one, either source or destination task.
The fitness function for the replacement task tr is the same for
both types of communication. It aims at selecting a task that, when
swapped with tu, would mostly decrease the communication cost and
contribute to reduce the violations of maximum bandwidth:
Φcommr (ti) = Cost(NewSol) · (1 + CapP(NewSol)),
where NewSol is the solution obtained by swapping ti and tu.
The computation-oriented fitness functions aim at finding power-
efficient solutions by smoothing the voltage spillover in the voltage
islands. Let us call V mini the minimum voltage required to guarantee
the throughput of task ti assigned to a processor in some voltage
island ιn. Since task is living in the same island with other tasks, it
may not be possible to assign V mini to it, as other tasks may require
a higher voltage.
We define the voltage spillover of ti as Spilloveri = V
min
i − V ,
where V is the average minimal voltage of all tasks allocated in the
same voltage island. The dispersion of island ιn is defined as
Dispersionn =
∑
ti∈ιn
(Spilloveri)
2.
and measures the voltage imbalance for the island. High dispersions
imply less power-efficient solutions, as more processors operate
at voltages higher than required. Computational fitnesses aim at
decreasing the voltage dispersion of the system. The unfavorable
component is selected from the tasks with the high spillover value:
Φcompu (ti) = −Spilloveri.
The replacement task is selected to maximize the product of the
cost improvement with the dispersion, penalizing solutions with large
capacity violations:
Φcompr (ti) =
1 + CapP(NewSol)
∆Cost ·∆Dispersion .
V. EXPERIMENTAL RESULTS
The results presented in this section have three primary objectives.
Firstly, optimal solutions are obtained for small examples by solving
the MILP model. It is shown that metaheuristics can also find the
optimum for these examples, and in much shorter time. Secondly,
the quality and speed of SA and EO are compared. The latter
is demonstrated to outperform in both metrics for a vast space
of solutions. Thirdly, the impact of the link capacity and memory
bandwidth constraints is discussed.
A. Examples and experimental setup
Every test case for the mapping problem is characterized by an
application task graph and a target CMP. The parameters of the test
cases are presented in Table III. The number of tasks and flows are
reported in the second and third columns. The fourth column shows
the dimensions of the mesh for the target CMP. The last column
displays the number of memory controllers in each test case.
The first group of examples is inspired by the realistic applications,
widely used in the SoC research domain (e.g. [12], [23]): Multi-
Window Displayer (MWD), MPEG4 decoder (MPEG4) and Object
Plane Decoder (OPD). To explore the scalability of the proposed
Name # of tasks # of flows Grid size # of MC
MWD 12 11 4× 3 2
MPEG4 12 13 4× 3 2
OPD 16 17 4× 4 2
64T 64 90 8× 8 4
144T 144 200 12× 12 4
256T 256 380 16× 16 8
400T 400 595 20× 20 8
TABLE III: Testcase configurations.
Class 1.2V 1.0V 0.8V
C1 1000MHz, 260mW 800MHz, 150mW 600MHz, 70mW
C2 450MHz, 200mW 350MHz, 120mW 250MHz, 60mW
C3 160MHz, 55mW 130MHz, 30mW 100MHz, 15mW
TABLE IV: Parameters of the processor classes.
technique, we generate a group of large examples for mapping onto
8× 8, 12× 12, 16× 16 and 20× 20-tile CMPs (test cases 64T to
400T). The task graphs for these configurations are obtained by
combining instances of MWD, MPEG4 and OPD. For instance, the
task graph for 400T consists of 30 small applications, 10 instances
of each type. To avoid having totally disconnected clusters of tasks,
few random flows were added between the components. The third
column of Table III displays the resulting number of flows in graphs.
For the experiments, we have considered three processor classes
(C1, C2 and C3) with different frequency and power parameters
operating at three different voltages: 1.2V, 1.0V and 0.8V. The
parameters are reported in Table IV. The distribution of tiles in
the CMP is as follows: 20% of the tiles have C1-processors, 30%
have C2-processors and 50% have C3-processors. The classes are
distributed uniformly in such a way that all voltage islands have a
similar mixture of classes. Without loss of generality, we assume that
all islands have the same size Svi (number of tiles). Different values
have been used in the experiments.
Every task has a different throughput requirement (IPS) and a
different performance when executed at each class of processor
(IPC). All these values are defined randomly, with IPC values in
the interval [0.5, 2.0] and guaranteeing that a feasible mapping exists
for the assigned performance and throughput requirements. This
randomization contributes to explore a larger set of configurations
and to have an unbiased tuning of the metaheuristics.
The traffic Λ(ti) between the task ti and memory controller was
estimated as 20% of total traffic between ti and all other tasks. The
ratio between the request and reply traffic was set to ρ = 0.2.
SA is parametrized by two values: the initial temperature Tinit and
the cooling factor α. In our experiments we define Tinit = 104 and
only vary the α value. Given that a large range of α values have
been explored for each experiment, the quality of the solutions is not
dependent on Tinit . The only parameter of EO is τ , i.e., the exponent
of the power-law (eq. (17)). Section V-C discusses the strategy used
to explore the values of α and τ .
B. Comparison with the optimal solution
The MILP formulation allows obtaining optimal solutions for the
mapping problem. However, the search of the optimum is computa-
tionally affordable only for small examples. We used CPLEX [24] to
solve the MILP problem for the test cases of the first group: MWD,
MPEG4 and OPD. The size of voltage islands Svi was set to four. The
time required to find the optimum is displayed in the “MILP” column
of Table V. One can observe the two-order increase in runtime for
Name MILP SA EO
MWD 85.25 0.01 0.01
MPEG4 120.17 0.02 0.01
OPD 4594.40 1.17 0.08
TABLE V: Time to reach the optimal solution (sec).
a 16-tile example (OPD) in comparison with the 12-tile examples
(MWD, MPEG4).
The metaheuristics are able to achieve the optimal solution for the
same examples in much shorter time (columns “SA” and “EO” of
Table V). In this experiment the SA and EO algorithms were run for
a variety of parameters (α and τ ), and the best runtime values were
selected. This comparison affirms the fact, that both metaheuristics
perform very well for the small examples with known optimum.
C. Simulated Annealing and Extremal Optimization: comparison
This section tries to give an apple-to-apple comparison of both
metaheuristics for the task mapping problem. The comparison is
illustrated using the 256T example with Svi = 16 and represents
a typical behavior of the two algorithms for the explored test cases.
The timeout for execution was set to 200 seconds, since no
significant improvements were observed after that time for both
methods. Figure 4 depicts the evolution of the cost function value
obtained by SA with various α and by EO with τ = 4.0. The traces
corresponding to higher values of α drop slower, but achieve better
solutions in the long run.
Let us now consider the EO trace. At every moment in time the
current solution found by EO is better than any of the SA solutions,
obtained with different α values. The resulting cost discovered by EO
upon timeout outperforms any of the SA solutions by 12%. Another
important fact is that EO solution cost drops rapidly (0.1-2.5 seconds,
depending on the test case), to the 10% of accuracy, with respect to
the value obtained in the long run. This makes EO useful to apply
when fast estimation of the cost is required, e.g. in exploration loops.
SA requires a careful tuning to eliminate the dependency of the α
parameter on the problem size. Otherwise, small changes in α may
lead to an important degradation in quality. In this work we do not
aim at tuning the SA method. Rather, we perform multiple runs with
different α values and select the best results. The aim is to show that
EO is a better alternative even with a very good tuning of SA.
In the experiments, it was also observed that EO is much less
sensitive to τ and to the size of the problem. This simplifies the
tuning of the algorithm. Note that some variation of τ may provide
slightly better results for certain examples. However, we do not aim
at demonstrating the highest improvement for all test cases. We prefer
to emphasize that even having τ fixed, EO is able to outperform SA
with any α. Guided by this reasoning, in the following experiments
we always define τ = 4.0. This value was found to deliver good
results for all test cases.
D. Power optimization with EO
In this section we analyze the final solutions obtained with a
timeout of 200 seconds. The goal is to study the reduction in
power that EO delivers in comparison with SA for broad set of
configurations. It is important to indicate that the results obtained by
SA were selected by taking the best solution from all the α values,
thus making the analysis independent of the cooling factor.
Three parameters are explored to obtain a comprehensive collection
of test cases. Firstly, examples of different size are considered. These
include the 64T, 144T, 256T and 400T configurations from Table III.
Secondly, for every test case the size of the voltage islands is varied
25
30
35
40
45
50
55
60
65
70
75
80
85
0 20 40 60 80 100 120 140 160 180 200 220
Runtime (sec)
P o
w
e r
 
( W
)
SA, α=0.999
SA, α=0.99995
SA, α=0.99998
SA, α=0.99999
EO, τ=4.0
12%
Fig. 4: Evolution of SA and EO solutions in time.
among 4, 8 and 16 processors. Thirdly, different ratios between
the computation power Pcomp and communication power Pcomm are
considered. This is an important parameter, as it reflects the ability
of the approach to give priority to one power component or improve
both simultaneously. Three values for Pcomp/Pcomm are explored: 0.2,
1.0 and 5.0. They and inspired by the results presented in [14].
Figure 5 reports the power (equation (1)) of the EO solution
with respect to the best value obtained by SA with various α. For
each configuration, denoted as testcase/Svi along the X-axis, three
values for different Pcomp/Pcomm are shown. For the majority of
configurations EO outperformed the results of SA, with a maximum
gain in power of 22.5% (configuration 256T/16, Pcomp/Pcomm =
5.0). Only for 3 of 36 explored configurations (64T/4, 64T/16 and
400T/4 with Pcomp/Pcomm = 0.2) EO was slightly worse than SA. The
difference in this case did not exceed 2.0%.
EO tends to perform better at higher Pcomp as well as for larger
values of Svi. In other words, EO better minimizes the voltage of
islands, due to the consideration of voltage spillover. As the island
size grows, the amount of tasks, required to be swapped in order
to improve the voltage, also increases. This is one of the important
features of EO, since it can model the fitness of each component in the
system. Modeling the voltage spillover in SA is difficult, since only
a global cost is considered in the acceptance of moves and random
swaps do not concentrate on the components with worst fitness.
As an example, Fig. 6 shows the final voltage distributions for the
256T example with Svi = 16. The system has 16 voltage islands and
each island contains 16 processors with a mixture of C1, C2 and C3
classes, as shown in Fig. 6(a). The final voltage assignment for each
island is represented by the three colors in the figure. The solution
obtained by SA has 8 islands at 1.2V, 5 at 1.0V and 3 at 0.8V. The
one obtained by EO has 3 islands at 1.2V, 6 at 1.0V and 7 at 0.8V.
The estimated power consumption of the EO solution is 12% smaller
than the SA solution.
Another intuitive result is that the total power grows with the size
of voltage islands (Fig 6(c)). The island size sets the number of tiles
0.6
0.7
0.8
0.9
1
1.1
64
T/4
64
T/8
64
T/1
6
14
4T
/4
14
4T
/8
14
4T
/16
25
6T
/4
25
6T
/8
25
6T
/16
40
0T
/4
40
0T
/8
40
0T
/16
Configuration
P o
w
e r
 
r e
d u
c t
i o
n
,  
E O
/ S
A
Pcomp/Pcomm ≈ 0.2
Pcomp/Pcomm ≈ 1.0
Pcomp/Pcomm ≈ 5.0
Fig. 5: Power reduction by EO with respect to SA.
C3 C2 C3C1
C2 C3 C1C3
C3 C2 C3C2
C1 C3 C2C3
0.8V 1.0V 1.2V
(a) SA solution (b) EO solution
VI
 
si
ze
 
=
 
8
VI
 
si
ze
 
=
 
16
VI
 
si
ze
 
=
 
4
0
5
10
15
20
25
30
35
40
Po
w
er
 
(W
)
(c) EO power
Fig. 6: Voltage distribution and power for 256T example.
that must run at the same voltage. Hence, larger islands imply less
mapping flexibility for individual tiles to reduce voltage.
E. Impact of link capacity and memory bandwidth
The link capacity and memory controller bandwidth constraints
have a relevant influence on power consumption. In this example we
fix the memory bandwidth and analyze how the solution changes as
the link capacity constraint becomes more stringent. The dependency
of power on the memory bandwidth has a similar trend. We use
again the test case 256T with Svi = 16 and set the bandwidth of
each memory controller to 3 Gbps.
The results for SA and EO are plotted in Fig. 7. A sequence of
solutions for different values of capacity constraints was obtained.
The minimum capacity required to obtain feasible solutions was
Capmin = 0.91 Gbps, and was reached by both methods. The
trendlines included in the plot help to analyze the evolution of the
solutions as the link capacity changes.
The tendency for power is to increase as capacity constraint tight-
ens up. This happens principally due to the growth in communication
cost as the tasks need to be spread to avoid congestion in the links.
Although the gap between the EO and SA costs decreases as the
capacity approaches to Capmin, EO wins in power for all considered
values. This example represents the typical behavior, observed for
both metaheuristics, when optimizing a configuration subject to
capacity constraints.
VI. CONCLUSIONS
Design reuse will become a major paradigm for engineering many-
core systems. This paper has addressed the problem of static task
mapping for large-scale tiled CMPs with multiple voltage islands, as
one of the approaches to reduce design cost and time-to-market. The
problem formulation considers task throughput requirements, on-chip
and off-chip memory traffic, and bandwidth constraints. Extremal
optimization metaheuristic has shown to be an efficient and scalable
approach to solve this complex combinatorial problem.
30
35
40
45
50
55
60
0.5 1 1.5 2 2.5 3 3.5 4
Link capacity (Gbps)
P o
w
e r
 
( W
)
SA
EO
min capacity = 0.91
Fig. 7: Impact of link capacity on system power.
VII. ACKNOWLEDGEMENT
This research has been funded by project CICYT TIN2007-66523,
FPI grant BES-2008-004612, and grants from Intel Corporation and
Catalan Government (SGR 2009-1137).
REFERENCES
[1] D. Pham et al., “Overview of the architecture, circuit design, and
physical implementation of a first-generation cell processor,” Solid-State
Circuits, vol. 41, no. 1, pp. 179–196, 2006.
[2] S. Rusu et al., “A 45nm 8-core enterprise Xeon processor,” in Solid-State
Circuits, nov. 2009, pp. 9–12.
[3] T. Fisher et al., “Design solutions for the Bulldozer 32nm SOI 2-core
processor module in an 8-core CPU,” in Solid-State Circuits, feb. 2011,
pp. 78–80.
[4] S. Bell et al., “TILE64 - processor: A 64-core SoC with mesh intercon-
nect,” in Solid-State Circuits, feb. 2008, pp. 88 –598.
[5] S. Vangal et al., “An 80-tile 1.28TFLOPS network-on-chip in 65nm
CMOS,” in Solid-State Circuits, feb. 2007, pp. 98 –589.
[6] R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote,
“Outstanding research problems in NoC design: system, microarchitec-
ture, and circuit perspectives,” IEEE Transactions on Computer-Aided
Design, vol. 28, no. 1, pp. 3–21, Jan. 2009.
[7] M. Azimi et al., “Integration Challenges and Tradeoffs for Tera-scale
Architectures,” Intel Technology Journal, August 2007.
[8] S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “The impact of
performance asymmetry in emerging multicore architectures,” in Inter-
national Symposium on Computer Architecture, 2005, pp. 506–517.
[9] D. Lackey et al., “Managing power and performance for System-on-Chip
designs using voltage islands,” in Proc. International Conf. Computer-
Aided Design (ICCAD), 2002, pp. 195–202.
[10] W. Kim, M. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis
of fast, per-core DVFS using on-chip switching regulators,” in High
Performance Computer Architecture, Feb. 2008, pp. 123–134.
[11] W.-K. Mak and J.-W. Chen, “Voltage island generation under perfor-
mance requirement for soc designs,” in Proc. of Asia and South Pacific
Design Automation Conference, 2007, pp. 798–803.
[12] P. Ghosh and A. Sen, “Energy efficient mapping and voltage islanding
for regular NoC under design constraints,” J. High Perform. Syst. Archit.,
vol. 2, pp. 132–144, August 2010.
[13] W.-L. Hung et al., “Temperature-aware voltage islands architecting in
System-on-Chip design,” in Proc. International Conf. Computer Design
(ICCD), 2005, pp. 689–696.
[14] G. Varatkar and R. Marculescu, “Communication-aware task scheduling
and voltage selection for total systems energy minimization,” in Proc.
International Conf. Computer-Aided Design (ICCAD), 2003.
[15] G. Chen, F. Li, S. Son, and M. Kandemir, “Application mapping for chip
multiprocessors,” in Proc. ACM/IEEE Design Automation Conference,
2008, pp. 620 –625.
[16] W. Dally and B. Towles, Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers, Inc., 2003.
[17] S. Boettcher and A. G. Percus, “Extremal optimization: Methods derived
from co-evolution,” in Genetic and Evolutionary Computation, 1999, pp.
825–832.
[18] I. De Falco, A. Della Cioppa, D. Maisto, U. Scafuri, and E. Tarantino,
“A multiobjective extremal optimization algorithm for efficient mapping
in grids,” in Applications of Soft Computing, vol. 58, 2009, pp. 367–377.
[19] G. D. Micheli and L. Benini, Networks on Chips: Technology and Tools
(Systems on Silicon). Morgan Kaufmann Publishers, Inc., 2006.
[20] D. Abts, N. D. E. Jerger, J. Kim, D. Gibson, and M. H. Lipasti,
“Achieving predictable performance through better memory controller
placement in many-core CMPs,” in Int. Symp. Computer Architecture,
2009, pp. 451–461.
[21] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by
simulated annealing,” Science, vol. 220, pp. 671–680, 1983.
[22] T. Zhou, W.-J. Bai, L.-J. Cheng, and B.-H. Wang, “Continuous extremal
optimization for Lennard-Jones clusters,” Phys. Rev. E, vol. 72, no. 1,
Jul 2005.
[23] A. Pinto, L. P. Carloni, and A. L. Sangiovanni-Vincentelli, “A method-
ology for constraint-driven synthesis of on-chip communications,” IEEE
Transactions on Computer-Aided Design, vol. 28, no. 3, pp. 364–377,
Mar. 2009.
[24] “CPLEX,” http://www.ilog.com/products/cplex.
