Static task mapping for tiled chip multiprocessors with multiple voltage islands by Nikitin, Nikita & Cortadella, Jordi
Static Task Mapping for Tiled Chip
Multiprocessors with Multiple Voltage Islands
Nikita Nikitin and Jordi Cortadella
Universitat Polite`cnica de Catalunya, Barcelona, Spain
Abstract. The complexity of large Chip Multiprocessors (CMP) makes
design reuse a practical approach to reduce the manufacturing and de-
sign cost of high-performance systems. This paper proposes techniques
for static task mapping onto general-purpose CMPs with multiple pre-
deﬁned voltage islands for power management. The CMPs are assumed
to contain diﬀerent classes of processing elements with multiple volt-
age/frequency execution modes to better cover a large range of applica-
tions. Task mapping is performed with awareness of both on-chip and
oﬀ-chip memory traﬃc, and communication constraints such as the link
and memory bandwidth. A novel mapping approach based on Extremal
Optimization is proposed for large-scale CMPs. This new combinato-
rial optimization method has delivered very good results in quality and
computational cost when compared to the classical simulated annealing.
Keywords: Chip Multiprocessing, Task Mapping, Power Management,
Extremal Optimization.
1 Introduction
Chip-multiprocessing (CMP) is becoming a major trend to take advantage of
Moore’s law under the power consumption limitations dictated by the heat dis-
sipation problems in high performance computing systems. Commercial and pro-
totype implementations have shown the performance gains achieved by CMPs
having up to a hundred cores [1–3]. As we move down to deep nanometric tech-
nologies, the design complexity of such systems increases signiﬁcantly. Manufac-
turing costs and time-to-market compromise the viability of new products that
are customized for speciﬁc applications.
Design reuse is a pragmatic solution to this problem, in both CMP design
and deployment. For an eﬀective reuse during deployment, CMPs are designed
general-purpose, to support a variety of applications. Hence, a methodology for
eﬃcient mapping of applications onto CMPs is essential. Many approaches have
been proposed to solve the mapping problem for application-speciﬁc and multi-
processor on-chip systems (SoCs) [4]. However, there are signiﬁcant diﬀerences
between the SoCs and CMPs, that are of the great importance for the mapping
problem. To understand these diﬀerences we have to consider two aspects of
CMPs: the tiled architecture and organization of power management.
A. Herkersdorf, K. Ro¨mer, and U. Brinkschulte (Eds.): ARCS 2012, LNCS 7179, pp. 50–62, 2012.
c© Springer-Verlag Berlin Heidelberg 2012
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 51
   
   
   
   


	


 
(a)

 


 
  

	


	


(b)
Fig. 1. (a) Tiled CMP; (b) task graph to be mapped onto CMP
Tile replication was proposed for many-core CMPs in order to reduce the
design time [2, 3]. This led to the concept of tiled architecture, characterized by
regular structures of homogeneous or heterogeneous tiles [1, 5, 6]. Such systems
may include specialized processors (e.g., graphics, DSP) or diﬀerent implemen-
tations of the same architecture (e.g., in-order/out-of-order) with varied power-
performance trade-oﬀs. Figure 1(a) depicts a CMP with three classes of tiles:
general-purpose cores (C), cores with graphics units (G) and DSPs (D). Each tile
also incorporates cache memories of two levels (L1, L2) and an on-chip router
(R), connecting it to the interconnection fabric. Two memory controllers (MC)
are placed at the periphery to provide communication with the oﬀ-chip memory.
To assure the performance and thermal properties of the system, CMPs are
designed to operate under a certain power budget. One of the most eﬀective
ways to manage power is to ﬂoorplan various voltage islands and assign the best
voltage and frequency for each core [7]. Unfortunately, the number of voltage
islands is constrained by the design of the power delivery network and costly
implementation of voltage regulators [8]. It is therefore realistic to consider that
future CMPs will have many cores and voltage islands with several cores (e.g., 4
or 8). This fact imposes an additional constraint in the task mapping problem:
even though some cores could possibly run at lower voltages and frequencies,
sharing the island with other cores may prevent from having this ﬂexibility.
Up to now, the research on power-awaremapping has assumed that the voltage
islands are deﬁned pre-silicon during task mapping in SoCs, often disregarding
the cost of their implementation. A broad overview of the related work on SoC
application mapping and island planning can be found in [4]. The approach in [9]
considers performance constraints, but does not account for the communication
power. A more realistic approach is proposed in [10] in which computation and
communication are optimized taking into account a component related to volt-
age shifters. The work in [11] emphasizes the importance of optimizing both
static and dynamic CMP power and describes eﬃcient heuristics for scheduling
of streaming applications. Thermal-aware island creation via evolutionary algo-
rithms was proposed in [12]. The distinction of processor classes was introduced
in [13], but assuming that every processor runs at independent voltage level.
1.1 Task Mapping for Tiled CMPs
The mapping problem we want to address diﬀers from previous ones in that the
CMP is assumed to be already manufactured and reused for several applications
52 N. Nikitin and J. Cortadella
(similarly to FPGA reconﬁguration). Therefore, the voltage islands have been
already planned and the maximum bandwidth of the links is also known a priori.
Another peculiarity of CMP mapping (as opposed to SoCs), captured by this
work, is the presence of traﬃc to the oﬀ-chip memory. Finally, the proposed
method is scalable to handle systems with hundreds of cores.
The work in [14] oﬀers a framework for accurate compiler-level mapping of
applications onto homogeneous mesh CMPs through analysis of the instructions
and allocation of data. The approach presented in this paper diﬀers by consid-
ering the variety of processing units, oﬀered by heterogeneous CMPs.
The examined problem consists of statically mapping a set of parallel tasks
onto a many-core CMP and selecting the voltages of islands so that the to-
tal system power is minimized. Every task is considered as an inﬁnite process
(encountered in control, automative and robotics systems) and has an associ-
ated throughput constraint that guarantees the required QoS for that task. A
variety of processor classes is supported, each one characterized by a set of volt-
age/performance/power parameters used to ﬁnd the best performance/power
trade-oﬀ for each task. It is assumed that the cores are organized in a mesh with
XY-routing [15]. The main contributions of this work can be summarized as:
– Mathematical formulation of a mixed-integer linear programming problem
(MILP) delivering optimal maping solutions for the examples of small size.
– Heuristical mapping optimization by Simulated Annealing (SA).
– Scalable approach based on Extremal Optimization (EO) [16], shown to out-
perform the optimization by SA, both in quality and computational cost.
Due to the lack of space and to provide the reader with intuition on the EO-
based approach, we only focus on this algorithm in the paper. The MILP and
SA mapping are presented in detail in the technical report [17].
The structure of the paper is as follows. Next section presents an overview of
the problem by considering a small example. Section 3 presents the deﬁnition of
the problem. The proposed solution technique based on Extremal Optimization
is explained in Section 4. Section 5 discusses the experimental results.
2 Problem Overview
This section illustrates the task mapping problem using a small example. Let us
assume a task graph with four tasks (Fig. 2(a)). There are three ﬂows between
the tasks, with the bandwidths speciﬁed in the arcs of the graph (in Gbps). Fig-
ure 2(b) depicts a CMP with four processors. There are two classes of processors:
C1 (light) and C2 (dark). The task graph must be mapped onto the CMP.
Communication-optimal mapping. Figure 2(c) shows a mapping that opti-
mizes the communication metric, i.e. the product of bandwidth and hop-count.
Assuming the distance between the neighboring processors is one hop, the cost
of the mapping is CCost1 = 1.0 · 1 + 1.0 · 1 + 0.5 · 2 = 3.0 (Gbps).
Throughput-feasible mapping. Now let us take into account the processor pa-
rameters and consider the throughput requirements of the tasks. Figure 2(d) de-
scribes these parameters. Processors can operate at two voltages, 1.0 and 0.8V .
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 53

 
 
 
(a) Task
graph

(b) Processor mesh
 
 
(c) Comm.-
optimal
mapping


	



	


	






	

	

(d) Processor and
throughput data
 
 


(e) Throughput-
feasible mapping
 
 


(f) Power-optimal
mapping
Fig. 2. Task mapping example
The corresponding frequency (F , in GHz ) and power (P , in W ) for each volt-
age is shown in the tables. Tasks may be executed with a diﬀerent performance
(Instructions Per Cycle, IPC) in each class of processors. Each task may also
require a speciﬁc throughput (given in giga-IPS in Fig. 2(d)).
The mapping in Fig. 2(c) is infeasible with the introduction of the throughput
constraints. Consider task t2 assigned to a C2-processor. The maximum perfor-
mance that C2 can provide for t2 is IPC(t2) · F (1.0V ) = 0.8 · 0.5 = 0.4 GIPS,
while the throughput requirement for t2 is 0.8 GIPS. To satisfy the requirements,
tasks t2 and t3 are swapped (Fig. 2(e)). This mapping satisﬁes the throughput
constraints and still keeps the optimal value for the communication metrics.
Power-optimal mapping. As a ﬁnal step, let us consider the partitioning of the
CMP into voltage islands. Let us assume the CMP has two islands, separated by
the bold dotted line, as shown in Fig. 2(e). Processors in the same island must
operate at the same voltage level, that is the minimal voltage required to satisfy
all the throughput constraints for the tasks mapped to this island.
For the mapping in Fig. 2(e), the upper island has to operate at 1.0V dic-
tated by the throughput constraint of t3. The lower island also has to run at
1.0V , because of t2. Thus, the computation power, calculated using the data
from Fig. 2(d), is Pcomp = 0.30+ 0.10+ 0.30+ 0.10 = 0.80 W . Let the energy to
transfer one bit for one hop be Ebit = 0.1nJ/bit. Then the communication power
Pcomm = CCost2 · Ebit = 3.0Gbps · 0.1nJ/bit = 0.3 W , and the total power P =
Pcomp+Pcomm = 1.10 W . Notice that if tasks t3 and t4 are swapped (Fig. 2(f)),
the upper island can lower the voltage to 0.8V without violating the throughput.
54 N. Nikitin and J. Cortadella
The new computation power is Pcomp = 0.15+0.05+0.30+0.10 = 0.60 W . The
communication cost is increased: CCost3 = 1.0·1+1.0·2+0.5·1 = 3.5 (Gbps·hop),
so the communication power becomes Pcomm = CCost3 ·Ebit = 3.5·0.1 = 0.35W .
However, the total power P = 0.95 W decreases, making the assignment in
Fig. 2(f) the best one in terms of total power eﬃciency.
3 Problem Definition
This section deﬁnes the mapping problem. A task graph TG(T ,F) is a directed
graph with vertices representing the tasks ti ∈ T . Each arc represents a commu-
nication ﬂow fsd ∈ F from task ts to td, with the minimum required bandwidth
Bsd. Every task ti has a throughput constraint IPS(ti), that is the minimum
number of instructions per second required to provide the service delivered by
the task. Λ(ti) deﬁnes the total traﬃc rate between ti and the memory controller.
The ratio between the traﬃc to and from the controller is speciﬁed by the pa-
rameter ρ. Note that the Λ(ti) value can be approximated, given the amount of
data, operated by the task (i.e. the working set), and the size and miss ratio of
the tile cache.
A CMP is represented by a mesh of processors PM(P,L) with W · H cells,
where P is the set of processors and L is the set of communication links with
capacity Cap. The performance of pj to execute task ti, measured in instructions
per cycle, is speciﬁed by the function IPC(ti, pj). The processors may operate at
diﬀerent voltages. We assume a set of voltages V = {v1, .., vV } available for all
processors. The frequency and power of pj operating at voltage vk are deﬁned
by the functions F (pj , vk) and P (pj , vk), respectively. Every pj belongs to some
voltage island ιn, as deﬁned by the island map I.
A CMP has a set of controllers to access the oﬀ-chip memory. We assume
that every processor pj is associated with one controller. This assumption can
be eliminated by deﬁnition of the probabilities to access diﬀerent controllers
for pj . Function McDist(pj) returns the hop-count distance from pj to the re-
lated controller. The McBw parameter sets the maximum controller bandwidth to
guarantee the performance of memory access.
The mapping problem is to assign all tasks in T to the processors in P, mini-
mizing total power consumption and satisfying a set of design and performance
constraints. The assignment constraint restricts every processor to hold one task
at most. The voltage selection constraints force all processors in the same island
to run at one voltage. Throughput constraints assure that for every task ti run-
ning on processor pj at voltage vk the throughput requirement is met, that is
IPC(ti, pj) · F (pj , vk) ≥ IPS(ti). The link capacity and memory bandwidth con-
straints restrain total traﬃc through links and memory controllers, evaluated
under the assumption of XY-routing.
Let the binary variables aijk deﬁne whether task ti is mapped onto processor
pj operating at voltage vk. The computation power can be written as:
Pcomp =
∑
ti∈T
∑
pj∈P
∑
vk∈V
aijk · P (pj , vk).
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 55
The communication power has two terms: the on-chip communication, deﬁned
by the ﬂows between the tasks, and the oﬀ-chip communication, due to the traﬃc
to and from memory controllers. It can be deﬁned as:
Pcomm =
∑
fsd∈F
Bsd · hsd · Ebit +
∑
ti∈T
∑
pj∈P
∑
vk∈V
aijk · Λ(ti) · McDist(pj) · Ebit,
where hsd represents the hop-count of ﬂow fsd, and Ebit is the estimated energy
for transmitting one bit over a link. The objective of mapping becomes:
min : P = Pcomp + Pcomm. (1)
The formal deﬁnition of the problem via MILP is available in the report [17].
4 Task Mapping with Extremal Optimization
This section proposes the Extremal Optimization (EO) [16] metaheuristic in
application to the task mapping problem. As Simulated Annealing (SA) [18], it
is inspired by equilibrium statistical physics. SA has been successfully applied
in many EDA problems, mostly related to layout synthesis. Recently EO has
emerged as a competitive alternative delivering superior results in quality and
computational cost. The SA algorithm is outlined in [17]. This section shows
how EO can be customized to eﬀectively solve the task mapping problem.
Extremal optimization is guided by the principle of evolution in ecosystems,
which were observed to evolve by selecting against its worst components. For ev-
ery possible solution, EO evaluates the fitness of each component in the system.
A high ﬁtness value indicates that the component has a comfortable low-cost
status in the system. EO focuses on improving the status of components with
low ﬁtness. At each iteration, some of the worst-ﬁt components are replaced
by other components that contribute to improve their ﬁtness. Local optima are
avoided by randomizing the selection process. The components are ranked ac-
cording to their ﬁtness in ascending order, and are randomly selected by some
probability distribution biased towards the ones with lowest ﬁtness values. The
power-law distribution is a typical one for EO. For example, if the system has
N components ranked from 1 to N in ascending order of their ﬁtness, the index
i of the selected component can be calculated as follows:
i = N · pτ (2)
where p is a random number obtained from a uniform distribution in the interval
[0, 1] and τ is the exponent of the power law.
For the task mapping problem, at each iteration EO selects a pair of tasks
to be swapped: an unfavorable task (tu) and a replacement task (tr). EO uses
information about the system cost when selecting the swapped tasks, that re-
sults into fast progress towards the ﬁnal solution. In addition, EO accepts new
solutions unconditionally, thus making the algorithm easier to tune.
56 N. Nikitin and J. Cortadella
The mapping problem can be considered as a multiobjective optimization
problem, since the Pcomp and Pcomm terms of the cost function (1) depend on
weakly related voltage level and hop-count values. It was observed in [19] that the
multiobjective EO operates better by interleaving the optimization of individual
objectives in time, rather than trying to optimize all of them simultaneously. This
suggests to introduce diﬀerent ﬁtness functions for the optimization of the power
components and alternate them at odd and even iterations of the algorithm.
Procedure 1. ExtremalOptimization
1: CurSol← BestSol← ”Some initial solution”
2: while some improvement in the last k iterations do
3: if even iteration then /* improve communication cost */
4: sort all tasks in ascending order of Φcommu
5: select tu according to equation (2)
6: sort all tasks ti = tu in ascending order of Φcommr
7: select tr according to equation (2)
8: else /* improve computation cost */
9: sort all tasks in ascending order of Φcompu
10: select tu according to equation (2)
11: sort all tasks ti = tu in ascending order of Φcompr
12: select tr according to equation (2)
13: swap tasks tu and tr in CurSol
14: if Cost(CurSol) < Cost(BestSol) then
15: BestSol← CurSol
16: end while
17: return BestSol
The EO algorithm is outlined in procedure 1. The initial mapping for EO is
obtained by greedily placing the tasks with highest throughput to the fastest
processors. The bandwidth constraints may be violated in the initial mapping
and will be handled during the optimization process. After the deﬁnition of
an initial solution, the execution is continued until no further improvement is
observed during a certain number of iterations.
To evaluate a conﬁguration, two functions are used. Cost() returns the cost of
a conﬁguration, calculated as the total system power according to equation (1).
CapP() calculates the penalty for link and memory bandwidth violations:
CapP =
∑
l∈L
max
(
Bl − Cap
Cap
, 0
)
+
∑
mc∈MC
max
(
Bmc − McBw
McBw
, 0
)
,
where Bl is the total bandwidth of ﬂows routed through link l and Bmc is the
bandwidth of controller mc. If all constraints are satisﬁed, then CapP = 0.
The core of the algorithm focuses on selecting the pair of tasks that must be
swapped. The ﬁtness functions alternate depending on the iteration number. In
one case, ﬁtness is oriented to improve the power generated by inter-task and
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 57
memory communication, considering the hop-counts and bandwidth parameters.
In the second case the power generated by computations is addressed.
The ﬁrst task, tu, is selected by using the Φu ﬁtness function and sorting the
tasks according to the ﬁtness value. The second task, tr, is selected by ranking
the task according to the improvement in cost that the swap would produce (Φr
functions). The power law described by equation (2) is used to select the tasks
randomly. Finally the locations of tasks of tu and tr are swapped unconditionally
and BestSol is updated if the cost is better than any solution visited so far.
4.1 Fitness Functions
To model the ﬁtness for the power consumption generated by the inter-task and
memory traﬃc on the mesh, Φcommu ranks the tasks according to the product of
total traﬃc and the square of hop-count of the involved ﬂows:
Φcommu (ti) = −
∑
fsd:(ts=ti)∨(td=ti)
Bsd · (hxsd + hysd)2 − Λ(ti) · McDist(pj)2.
The square of hop-count penalizes tasks with longer ﬂows, rather than those
with high bandwidth, since Bsd and Λ(ti) are constants that cannot be changed.
The selection of the ranked tasks tends to pick tasks with high communication
cost. The negative sign allows to rank the tasks in ascending order of ﬁtness.
The ﬁtness function for the replacement task tr aims at selecting a task that,
when swapped with tu, would mostly decrease the communication cost and con-
tribute to reduce the violations of maximum bandwidth:
Φcommr (ti) = Cost(NewSol) · (1 + CapP(NewSol)),
where NewSol is the solution obtained by swapping ti and tu.
The computation-oriented ﬁtness functions aim at ﬁnding power-eﬃcient so-
lutions by smoothing the voltage spillover in the voltage islands. Let us call V mini
the minimum voltage required to guarantee the throughput of task ti assigned
to a processor in some voltage island ιn. Since task is living in the same island
with other tasks, it may not be possible to assign V mini to it, as other tasks may
require a higher voltage.
We deﬁne the voltage spillover of ti as Spilloveri = V
min
i − V , where V is the
average minimal voltage of all tasks allocated in the same voltage island. The
dispersion of island ιn is deﬁned as
Dispersionn =
∑
ti∈ιn
(Spilloveri)
2.
and measures the voltage imbalance for the island. High dispersions imply less
power-eﬃcient solutions, as more processors operate at voltages higher than
required. Computational ﬁtnesses aim at decreasing the voltage dispersion of
the system. The unfavorable component is selected from the tasks with the high
spillover value:
Φcompu (ti) = −Spilloveri.
58 N. Nikitin and J. Cortadella
The replacement task is selected to maximize the product of the cost improve-
ment with the dispersion, penalizing solutions with large capacity violations:
Φcompr (ti) =
1 + CapP(NewSol)
ΔCost ·ΔDispersion .
5 Experimental Results
This section considers optimal mappings obtained by MILP for small examples
and compares the quality and speed of the SA and EO heuristics. The latter is
shown to outperform in both metrics for a vast space of conﬁgurations.
Every test case is characterized by an application graph and a target CMP, as
described by Table 1. The number of tasks and ﬂows are reported in the second
and third columns. The fourth and ﬁfth columns show the mesh dimensions and
the number of memory controllers. First group of examples is inspired by the ap-
plications, widely used in the SoC research (e.g. [10]): Multi-Window Displayer
(MWD), MPEG4 decoder (MPEG4) and Object Plane Decoder (OPD). To ex-
plore the EO scalability, we generate large examples for mapping onto 8× 8 to
20× 20-tile CMPs (test cases 64T -400T ). The graphs for these conﬁgurations
are obtained by combining instances of MWD, MPEG4 and OPD. To avoid
disconnected clusters of tasks, few random ﬂows were added between the cores.
Table 1. Testcase conﬁgurations
Name # of tasks # of ﬂows Grid size # of MC
MWD 12 11 4× 3 2
MPEG4 12 13 4× 3 2
OPD 16 17 4× 4 2
64T 64 90 8× 8 4
144T 144 200 12× 12 4
256T 256 380 16× 16 8
400T 400 595 20× 20 8
We consider three processor classes with diﬀerent frequency and power pa-
rameters operating at three voltage levels: 1.2V, 1.0V and 0.8V. The parameters
are reported in Table 2. The distribution of tiles in the CMP is as follows: 20%
of the tiles have C1 processors, 30% - C2 processors and 50% - C3 processors.
Without loss of generality, we assume that all islands have the same size Svi.
Table 2. Parameters of the processor classes
Class 1.2V 1.0V 0.8V
C1 1000MHz, 260mW 800MHz, 150mW 600MHz, 70mW
C2 450MHz, 200mW 350MHz, 120mW 250MHz, 60mW
C3 160MHz, 55mW 130MHz, 30mW 100MHz, 15mW
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 59
Every task has a throughput requirement (IPS) and a diﬀerent performance
when executed at each class of processor (IPC). These values are deﬁned ran-
domly, with IPC values in the interval [0.5, 2.0]. This randomization contributes
to have an unbiased tuning of the metaheuristics. The traﬃc Λ(ti) between the
task ti and memory controller is estimated as 20% of total traﬃc between ti and
other tasks. The ratio between the request and reply traﬃc is set to ρ = 0.2.
SA is parametrized by two values: the initial temperature Tinit and the cooling
factor α. We deﬁne Tinit = 10
4 and only vary the α value. Given that a large
range of α is explored for each experiment, the solution quality is not dependent
on Tinit. The only parameter of EO is τ , the exponent of the power-law.
5.1 Simulated Annealing and Extremal Optimization: Comparison
MILP formulation allows obtaining optimal solutions for the mapping problem,
however, it is computationally aﬀordable only for small examples. Table 3 dis-
plays the times required to solve the MILP problem by CPLEX [20] and to ﬁnd
optimal solutions by SA and EO, for the examples of the ﬁrst group. This com-
parison aﬃrms the fact, that both metaheuristics perform very well for the small
examples with known optimum.
Table 3. Time to reach the optimal solution (sec)
Name MILP SA EO
MWD 85.25 0.01 0.01
MPEG4 120.17 0.02 0.01
OPD 4594.40 1.17 0.08
Further we compare the SA and EO metaheuristics for the task mapping
problem, using the 256T example with Svi = 16. The timeout for execution
is set to 200 seconds, as no signiﬁcant improvements are observed after that
time for both methods. Figure 3(a) depicts the evolution of the cost function
value for SA with various α and EO with τ = 4.0. SA traces corresponding
to higher values of α drop slower, but achieve better solutions in the long run.
The resulting cost discovered by EO upon timeout outperforms any of the SA
solutions by 12%. Another important fact is that EO solution cost drops rapidly
to the value obtained in the long run. This makes EO useful when fast estimation
of the cost is required, e.g. in exploration loops.
In this work we perform multiple SA runs with diﬀerent α values and select
the best results. The aim is to show that EO is a better alternative even with
a very good tuning of SA. It was also observed that EO is much less sensitive
to τ and to the size of the problem. Hence, we always deﬁne τ = 4.0, the value
found to deliver good results for all test cases.
5.2 Power Optimization with EO
The goal of this section is to study the reduction in power that EO delivers in
comparison with SA upon the timeout of 200 seconds. Examples of diﬀerent size
60 N. Nikitin and J. Cortadella
30
40
50
60
70
80
90
0 50 100 150 200
Runtime (sec)
Po
w
er
 (W
)
SA, Į=0.999
SA, Į=0.99995
SA, Į=0.99998
SA, Į=0.99999
EO, Ĳ=4.0
12%
(a) SA and EO cost in time.
0.6
0.7
0.8
0.9
1
1.1
64
T/4
64
T/8
64
T/1
6
14
4T
/4
14
4T
/8
14
4T
/16
25
6T
/4
25
6T
/8
25
6T
/16
40
0T
/4
40
0T
/8
40
0T
/16
Configuration
Po
w
er
 re
du
ct
io
n,
 E
O
/S
A
Pcomp/Pcomm § 0.2
Pcomp/Pcomm § 1.0
Pcomp/Pcomm § 5.0
(b) Power reduction by EO.
Fig. 3. Power optimization by metaheuristics
are considered (64T to 400T ) with voltage islands varied among 4, 8 and 16
processors. Another important parameter is the ratio between the computation
and communication power, as it reﬂects the ability of the approach to give pri-
ority to one power component or improve both simultaneously. Inspired by the
results in [13], three values for Pcomp/Pcomm are explored: 0.2, 1.0 and 5.0.
Figure 3(b) reports the power (equation (1)) of the EO solution with respect to
the best value obtained by SA. For each conﬁguration, denoted as testcase/Svi
along the X-axis, three values for diﬀerent Pcomp/Pcomm are shown. For the
majority of conﬁgurations EO outperformed the results of SA, with a maximum
gain in power of 22.5%. Only for 3 of 36 explored conﬁgurations EO was slightly
worse than SA. The diﬀerence in this case did not exceed 2.0%.
EO tends to perform better than SA at higher Pcomp as well as for larger
values of Svi. In other words, EO better minimizes the voltage of islands, due
to the consideration of voltage spillover. As the island size grows, the amount
of tasks, required to be swapped in order to improve the voltage, also increases.
This is one of the important features of EO, since it can model the ﬁtness of
each component in the system. Modeling the voltage spillover in SA is diﬃcult,
since only a global cost is considered in the acceptance of moves.
As an example, Fig. 4 shows the voltage distributions for the 256T/16 test
case. Every island contains a mixture of C1, C2 and C3 processors, as shown in
Fig. 4(a). The ﬁnal voltage assignment is represented by the three colors in the
ﬁgure. The SA solution has 8 islands at 1.2V, 5 at 1.0V and 3 at 0.8V. The EO
solution has 3 islands at 1.2V, 6 at 1.0V and 7 at 0.8V. The estimated power
consumption of the EO solution is 12% lower. Another intuitive result is that
the total power grows with the size of voltage islands (Fig 4(c)), since larger
islands imply less mapping ﬂexibility for individual tiles to reduce voltage.
Static Task Mapping for Tiled CMPs with Multiple Voltage Islands 61
  
  
  
  
  
(a) SA solution (b) EO solution
VI
 s
iz
e 
= 
4
VI
 s
iz
e 
= 
8
VI
 s
iz
e 
= 
16
15
20
25
30
35
Po
w
er
 (W
)
(c) EO power
Fig. 4. Voltage distribution and power for 256T example
6 Conclusions
Design reuse becomes a major paradigm for engineering many-core systems. This
paper has addressed the problem of static task mapping for tiled CMPs with mul-
tiple voltage islands, as an approach to reduce design cost and time-to-market.
The problem formulation considers task throughput requirements, on-chip and
oﬀ-chip memory traﬃc, and bandwidth constraints. Extremal optimization has
shown to be an eﬃcient method for solving this combinatorial problem.
Acknowledgement. This research has been funded by project CICYT TIN2007-
66523, FPI grant BES-2008-004612, and grants from Intel Corporation and Cata-
lan Government (SGR 2009-1137).
References
1. Pham, D., et al.: Overview of the architecture, circuit design, and physical imple-
mentation of a ﬁrst-generation cell processor. J. Solid-State Circuits 41 (2006)
2. Bell, S., et al.: Tile64 - processor: A 64-core SoC with mesh interconnect. In: Solid-
State Circuits Conference (2008)
3. Vangal, S., et al.: An 80-tile 1.28Tﬂops network-on-chip in 65nm CMOS. In: Solid-
State Circuits Conference (2007)
4. Marculescu, R., Ogras, U.Y., Peh, L.-S., Jerger, N.E., Hoskote, Y.: Outstanding
research problems in NoC design: system, microarchitecture, and circuit perspec-
tives. IEEE Trans. on Computer-Aided Design of Integrated Circuits 28 (2009)
5. Azimi, M., et al.: Integration Challenges and Tradeoﬀs for Tera-scale Architectures.
Intel. Technology Journal (2007)
6. Balakrishnan, S., Rajwar, R., Upton, M., Lai, K.: The impact of performance
asymmetry in emerging multicore architectures. In: International Symposium on
Computer Architecture (2005)
7. Lackey, D., et al.: Managing power and performance for System-on-Chip designs
using voltage islands. In: Int. Conf. Computer-Aided Design (2002)
8. Kim, W., Gupta, M.S., Wei, G.-Y., Brooks, D.: System level analysis of fast, per-
core DVFS using on-chip switching regulators. In: International Symposium on
High Performance Computer Architecture (2008)
9. Mak, W.-K., Chen, J.-W.: Voltage island generation under performance require-
ment for SoC designs. In: Asia and South Paciﬁc Design Automation Conference
(2007)
10. Ghosh, P., Sen, A.: Energy eﬃcient mapping and voltage islanding for regular NoC
under design constraints. J. High Perform. Syst. Archit. 2 (2010)
62 N. Nikitin and J. Cortadella
11. Xu, R., Melhem, R., Mosse, D.: Energy-aware scheduling for streaming applications
on chip multiprocessors. In: Int. Symp. Real-Time Systems (2007)
12. Hung, W.-L., et al.: Temperature-aware voltage islands architecting in System-on-
Chip design. In: Int. Conf. Computer Design (2005)
13. Varatkar, G., Marculescu, R.: Communication-aware task scheduling and voltage
selection for total systems energy minimization. In: Int. Conf. Computer-Aided
Design (2003)
14. Chen, G., Li, F., Son, S., Kandemir, M.: Application mapping for chip multipro-
cessors. In: Design Automation Conference (2008)
15. Dally, W., Towles, B.: Principles and Practices of Interconnection Networks (2003)
16. Boettcher, S., Percus, A.G.: Extremal optimization: Methods derived from co-
evolution. In: Genetic and Evolutionary Computation Conf. (1999)
17. Nikitin, N., Cortadella, J.: Static task mapping for tiled chip multiprocessors with
multiple voltage islands. Technical Report (2011),
http://www.lsi.upc.edu/~techreps/files/R11-13.zip
18. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science 220 (1983)
19. De Falco, I., Della Cioppa, A., Maisto, D., Scafuri, U., Tarantino, E.: A multiob-
jective extremal optimization algorithm for eﬃcient mapping in grids 58 (2009)
20. CPLEX, http://www.ilog.com/products/cplex
