Power Control for Crossbar-based Input-Queued Switches by Bianco, Andrea et al.
Politecnico di Torino
Porto Institutional Repository
[Article] Power Control for Crossbar-based Input-Queued Switches
Original Citation:
Bianco A.; Giaccone P.; Masera G.; Ricca M. (2013). Power Control for Crossbar-based Input-
Queued Switches. In: IEEE TRANSACTIONS ON COMPUTERS, vol. 62 n. 1, pp. 74-82. - ISSN
0018-9340
Availability:
This version is available at : http://porto.polito.it/2460514/ since: November 2011
Publisher:
IEEE
Published version:
DOI:10.1109/TC.2011.220
Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article
("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.
html
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library
and the IT-Services. The aim is to enable open access to all the world. Please share with us how
this access benefits you. Your story matters.
(Article begins on next page)
1Power Control for Crossbar-based
Input-Queued Switches
Andrea Bianco, Paolo Giaccone, Guido Masera, Marco Ricca
Dipartimento di Elettronica, Politecnico di Torino, Italy
F
Abstract—We consider an N×N input-queued switch with a crossbar-
based switching fabric implemented on a single chip. The power con-
sumption produced by the crossbar chip and due to the data transfer
grows as NR3, where R is the maximum bit rate. Thus, at increasing
bit rate, power dissipation is becoming more and more challenging,
limiting the crossbar scalability for high performance switches.
We propose to exploit Dynamic Voltage and Frequency Scaling
(DVFS) techniques to control packet transmissions through each cross-
point of the switching fabric. Our power control operates independently
of the packet scheduler and exploits the knowledge of a traffic matrix
obtained by on-line measurements. We propose a family of control
algorithms to reduce the power consumption. The algorithms are partic-
ularly efficient in non-overloaded conditions. The actual potential of the
proposed approach is also evaluated on a real design case synthesized
on a 90 nm CMOS technology.
Index Terms—Input queued switch, power control, dynamic voltage
frequency scaling.
1 INTRODUCTION
The aggregate bandwidth of high speed routers is grow-
ing fast, due to the increased traffic demand in the
Internet. To support traffic growth, in core routers a
switching fabric that must switch data at increasing
speed is often implemented on a single integrated circuit.
The hardware design of such fabric is becoming more
and more critical, because of the large pin count and
the high bit rate. Indeed, if f is the maximum digital
signal frequency, the power consumption of a CMOS
device is proportional to f3 [1]. In a N × N single-
chip crossbar with N2 crosspoints, each implemented
through proper logic blocks, there are1 Θ(N2) CMOS
components (i.e., a fixed number for each crosspoint),
and the total power consumption becomes proportional
to R3N , where R is the data-transmission bit rate and N
is the maximum number of data simultaneously flowing
across the switching fabric.
Thermal power dissipation is becoming a critical de-
sign issue, due to high integration level on a single
chip, that implies very high power spatial density [2].
In integrated circuits, Dynamic Voltage and Frequency
Scaling (DVFS) [1], a classical technique used to control
the power consumption, is based on the idea of jointly
1. In Landau notation, function g(n) is Θ(h(n)) if, for n → ∞,
k1h(n) < g(n) < k2h(n) for some positive constants k1 and k2.
varying the power supply voltage and the peak signal
frequency. In this paper we propose to exploit DVFS
for the power control of a single-chip crossbar, to reduce
the power consumption at the cost of increasing packet
delays at low-medium loads without sacrificing switch
throughput. The main idea is to reduce the power when
the traffic load is low, extending the packet transmission
duration through bit voltage and frequency reduction.
Indeed, networks are typically provisioned for worst-
case or peak-hour traffic. However, several measure-
ments (see for example [3]) show that backbone utiliza-
tion rarely exceeds 30%, thus suggesting that exploiting
low traffic conditions can be a significant asset to reduce
power. We propose a set of algorithms for power control
that operate on an estimated traffic matrix to assess the
potential power gain that can be obtained exploiting
DVFS. We take an idealized approach based on fluid
model, i.e., we disregard the interaction with packet
scheduling algorithms that select the packets to be trans-
ferred across the switching fabric. We only concentrate
on the power of the crossbar chip, not considering the
power contribution of other components of the switching
architecture.
The paper is organized as follows. The system model
is defined in Sec. 2, while Sec. 3 formalizes the optimal
crossbar chip power control problem, describes its prop-
erties, and proposes a set of algorithms to solve it. Per-
formance results in Sec. 4 show the possible power gain
of our approach. Details on the hardware architecture
for a 410 Gbps crossbar are provided in Sec. 5, where
we show that the synthesis results well fit those of the
theoretical model.
2 PROBLEM DEFINITION
We start by considering a single CMOS component, the
basis of the combinatorial logic of a single crosspoint in
the crossbar chip.
2.1 Energy model for a single CMOS gate
The energy consumption of a CMOS gate is strongly
dependent on the supply voltage V and it can be mod-
eled as the sum of a dynamic energy component (due
to electrical signal switching activity needed to transfer
2sequence of 0s and 1s) and a static energy component
(due to leakage currents). We consider only the dy-
namic energy component, while we neglect the latter
contribution. Leakage currents tend to be proportional
to occupied area and are normally controlled by means
of circuit level techniques that are out of the scope of
this work. The energy due to a bit transition (i.e., the
switching activity) is a quadratic function of V according
to the well known formula Ebit = 0.5CV
2, where C is
the load capacitance. If we consider a 0-1 square wave
signal with frequency f , the power consumption is
P = Ebitf ∝ fV
2 (1)
that represents also the thermal power to dissipate. The
allowed frequency is f ∝ V due to the delay needed
to switch from one logic state to another [4]. Thereby,
the power consumption for a CMOS operating at maxi-
mum frequency and voltage is proportional to f3. DVFS
techniques jointly reduce V and f to minimize power
consumption, exploiting time periods in which the signal
can be “slowed down” to a lower peak frequency. This
approach is actually implemented in commercial CPUs,
where the processing speed changes with the instanta-
neous processing load [5].
We consider a CMOS device operating at voltage V ,
ranging between Vmin and Vmax. Within this range, we
assume that bit transmissions can occur at intermediate
voltage levels. When operating at V < Vmax, since f ∝ V ,
the signal frequency can be slowed down by a factor
α = Vmax/V with respect to the maximum frequency
allowed when using Vmax. Thus, α is the expansion factor
of the bit duration with respect to the bit duration
when using Vmax. Furthermore, V must be larger than
Vmin > 0, because of technological constraints that forbid
to reduce the voltage level too much and of the impact
of leakage currents, that otherwise would become not
negligible. Define β = Vmin/Vmax. Depending on the
technology, β = 0.5 for a classical DVFS scheme or
β = 0.3 in the case of an “extreme” DVFS scheme,
according to [1]. By construction, 1 ≤ α ≤ 1/β.
2.2 Switching architecture
We consider in Fig. 1 an N×N input queued (IQ) switch,
with virtual output queueing (VOQ), i.e. one queue
VOQij for each input i and output j pair. The IQ archi-
tecture ensures high scalability in line rate and number
of ports, and the VOQ scheme is theoretically optimal
from the performance point of view. The switching fabric
is an N × N crossbar chip, with N2 crosspoints and
Θ(N2) CMOS components. The crosspoint connecting
input i to output j is denoted as XPij and is fed by
VOQij traffic.
A packet scheduler [6] selects the set of packets trans-
ferred simultaneously through the crossbar, satisfying
the constraints that at most one packet is sent from each
input and to each output, to avoid output conflicts. We
do not focus on any particular scheduler, although for
Rate estimator
Crossbar
VOQs
Power control
schedulerPacket
Fig. 1. Power control scheme in an IQ switch
simplicity the model assumptions hint at packet sched-
ulers able to achieve 100% throughput under admissible
traffic. The scheduling decisions occur at a packet level,
with a time granularity equal to the minimum packet
duration. In the case of minimum Ethernet packet size
and 10 Gbit/s line rates, a new scheduling decision must
be taken every 50 ns. Given such a strict timing con-
straint, packet schedulers are often implemented directly
in hardware, but off-chip, i.e., on a separate chip with
respect to the crossbar chip.
3 CROSSBAR POWER CONTROL
The aim of the power control block in Fig. 1 is to
exploit DVFS at crosspoints to reduce the crossbar chip
power consumption. Based on traffic measurements on
the ms scale which provide rate estimations, the control
determines the DVFS factor αij for the combinatorial
logic at XPij , assuming that each crosspoint is controlled
independently. Due to the relaxed timing constraints,
the algorithm for power control is assumed to be im-
plemented as a software component running on an off-
chip processor. Since we focus on crossbar power con-
sumption, we disregard the power contribution of the
scheduler and of the power control block. However, the
only additional power consumption introduced by our
proposed DVFS is due to the power control block; this
contribution is negligible with respect to the scheduler
consumption due to comparable algorithmic complexity
and much larger time scale.
Let α = [αij ] be the N × N matrix with the DVFS
factors currently employed in the crossbar. Note that
setting αij > 1 implies that the forwarding rate at
XPij is reduced and the packet transmission time is
increased by the expansion factor αij . This has two
main consequences: i) an additional queueing delay in
VOQij , ii) the packet scheduler cannot serve any new
packet from input i and to output j until XPij ends the
packet transmission. Thus, the packet scheduler should
be slightly modified to take into account DVFS factors in
packet scheduling. We disregard this issue in the paper,
and we take an ideal fluid-based approach, looking only
at I/O flow rates, to evaluate the possible asymptotic
benefits in terms of reduced power consumption. Note
that extending packet duration might influence switch
3throughput and buffer size requirements. However, the
power control algorithms avoid switch overloading, by
increasing packet duration only at low-medium input
load. This translate in an internal load increase. In
other words, the switch operates internally always in a
high load regime, regardless of the real input load, but
never in overload. As such, buffer requirements are not
modified, because buffer size are designed for high load
conditions, which are not modified by the power control
scheme.
3.1 Input traffic
To avoid dealing with data content, we assume that a
data packet of length L is transmitted using L signal
transitions: i.e., each packet is composed by a sequence
of alternating 0 and 1.
Denote the maximum line rate as rmax, measured in
[bit/s]: rmax is achievable only for V = Vmax. The
traffic load on each link is measured on a time window
whose duration Tw is in the order of ms. Let rij be
the average arrival rate [bit/s] for the traffic flows
enqueued at VOQij during the current time window,
and R = [rij ] the corresponding N × N traffic matrix.
Let S = [sij ] be the normalized traffic matrix obtained
by setting sij = rij/rmax, with sij ∈ [0, 1]. We assume
that sij > 0 for any i and j.
Definition 1: The average load of matrix S is defined as
ρave(S) =
1
N
N∑
i=1
N∑
j=1
sij
Definition 2: The average load at input i and at output j
is ρIi (S) =
∑N
k=1 sik and ρ
O
j (S) =
∑N
k=1 skj respectively.
Definition 3: The maximum load of matrix S is
ρmax(S) = max{maxk{ρIk(S)},maxk{ρ
O
k (S)}}.
Definition 4: The traffic matrix S is said to be admissible
iff ρmax(S) ≤ 1.
Obviously, ρave(S) ≤ ρmax(S).
3.2 The minimum power control problem
To keep bounded queues and delays, and to avoid over-
load, we model the constraints related to the maximum
time expansion allowed for the transmitted bits. During
a measurement period, the total number of arrived bit is
Twrij , smaller than the maximum number of bits Twrmax
that can be transmitted at Vmax. Hence, the maximum
allowed expansion factor for each bit is rmax/rij , i.e.
αijrij ≤ rmax. At the same time, to avoid overload, it
is necessary to limit the expansion at each input and
output:
N∑
k=1
αikrik ≤ rmax
N∑
k=1
αkjrkj ≤ rmax ∀i, j
which can be normalized as
N∑
k=1
αiksik ≤ 1
N∑
k=1
αkjskj ≤ 1 ∀i, j (2)
Similarly to (1), the power consumption of XPij , denoted
as Pij , is proportional to
Pij ∝ rij
(
Vmax
αij
)2
= sijrmax
(
Vmax
αij
)2
∝
sij
α2ij
The total crossbar power consumption is the sum of the
power contributions of all crosspoints:
Ptot =
N∑
i=1
N∑
j=1
Pij ∝ fP (α) =
N∑
i=1
N∑
j=1
sij
α2ij
(3)
where fP (α) is a power cost factor. Finally, the min-
imum power problem (denoted as OPT-MP) becomes:
given an admissible S, find a feasible α minimizing fP :
min
α
fP (α) = min
{αij∈R+}i,j
N∑
i=1
N∑
j=1
sij
α2ij
(4)
subject to


N∑
k=1
αiksik ≤ 1 ∀i (5)
N∑
k=1
αkjskj ≤ 1 ∀j (6)
αij ∈ A ∀i, j (7)
where A is the set of all available voltage levels.
Property 1: OPT-MP is an integer convex non-linear
optimization problem.
3.2.1 Continuous version of the problem
Following a standard methodology, we start to relax
OPT-MP to continuous variables. This leads to the fol-
lowing problem, denoted as CONT-MP: minimize fP (α)
subject to (5) and (6); (7) is substituted by
αij ≥ 1 ∀i, j
corresponding to a DVFS scheme in which any voltage
between 0 and Vmax is allowed
2. Let αˆOPT-MP be the
optimal solution of OPT-MP. Let αˆCONT-MP be the optimal
solution of CONT-MP. Since CONT-MP is a relaxed
version of OPT-MP, αˆCONT-MP is a lower bound on the
power cost
Property 2: fP (αˆCONT-MP ) ≤ fP (αˆOPT-MP ).
Theorem 1: CONT-MP is equivalent to
min
α
fP (α) (8)
subject to


N∑
k=1
αiksik = 1 ∀i (9)
N∑
k=1
αkjskj = 1 ∀j (10)
αij ≥ 1 ∀i, j (11)
Proof: Assume αˆ = [αˆij ] to be the optimal solution.
Define sˆij = αˆijsij . By contradiction, assume that there
2. The constraint on Vmin will be discussed at the end of the section.
4exists i such that
∑
k sˆik < 1, i.e. the i-th row of Sˆ = [sˆij ]
sums to less than one (the same argument holds for the
case the column sums to less than one). Now two cases
can occur. In the first case, it exists also one column j
that sums to less than one, i.e.
∑
k sˆkj < 1. Hence, it is
possible to increase sˆij to sˆ
′
ij while satisfying constraints
(5)-(6). The new corresponding α′ij = sˆ
′
ij/sij is feasible
and provides a lower cost function; this contradicts
our assumption. In the second case, all the columns
sum to one and, summing over all the columns, we
have
∑
j
∑
k skj = N , which contradicts the assumption∑
i
∑
k sik < N .
Note that one of the constraints in (9)-(10) is linearly
dependent of the others and can be omitted.
Definition 5: Given a non-negative matrix H ∈ RN×N ,
H is said to be ρ-double-stochastic if ρIi (H) = ρ
O
j (H) = ρ
for any i and j, i.e. ρave(H) = ρmax(H) = ρ. A 1-double-
stochastic matrix is usually called double-stochastic ma-
trix.
Definition 6: Given a non-negative matrix H ∈ RN×N ,
H is said to be ρ-sub-stochastic if ρave(H) ≤ ρmax(H) = ρ.
Thanks to Theorem 1, CONT-MP translates to: given
a ρ-sub-stochastic matrix S, find a double-stochastic
matrix Sˆ = [sˆij ] such that the set of αij = sˆij/sij
minimizes fP (α). In other words, S is augmented to
become double-stochastic.
The following Theorem provides an easily computable
optimal solution:
Theorem 2: Given a ρ-double-stochastic matrix S, the
optimal solution αˆ for CONT-MP is αˆij = 1/ρ, for
any i, j. The corresponding power cost factor is
fP (αˆCONT-MP ) = Nρ
3.
Proof: The proof is based on the use of the Lagrange
multipliers and on the Taylor’s Theorem for multivariate
functions. Denote ⊗ as the Hadamard product (i.e.,
element-by-element) of two matrices. Define αˆ as the
optimal solution given by αˆij = 1/ρ and define α, with
α 6= αˆ, a generic feasible solution satisfying (9) and (10);
α ⊗ S and αˆ ⊗ S are both double stochastic matrices.
We can define matrix ∆ = α − αˆ and assume that
maxi,j{∆ij} ≤ ǫ where ǫ > 0. We can use Birkhoff-von
Neumann Theorem [7] to claim that there exist a set of
real numbers γk such that
∆⊗ S =
∑
k
γkM
k
∑
k
γk = 0 (12)
where Mk is a permutation matrix. Equivalently,
∆ij =
∑
k
γk
mkij
sij
(13)
Consider for algebraic convenience consider the vec-
torization form of a matrix; the column vector form of
matrix ∆ is denoted by ∆. By classical Taylor’s Theorem
for multivariate functions,
fP (α) − fP (αˆ) = ∆
T∇fP (αˆ) +
1
2
∆TH(η)∆ (14)
where H(η) is the Hessian matrix computed in η =
(1 − γ)αˆ + γα = αˆ + γ∆, for some constant γ ∈ [0, 1].
Equivalently,
ηij = αˆij + γ∆ij (15)
We first show that the first term in the right hand side
of (14) is null. Indeed, by (12) and (13):
∆T∇fP (αˆ) =
∑
ij
−2sij
αˆ3ij
∑
k
γk
mkij
sij
=
∑
ij
(−2ρ3)
∑
k
γkm
k
ij = (−2ρ
3)
∑
k
γk
∑
i,j
mkij =
(−2ρ3)
∑
k
γkN = 0
Let us consider now the second term in the right hand
side of (14). Observe that H(α) is a diagonal matrix, in
which the element corresponding to (i, j) pair is equal
to 6sij/α
4
ij . Hence, by (15):
∆TH(η)∆ =
∑
i,j
∆2ij
6sij
η4ij
=
∑
i,j
∆2ij
6sijρ
4
(1 + γρ∆ij)4
Let ǫ′ = minij{∆ij |∆ij > 0} and s′ = minij{sij}. Finally,
we can claim
fP (α) − fP (αˆ) = ∆
TH(η)∆ ≥
6ρ4(ǫ′)2s′
(1 + γρǫ)4
> 0
that means that any α 6= αˆ that satisfies (9) and (10)
cannot be the optimal solution.
The minimum power cost factor is immediately ob-
tained by computing fP (αˆ).
In Sec. 5, we validate the cubic relation between power
and load through the results of the actual hardware
synthesis of a crossbar chip. Furthermore, we can get
an important intuition from the above theorem, which
will drive the design of approximated algorithms for the
CONT-MP problem: In the optimal solution, all the αij are
expanded proportionally by the same factor.
When considering also the constraint on Vmin, the
expansion ratio is limited by αij ≤ 1/β. For ρ-double-
stochastic matrices, the optimal solution becomes αij =
min(1/ρ, 1/β), ∀i, j and the corresponding optimal solu-
tion for CONT-MP becomes:
fP (αˆCONT-MP ) =
{
Nρβ2 if ρ < β
Nρ3 if ρ ≥ β
(16)
Thus, β is the value of “critical load” above which
DVFS is not able to expand the bit duration due to the
constraints imposed by the traffic load in (2).
Consider now a relaxed version of the CONT-MP
problem, denoted as MISO-MP (Multiple-Inputs Single-
Output), in which we remove the expansion constraints
(9) on each input.
Theorem 3: Given any admissible traffic matrix S, the
optimal solution of MISO-MP is given by αij = 1/ρ
O
j (S).
The corresponding power cost factor is:
fP (αˆMISO-MP ) =
∑
j
(ρOj (S))
3
5Note that this results does not require S to be a double-
stochastic matrix.
Proof: Define the Lagrange function as
Λ =
∑
ij
sij/α
2
ij +
∑
j
λj
(∑
k
skjαkj − 1
)
A necessary condition for the solution to be optimal is
∂Λ/∂αij = 0, which implies −2sijα
−3
ij + λjαijsij = 0.
It should be αij = (2/λj)
−4, i.e. for a fixed j, all the
αij are constant. Thus (10) becomes αij
∑
k skj = 1 and
hence αij = 1/ρ
O
j (S). This satisfies also (11). By simple
substitution, we get the corresponding power cost.
Property 3: fP (αˆMISO-MP ) ≤ fP (αˆCONT-MP )
i.e. MISO-MP provides a lower bound, simple to com-
pute, for CONT-MP and OPT-MP under any admissible
traffic matrix.
3.2.2 Power consumption without DVFS
A feasible, but not optimal, solution for OPT-MP is when
no DVFS scheme is adopted, i.e. αij = 1 for all i, j. We
define this scheme as NODVFS and the corresponding
solution as αˆNODVFS . The power cost factor fP under any
admissible traffic matrix S can be obtained by setting
αij = 1 in (3):
fP (αˆNODVFS ) =
N∑
i=1
N∑
j=1
sij = Nρave(S) (17)
denoting a linear relationship between the average load
on S and the total power consumption.
Property 4: fP (αˆOPT-MP ) ≤ fP (αˆNODVFS ).
Thus fP (αˆNODVFS ) is a loose upper bound for OPT-MP.
We define the relative power η(αˆ) of a DVFS solution
αˆ, relative to NODVFS, as:
η(αˆ) =
fP (αˆ)
fP (αˆNODVFS )
=
fP (αˆ)
Nρave(S)
. (18)
Since η(αˆ) ∈ [0, 1], the closer η(αˆ) to zero, the larger the
scheme gain with respect to NODVFS.
For double-stochastic matrices, dividing (17) by (16):
Property 5: Under ρ-double-stochastic matrices,
η(αˆCONT-MP ) = β
2 for ρ < β, ρ2 for ρ ≥ β.
In summary, the solution to the CONT-MP problem,
which uses any voltage level between Vmin and Vmax,
provides a lower bound for the power of the OPT-
MP problem. When the matrix is double-stochastic, the
optimal solution to CONT-MP is trivial. Otherwise, a
lower bound can be found with the solution of MISO-
MP, trivial to compute.
3.3 Power control algorithms
To solve OPT-MP for any traffic matrix we propose to: i)
solve the corresponding CONT-MP problem, ii) approxi-
mate each αij to the closest smaller value available in the
set A. In other words, if αij is the solution for CONT-MP,
then use for OPT-MP:
α′ij = max{α ∈ A | α ≤ αij}
Note that, by construction, the set of α′ij defines an
admissible solution for OPT-MP.
To solve CONT-MP, we adopt a quasi-optimal al-
gorithm based on the logarithmic barrier method for
convex problems [8] which provides an ǫ-approximation
of the optimal solution. Furthermore, we adopt a two-
steps algorithm: we augment S to a double stochastic Sˆ
according to one of algorithms among AUGM-1, AUGM-
MAX or AUGM-SORT, described below. Then, we com-
pute αij = sˆij/sij .
INCREASE-MATRIX Algorithm
Input: N ×N matrix S = [sij], {ρ
I
i }
N
i=1, {ρ
O
j }
N
j=1, ρT , Ω
I , ΩO .
Output: N ×N matrix ∆ = [δij ]
1. δij = 0 for any 1 ≤ i, j ≤ N
2. ΩIO = {(i, j) : i ∈ ΩI , j ∈ ΩO}
3. repeat until no choice is anymore available
4. choose any (i, j) ∈ ΩIO such max{ρIi , ρ
O
j } < ρT
5. δij = min{ρT − ρ
I
i , ρT − ρ
O
j }
6. ρIi = ρ
I
i + δij , ρ
O
j = ρ
O
j + δij
We now describe the INCREASE-MATRIX procedure,
on which all the augmentation algorithms are based. The
inputs of the procedure are i) a sub-stochastic matrix S,
ii) the corresponding row ρIi and column ρ
O
j sums; iii)
a target load value ρT such that maxk{ρIk, ρ
O
k } ≤ ρ ≤ 1,
and iv) a set of input ports ΩI and output ports ΩO . The
algorithm returns a matrix ∆ = [δij ] with the largest
possible elements such that: (i) only the elements δij
corresponding to rows and columns present in both ΩI
and ΩO may be > 0; (ii) the maximum row and column
sum is ρT , i.e.
N∑
k=1
sik + δik ≤ ρT for any i ∈ Ω
I
N∑
k=1
skj + δkj ≤ ρT for any j ∈ Ω
O
The algorithm operates only on a sub-matrix restricted
to the rows in ΩI and the columns in ΩO. It chooses
a sequence of elements whose row and column sum to
less than ρT . Then, each element in the sub-matrix is
augmented to reach ρT without violating the constraints.
Note that the maximum number of iterations in step 3
is upper bounded by 2N .
Having defined INCREASE-MATRIX, we now describe
the algorithms we propose to augment S to a double-
stochastic Sˆ:
• AUGM-1:
1) compute ρIi and ρ
O
j for any i and j;
2) run INCREASE-MATRIX on S, ρIi , ρ
O
j , ρT = 1, Ω
I =
ΩO = {1, . . . , N};
3) compute sˆij = sij + δij for all i and j.
Note that AUGM-1 is a classical iterative algorithm (see
Sec. II.A of [7]) to augment a sub-stochastic matrix to a
double-stochastic one. The complexity is O(N2), due to
steps 1) and 3).
• AUGM-MAX:
1) compute ρIi and ρ
O
j for any i and j;
62) compute ρmax(S);
3) run INCREASE-MATRIX on S, ρIi , ρ
O
j , ρT = ρmax(S),
ΩI = ΩO = {1, . . . , N};
4) compute sˆij = sij + δij + (1 − ρmax(S))/N .
The complexity of AUGM-MAX is O(N2), due to steps
1) and 4).
• AUGM-SORT:
1) compute ρIi and ρ
O
j on S for any i and j;
2) sort ρIi and ρ
O
j in increasing order. Let i(k) be the kth
input and j(k) be the kth output in such increasing
sequences;
3) initialize an auxiliary matrix X(0) = S and set ΩI0 =
ΩO0 = ∅;
4) iterate, for k from 1 to N , the following steps:
a) ΩIk = Ω
I
k−1 ∪ i(k), i.e. the set of the inputs with
the k smallest row sums;
b) ΩOk = Ω
O
k−1 ∪ j(k), i.e. the set of the outputs with
the k smallest column sums;
c) run INCREASE-MATRIX on X(k−1), ρIi , ρ
O
j , Ω
I
k,
ΩOk and ρ
(k)
T = max{ρ
I
i(k)
, ρOj(k)}, i.e. ρ
(k)
T is the
maximum load for the first kth inputs and out-
puts of S;
d) x
(k)
ij = x
(k−1)
ij + δij for any i, j, i.e. set X
(k) =
X(k−1) +∆;
e) eventually go to a) to start a new iteration;
5) compute sˆij = x
(N)
ij + (1 − ρmax(X
(N)))/N .
The complexity of AUGM-SORT is O(N2) by optimizing
the data structure to choose an (i, j) ∈ ΩIO in INCREASE-
MATRIX and by sorting only once ρIi and ρ
O
j .
Theorem 2 suggests that the optimal way to increase
the S is proportionally, at least for some families of
traffic. AUGM-1 is a classical way to augment a matrix.
Instead, AUGM-MAX and AUGM-SORT tend to augment
the matrix more proportionally.
4 PERFORMANCE EVALUATION
We first discuss the performance for ρ-double-stochastic
matrices. Then, we move to ρ-sub-stochastic matrices.
4.1 Power consumption for double-stochastic matri-
ces
According to Theorem 2, the optimal solution for CONT-
MP is expressed by (16). Fig. 2 shows the power consump-
tion per port fP (αˆ)/N vs. the average load, for the optimal
solution of CONT-MP and β ∈ {0.3, 0.5, 0.7}. We show
also the linear growth of NODVFS, computed with (17).
For small loads, DVFS is very efficient, by reducing the
power by a factor 1/β2 (see Property 5), equal to 11, 4
and 2, respectively, for each value of β. For larger loads,
the DVFS power reduction decreases, becoming negligi-
ble in highly loaded conditions, because bit expansion is
not allowed due to the high traffic load.
We now consider the effect of a finite set A of voltage
levels. Table 1 shows the worst-case (for any load)
ratio between the consumption of OPT-MP with finite
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.2  0.4  0.6  0.8  1
Po
w
er
 p
er
 p
or
t
Average Load
NoDVFS
Cont-MP β=0.3
Cont-MP β=0.5
Cont-MP β=0.7
Fig. 2. Optimal solution for continuous DVFS (CONT-
MP), under any ρ-double-stochastic matrix.
TABLE 1
The power consumption ratio between DVFS with
discrete voltage levels (OPT-MP) and continous DVFS
(CONT-MP), for double-stochastic matrices
|A| β Voltage levels /Vmax max
0≤ρ≤1
fP (αˆOPT-MP )
fP (αˆCONT-MP )
0.3 0.3, 0.55, 1 1.31
3 0.5 0.5, 0.71, 1 1.09
0.7 0.7, 0.84, 1 1.02
0.3 0.3, 0.45, 0.67, 1 1.13
4 0.5 0.5, 0.63, 0.79, 1 1.04
0.7 0.7, 0.78, 0.89, 1 1.01
0.3 0.3, 0.41, 0.55, 0.74, 1 1.07
5 0.5 0.5, 0.60, 0.71, 0.84, 1 1.02
0.7 0.7, 0.76, 0.84, 0.92, 1 1.01
set of voltage levels and the consumption of CONT-OPT
with continuous DVFS, as a function of the number of
available voltage levels. The |A|−2 intermediate voltage
levels between Vmin and Vmax have been numerically
optimized to minimize such ratio. Note that very few
intermediate levels (i.e., one for β = 0.5) are sufficient to
observe differences lower than 10%. Hence, the simple
solution to CONT-MP well approximates the solution to
the OPT-MP problem. Finally, very few voltage levels
are enough to exploit the potential benefits of DVFS.
4.2 Power consumption for sub-stochastic matrices
We consider the family of random traffic matrices
generated as follows. Given ρ ∈ (0, 1], generate a matrix
U = [uij ] of N
2 random variables, uniformly distributed
on the interval (0, 1]. Then, derive each element of S as
sij = uijρ/(ρmax(U)). Using this construction, it can be
shown that the corresponding average load ρave(S) ≈
ρ/(1 + (
√
0.67 log(N)/N )) for large enough N .
We compare the algorithms proposed in Sect. 3.3 for
continous DVFS, because, as shown in the previous
section, CONT-MP is a good approximation of OPT-MP
even when few voltage levels are available. We show the
optimal solution for CONT-MP only for smaller switch
sizes (N = 16), due to computational constraints. We
report also the solution for the lower bound provided by
7 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8
R
el
at
iv
e 
Po
w
er
Average Load
NoDVFS
Augm-1
Augm-MAX
Augm-SORT
Optimal
Miso-MP
Fig. 3. Relative power for N = 16 and β = 0.3, under
sub-stochastic matrices
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8
R
el
at
iv
e 
Po
w
er
Average Load
NoDVFS
Augm-1
Augm-MAX
Augm-SORT
Miso-MP
Fig. 4. Relative power for N = 256 and β = 0.3, under
sub-stochastic matrices.
MISO-MP. Even if the results hold for β = 0.3, similar
results were obtained for other values of β.
Figs. 3, 4 show the relative power (Eq. (18)), for differ-
ent N . Note that, to ensure admissibility, the maximum
average load in the abscissa is limited by construction
to be always less than 1/(1 +
√
0.67 log(N)/N), i.e. 0.75
and 0.88 for N = 16 and N = 256 respectively.
When increasing ρave(S), the relative power of MISO-
MP shows a quadratic growth, similarly to double-
stochastic matrices for which Property 5 holds. The
behavior is close to the optimal solution, justifying its use
to approximate CONT-MP for large N . Even if not op-
timal, AUGM-SORT and AUGM-MAX show performance
close to the optimal. On the contrary, AUGM-1 behaves
the worst, only providing minor power reductions with
respect to NODVFS.
Similar results holds For N = 256 in Fig. 4. We were
unable to obtain the optimal solution in reasonable time.
AUGM-1 does not provide any benefit. AUGM-SORT and
AUGM-MAX provide performance close to the lower
bound MISO-MP. Thus, these DVFS schemes appear
to be the most efficient, especially at low average load,
regardless of the switch size.
outputs
inputs
Fig. 5. Mux-based 3× 3 crossbar
5 HARDWARE DESIGN AND EVALUATION
To better explore the effects of DVFS on a real switch
fabric, a 128 × 128 crossbar switch was adopted as a
case study. To optimize crossbar scalability, instead of the
classical X-Y architecture, we choose a mux-tree based
pipelined architecture. Indeed, in classical X-Y based
crossbar switches [9], any input–output connection is
provided by horizontal and vertical wires spanning the
whole area. Hence, propagation delay along wires tends
to grow rapidly with the number of input-output ports
and soon becomes the limiting factor for throughput
performance. Multiple bit slices can be used to cope with
limited clock frequency, while reaching at the same time
high line throughput. However, in this case, improved
performance comes at the cost of additional implemen-
tation complexity.
High data rates over a large switch, with more than
one hundred input output ports, can be obtained at
a lower implementation complexity with a mux-tree
based pipelined architecture [9], shown in Fig. 5: Each
output is connected through a tree of multiplexers that
receive all input ports. Two basic features of the tree
organization can be exploited to improve speed: (i) the
entire multiplexing operation can be split in several
tree stages, with each stage individually sized to match
timing constants according to its load capacitance, and
(ii) pipeline registers can be inserted along the tree to
cut critical path delays, thus achieving very high clock
frequency.
The mux–tree based pipelined switch of size 128×128
was modeled using VHDL language and synthesized
to derive area occupation, achievable throughput and
dissipated power. Fig. 6 shows the structure of a single
slice of the crossbar fabric: each input port receives data
serially and the 128 inputs are divided into two parts,
where the upper (and the lower) portion deals with 64
inputs. Internal registers are used to provide pipelining.
In the upper half of the fabric, 16 multiplexers and 4 mul-
tiplexers are contained in the first and second pipeline
stages respectively. A 4 × 1 multiplexer is allocated in
the third pipeline stage. The same structure is repeated
in the lower half, and a 2× 1 multiplexer is used for the
final selection. Thus, the showed slice forms a 128 × 1
multiplexer with pipelining. To control the whole set of
multiplexers, 85 select lines are required.
The complete fabric architecture consists of 128 slices
equal to the one given in Fig.6. The same data inputs
are applied to each slice and a total of 128× 85 = 10880
select lines are used to control the switch. Destination
conflicts are not allowed in the described architecture,
and are prevented by a proper scheduling algorithm [6].
8MUX
4x1
MUX
4x1
MUX
4x116
464
{
{ {
16
16 4
MUX
4x1
MUX
4x1
MUX
4x14 1
16
{4
16
464
{
{ {
16
16 4
14
16
MUX
2x1
128
Inputs
85
Sel
clk
clk_en
reset
Outputs
REG REG REG
REG REG REG
Fig. 6. Architecture of a slice of the switch fabric
A further important property of the adopted switch
fabric architecture is its modularity. This feature enables
the possibility to adopt a hierarchical synthesis flow that
simplifies the floorplan. Additionally, although this is
not exploited in this work, the modular structure of the
switch also allows for applying different choices of volt-
age and frequency scaling to individual slices. Assuming
that a lower traffic is observed along paths associated
with a specific slice, then voltage and frequency scaling
for this single slice would be beneficial to reduce power
consumption and would allow at the same time for
higher throughput across different slices.
The VHDL code of the fabric was written, debugged
and simulated under Mentor Graphics Modelsim using
randomly generated patterns of input data. Synthesis
was performed using Synopsys Design Compiler on a 90
nm CMOS technology. The power analysis of the switch
fabric was performed using Synopsys Power Compiler.
We do not consider the power contribution due to the
implementation of the power control algorithm or any
other component because we focus on the crossbar chip.
We restrict our analysis to the synthesis results and we
do not consider the consumption due to the actual chip
layout; hence, our power consumption results are rela-
tive. Derived power dissipation figures are based on the
actual switching activities measured at circuit nodes dur-
ing simulation of the fabric in the presence of different
test patterns. Thanks to the high level of applied pipelin-
ing, the maximum operating frequency of the designed
crossbar, when the supply voltage is not scaled, is as high
as 3.2 GHz, allowing to reach an aggregated bandwidth
of 410 Gbps. To evaluate the potential of the described
DVFS approach, the crossbar was synthesized with sev-
eral values of supply voltage and frequency of the clock
signal. Six scaling factors (i.e. {0.4, 0.5, 0.6, 0.7, 0.8, 0.9},
corresponding to α = {2.50, 2.00, 1.67, 1.43, 1.25, 1.11}),
were used to reduce supply voltage. In addition, the
clock frequency, fCK , was changed in the range between
the maximum achievable value of 3.2 GHz down to
200 MHz, equally for all the ports. Hence, the corre-
sponding traffic matrix S is ρ-double-stochastic with all
sij = ρ/N and ρ = fCK/(3.2 GHz). The power con-
sumption in the fabric is associated with the switching
activity in the slice components and therefore to the
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1
D
iss
ip
at
ed
 p
ow
er
 p
er
 p
or
t (
mW
)
Average load  (fCK / 3.2 GHz)
NoDVFS
α=1.11
α=1.25
α=1.43
α=1.67
α=2.00
α=2.50
Theoretical
 0.1
 1
 0.05  0.1  0.15  0.2  0.25  0.3  0.35  0.4
Zoom in log-scale
Fig. 7. Power obtained by the VHDL synthesis, for a 128×
128 crossbar with 410 Gbps bandwidth.
average data throughput. For each selected value of fCK ,
the maximum possible data rate has been assumed for
input data. For example, with fCK = 1.2 GHz, data are
received at the rate of 1,200 Mbps per input port. The
select lines which control the multiplexers are assumed
to switch at a 1000 times lower rate. Note that power
would also be consumed to change between voltage
levels. Furthermore, each transition to new values of
supply voltage and fCK introduces a latency, which may
affect the global throughput. However, for simplicity rea-
sons, latency and power overheads generated by these
transitions are not considered in this study.
Switch fabric power consumption per port is reported
in Fig. 7 for different voltage scaling factors and clock
frequencies. The theoretical curve is ρave(S)
3. As ex-
pected, power consumption scales linearly with fCK and
thus with input data rate, but the slope depends on
the applied voltage scaling. Therefore different power
reduction gains can be obtained at different input data
rates. For example, if input data rate is 50% of the
maximum allowed level, 75% of the dissipated power
can be saved, from 4.2 mW with no applied DVFS to
1 mW with a voltage scaling factor equal to 0.5. A
lower reduction of dissipated power is possible when
working at higher data rates: with input data at 75% of
9the maximum frequency, the dissipated power can be
reduced by 51% from 6.3 mW to 3.1 mW.
Furthermore, the filled points on the theoretical curve
for a specific load ρ are aligned with the linear inter-
polation of the powers obtained for a specific value of
α = 1/ρ. This means that the cubic dissipation model of
Theorem 2, based on a single expansion factor for the
whole crossbar, is accurate.
6 CONCLUSIONS
We discussed the potential power gains that DVFS tech-
niques can provide when controlling a crossbar used as
a switching fabric in an input-queued switch. We took
an idealized approach, disregarding the details related
to packet scheduling, looking at flow rates.
Performance results, validated through a real hard-
ware synthesis, show that a significant power reduction
can be obtained, especially at low loads. The proposed
algorithms are computationally simple and obtain per-
formance gain close to those of more complex, optimal
algorithms.
REFERENCES
[1] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “The limit of
dynamic voltage scaling and insomniac dynamic voltage scaling,”
IEEE Trans. on VLSI Systems, vol. 13, no. 11, pp. 1239–1252, Nov.
2005.
[2] F. Hameed, M. Faruque, and J. Henkel, “Dynamic thermal manage-
ment in 3d multi-core architecture through run-time adaptation,”
in IEEE Design, Automation & Test in Europe (DATE), 2011.
[3] https://research.sprintlabs.com/packstat/packetoverview.php.
[4] M. Flynn and P. Hung, “Microprocessor design issues: thoughts on
the road ahead,” IEEE Micro, vol. 25, no. 3, pp. 16–31, May 2005.
[5] T. Kolpe, A. Zhai, and S. Sapatnekar, “Enabling improved power
management in multicore processors through clustered dvfs,” in
IEEE Design, Automation & Test in Europe (DATE), 2011.
[6] H. J. Chao and B. Liu, High Performance Switches and Routers.
Wiley-IEEE Press, 2007.
[7] C.-S. Chang, W.-J. Chen, and H.-Y. Huang, “Birkhoff-von neumann
input buffered crossbar switches,” in IEEE INFOCOM, vol. 3,
March 2000, pp. 1614–1623.
[8] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge
University Press, 2004.
[9] Ting Wu, Chi-Ying Tsui, and Mounir Hamdi, “A 2Gb/s 256 x 256
CMOS crossbar switch fabric core design using pipelined MUX,”
in IEEE International Symposium on Circuits and Systems (ISCAS),
Phoenix-Scottsdale, AZ, May 2002.
