## POLITECNICO DI TORINO

## Repository ISTITUZIONALE

## Thermal control for crossbar-based input-queued switches

Original
Thermal control for crossbar-based input-queued switches / Bianco A.; Giaccone P.; Masera G.; Ricca M.. - STAMPA. (2010). ((Intervento presentato al convegno IEEE Globecom 2010 tenutosi a Miami, FL nel December 2010.

Availability:
This version is available at: $11583 / 2375042$ since:
Publisher:
IEEE

Published
DOI:10.1109/GLOCOM.2010.5683306

Terms of use.
openAccess
This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository

Publisher copyright
(Article begins on next page)

# Thermal Control for Crossbar-based Input-Queued Switches 

Andrea Bianco, Paolo Giaccone, Guido Masera, Marco Ricca<br>Dipartimento di Elettronica, Politecnico di Torino, Italy


#### Abstract

We consider an $N \times N$ input-queued switch based on a crossbar switching fabric implemented on a single chip. The thermal power produced by the crossbar chip grows as $N R^{3}$, where $R$ is the maximum bit rate. Power dissipation is becoming more and more challenging, limiting the crossbar scalability for high performance switches.

We propose to exploit Dynamic Voltage and Frequency Scaling (DVFS) techniques, quite commonly used in integrated circuit design, to control packet transmissions through each crosspoint of the switching fabric. Our thermal control operates independently of the packet scheduler and it is based on short-term traffic measurements. We propose a family of control algorithms to reduce the thermal power dissipation in non-overloaded conditions.

Index Terms-Input queued switch, energy, thermal control.


## I. Introduction

The aggregate bandwidth of high speed routers is growing fast, due to the increased traffic demand in the Internet. Usually, one or few switching fabrics are present in the core of the routers to switch all the data from the inputs to the outputs; each fabric is often implemented on a single integrated circuit. The hardware design of such fabrics is becoming more and more critical, because of the large pin count and the high bit rate. Indeed, if $f$ is the maximum digital signal frequency, the power consumption of a single CMOS is proportional to $f^{3}$ [1]. In a $N \times N$ single-chip crossbar with $N^{2}$ crosspoint, each implemented through a combinatorial logic, we have $\theta\left(N^{2}\right)$ CMOSs (i.e., a fixed number for each crosspoint), and the total power consumption becomes proportional to $R^{3} N$, where $R$ is the bit rate and $N$ is the maximum number of data simultaneously flowing across the switching fabric.

Thermal dissipation is becoming a critical design issue, due to high integration level on a single chip, that implies very high power spatial density [2]. In integrated circuits, Dynamic Voltage and Frequency Scaling (DVFS) [1] is a classical technique used to control the power consumption. DVFS is based on the idea of jointly varying the power supply voltage and the peak signal frequency. A vast literature on DVFS techniques is available such as focusing on a single CMOS, on a CMOS cascade, and on a complete CPU.

In this paper we propose to use DVFS for the thermal control of a single-chip crossbar, analyzing the tradeoff between throughput (i.e., performance) and power consumption (i.e., thermal power to dissipate). The main idea is to exploit low load traffic conditions to extend packet duration by reducing bit voltage and frequency to control thermal power. Note that networks are typically provisioned for worst-case or peakhour traffic. However, several measurements (see for exam-
ple [3]) show that backbone utilization rarely exceeds $30 \%$, thus suggesting that exploiting low traffic conditions can be a significant asset to reduce thermal power. We propose a set of algorithms for thermal control that operate on an estimated traffic matrix to assess the potential power gain that can be obtained exploiting DVFS. We take an idealized approach, i.e., we disregard the interaction with packet scheduling algorithms that select the packets to be transferred across the switching fabric. The system model is defined in Sec. II. Sec. III formalizes the optimal thermal control problem, describes its properties, and proposes a set of algorithms to solve it. Performance results in Sec. IV show the possible thermal gain of our approach. The main contributions of the paper: (i) definition of the thermal control problem for the crossbar; (ii) definition of the optimal algorithm and of simpler approximated algorithms; (iii) performance evaluation of such algorithms.

## II. Problem Definition

We start by considering a single CMOS on which the combinatorial logic is based. Then, we examine the switching architecture to define the crossbar thermal control problem.

## A. Energy model for a single CMOS gate

The energy consumption of a CMOS gate is strongly dependent on the supply voltage $V$ and it can be modeled as a sum of a dynamic energy component (due to electrical signal switching activity needed to transfer sequence of 0s and 1 s ) and a static energy component (due to leakage currents). We consider only the dynamic energy component, while we neglect the latter contribution. Indeed, leakage currents can be made negligible with a proper hardware design, whose discussion is out of the scope of this paper. The energy due to a bit transition (i.e., the switching activity) is a quadratic function of $V$ according to the well known formula $E_{b i t}=$ $0.5 C V^{2}$, where $C$ is the load capacitance. If we consider a $0-1$ square wave signal with frequency $f$, the power consumption is $P=E_{b i t} f \propto f V^{2}$; this value represents also the thermal energy to dissipate. The maximum allowed frequency is

$$
\begin{equation*}
f_{\max } \propto V \tag{1}
\end{equation*}
$$

due to the delay needed to switch from one logic state to another [4]. Thereby, the power consumption for a CMOS operating at maximum frequency and voltage is proportional to $f^{3}$. DVFS techniques jointly reduce $V$ and $f$ to minimize power consumption, exploiting time periods in which the signal can be "slowed down" to a lower peak frequency.


Fig. 1. Thermal control scheme

We consider a CMOS device operating at a voltage $V$ between a minimum $V_{\min }$ and maximum $V_{\max }$. Within this range, we assume that bit transmission can occur at intermediate voltage levels. When operating at $V<V_{\max }$, thanks to (1), the signal peak frequency can be slowed down by a factor $\alpha=V_{\max } / V$ with respect to the maximum frequency allowed when using $V_{\text {max }}$. Thus, $\alpha$ represents an expansion factor of the bit duration with respect to the bit duration when using $V_{\max }$. Furthermore, $V$ should be larger than $V_{\min }>0$, because of technological constraints that forbid to reduce too much the voltage level and of the impact of leakage currents, that would become not negligible anymore. Define $\beta=V_{\min } / V_{\max }$. Depending on the technology, $\beta=0.5$ for a classical DVFS scheme or $\beta=0.3$ in the case of an "extreme" DVFS scheme, according to [1]. By construction, $1 \leq \alpha \leq 1 / \beta$.

## B. Switching architecture

We consider an $N \times N$ input queued (IQ) switch, with virtual output queueing (VOQ), i.e. one queue $\mathrm{VOQ}_{i j}$ for each input $i$ and output $j$ pair. This IQ architecture allows high scalability in terms of line rate and number of ports, and the VOQ scheme is theoretically optimal from the performance point of view. To avoid dealing with data content at this abstract level, we assume that a data packet of length $P$ is transmitted using $P$ signal transitions: i.e., each packet is composed by a sequence of alternating 0 and 1 . The maximum line rate for each port is $R_{\text {max }}$, measured in [bit/s]; this value can be only reached for $V=V_{\max }$. The switching fabric is a $N \times N$ crossbar, with $N^{2}$ crosspoints and $\theta\left(N^{2}\right)$ CMOSs. The crosspoint from input $i$ to output $j$ is denoted as $\mathrm{XP}_{i j}$.

## III. Crossbar thermal control

In real switch implementation, a packet scheduler is responsible of selecting the set of packets to transfer simultaneously through the crossbar, satisfying the constraints that at most one packet is sent from each input and to each output. The scheduling decisions occur at a packet level, with a time granularity equal to the minimum packet duration. In the case of minimum Ethernet packet size and $10 \mathrm{Gbit} / \mathrm{s}$ line rates, a new scheduling decision must be taken every 50 ns . Given such a strict timing constraint, packet schedulers are implemented directly in hardware. A large literature is available on the design of low complexity and high performance packet schedulers for input queued switches [5]-[7].

Differently from the packet scheduler, the thermal control operates at a larger time scale, related to the milliseconds thermal constants of the materials employed to build the chip. As shown in Fig. 1, we propose a thermal control scheme decoupled from the packet scheduling decision, whose aim is to exploit DVFS at crosspoints to reduce the crossbar thermal power. Based on traffic measurements on the millisecond scale, the control sets the DVFS factor $\alpha_{i j}$ for the combinatorial logic present at $\mathrm{XP}_{i j}$; each crosspoint is controlled independently. Note that, due to the relaxed timing constraints, the algorithm for thermal control can be implemented in software.

Let $\hat{\alpha}=\left[\alpha_{i j}\right]$ be the $N \times N$ matrix with all the DVFS factors. Note that setting $\alpha_{i j}>1$ implies that the forwarding rate at $\mathrm{XP}_{i j}$ is reduced and the packet transmission time is increased by a factor $\alpha_{i j}$. This has two main consequences: i) an additional queueing delay in $\mathrm{VOQ}_{i j}$, ii) the packet scheduler cannot serve any new packet from input $i$ and to output $j$ until $\mathrm{XP}_{i j}$ ends the packet transmission. This means that the packet scheduler should take into account thermal control expansion factor in packet scheduling. We disregard this issue in the paper, and we take an ideal fluid-based approach, looking only at rates of I/O flows, to understand which are the potential benefits that can be obtained in terms of reduced power consumption.

## A. Problem definition

The traffic load on each link is measured on a time window which duration is in the order of thermal constants (ms). Let $r_{i j}$ be the average rate [bit/s] for the traffic flows enqueued at $\mathrm{VOQ}_{i j}$, and $R=\left[r_{i j}\right]$ the corresponding $N \times N$ traffic matrix. Let $S=\left[s_{i j}\right]$ be the normalized traffic matrix obtained by setting $s_{i j}=r_{i j} / R_{\text {max }}$, with $s_{i j} \in[0,1]$. We assume that $s_{i j}>0$ for any $i$ and $j$. The average load of matrix $S$ is defined as $\rho_{\text {ave }}(S)=\left(\sum_{i=1}^{N} \sum_{j=1}^{N} s_{i j}\right) / N$. The load at input $i$ and at output $j$ is $\rho_{i}^{I}(S)=\sum_{k=1}^{N} s_{i k}$ and $\rho_{j}^{O}(S)=\sum_{k=1}^{N} s_{k j}$ respectively. The maximum load of matrix $S$ is defined as $\rho_{\text {max }}(S)=\max \left\{\max _{k}\left\{\rho_{k}^{I}\right\}, \max _{k}\left\{\rho_{k}^{O}\right\}\right\}$. and it is said to be admissible iff $\rho_{\max }(S) \leq 1$. Obviously, $\rho_{\text {ave }}(S) \leq \rho_{\max }(S)$.

We now model the constraints related to the maximum time expansion allowed for the transmitted bits. Consider a generic time period $T$, for which a flow rate is equal to $r_{i j}$. The total duration of the bit transmissions is $T_{1}=r_{i j} T / r_{\max }$ and the maximum bit expansion factor is $T / T_{1}=r_{\max } / r_{i j}$, i.e. $\alpha_{i j} r_{i j} \leq r_{\max }$. At the same time, we have stricter constraints on the expansion factor imposed by the traffic load in $T$ on input and output traffic relations: $\sum_{i=1}^{N} \alpha_{i k} r_{i k} \leq r_{\max }$, $\sum_{j=1}^{N} \alpha_{k j} r_{k j} \leq r_{\max }, \quad \forall k \in\{1, \ldots, N\}$ which can be normalized as

$$
\begin{equation*}
\sum_{i=1}^{N} \alpha_{i k} s_{i k} \leq 1 \quad \sum_{j=1}^{N} \alpha_{k j} s_{k j} \leq 1 \tag{2}
\end{equation*}
$$

The power consumption of $\mathrm{XP}_{i j}$ is

$$
P_{i j}=r_{i j}\left(\frac{V_{\max }}{\alpha_{i j}}\right)^{2}=s_{i j} r_{\max }\left(\frac{V_{\max }}{\alpha_{i j}}\right)^{2}
$$

neglecting all constants of proportionality. The total crossbar power consumption is the sum of the power contributions of all crosspoints:

$$
P_{t o t}=\sum_{i=1}^{N} \sum_{j=1}^{N} P_{i j}=\sum_{i=1}^{N} \sum_{j=1}^{N} \frac{s_{i j}}{\alpha_{i j}^{2}} r_{\max } V_{\max }^{2}
$$

Finally, the minimum thermal power problem (denoted as OpT-MTP) becomes: given a admissible $S$, find $\hat{\alpha}$ that minimizes the cost function $f_{P}$ :

$$
\begin{equation*}
\min _{\hat{\alpha}} f_{P}(\hat{\alpha})=\min _{\alpha_{i j} \in \mathbb{R}^{+}} \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{s_{i j}}{\alpha_{i j}^{2}} \tag{3}
\end{equation*}
$$

such that

$$
\begin{align*}
\sum_{k=1}^{N} \alpha_{i k} s_{i k} & \leq 1  \tag{4}\\
\sum_{k=1}^{N} \alpha_{k j} s_{k j} & \leq 1  \tag{5}\\
\alpha_{i j} & \in \mathcal{A} \tag{6}
\end{align*}
$$

where $\mathcal{A}$ is the set of all available voltage levels.
Property 1: OpT-MTP is an integer convex non-linear optimization problem.
Following a standard methodology, we start to relax OpT-MTP to continuous variables; this defines the following problem, denoted as ConT-MTP: minimize (3) subject to (4) and (5); (6) is substituted by $\alpha_{i j} \geq 1 \forall i, j$ that corresponds to a DVFS scheme in which any voltage between 0 and $V_{\max }$ is allowed. Let $\hat{\alpha}_{\text {OPt-MTP }}$ be the optimal solution of OPT-MTP. Let $\hat{\alpha}_{\text {Cont-MTP }}$ be the optimal solution of CONT-MTP.

Property 2: $f_{P}\left(\hat{\alpha}_{\text {CONT-MTP }}\right) \leq f_{P}\left(\hat{\alpha}_{\text {OPT-МTP }}\right)$
i.e. $\hat{\alpha}_{\text {Cont-MTP }}$ as a lower bound on the thermal power cost.

Theorem 1: CONT-MTP is equivalent to

$$
\begin{align*}
\min _{\hat{\alpha}} f_{P}(\hat{\alpha}) & \\
\sum_{k=1}^{N} \alpha_{i k} s_{i k}=1 & \forall i  \tag{7}\\
\sum_{k=1}^{N} \alpha_{k j} s_{k j}=1 & \forall j  \tag{8}\\
\alpha_{i j} \geq 1 & \forall i, j \tag{9}
\end{align*}
$$

The proof of Theorem 1 is omitted for the sake of space. Note that exactly one of the constraints in (7)-(8) is linearly dependent from the other, and it can be omitted.

A non-negative matrix $H \in \mathbb{R}^{N \times N}$ is said to be $\rho$-doublestochastic if $\rho_{i}^{I}=\rho$ for any $i$ and $\rho_{j}^{O}=\rho$ for any $j$. In this case, $\rho_{\text {ave }}(H)=\rho_{\max }(H)=\rho$. A 1-double-stochastic matrix is called double-stochastic matrix.

A non-negative matrix $H \in \mathbb{R}^{N \times N}$ is said to be $\rho$-substochastic if $\rho_{\max }(H)=\rho$. In this case, $\rho_{\text {ave }}(H) \leq \rho_{\max }(H)$.

Thanks to Theorem 1, CONT-MTP has the following explanation: given a $\rho$-sub-stochastic matrix $S$, find a doublestochastic matrix $\hat{S}=\left[\hat{s}_{i j}\right]$ such that the set of $\alpha_{i j}=\hat{s}_{i j} / s_{i j}$
minimizes (3). Hence, the problem consists of augmenting $S$ such that it becomes double-stochastic.

In the following specific case, we can analytically compute the optimal solution:

Theorem 2: Given a $\rho$-double-stochastic matrix $S$, the optimal solution for CONT-MTP is $\hat{\alpha}_{i j}=1 / \rho$, for any $i, j$. The corresponding power consumption is $f_{P}\left(\hat{\alpha}_{\text {Cont-MTP }}\right)=N \rho^{3}$. The proof is based on the use of the Lagrange multipliers and on the Taylor's Theorem for multivariate functions, and it is omitted here due to lack of space. Furthermore, we can get an important intuition from the above Theorem, that will drive the design of approximated algorithms for the CONT-MTP problem: In the optimal solution, all the $\alpha_{i j}$ are expanded proportionally by the same factor.

When considering also $V_{\min }$, the expansion ratio is limited by: $\alpha_{i j} \leq 1 / \beta$. The optimal solution becomes $\alpha_{i j}=$ $\min \left(1 / \rho_{\max }(S), 1 / \beta\right) \quad \forall i, j$ and the corresponding optimal solution for CONT-MTP becomes:

$$
f_{P}\left(\hat{\alpha}_{\text {Cont-MTP }}\right)= \begin{cases}N \rho_{\max }(S) \beta^{2} & \text { if } \rho_{\text {ave }}(S)<\beta  \tag{10}\\ N\left(\rho_{\max }(S)\right)^{3} & \text { if } \rho_{\text {ave }}(S) \geq \beta\end{cases}
$$

According to (10), $\beta$ is the value of "critical load" above which DVFS is not able to expand the bit duration due to the constraints imposed by the traffic load in (2). Recall that, in practical applications, $\beta \in[0.3,0.5]$.

Now we consider a relaxed version of the CONT-MTP problem, denoted as Miso-MTP, in which we remove the expansion constraints (7) on each input.

Theorem 3: Optimal solution of Miso-MTP is given by $\alpha_{i j}=1 / \rho_{j}^{O}(S)$. The related power cost is: $f_{P}\left(\hat{\alpha}_{\text {Miso-mTP }}\right)=$ $\sum_{j}\left(\rho_{j}^{O}(S)\right)^{3}$. The proof is omitted for absence of space and it is based on the definition of Lagrange function.

Property 3: $f_{P}\left(\hat{\alpha}_{\text {Miso-MTP }}\right) \leq f_{P}\left(\hat{\alpha}_{\text {Cont-mTP }}\right)$
i.e. Miso-MTP provides a lower bound, immediate to compute, for CONT-MTP and OPT-MTP.

A feasible, but not optimal, solution for OPT-MTP is when no DVFS scheme is adopted, i.e. $\alpha_{i j}=1$ for all $i, j$. We define this scheme as NoDVFS and the corresponding solution as $\hat{\alpha}_{\text {NoDVFs }}$. The power cost $f_{P}$ becomes

$$
\begin{equation*}
f_{P}\left(\hat{\alpha}_{\mathrm{NODVFS}}\right)=\sum_{i=1}^{N} \sum_{j=1}^{N} s_{i j}=N \rho_{\text {ave }}(S) \tag{11}
\end{equation*}
$$

denoting a linear relationship between the average load on $S$ and the total power consumption.

Property 4: $f_{P}\left(\hat{\alpha}_{\text {OPT-MTP }}\right) \leq f_{P}\left(\hat{\alpha}_{\text {NoDVFS }}\right)$
This permits to use $f_{P}\left(\hat{\alpha}_{\text {NoDVFS }}\right)$ as a loose upper bound on the performance for OPT-MTP.

In summary, the solution to the CONT-MTP problem, assuming that any voltage level between $V_{\min }$ and $V_{\max }$ can be used, provides a lower bound for the thermal gain of the OptMTP problem deals with a finite number of voltage levels; the optimal solution to CONT-MTP is immediate only for $\rho$ double stochastic matrices.

## B. Thermal control algorithm

To solve Opt-MTP for any matrix we propose to: i) solve the corresponding CONT-MTP problem, ii) approximate each $\alpha_{i j}$ to the smaller voltage value available in $\mathcal{A}$. If $\alpha_{i j}$ is the solution for CONT-MTP, then use for OPT-MTP: $\alpha_{i j}^{\prime}=\max \left\{\alpha \in \mathcal{A} \mid \alpha \leq \alpha_{i j}\right\}$. Note that, by construction, the set of $\alpha_{i j}^{\prime}$ defines an admissible solution for OPT-MTP.

To solve CONT-MTP, we adopt two approaches:

- quasi-optimal algorithm (denoted as OPT), obtained by adopting the logarithmic barrier method for convex problems [8]. It provides an $\epsilon$-approximation of the optimal solution, where $\epsilon$ is an input parameter, with enough large number of iterations. It converges quite slowly in our scenarios. Thus, we adopt it as a reference case for the optimal solution only.
- two-steps algorithm: we augment $S$ to a double stochastic $\hat{S}$ according to three algorithms: AUGM-1, AUGM-MAX or AUGM-Sort. Then, we compute $\alpha_{i j}=\hat{s}_{i j} / s_{i j}$.

The three algorithms to augment $S$ to a double-stochastic $\hat{S}$ are based on the Matrix-Increase algorithm, described in the pseudo-code below.

```
MATRIX-InCREASE Algorithm
Input: \(N \times N\) matrix \(S=\left[s_{i j}\right],\left\{\rho_{i}^{I}\right\}_{i=1}^{N},\left\{\rho_{j}^{O}\right\}_{j=1}^{N}, \rho_{T}, \Omega^{I}, \Omega^{O}\).
Output: \(N \times N\) matrix \(\Delta=\left[\delta_{i j}\right]\)
1. \(\delta_{i j}=0\) for any \(1 \leq i, j \leq N\)
2. \(\Omega^{I O}=\left\{(i, j): i \in \Omega^{I}, j \in \Omega^{O}\right\}\)
3. repeat until no choice is anymore available
        choose any \((i, j) \in \Omega\) such \(\max \left\{\rho_{i}^{I}, \rho_{j}^{O}\right\}<\rho_{T}\)
            \(\delta_{i j}=\min \left\{\rho_{T}-\rho_{i}^{I}, \rho_{T}-\rho_{j}^{O}\right\}\)
            \(\rho_{i}^{I}=\rho_{i}^{I}+\delta_{i j}, \quad \rho_{j}^{O}=\rho_{j}^{O}+\delta_{i j}\)
```

The MATRIX-InCREASE algorithms inputs are i) a substochastic matrix $S$, whose corresponding row $\rho_{i}^{I}$ and column $\rho_{j}^{O}$ sums are pre-computed; ii) a target load value $\rho_{T}$ such that $\rho_{T} \leq \max _{k}\left\{\rho_{k}^{I}\right\}$ and $\rho_{T} \leq \max _{k}\left\{\rho_{k}^{O}\right\}$, and iii) a set of inputs $\Omega^{I}$ and a set of outputs $\Omega^{O}$. The algorithm returns a matrix $\Delta=\left[\delta_{i j}\right]$ with the largest possible elements such that: (i) only the elements $\delta_{i j}$ corresponding to rows and columns present in both $\Omega^{I}$ and $\Omega^{O}$ may be $>0$; (ii) the maximum row and column sum is $\rho_{T}$, i.e.

$$
\begin{aligned}
& \sum_{k=1}^{N} s_{i k}+\delta_{i k} \leq \rho_{T} \text { for any } i \in \Omega^{I} \\
& \sum_{k=1}^{N} s_{k j}+\delta_{k j} \leq \rho_{T} \text { for any } j \in \Omega^{O}
\end{aligned}
$$

The algorithm operates only on a sub-matrix corresponding to the rows in $\Omega^{I}$ and the columns in $\Omega^{O}$. It chooses a sequence of elements in such sub-matrix for which both row and column sum to less than $\rho_{T}$. Then, each element in the sub-matrix is augmented as much as possible to reach $\rho_{T}$. Note that the maximum number of iterations in steps 3-6 is upper bounded by $2 N$. Having defined Increase-Matrix, we describe the algorithms that we propose to augment $S$ to a double-stochastic $\hat{S}$ :

- AUGM-1: i) compute $\rho_{i}^{I}$ and $\rho_{j}^{O}$ for any $i, j$; ii) run INCREASE-MATRIX on $S, \rho_{i}^{I}, \rho_{j}^{O}, \rho_{T}=1, \Omega^{I}=\Omega^{O}=$
$\{1, \ldots, N\}$; iii) compute $\hat{s}_{i j}=s_{i j}+\delta_{i j}$. This algorithm is a classical iterative algorithm (see Sec. II.A of [9]) to augment a sub-stochastic matrix to a double-stochastic one. The complexity is $O\left(N^{2}\right)$, due to steps i) and iii).
- AUGM-MAX: i) for any $i, j$ compute $\rho_{i}^{I}, \rho_{j}^{O}$ and then lastly $\rho_{\max }(S)$; ii) run Increase-Matrix on $S, \rho_{i}^{I}, \rho_{j}^{O}, \rho_{T}=$ $\rho_{\max }(S), \Omega^{I}=\Omega^{O}=\{1, \ldots, N\}$; iii) compute $\hat{s}_{i j}=s_{i j}+$ $\delta_{i j}+\left(1-\rho_{\max }(S)\right) / N$. The complexity is $O\left(N^{2}\right)$, due to steps i) and iv).
- AUGM-Sort: i) initialize $\hat{S}$ as $S$ by setting $\hat{s_{i j}}=s_{i j}$ for any $i$ and $j$; ii) compute $\rho_{i}^{I}$ and $\rho_{j}^{O}$ for any $i$ and $j$; iii) sort $\rho_{i}^{I}$ and $\rho_{j}^{O}$ in increasing order; the induced sequence of inputs is described by $I_{(k)}$ defined as the $k$-th input and by $O_{(k)}$ defined as the $k$-th output; iv) let $\Omega^{I}=\Omega^{O}=\emptyset$. Iterate, for $k$ from 1 to $N$, the following procedure: iv.a) $\Omega^{I}=\Omega^{I} \cup I_{(k)}$, i.e. compute the set of the $k$ inputs with the smallest row sums; iv.b) $\Omega^{O}=\Omega^{O} \cup O_{(k)}$, i.e. compute the set of the $k$ outputs with the smallest column sums; iv.c) run InCREASE-MATRIX on $S, \rho_{i}^{I}, \rho_{j}^{O}, \Omega^{I}, \Omega^{O}$ and $\rho_{T}=\max \left\{\rho_{I_{(k)}}, \rho_{O_{(k)}}\right\}$, i.e. $\rho_{T}$ is the maximum load for first $k$ inputs and outputs of $S$; iv.d) update $\hat{s}_{i j}=\hat{s}_{i j}+\delta_{i j}$ for any $i$ and $j$, and continue with a new iteration; v) compute $\hat{s}_{i j}=\hat{s}_{i j}+\left(1-\rho_{\max }(S)\right) / N$. The complexity is $O\left(N^{2}\right)$. In iv) this complexity is achieved by optimizing the data structure to choose an $(i, j) \in \Omega^{I O}$ in INCREASE-MATRIX and by initializing only once $\delta_{i j}$.

AUGM-1 is a classical way to augment a matrix, but has the disadvantage that it increases a selected element to set the sum of the corresponding row or column equal to one. As such, a non uniform increase in matrix element is obtained. AUGMMAX is a simple variant of AUGM-1 in which the matrix is augmented to reach exactly $\rho_{\max }$ for all rows and columns. Then, all matrix element are proportionally augmented until the matrix becomes double stochastic. This approach is based on the intuition derived from Theorem 2. Unfortunately, if even a single row or column sums to 1 , AUGM-MAX degenerates into Augm-1. Finally, Augm-Sort order rows and columns in the original matrix so as to define a set of sub-matrices of different size. Each $k \times k$ sub-matrix includes a sub-matrix of size $(k-1) \times(k-1)$ of smaller load. Elements in each submatrix are augmented to reach the maximum admissible value in the sub-matrix, starting from the sub-matrix of smallest size.

## IV. Performance evaluation

To compare the DVFS schemes, we define the relative power $\eta(\hat{\alpha})$ of a DVFS solution $\hat{\alpha}$, relative to NODVFS, as:

$$
\begin{equation*}
\eta(\hat{\alpha})=\frac{f_{P}(\hat{\alpha})}{f_{P}\left(\hat{\alpha}_{\mathrm{NoDVFS}}\right)}=\frac{f_{P}(\hat{\alpha})}{N \rho_{\text {ave }}(S)} \tag{12}
\end{equation*}
$$

Roughly speaking, $\eta(\hat{\alpha})$ is the thermal reduction factor compared to NoDVFS. Since $\eta(\hat{\alpha}) \in[0,1]$, the closer $\eta(\hat{\alpha})$ to zero, the larger the scheme gain with respect to NoDVFS.

## A. Power consumption under $\rho$-double-stochastic matrices

According to Theorem 2, the optimal solution for CONTMTP is expressed by (10). Fig. 2 shows the power consumption per port $f_{P}(\hat{\alpha}) / N$ vs. the average load for CONT-MTP


Fig. 2. Optimal solution for continuous DVFS, $\rho$-double-stochastic matrices.


Fig. 3. Relative power vs. average load for $N=256$ and continuous DVFS.
growth. As a consequence, the algorithms AugM-Sort and AUGM-MAX provide performance close to the lower bound Miso-MTP. Regardless of the considered load $\eta\left(\hat{\alpha}_{\text {AUGM-1 }}\right)$ overlaps $\eta\left(\hat{\alpha}_{\text {NoDVFS }}\right)$ to confirm the inability of AUGM-1 to exploit the DVFS gain.

## V. Conclusions

We discussed the potential thermal gains that DVFS techniques can provide when controlling a crossbar used as a switching fabric in an input-queued switch. We took an idealized approach, disregarding the details related to packet scheduling, looking at flow rates. Thus, DVFS schemes can be efficiently used to reduce power consumption especially at low average load regardless of the switch size. The proposed algorithms are computationally simple and obtain performance gain close to those of more complex, optimal algorithms.

## Acknowledgments

The research leading to these results has received funding from the European Community's $7^{\text {th }}$ Framework Programme under grant agreement 247674 (STRONGEST).

## REFERENCES

[1] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, "The limit of dynamic voltage scaling and insomniac dynamic voltage scaling," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 13, no. 11, pp. 1239-1252, nov. 2005.
[2] F. J. Pollack, "New microarchitecture challenges in the coming generations of cmos process technologies," Microarchitecture, IEEE/ACM International Symposium on, 1999.
[3] https://research.sprintlabs.com/packstat/packetoverview.php.
[4] M. Flynn and P. Hung, "Microprocessor design issues: thoughts on the road ahead," Micro, IEEE, vol. 25, no. 3, pp. 16-31, May-June 2005.
[5] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand, "Achieving 100\% throughput in an input-queued switch," IEEE Transactions on Communications, pp. 1260-302, 1999.
[6] P. Giaccone, B. Prabhakar, and D. Shah, "Randomized scheduling algorithms for high-aggregate bandwidth switches," IEEE Journal on Selected Areas in Communications, High-performance electronic switches/routers for high-speed internet, vol. 21, pp. 546-559, May 2003.
[7] H. J. Chao and B. Liu, High Performance Switches and Routers. WileyIEEE Press, 2007.
[8] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[9] C.-S. Chang, W.-J. Chen, and H.-Y. Huang, "Birkhoff-von neumann input buffered crossbar switches," in IEEE INFOCOM, vol. 3, March 2000, pp. 1614-1623.

