Abstract -This paper presents a novel architectural mechanism and a power management structure for the design of an energy-efficient Gigabit Ethernet controller. Key characteristics of such a controller are low-latency and highbandwidth required to meet the pressing demands of extremely high frame and control data, which in turn cause difriculties in managing power dissipation. We propose a flow-through-queue (FTQ) based power management method, which allows some of the tasks involved in processing the frame data to be offloaded. This in turn enables utilization of multiple clock rates and multiple voltages for different cores inside the Ethernet controller. A modeling approach based on semi-Markov decision process (SMDP) and queuing models is employed, which allow one to apply mathematical programming formulations for energy optimization under performance constraints. The proposed Gigabit Ethernet controller is designed with a 130nm CMOS technology that includes both high and low threshold voltages. Experimental results show that the proposed power optimization method can achieve system-wide energy savings under tighter performance constraints.
I. INTRODUCTION
A look at today's high-speed networking system trends reveals that as Internet link speeds continue to grow exponentially, a Gigabit Ethernet controller is becoming more complex to satisfy the high-functionality, high-performance demands of today's applications. For example, the Gigabit Ethernet controller must be able to support high frame-rate data processing and lowlatency access to achieve full-duplex line rates for maximumsized, e.g., 1518-byte, frame [1] . However, this trend also translates into high power densities, higher operating temperatures, and lower circuit reliability. Power consumption increases rapidly with increase in link speed [2] . Thus, designers of the Gigabit Ethernet controller must consider power dissipation as one ofthe primary issues.
Although power savings are commonly achieved through circuit-level optimization techniques, many opportunities exist at the system and architecture levels to reduce energy consumption. Furthermore, current CMOS technologies allow an increasing number of clock and voltage domains to be specified on the same chip, which allows dynamic voltage and frequency scaling (DVFS) and multiple supply and threshold voltage (Vdd and Vth) assignments to be utilized [7] . System designs utilizing multiple clocks and multiple voltage cores, where globally asynchronous and locally synchronous (GALS) communication architecture is 'This work was supported in part by a grant (# 0541469) from the CCF division of the NSF.
deployed, face increasing difficulty in managing power consumption under tighter performance constraints [4] [8] .
As reported in [3] - [6] , the problem of power modeling and optimization at high-levels of abstraction in GALS has received a lot of attention especially with respect to multiple voltage domains. In [3] , the authors show that GALS processors with multiple clocks and a single voltage are not necessarily better in terms of power consumption compared to fully synchronous design due to the asynchronous communication overhead. It is also reported that the use of dynamic voltage scaling in multiple voltage cores improves power savings up to 20%. The work presented in [4] studies online DVFS scheme in the context of a multiple clock domain architecture by utilizing interface queues to guide the DVFS control. Voltage island-based power management is proposed in [5] to satisfy the required performance in multi-threshold CMOS technologies. In [6] , the authors present an architecture for GALS systems, which allows dynamic load-balancing and adaptive inter-task voltage scaling based on the load in each ofthe processing units.
Although these techniques perform DVFS, little attention has been given to modeling a power-managed system with multiple VddNVth choices. Indeed, a centralized DVFS architecture [3] that utilizes interface queues to transfer high-bandwidth data between multiple voltage domains tends to perform rather poorly under tight performance constraints. Finally, GALS [6] often results in overhead penalty in terms of timing due to the complexity of configurations.
In this paper, we propose a flow-through-queue (FTQ) based power management method by offloading some of the tasks involved in processing the frame data, which enables multiple clock rates and multiple voltage cores inside the Ethernet controller chip. Note that in the Gigabit Ethernet controller, the control data must be accessed with low-latency, while the frame data must be accessed with high bandwidth so as to maximize the transfer speed. These two competing requirements create a very challenging power minimization problem. FTQ, which directs the frame data processing between functional modules, improves hardware support for higher performance with respect to handling the incoming packets. We also present a systematic approach for constructing a stochastic power management model. The numerical optimization solution of this stochastic model is based on a semi-Markov decision process (SMDP). Note that SMDP model, which offers a robust theoretical framework, enables one to apply strong mathematical optimization techniques to derive optimal power management policies. To achieve further energy savings in multi-threshold CMOS technologies, mathematical programming problems are 6B-1 formulated with multiple VddNVth assignments under tight performance constraints.
The remainder of this paper is organized as follows: Section 2 provides a brief background of the Gigabit Ethernet controller while section 3 describes the details of proposed FTQ-based architecture. In section 4, we construct the FTQ-based system with SMDP and queuing models. Section 5 provides performance optimization methods. Experimental results and conclusion are given in section 6 and section 7. II. BACKGROUND: ETHERNET CONTROLLER The host system of a networking server uses the Ethernet controller to send and receive packets. Sending and receiving packets over the local interconnect, e.g., PCI-E bus [9] , is handled by the Ethernet controller and the device driver in the host operating system. In general, the Ethernet controller typically has a direct memory access (DMA) engine to transfer data between the host system memory and the network interface memory. In addition, Ethernet controller includes a medium access control (MAC) unit to implement the link level protocol for the underlying network, and use a signal processing hardware to implement the physical (PHY) layer defined in the network. Figure 1 in alphabetical orders). In step (a), the Ethernet controller receives a data stream from the selected physical layer interface. It performs address checking, CRC calculation, and CSMA/CD functions [2] in step (b). In step (c), the Ethernet controller calculates checksum and parses TCP/IP headers, while classifying frame based on a set of matching rules in step (d). In step (e), the Ethernet controller strips the VLAN (Virtual Local Area Network) tag, and then temporarily places packet data and header into the pre-allocated receive buffer (i.e., RXMBUF) in step (f). After that, the Ethernet controller completes buffer descriptors for the packet in step (g). Finally in step (h), the DMA transfer for packet data and descriptors to the host memory is accomplished via the PCI-E interface by notifying the device driver by means of an interrupt. III. FTQ-BASED ARCHITECTURE As described before, defragmenting the packets of various communication protocols in hardware remains an extremely complex task. Thus, the Ethernet controller needs more functional modules and specialized hardware units that efficiently transfer between the local interconnect and the network. Therefore, for a power-constrained system, it is necessary to capture parallelism and asynchrony among multiple functional modules operating at multiple clocks and voltages. Figure 2 . Concept of Flow-Through-Queue. The FTQs provide a FIFO mechanism between the state machines describing various functional modules. Each state machine essentially reacts to the content of its corresponding FTQ to initiate and direct the processing activities of the state machine as shown in Figure 2 . The content of FTQ includes pointers that are used to indicate where the frame data is located in the buffers. When the FTQ is empty, the state machine has no work to perform and is in the idle mode. A functional module can switch between different power-speed levels. Switching between the power-saving modes in the active state is managed by a power management policy. The DVFS controller for each functional module utilizes information about the FTQ of the module, i.e., how full the queue is and how quickly the number of entries in queue changes, to dynamically vary the supply voltage and frequency setting.
The FTQ abstraction enables high-levels of parallelism by permitting different frames in the same stage of processing to proceed concurrently. In general, the frame data is provisionally stored in memory buffers before being sent to local interconnect or network, while the control data is processed by a string of functional modules, each requiring low-latency as shown in Figure 1 (see steps (c), (d), (e), and (g)). Thus, this architecture targets the control dominated tasks rather than the storage and forwarding of the frame data. The event-queue mechanism ofthe FTQ enables multiple clocks and multiple voltages for the functional modules, satisfying the low-latency control data access and the high-bandwidth frame data access. The FTQ configuration for the packet receive path is illustrated in Figure 3 . Functional modules i.e., QP (Queue Placement), DI 
6B-1
The timing diagram of FTQ-based processing for the packet receive path is depicted in Figure 4 with some of FTQ-related signals. The timing diagram shows that the pointer values (e.g., C0002, C0604, COA05, and COC06) are transferred to the following functional modules via FTQs while performing packet header processing at each step, whereas the frame data is directly transferred to PCI-E bus from memory buffer through the DMA.
IV. MODELING A FTQ-BASED SYSTEM
In this section, we present a systematic approach for modeling a FTQ-based system with stochastic processes, i.e., semi-Markov decision processes (SMDP) [12] . Note that a SMDP is a tuple <S, E, Y, Z, R>, where S is a set of states, E is a set of actions, Y is the transition probability function, Z specifies the probability distribution of transition times for each state-action pairs, and R is the expected reward function [12] . Figure 5 shows the SMDP model of the Gigabit Ethernet controller for the packet receive path with a state set S = SI, S2, ... Sm}, where m is the number of processing modes available to the system. This figure shows that each state in the SMDP model interacts with relevant functional modules, implying dependency between these modules. For example, the S5 state involves RISC, QP, MA (Memory Arbiter), and RXMBUF modules. Definitions of the states for this SMDP model are provided in Table 1 . The idle and sleep modes shown in Figure 5 are for the whole system, i.e., all functional modules go to sleep in SI,. Note also that each functional module has its own idle and sleep modes as shown in Figure 2 .
The FTQ may be represented by the G/M/1 queuing model, where inter-arrival times are arbitrarily distributed and service times are exponentially distributed [13] . A general distribution is assumed for the inter-arrival times because an exponential distribution would underestimate the occurrence probability for long request inter-arrival times and so it does not adequately model the request arrival time in the idle periods [14] . The service time behavior is captured by a given service time distribution for the functional module when it is in the active mode. Similarly, the input request behavior is modeled by a given inter-arrival time distribution. Let Si represent the ith state in a SMP, and Ii denote the task (i.e., the job descriptor) interarrival time whose distribution depends only on the present state Si. Assuming that inter-arrival times are mutually independent, we may define the arrival process of tasks at time t from state i to statej ofthe SMDP as follows: a,j (t) = Prob {S,,, = j,I < t S i} 
where yis the unique solution of Laplace-Stieltjes transform of the inter-arrival time distribution function [16] , which is We would like to consider the utilization of a functional module i.e., how much of the computational resource provided by the functional module is utilized by the application. More precisely, the utilization ratio, Uk, may be defined as:
where BP is the duration of the busy period of the functional module, and IP is the duration of its idle period. Without presenting the proof, we simply state (cf. [13] ):
where E(T) is the expected number of transitions in the SMDP. Thus, given the number, n, of tasks waiting in the FTQ, we can calculate BP and IP as follows where p(x, y) is the probability that the system moves to state y from state x (see Figure 4) . For a policy m, we define the discounted cost C of a processing path dof length k as follows.
C (a5) -=oy cost(s ,a') (10) 2 In this paper, subscripts denote state information whereas superscripts denote time stamp.
where ti is the time that the system spent in state s' before action al causes a transition to state s'1. [19] based on RTL simulation ofthe system. We use TSMC 130nmLP library which has 3 optional operating voltages (e.g., 1.35V, 1.5V, and 1.65V) and dual (High and Low) Vth for standard cells.
Our proposed multiple VddNVth assignment method takes as input a circuit that has been optimized for a maximum speed by using the available slack, which is obtained by Synopsys Design Compiler. After determining the timing critical paths of the circuit, we use high supply voltage, Vdd.h, and low threshold voltage, Vth.l, for the gates on those paths. We use a low supply voltage, Vdd 1, for the other gates, especially those that drive large capacitance since this approach yields the largest dynamic power savings. Figure 6 shows the power characteristics of 6B On the other hand, an all IOW-Vth cell-based design produces 38uW leakage power with 9.36ns circuit delay. In addition to reduction in leakage power, this approach also reduces the peak power dissipation. (The peak power dissipation in a localized space can cause local heating and peak temperature). Figure 7 shows the power distribution change inside EthemetMAC module before and after multiple Vdd/Vthi assignment.
High Vth Low Vth High Vth Low Vth High Vth Low Vth
Vdd =135V Vdd =150V Vdd =165V Figure 6 . Power characteristics of EthemetMIAC. Figure 9 indicates that the performance metrics (delay and leakage power) are adjusted gradually and trade-off becomes more dramatic at the corner cases (all high-Vth and all low-Vthi cell assignments).
00
Net-----sw itching-----------power Figure 10 . Energy due to leakage currents vs. Workload.
In the fourth experiment, we set the performance constraints on the Td and Uk (e.g., Td = 5 and Uk= 0.6) as in equation (12) . The solution of the SMDP-based optimization problem produces an optimal policy. Different arrival rates (X ) of tasks are used to generate the multiple rows in Table 2 , which represents the energy consumption for various VddNVth assignments in the active and idle modes of the functional modules (e.g., QP, DI, and DMA). We assume that the service time is 1 to simplify the calculations. Next, we apply different workloads for each module to simulate the optimal policy as shown in Table 3 . Results demonstrate that SMDP-based optimization produces energy savings for both active and idle modes up to 20% and 56%, respectively. .u
