Abstract
Introduction
As more power-managed functional blocks (FBs) are being built to realize power-saving opportunities by utilizing dynamic voltage and frequency scaling (DVFS) techniques, the task of integrating multiple power management policies in a single system is becoming ever more challenging. Furthermore, although current CMOS technologies allow an increasing number of different clock and voltage domains to be specified on the same chip, traditional dynamic power management (DPM) methods have not been able to take full advantage of DVFS techniques due to intricate trade-offs between the powersavings and performance constraints. This is because a systemlevel power manager (PM) has only limited control over powersaving techniques due to additional power and delay costs incurred during power-mode transitions [1] . In addition, the power management routine, most likely residing in the operating system, can itself become a heavy duty since it has to continually monitor the workload of FBs and send DVFS assignment commands to them [2] .
DVFS-enabling techniques depend not only on the configuration of the voltage/frequency control circuits, but also on the efficiency of the prediction mechanism used by the PM to set the voltage/frequency levels. As shown in [3] - [8] , the problem of determining a power management policy with DVFS techniques at system-level has received a lot of attention. In [3] , the authors present a frequency management method based on variable updated intervals, instead of using a fixed update interval, where a frequency scheduling method is based on effective deadline mechanism. The analytical models for selecting an optimal DVFS under tight performance constraints are presented in [4] [5] . Reference [6] presents an online DVFS technique by utilizing interface queues to guide the DVFS control in a multiple clock and voltage domain architecture. A voltage island-based power management technique is proposed in [7] to meet a performance constraint in multi-threshold CMOS technologies. Authors of [8] present an optimization technique for power mode transitions under timing constraints.
Although all of the above techniques perform DVFS based on power management policies, little attention has been paid to handle variable frequency adjustment and prediction techniques by using hardware-control mechanisms, which minimizes the computational efforts by the PM. Furthermore, traditional approaches for DPM, mainly based on the software-control of power-saving techniques, are highly dependent on the speed of operating system, which incur non-negligible overhead. Thus, minimizing the overhead of power-mode transitions in real-time with the hardware-control architecture is an important step to guarantee the quality of DPM techniques. This is precisely the contribution of the present paper.
In this paper, we present a power management framework for dynamic continuous frequency adjustment which provides power-saving opportunities by dynamically and continuously adjusting a variable operating frequency. The basic idea of CFA is to eliminate the power and delay costs incurred by the powermode transitions which involve clock generators (e.g., PLL). Predicting a workload of tasks is formulated as an initial value problem (IVP) [9] , where the frequency is adjusted by the proposed dynamic frequency adapter. Note that the IVP formulation in the paper determines the workload value of future time subsequent to a given time.
The remainder of this paper is organized as follows. Section 2 provides a motivational example, while section 3 describes the proposed architecture. In section 4, we present the details of the workload prediction technique. Experimental results and conclusion are given in section 5 and 6.
Motivational Example
According to conventional DPM approaches [10] , where many FBs in a system are equipped with multiple power-performance states (e.g., sleep, idle, and active modes), a PM sends an command, i.e., DFS (Dynamic Frequency Scaling) values, to each FB. For example, as shown in Figure 1 , where we assume each FB has three active states which is controlled by DFS values, the service provider (SP) or the service requestor (SR) can switch between the different speed-levels, where DFS 1 < DFS 2 < DFS 3 in terms of frequency values. The PM monitors the current workload of the system by looking into the corresponding service queue (SQ) to adjust the DFS value for each FB. A transition into or out of a power-performance state (i.e., dotted arrows in the figure), however, consumes energy and/or incurs delay penalty that may not be negligible. Attempting to greedily respond to the workload changes so as to provide an optimal DVFS value can result in significant energy and delay overheads associated with mode transitions. To solve this problem, a software component, which can predict the required future performance level of the system to prevent frequent power mode transitions, has been incorporated into the power managers of [2] [3] . Although these prediction methods help reduce the energy overheads, there are some disadvantages because i) a software-oriented prediction algorithm increases the computational overhead of the PM that resides in the driver or the operating system, and ii) when using a PLL (Phase-Lock Loop) to effect a frequency change, the FB may be stalled during the lock time of the PLL. Consequently, use of the PLL to realize the DFS setting commanded by the PM may result in sizeable performance penalty [11] . The main contribution of our work is that we predict the workload level for the next time step and ramp up (or down) the system frequency in a continuous manner until the target frequency value is achieved, where there is never a need for stalling the FB.
Power Management Architecture
In this section we present a platform-specific continuous frequency adjustment (CFA) technique. The target system is comprised of various software and hardware modules, which include a power manager (PM), a performance monitoring unit (PMU), and a dynamic frequency adapter (DFA), as depicted in Figure 2 . Note that the DFA is implemented as a hardware module to minimize software-oriented computational efforts. 
Power Manager
The main goal of the PM is to determine and execute a power management policy (i.e., one that maps workloads to power state transition commands so as to minimize the system energy dissipation), based on the information provided by the PMU. The PM consists of a workload prediction and a policy determination. In this paper, we focus on the selection of optimal frequency value and continuous (gradual) change in system frequency moving from current value to target value.
Performance monitoring unit
The PMU profiles and analyzes the workload (e.g., the arrival rate of tasks) characteristics by looking into the SQ. In our problem setup, the SQ of each FB is represented by the G/M/1 queuing model, whereby the inter-arrival times are arbitrarily distributed and the service times are exponentially distributed [12] . Note that the service time behavior of each FB is captured in the form of the service time distribution for the FB when it is in the active mode. Similarly, the input request behavior (i.e., workload) for each FB is modeled by the request inter-arrival time distribution at the corresponding input queue. Details of the G/M/1 queuing model are omitted here to save space. Interested reader may refer to [12] .
Dynamic Frequency Adapter
When the workload of an FB changes frequently, the task of deciding what frequency value to assign to the FB becomes increasingly difficult. Furthermore, the conventional PLL-based frequency scaling techniques waste energy and delay when they change the frequency values. To overcome these shortcomings, we present a workload-aware DFA to generate a continuously varying frequency for each FB. One benefit of using a variable frequency is that the DFA enables each FB to remain functional even when its frequency is being adjusted, while satisfying the required performance. Furthermore, the DFA is able to increase (or decrease) the operating frequency value at a slow or fast rate with the help of the PMU, depending on how slow or fast the workload is changing and what the user-specified preferences are, as depicted in Figure 3 . The procedure for continuously adjusting the frequency is explained as follows. The PM examines the workload of each FB at decision epoch n+1 for the time interval ranging from decision epoch n to n+1, and subsequently, sets the frequency value of each FB for the next time ranging from n+1 to next decision epoch at time n+2 (see the next section for details of the frequency prediction algorithm). Note that each time-based or interrupt-based event occurrence is called a decision epoch. Assume that a mapping table for selecting an optimal operating frequency as a function of the workload has been provided. If the workload change is fast (slow), the interval during the frequency adjustment is performed will be shortened (lengthened) to improve DFA responsiveness. In the proposed framework, determining which frequency level to use in what time interval is implemented in hardware. Figure 4 shows the block diagram of the proposed DFA which generates a variable frequency by using a pulse width modulation technique. The DFA is implemented inside a chip, where we effectively manage noise and signal integrity problems. Note that we apply this architecture to a specific target system (e.g., high-speed networking controller), where the operating frequency is rather low (e.g., around 200MHz), yet the system provides high throughput. 
Workload-Driven Frequency Adjustment
In this section, we present a workload prediction technique based on the initial value problem (IVP) formulation and a procedure of dynamic frequency adjustment.
IVP-based Workload Prediction
Assume that, by utilizing the PMU, a PM is able to monitor the current workload of the tasks at decision epochs t 1 , …, t n where t i+1 = t i +T. Let w(t) denote the workload (i.e., the arrival rate of tasks) of a target FB at time t and let f be a function providing the operating frequency for the FB in terms of time and workload in every interval [t i , t i+1 ]. Then, an initial value problem (IVP) may be defined to predict w(t) as follows: 
where the smaller this time step h is, the more accurate the results will be. The difference between different ODE solvers is in how they approximate w'(t) and whether and how they adaptively adjust h. Considering the accuracy and overhead, we have evaluated a number of methods for solving the IVP, which include the Euler's method, the 4 th -order Runge-Kutta method, and the 4 th -order Adams predictor-corrector method [9] . In Figure 5 , we assume that w(0) = 0.3 as an initial value. The time step size is defined as h = T/K, where the time interval [t i , t i + T] is divided into K equal-length segments. It is clearly seen that the Euler method, the simplest approach for solving the IVP, shows low accuracy (i.e., high error) in predicting the workload value, where the error is defined as the difference between the exact values and the computed approximates. However, the 4 th -order Runge-Kutta method exhibits low error and consistent stability in predicting the workload value. The 4 th -order Adams predictor-corrector is also accurate, but has higher computational complexity. Figure 6 shows the trade-off between the accuracy and time step h in terms of performance of workload prediction, where time (x-axis) is defined in terms of successive time steps. In this evaluation, the 4 th -order Runge-Kutta method is used with an initial value w(0) = 0.3. The determination of the time step size is crucial since smaller time step increases the computational overhead in the software (e.g., operating system). We use various values for time step size h (= 2, 5, and 10), where T is fixed, while monitoring the error in predicting the workload values. The time step of size 2 indicates great accuracy, but increases computational efforts by the software (due to more computations in the same interval), whereas step size of 10 exhibits lower computational efforts with lower accuracy. In our problem setup, we have empirically observed that a time step size of 5 provides a reasonable trade-off point. To make the workload prediction technique more suitable for online implementation, an efficient one-step method known as the midpoint method is utilized to solve the IVP. Specifically, at time instance t in [t i , t i + T], we predict the workload value for time t + h, based on the value at time t + h/2, which is obtained by using the midpoint method, as depicted in Figure 7 . First, the current workload at time t is monitored by the PMU and a frequency value is read from a pre-characterized workloadfrequency mapping table by the power manager. Note that we do not want to use the predicted value for time t, which was previously computed at time t -h, because we can achieve the exact frequency value at time t. Next, the workload value at time t + h/2 is estimated by using a moving average method, for example, if the window size of the moving average calculator is 2, then, w
. This workload value is subsequently used as the midpoint estimate of the workload in the upcoming period. In particular, it is used along with w exact (t) to compute w pred (t + h) by applying the IVP. The advantage of this prediction method is that we do not attempt to predict w pred (t + h) directly by using a moving average method only. Instead, we estimate the workload value for a nearer time in the future (which should provide higher accuracy) and use that value to initially estimate the rate of workload change in the upcoming period, followed by finally computing w pred (t + h) by solving the IVP. (t + h) = w exact (t), the current frequency value is maintained. It is worthwhile to mention that the DFA is capable of handling the throughput and power budget. If there is a target throughput, for example, the DFA will slowly increase the frequency up to a target frequency value that results in justenough throughput and minimum power dissipation.
Workload-Driven Frequency Adjustment

Mapping of Workloads to Optimal Frequency
The entries of the workload-frequency mapping table correspond to various values of workload (i.e., the arrival rate of tasks). Figure 9 illustrates the mapping process from workloads to an optimal operating frequency, assuming that 0.1 ≤ the arrival rate ≤ 0.9. In this figure, the pre-characterized mapping table is achieved through extensive offline simulations during design time, considering performance characteristics of each FB provided by the user or application, in a similar way as [5] [14] . For example, when a power manager predicts the workload for the near future, an optimal frequency value for the next decision epoch is selected and provided to the DFA which will continuously change the operating frequency from its present value to the target value. Note that mapping from workload to operating frequencies is achieved by a simple linear function while considering the maximum and minimum operating frequencies that can be applied to the FB in question. 
Experimental Results
In the experimental setup, we applied the proposed CFA technique to a high-speed network controller (i.e., gigabit Ethernet controller) which includes IEEE 802.3 PHY/MAC blocks, RISC processor, direct memory access (DMA) engine, PCI-E core, etc. as shown in Figure 10 . This embedded system is implemented with TSMC 65nmLP library. To capture powersaving opportunities by using the proposed technique, we consider a part of the process of receiving packets inside the system, which involves Ethernet MAC (EMAC) block and control block. Note that it will not hurt the quality of the paper if we concentrate on these blocks (i.e., inside dotted box in Figure 10 ) to simplify the experimental setup, since they sufficiently exhibit the characteristics of the SR, the SP, and the SQ. Thus, the continuously varying frequency value is applied to the control block (i.e., the SP) by the hardware-implemented DFA, where its frequency is optimally adjusted. The functions performed by these blocks (i.e., EMAC and control blocks) are explained briefly as follows. The EMAC receives a data stream from the selected physical layer interface and performs address checking, CRC calculation, and CSMA/CD functions [15] . The control block calculates checksum and parses TCP/IP headers and classifies the frames based on a set of matching rules. While processing the control data in the control block, the frame data is temporarily stored in memory buffers before being sent to local interconnect through the PCIE interface. Figure 11 . Power consumption of the service queue.
We first achieve the power and energy dissipation of the service provider (i.e., control block) and the service queue (i.e., memory) by using TSMC 65nmLP library, which has 3 optional operating voltages (e.g., 1.08V, 1.20V, and 1.29V). To calculate accurate power values for static and dynamic power consumption, we used SAIF (Switching Activity Interchange File) based on backannotated RTL simulation of the system with Power compiler [16] . To achieve the energy dissipation of the control block, different workloads (e.g., the arrival rate of traffic) are used to generate the multiple columns in Table 1 , where the dynamic and static power values are considered. For simulation setup, we set that the maximum full duplex bandwidth (e.g., 1000Base-T) is achieved. Note that the overhead of designing the DFA block inside the system is negligible due to its small number of gate counts (around 150 standard cells) and power dissipation (around 2uW including dynamic and static power). Figure 11 shows power consumption of the service queue in terms of the normalized arrival rate of the traffic. We set the packet size to 64bytes and the service time to 1 (by using the G/M/1 queuing model) for simplicity. For example, when the arrival rate of the tasks is 0.29 (normalized), the memory size necessary for buffering the incoming data is 5.8 times greater than the case of where the arrival rate is 0.04.
Next, we evaluate the effectiveness of the proposed CFA technique. We assume that the workload changes dynamically from 0.1 to 0.9. For comparison purpose, we implemented a couple of power management policies (denoted by PM1 and PM2 and described below) as representatives of the conventional methods, similar to [7] [13] . We use three set of frequency values to simplify the experimental setup (F 1 < F 2 < F 3 in terms of frequency values). Then, we generate dynamic workloads randomly with 100, 500, 1000, and 5000 numbers of power management decision epochs and apply both above-mentioned conventional policies and the proposed power management technique to the control block.
PM1:
The simulation results in Figure 12 , which corresponds to the case of 100 decision epochs, show that the proposed CFA technique achieves energy savings compared to the conventional methods. Results in Table 2 , which also reports the characteristics of the workload distribution (e.g., Low = 0.1 ≤ the arrival rate ≤ 0.3), demonstrate that, compared to the PM1 policy, our approach achieves power and energy savings up to 13.6% and 12.3%, respectively.
Conclusion
In this paper, we addressed the problem of power management techniques in the context of handling dynamic frequency management, where power-mode transition cost is no longer negligible in the nano-scaled systems. We proposed a continuous frequency adjustment technique based on a workload prediction method, which minimizes the transition cost. Experimental results with a 65nm design show that the proposed technique ensures robust energy savings under dynamic workloads.
