Abstract-In this paper, we consider energy-aware network devices (e.g. routers, switches, etc.) able to trade their energy consumption for packet forwarding performance by means of both low power idle and adaptive rate schemes. We focus on state-of-the-art packet processing engines, which generally represent the most energy-starving components of network devices, and which are often composed of a number of parallel pipelines to "divide and conquer" the incoming traffic load. Our goal is to control both the power configuration of pipelines, and the way to distribute traffic flows among them, in order to optimize the trade-off between energy consumption and network performance indexes. With this aim, we propose and analyze a constrained optimization policy, which try to find the best tradeoff between power consumption and packet latency times. In order to deeply understand the impact of such policy, a number of tests have been performed by using experimental data from SW router architectures and real-world traffic traces.
INTRODUCTION
It is well known that network links and devices are provisioned for busy or rush hour load, which typically exceeds their average utilization by a wide margin [1] . While this margin is seldom reached, nevertheless the power consumption is determined by it and remains more or less constant even in the presence of fluctuating traffic loads. This situation suggests the possibility of adapting network energy requirements to the actual traffic profiles. Thus the key of any advanced power saving criteria resides in dynamically adapting resources, provided at network, link or equipment levels, to current traffic requirements and loads [2] [3] .
In more detail, it is well known that today's network relies very strongly on electronics, despite the great progresses of optics in transmission and switching. Operational power requirements arise from all the HW elements realizing network-specific functionalities, like the ones related to dataand control-planes, as well as from elements devoted to auxiliary functionalities (e.g., air cooling, power supply, etc.). In this respect, the data-plane certainly represents the most energy-starving and critical element in the largest part of network device architectures, since it is generally composed by special purpose HW elements (packet processing engines, network interfaces, etc.) that have to perform per-packet forwarding operations at very high speeds.
In this sense, Tucker et al. [4] and Neilson [5] focused on high-end IP routers, and estimated that the data-plane weighs for 54% on the overall device architectures, vs. 11% for the control plane and 35% for power and heat management. The same authors further broke out energy consumption sources at the data-plane on a per-functionality basis. Internal packet processing engines require about 60% of the power consumption at the data-plane of a high-end router, network interfaces weigh for 13%, switching fabric for 18.5% and buffer management for 8.5%.
Starting from these data, we decided to focus on packet processing engines for network devices, which generally represent the most energy-harvesting physical component of many network devices, and not only of high-end routers. These engines are realized with heterogeneous HW technologies (from classical ASIC [6] or FPGA [7] chips to GPU-based ones [8] ), and often have highly parallel architectures in order "to divide and conquer" the traffic load incoming from a number of high-speed interfaces.
Traffic flows income and outcome from the engine by means of Serializer/Deserializer busses (SerDes), which are realized with different standards like PCI Express, SGMII, XGMII, XAUI, etc. In high performance architectures, as shown in Fig. 1 , a specific HW component is required in order to multiplex and de-multiplex traffic between the SerDes and the parallel pipelines of the engine. This component can be included inside the same packet processing engine [6] , or it can be placed in the interface cards before the SerDes bus (like in the Receive-Side Scaling -RSS -standard for server network This work has been supported by the ECONET (low Energy Consumption NETworks) project co-funded by the European Commission under the 7th Framework Programme (FP7). 
IEEE Online Conference on Green Communications
978-1-4244-9519-1/11/$26.00 ©2011 IEEEinterface cards [9] ).
In such scenario, we assume to adopt two basic techniques, already heavily widespread in silicon technologies, in order to reduce the energy requirements of packet processing engine: the Adaptive Rate (AR) and the Low Power Idle (LPI). The former allows dynamically modulating the capacity of a processing engine (or of a single pipeline), in order to meet traffic loads and service requirements while the latter forces processing engines (or single pipelines) to enter low power states when not sending/processing packets. As outlined in a number previous works [1] [2] [12] , the use of such techniques generally allows trading energy consumption for networking performance (in terms of packet latency times, loss rate, etc.).
Assuming the possibility of selectively tuning AR and LPI mechanisms for each parallel pipeline, our goal is to dynamically manage the engine configuration in order to optimally balance its energy consumption with respect to its network performance. Given the incoming load features and parameters, we want to find i) how many pipelines have to actively work, ii) their AR and LPI configurations, and iii) which share of the incoming traffic volume the load balancer module must assign to them. To this purpose, we modeled the energy-and network-aware dynamics of packet processing engines, and formalized an optimization problem in an enough general way to reflect different criteria, like:
the minimization of energy consumption for a certain constraint in packet latency time, or ii) the maximization of network performance for a given energy cap, or iii) the optimization of a given trade-off between the two previous policies. The optimization problem takes constraints on maximum energy consumption and packet latencies explicitly into account.
The paper is organized as follows. Section II introduces AR and LPI capabilities and how they can impact on network performance. The model for energy-aware pipelines is described in section III, and the optimization problem definition in section IV. Some numerical results obtained with real traffic traces are in section V, and the conclusion in VI.
II. ENERGY-AWARE SILICON AND NETWORK PERFORMANCE Nowadays, the largest part of current network equipment does not include power scaling capabilities, but power management is a key feature in today's processors across all market segments, and it is rapidly evolving also in other hardware (HW) technologies [10] . The rest of this section is structured as follows. Sub-section II.A introduces how ACPI (Advanced Configuration and Power Interface) standards make AR and LPI capabilities accessible to the SW layer. Subsection II.B discusses the impact of AR and LPI on the forwarding performance of a network device, and how these two capabilities may interact between themselves.
A. The ACPI example
In general purpose computing systems, the ACPI [11] standard models AR and LPI functionalities by introducing two sets of energy-aware states, namely performance and power states (P-and C-states), respectively.
Regarding the C-states, C 0 is an active state where the CPU executes instructions, while C 1 through C n are processor LPI states. As the sleeping power state (C 1 , …, C n ) becomes deeper, the transition between active and sleeping (and vice versa) requires longer time.
ACPI also allows the performance of the processor's core to be tuned through P-state transitions. P-states allow modifying the operating energy point of a core by altering the working frequency and/or voltage, or throttling its clock. Thus, by using P-states, a core can consume different amounts of power while providing different processing performance at the C 0 state. At a given P-state, the core can transit to higher Cstates in idle conditions. In general, the higher the index of Pand C-states is, the less will be the power consumed, and the heat dissipated. Due to issues in silicon electrical stability, the transition time between different P-states is generally very slow. A large part of current CPUs can switch their operating P-state in about 10 ms. Given such large P-state transition times, it is worth noting that any closed-loop control policies with tight time constraints are not feasible and cannot be adopted for optimizing power consumption inside network device architectures.
B. The energy-aware trade-offs
As previously sketched, LPI and AR have different impacts on packet forwarding performance. As shown in Fig 2 , AR (Fig. 2c) obviously causes a stretching of packet service times while the sole adoption of LPI (Fig. 2b) introduces an additional delay in packet service, due to the wake-up times. Moreover, preliminary studies in this field [1] showed how performance scaling and idle logic work like traffic shaping mechanisms, by causing opposite effects on the traffic burstiness level. The wake-up times in LPI favour packet grouping, and then an increase in traffic burstiness, while service time expansion in AR favours burst untying, and consequently traffic profile smoothing. Finally, as outlined in Fig. 2d , the joint adoption of both energy-aware capabilities may not lead to outstanding energy gains, since performance scaling causes larger packet service times, and consequently shorter idle periods. It is worth noting that the overall energy saving and the network performance strictly depend on incoming traffic volumes and statistical features (interarrival times, burstiness levels, etc.). For instance, idle logic provides top energy and network performance when the incoming traffic has a high burstiness level. This is because less active-idle 
time needed to wake up the HW from the ‫ܥ‬ ௫ sleeping state τ ሺ‫ܥ‬ ௫ ሻ time needed to put the active HW into the ‫ܥ‬ ௫ sleeping state τ ௦௧௨ ൫ܲ ௬ ൯ time to recover forwarding operation after the HW wakeup ߤ൫ܲ ௬ ൯ packet service rate in the ܲ ௬ state Φ ୟ ൫ܲ ௬ ൯ power consumption when the server is active in ܲ ௬ state Φ ୧ୢ୪ୣ ሺ‫ܥ‬ ௫ ሻ power consumption when the server is sleeping in ‫ܥ‬ ௫ state Φ ୲ ሺ‫ܥ‬ ௫ ሻ power consumption during τ ୭ and τ ୭ periods ߬ server vacation time, ߬ = τ ୭୬ ሺ‫ܥ‬ ௫ ሻ + τ ୱୣ୲୳୮ ൫ܲ ௬ ൯ + τ ୭ ሺ‫ܥ‬ ௫ ሻ ߣ rate of batch arrival ߚ probability that an incoming burst contains j packets ߚ average number of customer in a batch ܲ stationary probability of having ݊ ∈ ሾ0, ܰሿ packets in the queuing system ߩ traffic utilization ߩ of the server, which in the case of infinite buffer can be expressed as ߩ = ఒఉ ఓ transitions (and wake-up times) are needed, and the HW can remain in a low consumption state for longer periods.
III. MODELING ENERGY-AWARE PIPELINES
This section is organized as follows. Subsection A introduces the model for pipelines discussing how AR and LPI influence packet processing. The model for the incoming traffic is in subsection B. Finally, subsection C briefly reports some details of the adopted analytical model, and defines the energy-and network-aware performance indexes.
A. The pipeline model
In order to represent the behavior of the pipelines of an energy-aware packet processing engine with LPI and AR capabilities, we decided to adopt the model in [13] . This model is founded on classical concepts of queuing theory, and it is specifically designed to estimate energy-and network-aware performance indexes. For sake of simplicity, let us to adopt the ACPI representation of power management primitives, and refer to AR and LPI configurations in terms of P-and C-states. We assume to model the packet computation engine of the network device as a single server queuing system with maximum service rate ߤ.
The selection of different P-and C-states is supposed to impact on the pipeline performance in terms of both the packet service capacity, and wakeup times of the server. Similarly to [12] and as previously sketched, the ߤ service rate is thought to represent the device capacity in terms of packet headers that can be processed per second. Moreover, we assume all packet headers requiring a constant service time. This hypothesis represents a reasonable approximation for a large part of current routing and switching devices. The model notation is introduced in Table I. Let ሼC , C ଵ , … , C ሽ and ሼP , P ଵ , … , P ሽ be the set of sleeping and performance states available in the pipeline, respectively.
Each sleeping state is thought to be bound with both a different value of idle power consumption Φ ୧ୢ୪ୣ ሺC ௫ ሻ and different transition times τ ሺC ௫ ሻ and τ ሺC ௫ ሻ , needed to enter and to wake-up from the idle state, respectively. Let us suppose that a deeper sleeping state is characterized both by lower power consumption, and by a larger transition period.
In a similar way, each P state can be related with a different active power consumption Φ ୟ ൫P ௬ ൯ , as well as a different packet processing capacity μ൫P ௬ ൯. As the P ௬ state is higher, both the Φ ୟ ൫P ௬ ൯ and the μ൫P ௬ ൯ values decrease.
However, transitions between the active state ‫ܥ‬ to the ‫ܥ‬ ௫ state are not instantaneous, and a transition time ߬ is required. When new packets are received, the pipeline has to wake-up by exiting the ‫ܥ‬ ௫ state and returning to the active one (this requires an additional ߬ period). Furthermore, depending on the specific HW/SW architecture and implementation, an additional time ߬ is required to setup and to suitably configure the packet elaboration process. It is worth noting that, while ߬ and ߬ depend on the sleeping ‫ܥ‬ ௫ state, the ߬ parameter depends on the ܲ ௬ state, since it represents a certain number of operations that have to be performed by the server, before re-starting packet-forwarding operations. The instantaneous power requirements can be expressed as follows:
if the server is in the C state Φ ୲ ሺC ୶ ሻ if the server is moving between C and C ୶
As in most HW platforms ߬ ≪ ߬ , in the model derived in this paper, we neglect the ߬ period.
B. The traffic model
The modeling and the statistical characterization of packet inter-arrival times are well known to have Long Range Dependency (LRD) and multi-fractal statistical features [14] . However, as sustained more recently in [15] and [16] , a Batch Markov Arrival Process (BMAP) can effectively estimate the network traffic behavior.
Therefore, we decided to model incoming traffic through a Batch Markov Arrival Process (BMAP) with Long Range Dependent (LRD) batch sizes. We assume to receive groups of j packets at exponential inter-arrival times with average value equal to 1/ߣ. The sizes j of packet batches are supposed to follow Zipf's law (which can be thought as the discrete version of the Pareto probability distribution).
C. The network-and energy-aware performance indexes
The model we propose corresponds to a M x /D/1/SET queuing system [17] . Packets arrive in batches at Markov interarrival times with average rate ߣ, and are served by a single server at a fixed rate ߤ . In order to take the LPI transition periods into account, the model considers deterministic server setup times. In more detail, when the system becomes empty, the server is turned off. The system returns operative only when a batch of packets arrives. At this point of time service can begin only after an interval ߬ = τ ୭୬ + τ ୡ୭୬ has elapsed.
Under such assumption and as demonstrated in [13] , the average packet waiting time ܹ ෩ can be expressed as follows:
and the average power consumption as:
This model has been validated with respect to SW router architectures based on COTS HW. The results outlined its good accuracy, since the maximum estimation error was lower than 2% for both power consumption and packet latency times.
IV. THE ENERGY-AWARE LOAD BALANCING
This section is organized as follows. The definition of the optimization problem is in subsection A. Subsection B introduces some preliminary results that can be used to better understand the proposed policy according to different trade-off values and traffic volumes.
A. Optimization Problem Definition
We consider a traffic de-multiplexer distributing the incoming traffic among Λ parallel pipelines. The traffic incoming to the de-multiplexer is represented as a BMAP process with a batch arrival rate ߣ መ , and with Zipfdistributed packet batches with an average length equal to ߚ መ .
Starting from the main achievements of previous works [1] , and in order to make an optimal use of LPI primitives, we decided to not untie the incoming packet batches, and to send every packet composing a batch to a single pipeline. This design choice allows reducing the power consumption of the system according to a slight increase of packet latency times especially at low incoming traffic loads 1 . Under such assumptions, we can simply deduce that the process of incoming traffic is still BMAP, with the following parameters:
Thus, we can define the average power consumption of our system as the sum of the contributions from the Λ parallel pipelines:
and, the average latency time experienced by a packet incoming into the system can be defined as in the following:
1 The model and the load balancing criterion can be simply and suitably extended to consider the untying of packet batches, too.
Given the features of incoming traffic load (in terms of λ , β and β ) and thresholds on the maximum values of both packet latency ܹ * and power consumption Φ * , the objective of the load balancing criterion is to find the best values of λ ሺሻ , C ௫ ሺሻ , and P ௬ ሺሻ for ∀݅ = 0, … , Λ − 1 so that the system has the best trade-off between network performance and energy consumption. Thus, we define our optimization problem as follows:
where the ߛ index ranges between 0 and 1, and represents the "trade-off parameter", which modulates the minimization of power consumption with respect to the one of average packet latency. It is worth noting that, for ߛ = 0, our optimization problem corresponds to the maximization of network performance for a given power consumption cap. While for ߛ = 1, it corresponds to the minimization of the system power consumption constrained to a maximum value of average latency.
Regarding the optimization problem, it is quite complex, since we have a non-linear objective function, which depends on both discrete (i.e., C ௫ ሺሻ , P ௬ ሺሻ ∀݅ = 0, … , Λ − 1 ) and continuous (i.e., λ ሺሻ ∀݅ = 0, … , Λ − 1) variables.
By taking into account that the number of pipelines Λ, and of available C and P states are generally low, our minimization strategy mainly consists on solving the problem for each available configuration of C and P states of the pipelines. In more detail, for each feasible combination of ቄ൫C ௫ ሺሻ , P ௬ ሺሻ ൯, … , ൫C ௫ ሺஃିଵሻ , P ௬ ሺஃିଵሻ ൯ ቅ, we find the best values of ൛λ ሶ ሺሻ , … , λ ሶ ሺஃିଵሻ ൟ minimizing the objective function and satisfying the constraints.
Moreover, exploiting the last constraints in eq. 9, we can express λ ሺஃିଵሻ = ߣ መ − ∑ λ ሺሻ ஃିଶ ୀ and consequently reduce the number of variables. Then, we simply try to find the minimum of the objective function by studying its partial derivatives in λ ሺሻ ∀݅ = 0, … , Λ − 2 inside the region satisfying the constraints, and in its frontier.
B. Analyzing the trade-off
In order to better understand and characterize the effects of the proposed optimization policy and the role of the trade-off parameter ߛ, we decided to perform some preliminary tests in presence of variable incoming load.
In more detail, we considered a packet processing engine with Λ=4 pipelines, and we used the parameters of a Xeon 5550 processor, generally used in Linux-based SW routers [12] . This choice is mainly because current HW routers do not include AR and LPI capabilities, and only their nominal and/or maximum power consumptions are reported in the datasheets.
Each pipeline corresponds to a processor core, and, as shown in Tables II and III , includes AR and LPI capabilities in terms of 4 available P-states, and 3 C-states (including the C 0 one), respectively. Previous experimentations on SW router architectures [12] suggest to use the values indicated in Table  II for the ߬ ‫݊‬ parameter, and to fix ߬ ‫݂݊ܿ‬ = ߤ −1 . The selection of a C-or P-state on a pipeline is fully independent from the other pipelines.
As far as the incoming traffic is concerned, by observing parameters in real traffic traces (e.g., see Fig. 12 ), we decided to fix β = 4, while we increased the value of λ from 1 kpkt/s to 2.5 Mpkt/s (which, in our case, roughly corresponds to the threshold after that optimization constraints cannot be satisfied). The optimization problem has been solved for various values of the trade-off parameter, and, in more detail for ߛ =0, 0.25, 0.5, 0.75 and 1. The maximum latency ܹ * has been fixed to 50 μs, and the constraint on power consumption Φ * to 250 W.
Figs. 3-7 show the optimal shares ൛λ ሶ ሺሻ , … , λ ሶ ሺଷሻ ൟ of incoming traffic load for each pipeline with respect to different values of ߛ . Figs. 8 and 9 report the estimated power consumptions and the packet latency times, respectively, in the optimal configurations. Figs. 10 and 11 shows how many pipelines are working in the available P-and C-states in the ߛ =0 and ߛ =0.75 cases.
By observing Figs. 3-7 , we can outline how, in case of minimization of the latency times constrained to the energy consumption (i.e., ߛ = 0 ), the optimal policy suggests to uniformly divide the incoming load among the pipelines. Only for the highest load volumes (λ 2.4 Mpkt/s), this fairness is not maintained. In fact, in order to satisfy the power consumption constraint, the optimization policy maintains 3 pipelines with P 0 and C 1 , and reduces the energy consumption of the whole engine by decreasing the performance of the pipeline 0. Accordingly, the load-balancer reduces the load share incoming to this pipeline.
On the contrary, when we minimize the power consumption for a given threshold on maximum latency times (i.e., ߛ = 1), the load balancer tries to concentrate as much traffic volume as possible into few pipelines. For instance and with reference to Fig. 7 , the load-balancer redirects traffic only to the pipeline 3 at very low incoming volumes. When a change is needed on the C-or P-state configuration of pipeline 3 to satisfy the network performance constraints, the optimization policy decides to delay this configuration change, and to use also other (few) pipelines. However, by further increasing the incoming traffic load, the configuration change on the pipeline 3 becomes soon more energy-efficient, and the largest part of the load returns on this pipeline. When λ 1.5 Mpkt/s, the optimization policy starts to distribute traffic among pipelines in a more and more fair way in order to satisfy the ܹ * constraint.
Regarding energy consumption and average latency times, the ߛ = 1 case exhibits a nearly linear behavior on Φ with respect to ߣ መ , while ܹ is almost equal to ܹ * for the largest part ߣ መ values. This behavior is sensibly different with respect to the case ߛ = 0, where Φ increases with a concave trend according to ߣ መ , and ܹ values remain much lower than ܹ * .
As far as the other values of ߛ are concerned (see Figs. 4-6 ), the optimization policy roughly behaves as the minimization of power consumption ( ߛ = 1 ) at low traffic volumes, and as the minimization of packet latency (ߛ = 0) at higher loads. The macroscopic role of the trade-off parameter ߛ appears to be moving the point where the optimization policy switches between the minimization of power consumption and the maximization of network performance: as ߛ increases, as the region with unfair traffic share enlarges. This role is also evident in Fig. 8 , where the power consumptions of the cases ߛ=0.25, 0.5 and 0.75 start by agreeing with the ߛ=1 curve, and increasing ߣ መ they finish, one by one, by meeting the ߛ =0 values. As ߛ raises, as such meeting point happens at higher traffic volumes. By observing Figs. 10 and 11, we can outline also that the P-and C-states transitions become more frequent according to ߛ.
V. NUMERICAL RESULTS
In order to evaluate the proposed optimization policy in a correct and suitable way, we decided to use daily dynamics of real Internet traffic. In more detail, we used data from the traffic traces that are publicly available in [18] and part of "A Day in the Life of the Internet" [19] 2 . We used a 96-hour-long traffic trace divided into sequential time windows of 15 minutes. Thus, for each time window, we applied our optimization policy with the same values of ߛ of section IV.B. Moreover, to obtain the results in this section we left the same packet processing engine configuration, and the same values of ܹ * and of Φ * of the previous section.
As far as the incoming traffic is concerned, for each time window, we used the ߣ, ߚ, and ߚ i values as calculated from the traffic trace. In detail, these parameters were obtained by least squares fitting of the Zipf distribution with the trace sample. The evolution of the traffic offered load over the time of the reference traffic trace is reported in Fig. 12 in terms of burst arrival rates and burst sizes. The minimum value of traffic loads is from 3:00 to 6:00, while rush hours occur at 11:00 and 14:00. It is interesting to underline how an increase in incoming traffic volume is due to the rise of both batch arrival rate and burst sizes.
Figs. 13 and 14 show the estimated values for Φ and ܹ , respectively, in the optimal configuration with respect to the traffic trace time windows and different values of the trade-off 2 In order to meet the Software Router capacities in Table III , we increased the traffic volumes in the original trace by a scaling factor of 30. parameter ߛ. In the same scenario, Figs. 15 and 16 shows how many pipelines are using a certain P-or C-state, respectively. These figures clearly outline how the optimization policies for ߛ =0, 0.25 and 0.5 provide almost the same results. As discussed in subsection IV.B, this behavior is mainly because, in case small value of ߛ (as 0.25 and 0.5), the optimization policy behaves like the pure minimization of packet latency after low volumes of incoming traffic, and the volumes in the considered traffic trace are higher than these thresholds. However, in case of ߛ = 1 , the optimization policy allows saving about 12% of energy respect to ߛ = 0. On the other side, with ߛ = 1, the average packet latency time is always near to the ܹ * value. Finally, for ߛ = 0.75, we have an energy saving of 2.5% with respect to ߛ = 0, and the ܹ values appear to be a bit higher (max 5μs) than ߛ = 0 especially during low load periods (from 00:00 AM to 9:00 AM).
VI. CONCLUSIONS
In this paper, we considered energy-aware network devices (e.g. routers, switches, etc.) able to trade their energy consumption for packet forwarding performance by means of both low power idle and adaptive rate schemes. We focused on state-of-the-art packet processing engines, which generally represent the most energy-starving components of network devices, and which are often composed of a number of parallel pipelines to "divide and conquer" the incoming traffic load. Our goal was to control both the power configuration of pipelines, and the best way to distribute traffic flows among them, in order to optimize the trade-off between energy consumption and network performance indexes. With this aim, we proposed and analyzed a constrained optimization policy, which optimize the trade-off between power consumption and packet latency times. In order to deeply understand the impact of such policy, a number of tests have been performed by using experimental data from SW router architectures and real-world traffic traces.
The obtained results showed that the proposed optimization policy, for low traffic volumes, roughly corresponds to the minimization of energy consumption constrained to a maximum packet latency. For higher values, the same policy starts to maximize network performance for a given energycap. By tuning the trade-off parameter in the proposed objective function, we can control at which incoming load the policy switches between the two behaviors.
