Abstract-The backbone of a large-scale supercomputer is the interconnection network. As compute nodes become more energy-efficient, the interconnect is accounting for an increasing proportion of the total system energy consumption. The interconnect's energy consumption is, however, only starting to receive serious attention. Some hardware-based schemes have been proposed that exploit idle periods or low utilisation, either by turning off the links or by lowering the frequency and voltage. Although these schemes are effective in certain cases, they do not have enough global information about the application's communication behaviour to efficiently manage the network power consumption. This paper proposes an alternative approach: moving the intelligence into the PMPI layer of the MPI library, and using prediction to discover repetitive patterns in the application's communication behaviour.
I. INTRODUCTION
High-Performance Computing (HPC) is a crucial tool for modern science. There is a constant need for more powerful supercomputers, but increasing performance is leading to excessive peak power demand and total energy consumption. While supercomputers have traditionally been ranked only by performance, now that power and energy are first-order design constraints, systems are also being ranked on energy efficiency [1] . An important characteristic of energy-efficient system components is energy proportionality, which means that energy consumption depends linearly on utilization.
The system's interconnect accounts for an important fraction of its total energy consumption. Nevertheless, although a significant effort has been invested into achieving energy proportionality of processors and memory, similar techniques in networks have not reached wide adoption. With energyefficient processing elements and larger networks, the interconnection network is expected to account for up to 30% of the system's total power [2] . Outside HPC, where data centre processors often have low utilisation, this fraction can reach 50% [3] . Most of this power consumption is due to the interconnection links. For example, the links in an IBM eight-port Infiniband 12× switch consume 64% of the switch power [4] . High-performance interconnect links are, however, not energy proportional, since their power consumption is always near peak, whether or not they are actually being used for message transmission.
One approach to reduce network energy consumption is to put the links into low-power mode when they are not being used. The problem is that link state changes, from off to active, can take up to 10μs [5] . Since state changes add to the latency of MPI messages, and many HPC applications are highly sensitive to latency, this leads to an unacceptable loss in performance. An alternative is to lower the voltage and bandwidth of links when utilization is low, which has faster link reactivation, at about 100ns, but the potential power saving is much lower [3] . Both mechanisms switch between power modes using low-level hardware schemes [6] , [7] , [8] . Common drawbacks are the inability to capture significant energy savings, as well as an unknown and uncontrollable performance penalty.
Most HPC applications follow the bulk synchronous programming paradigm, in which application processes are synchronised, either all performing computation at the same time or all involved in communication. In general, application developers view the time spent in communication as overhead, and therefore try to minimize it. This leads to high peak bandwidth demand and latency sensitivity, but low average utilisation, which, as explored in the following section, provides significant opportunities for energy savings. Unfortunately, as mentioned above, current interconnects are not energy proportional, so the potential energy savings are lost.
The majority of execution time in most HPC applications is spent in a large number of iterative execution phases. Since the communication pattern inside each phase is essentially the same, it is possible to observe the communication behaviour in one iteration, and use the knowledge gained to predict the behaviour of the subsequent iterations. Specifically, this means detecting the patterns of MPI calls that are repeating within each MPI process. To achieve this, we use an algorithm based on n-gram extraction techniques, a widely used concept from statistical natural language processing [9] , [10] . Our Pattern Prediction Algorithm (PPA) allows an on-the-fly detection We apply the algorithm to recently announced power-saving features for Infiniband (IB) switches, reducing link power consumption without losing network connectivity. This feature works by shutting down all but one lane of an 4×IB link. The proposed energy-saving mechanism consists of two parts. The first part detects consecutive repeatable patterns in MPI communication, leading to successful prediction. The second part uses this prediction to control the shifting between link power modes. Both parts are executed in the PMPI profile layer of MPI, eliminating the need for user involvement, but enabling customisation by the system integrator and by HPC operations. Existing MPI programs benefit from energy savings without needing any source code modification.
Specifically, this paper makes the following contributions:
• We demonstrate a large potential to save energy in the interconnection network. The majority of HPC workloads we tested have long link idle times, allowing overheads in switching power modes to be amortised by large energy savings. We also point out how newly-announced features of IB switches will allow power savings without loss of network connectivity.
• We show that, for the studied HPC applications, the PPA algorithm can successfully exploit patterns in MPI communication. We measure prediction accuracy of up to 98%. We also provide a complete description of our power saving mechanism, enabling it to be run within the PMPI layer of MPI.
• We evaluate our energy-saving mechanism using an event-driven simulator and traces obtained from a production run on a real supercomputer. Results show an average reduction in IB switch energy consumption of up to 33%, compared with the power-unaware scheme where links are "always-on". We also show there is no significant increase in execution time. In particular, the worst average increase was around 1%. The rest of this paper is structured as follows. Section II provides the motivation and necessary background to understand switching of IB links to low-power mode during iterative computation phases. Section III introduces the design of our link power saving mechanism and the PPA algorithm. Section IV describes the methodology and experimental evaluation, and it explores the energy-time tradeoff. Section V compares with the related work. Finally, Section VI presents the most important conclusions from this work.
II. BACKGROUND A. Motivation
As discussed above, HPC applications typically follow the bulk synchronous programming model, in which network traffic is concentrated into distinct communication phases. It is reasonable to expect that, since the network links are idle during computation phases, there is an automatic opportunity to enter power-saving mode. It is, however, important to take account of the overhead in changing power mode, which is approximately 10μs [5] . There can be no energy savings from idle periods that are shorter than the total time to turn the link off then back on again. A significant energy saving is only possible if the idle period is much longer than this overhead. For simplicity in exposition, we assume that the time to turn the link off is the same as the time to turn it back on again, and therefore denote both by T react . In summary, energy savings are only possible for idle periods with T idle > 2×T react .
We evaluated the potential for link power reduction by analysing traces of typical HPC applications (Gromacs [11] , Alya [12] , WRF [13] and two NAS Parallel Benchmarks [14] ) running on a production machine. The machine is based on Bull B505 nodes, each with two 6-core Intel Xeon E5649 processors running at 2.53GHz and with 24GB of RAM. We configured the applications to use one MPI process per processor. We used strong scaling, in which the same workload was used irrespective of the number of processors.
The results are shown in Table I . We see that, for almost all applications, 99% of the link idle time is inside idle intervals that are longer than 20μs, which is twice the typical value of T react . Even more importantly, in the majority of cases, more than 90% of the total link idle time is in longer idle intervals of duration T idle > 200μs, where significant power can be saved. Since the goal is a reduction in operational costs over the lifetime of the supercomputer, the important consideration is average potential energy savings over all applications. Only the NAS MG benchmark when running with a large number of processes, has a figure lower than 90%. All the results in this paper are for the pessimistic case of strong scaling. Better results are expected for weak scaling. Nevertheless, although, for strong scaling, the number of short intervals (T idle < 20μs) rises with the number of MPI processes, short intervals still contribute a small proportion of the total idle time. Since long idle intervals account for most of the idle time, reducing link power only during the long idle intervals is sufficient to obtain most of the potential energy savings, resulting in close to energy proportionality.
While deactivating IB lanes can be overlapped with computation, reactivation may incur a latency in subsequent communication. In an ideal case, the IB link lanes would be turned on in time to avoid a latency penalty on the next message. We solve this problem by providing the necessary knowledge using a prediction algorithm.
B. Network power management support on IB switches
Mellanox has recently developed Host Channel Adaptors (HCAs) and switches that save power by optimizing each port's link width and speed. These optimizations are embedded in the HCA and switch hardware, and are enabled via the firmware. Port link width reduction is done using a method called Width Reduction Power Saving (WRPS). For example, using WRPS a 40 Gb/s 4× QDR port can run as 10 Gb/s 1× QDR by shutting down three of its four QDR lanes. This reduction in link width reduces the power consumption of Mellanox Switch SX6036 to only 43% of its nominal power (when all four lanes are active) [15] . We use this published value of 43% in the evaluation section as the power consumption of an IB switch in low-power mode.
III. DESIGN
This section describes our energy-saving mechanism, which reduces link power consumption during idle periods, with negligible impact on execution time. Figure 1 is a high-level view of our proposal, which consists of two parts. The first When patternPrediction is true, however, control of the link's power modes is transferred to the second part, the Power Mode Control Component (PMCC). Whenever this component is active, it is invoked after every MPI event. It compares the actual MPI events with those expected from the pattern. So long as they continue to match, the length of the next idle interval can be read from the pattern. At the start of expected long idle intervals, the link is put into low-power mode for the appropriate amount of time. As long as the program continues to follow the pattern, there is no need to invoke PPA, since the pattern is already known. It is only necessary to continue updating the idle intervals with recent values, allowing some adaptation to varying application characteristics. If the current MPI event does not match the pattern, however, PMCC sets patternPrediction to false. In that case, PPA is reactivated and the link is kept in full-power mode until the next repeatable pattern.
A. Pattern Prediction Component
The algorithm uses the concept of n-grams, which is extensively used in the area of natural language processing. The n-gram extraction approach has been used to efficiently detect DNA patterns [16] and patterns in musical notes [17] . An ngram is defined to be a subsequence of n items in a sequence. In our case, the sequence of items, known as grams, is derived from the MPI events in the program's execution. Each gram is one or more consecutive MPI events that are separated only by short idle intervals, whereas the idle intervals between different grams are long. An n-gram is a sequence of n consecutive grams. Note that PPA works on the MPI events in a single process. Although it is outside the scope of this paper, if there are multiple MPI processes per node, prediction should be done inside each MPI process separately, with their outputs combined using a single PMCC per node.
Before the PPA algorithm is invoked, the grams need to be formed. Algorithm 1 performs the grouping of MPI events if pos ≥ (posN ext + patSize) then 9:
call P P A() 10:
end if 11: end if into grams, based on the idle time interval between adjacent MPI events. Two consecutive MPI events are considered to be part of the same gram whenever the idle time separating them is less than a threshold known as the grouping threshold (GT). The intention is that the link enters low-power mode between grams but not inside them, so this grouping threshold should be larger than the critical value of 2×T react discussed in Section II-A.
The input to the algorithm is the current MPI event type, eventType, and the length of the idle time preceding it, previousIdleTime. The output of the algorithm is the predicted pattern of MPI events, predictedPattern, and the current partial gram, currentGram, required by PMCC.
Here, an array, array, of tuples is created. Each tuple in this array holds the list of MPI events in the current gram, as well as the length of the idle interval that follows the gram. Note that the current gram can only be inserted into array when this latter length is known; i.e. after the idle interval following it, which is on the first MPI event of the next gram. Figure 2 illustrates the effect of Algorithm 1. Each set of three consecutive MPI Sendrecv calls is grouped together to form a single gram, while each MPI Allreduce call is isolated as a separate gram. These grams will be used as building blocks to construct the repeatable communication patterns. The building of patterns is done by the PPA algorithm, which is invoked on line 9, only when there is no currently repeating pattern and a sufficient number of grams has been seen.
B. Pattern Prediction Algorithm
A repeating pattern is a sequence of grams that has been observed to occur at least twice consecutively. We established the following policy to discover these repeating patterns and accurately predict their continuation:
• After observing three consecutive occurrences of the same pattern, it is predicted to continue to repeat for a long time, meaning that the Power Mode Control Component is activated.
• On misprediction, the Power Mode Control Component is deactivated. However, observing the pattern once more causes it to be detected, meaning that the Power Mode Control Component is reactivated.
This policy is implemented by Algorithm 2, the Pattern Prediction Algorithm (PPA). It is based on an algorithm proposed by Alawneh for the detection of process patterns [18] . We modified the algorithm to adapt it to detect continuous repetitions of patterns in program execution and the prediction of pattern appearance based on previous appearances.
The input to the PPA algorithm is the array of tuples, array, from Algorithm 1. Each tuple in the array corresponds to a completed gram, holding the list of MPI events inside it, as well as the length of the idle interval that follows it. The PPA algorithm builds a uthash [19] hash table, known as the pattern list, with key the pattern sequence (list of grams) and value a tuple giving the pattern's length, its positions in the array, its frequency, the list of idle intervals between grams and the total number of MPI calls in the sequence. In addition, there are two indices into array, posCur, initially zero, which points to the current pattern, and posN ext, initially equal to patSize, which starts with the value two.
The PPA algorithm is best understood using an example. Figure 3 illustrates the execution of the algorithm for the Alya workload. At the top, in Figure 3(a) , is the list of MPI events grouped into grams; it is an extension of the example in Figure 2 . Next, in subfigure (b), is shown the progress of the algorithm, with each row corresponding to an MPI event. For simplicity, the lengths of the idle intervals have been omitted from array. At the bottom, in Figure 3 (c) is shown the insertions into the pattern list.
We now follow the progress of the algorithm in Figure 3 (b). The PPA algorithm will not be executed until there are sufficient completed grams in array (line 9 of Algorithm 1). Since the initial values of patSize and posNext are both two, the number of formed grams becomes large enough only on the ninth MPI call (line 9 in the PPA execution in Figure 3(b) ). At this point, since newPattern is true and checkConsec is false, the only action is to insert the current gram into the pattern list (lines 48 to 50). The first bi-gram, 41-41-41 10, is therefore read from array, and added to the pattern list (lines 48 and 49). This insertion is shown in Figure 3(c) . The return value from updateP L indicates whether this is the first insertion of that particular pattern sequence. It is, so newP attern is true.
On the next MPI event, newP attern and checkConsec are both true, so the first action is to check whether there are two consecutive identical patterns in the array (line 23). The comparison is between the bi-grams 41-41-41 10 and 10 41-41-41 at the beginning of the array. These do not match, so control passes to lines 36 to 40, where checkConsec becomes false, and both posCur and posN ext are shifted one position. On the 11th MPI call, the second bi-gram 10 10 is added to the pattern list, in a similar manner to the first. On the 13th MPI call, the third bi-gram is added. On the 15th MPI event, the 41-41-41 10 bi-gram is encountered for a second time. Since it was already present in the pattern list, newPattern is set to false (line 49). Inside updatePL, the frequency count, shown in the third column in the insertions list in Figure 3 , is increased to two and the list of positions is extended to be [0, 3]. Next, on the 16th MPI event, newPattern and checkConsec are both true, but, as before, there is no consecutive repeat of the bi-gram 41-41-41 10. Therefore, checkConsec is set to false, but, as now newP attern is set to false, it is necessary first to check whether the enlarged pattern can detects its repetitions, before we shift both indices by the patSize − 1.
On the 17th MPI event, newPattern is still false, and checkConsec is now false, since the sequence of grams, 41-41-41 10 has been seen twice, but they are not consecutive. 41-41 10 10. If this pattern had previously been used for prediction, then it would be immediately reactivated (lines 9 and 11), according to the second statement in the policy at the beginning of this section. This is not the case, so instead, line 15 checks whether all previous occurrences of the bigram 41-41-41 10 can be extended to the new tri-gram. If the newly constructed tri-gram cannot be detected at any previous position of its prefix bi-gram and there's no consecutive repeats, than it will be removed from the pattern list (line 43) and the size of a n-gram will be set to the minimal value, 2 (bi-gram). Here, it is not the case, so match is set to true. Eventually, on the 17th call, the first consecutive repetition of the tri-gram 41-41-41 10 10 is found. At this point, consecutiveRepeats is incremented to 1, and both posCur and posN ext are advanced by the pattern size. When PPA is next invoked on the 21st MPI event, the second consecutive repeat is seen. The pattern is assigned to predictedP attern and patternP rediction is set to true (lines 28 to 31), since the PPA algorithm has successfully found the repeating pattern. In order to recognize the natural (real) iteration in the application and predict each iteration based on the behaviour of the previous one, we must avoid merging multiple application iterations into a single pattern. This is done by setting the maximum pattern size to be the length of the current pattern (line 28). If this were not done, and increasing numbers of application iterations were combined into a single pattern, prediction accuracy would suffer, since idle intervals would be predicted based on older values from many iterations previously. The pattern size can therefore vary from the smallest bi-gram to the size defined by maxP atternSize value.
C. Power Mode Control Component
The Power Mode Control Component is responsible for switching between link power modes, according to the current repeatable pattern. The algorithm is presented as Algorithm 3, which is executed only when patternPrediction flag is true. The first input to the algorithm is the predicted pattern from Algorithm 2. This pattern is described by two arrays. The first array is the sequence of grams, predictedPattern, and the second array is the sequence of idle time intervals following those grams, idleTimeArray. The other input to the algorithm is the current gram being built by Algorithm 1.
Algorithm 3 works with the partial gram, and considers it to be complete when it has the correct length (line 5). If, in addition, the MPI events in the actual gram match the prediction (line 6), then the predicted length of the upcoming idle period, idleTime, is read from the array. It is modified by the displacement factor, as described below, and the T react , obtaining the final prediction, predictIdleTime. The resulting value can be passed as the argument to WRPS method, giving the time to remain in low-power mode. If, on the other hand, the actual gram does not match the prediction, then the current pattern has finished, and PPA is reactivated by setting the patternP rediction flag to false.
The displacement factor, mentioned above, is a safety factor, used to take account of variability in the link idle intervals. To reduce the likelihood that the link is not turned on too late, the predicted idle time is reduced using the displacement factor (line 8 of Algorithm 3). It is a value between 0 and 1, where 0 means that the predicted idle time is not reduced, and 1 means that it is reduced all the way to zero. For simplicity in presentation, the displacement factor is expressed as a percentage (so a displacement factor of 5% is equivalent to a value of 0.05 in the algorithm).
Algorithm 3 Power Mode Control Component
The function of the displacement factor is illustrated in Figure 4 . Figure 4(a) is the case when the current pattern has an idle interval slightly larger than predicted. In this case, a displacement factor of 10% reduces the energy savings by slightly more than 10%, compared with optimal. Figure 4(b) is the case when the current pattern has an idle interval shorter than predicted. In this case, the displacement factor of 10% has avoided the latency penalty that would have been incurred by switching on the link too late. In general, in the context of HPC, it is better to reduce the energy savings than risk a noticeable degradation in performance. Varying the displacement factor exposes a trade-off between the two. Figure 5 shows the hardware support that is required for IB link power management. A special command is required, which enables user code to request that the link enters lowpower mode once any ongoing communication has completed. In order to avoid interrupting the CPU when it is time to wake up, we propose adding one hardware timer associated with the link. This timer is programmed using the predicted idle time. After the programmed delay elapses the timer will generate an interrupt to the firmware, which will reactivate the lanes. Communication between PMCC and the hardware is unidirectional, meaning that there is no feedback to the system on the correctness of prediction.
D. Hardware Support

IV. EXPERIMENTAL EVALUATION A. Methodology
In order to quantify the performance impact and energy savings, we use the Venus-Dimemas [20] , [21] simulator. Dimemas is an event-driven simulator, which replays a trace of the application's computation bursts and MPI activity, preserving casual relationships and timings. Venus is a detailed network simulator, which models the complete network architecture including topology, routing, and an accurate switch/adapter model. Computation bursts are modelled by recording their durations in the trace. We obtained traces of five representative HPC applications on a machine based on Bull B505 nodes, each with two 6-core Intel Xeon E5649 processors running at 2.53 GHz and with 24 GB of RAM. The applications were configured with one MPI process per node and strong scaling (i.e. a fixed workload). The parameters of the simulated system are given in Table II. Random routing Switch power consumption 43% when in low-power mode [15] We first ran the simulations without modifying the traces, in order to check that the original execution times were reproduced. Next, we apply PPA to the traces, inserting new events that mark when prediction is possible and events that mark when links are in low-power mode. When mispredictions happen delays due to reactivation of a lanes are inserted in the traces. All other overheads associated with the power saving mechanism are inserted, including the time to execute the PPA algorithm, as well as the overheads of data collection. Finally, we simulate the new traces on Venus-Dimemas, in order to quantify the resulting performance and energy savings.
Using the Paraver tool [22] , we measure the total amount of time for which the IB links are fully active, as well as the time that the links are in low-power mode. Figure 6 shows a trace from Paraver. The dark blue regions represent durations 
B. Results
This section presents and analyzes the experimental results, in terms of execution time and energy savings. For all benchmarks except NAS BT, we show results for runs with 8, 16, 32, 64 and 128 MPI processes. Since NAS BT requires the number of processors to be square, we instead run it with 9, 16, 36, 64, and 100 MPI processes. Figure 7 shows the energy savings and performance impact for a medium value of the displacement factor equal to 5%. Since we used strong scaling workloads, the amount of communication relative to computation increases with the number of nodes, inevitably reducing the opportunities for energy savings. We expect this problem to not occur with weak scaling. For the same reason, larger scale runs suffer from a larger increase in execution time, but still the maximum average increase, across applications, is around 1%. Due to larger inter- process communication the delays introduced in the system coming from our power saving mechanism can accumulate between processes. Depending on the communication pattern during execution, this could bring the agglomeration of delays and create a total delay in the entire application that is much larger than a single local delay on one MPI process. This can be seen for the Gromacs application, where in a run with 128 processes, we see more than 4% increase in execution time. Figures 8 and 9 explore the trade-off in varying the displacement factor. Choosing a larger displacement factor reduces the overheads incurred by waking the link up too late, at the cost of reduced time in the low-power mode. The results for a large displacement of 10%, in Figure 8 show that the average energy reduction is lower at 30.6%, with an almost negligible increase in execution time, compared with the original. Using the smaller displacement factor of 1%, in constrast, shown in Figure 9 gives the largest average energy savings of 33.5%, at the cost of potentially larger impact on execution time.
The energy consumption of the interconnection network can be reduced further if other components in the switches can be turned off; e.g. the input buffers and crossbars. The reactivation times of these elements are much longer, at up to a millisecond, which could cause an unacceptably large increase in execution time. We expect that our power saving mechanism can better amortize larger reactivation times and allow switches to go to deeper low-power modes without any major negative effect on the execution times.
C. Grouping Threshold (GT) Value
An important parameter in the PPA algorithm is the grouping threshold (GT) value, which determines whether two consecutive MPI calls should be considered as part of the same gram. Since there are no opportunities for power savings during idle periods shorter than 2×T react , the value of GT should be greater than this value. Table III shows the values of the grouping threshold that were used for evaluation, as well as the resulting prediction accuracy. Prediction accuracy is averaged over all MPI calls, including those outside the iterative parts of the application, which correspond to less predictable initialization and finalization phases. This is an important consideration for WRF and partially for Gromacs, while for Alya, NAS BT and NAS MG, the majority of calls are inside the iterative phase and the prediction accuracy is rather large. It is interesting that although the WRF application has the lowest prediction accuracy, it has the second-largest power savings; see Figure 9 (a). This is because the majority of large idle intervals are inside the iterative phase, while idle intervals in the other parts of the application are quite small. The opposite is true for the Alya application, where the prediction accuracy is large but the power savings are smaller. Here, the mayority of the large idle intervals are not in the iterative part of the application.
D. System Overheads
To measure overheads we relied on the system clock using the gettimeofday system call. The costs of overheads associated with interception of the MPI call and reading the system time are approximately 1μs. These overheads occur every MPI call while overheads that come from power saving system are different and do not occur on every MPI call. When the algorithm predicts the repeating pattern allowing power saving mechanism to shut down inactive lanes, the PPA is disabled, waiting for pattern misprediction to be relaunched again. Also, if the number of necessary grams is not enough the PPA will not be invoked. For activation/deactivation of the IB lanes, we chose a typical latency of 10μs. While the deactivation will be overlapped with computation, the reactivation penalty in case of misprediction has to be paid. The penalty could be equal or smaller than reactivation time if reactivation has already been started. The PPA overheads are also varying and depend on pattern size and number of all possible patterns detected during the execution. We used uthash [19] hash table to store the pattern objects where pattern is used as a key. Table IV shows the average overheads of PPA through the HPC applications. Although the overhead per MPI call on first sight can seem very large, it only occurs on small number of MPI calls (average 2.1%). The overheads associated with PPA can be further reduced by using faster hash tables.
V. RELATED WORK
Optimization of the interconnection network power consumption optimization is an important target for HPC system designers. Hoefler [5] gives an overview of the power problem and related aspects of interconnect power, with a focus on supercomputers. Power models for the interconnection network, which characterize the power profile of network routers and links, have been proposed, enabling further research into power-efficient techniques [23] . Several power reduction techniques have also been proposed, with most of them based on DVS. Shang et al. [7] proposed a history-based DVS policy, where past network utilization is used to predict the future traffic. Soteriou at al. [24] propose software techniques that form an extension to a parallelizing compiler flow, statically generating DVS instructions that later will direct run-time network power reduction.
Other techniques are based on turning off communication links that are either idle or have low utilization. Alonso et al. [8] propose a power-saving mechanism for regular interconnection networks built with a high degree switches, where each network dimension is formed from multiple links in parallel. The idea is to turn off and on the links that compose trunk link, as a function of the network traffic. All links but one can be turned off. Therefore, connectivity in the network is maintained, which allows the use of the same routing algorithm regardless of the power reduction level. In the work of Kim et al. [25] , a DVS technique is complemented with powering down under-utilized links. The use of an adaptive routing algorithm is required, in order to avoid deadlocks.
Li et al. [26] proposes a compiler-directed communication link shut-down strategy. The compiler determines the final use of communication links within each loop nest, and inserts a link shut-down instruction. The link is reactivated upon the next access to it. Our approach is similar, in that shut-down instructions are issued from the CPU in a way that depends on an analysis of communication patterns. In our approach, these instructions are issued by the runtime, without requiring any modification in the source code, while here the turn-off instructions are inserted during the compilation process.
The work of Lim et al. [27] is complementary to ours, since the MPI run-time system dynamically reduces CPU power consumption during communication phases. The runtime system identifies communication regions and selects the processor frequency in order to minimize the energy-delay product, without profiling or training.
Jian et al. [28] work is focused on non-prediction powersaving techniques. Links are powered up just before they are needed, by relying on hints from the built-in system events or from macros in MPI source code. Here, separate control network is needed which is always on, to enable link activation messages to flow through. In our approach, we rely on IB architecture with links that offer a dynamic range in terms of performance and power.
In the work of Abts et al. [3] , the authors propose energyproportional datacenter networks. Link data rates are selected on the basis of traffic intensity in the network. They use the congestion sensing heuristic to sense traffic intensity, dynamically activating links as they are needed. While this work is focused on datacenter applications, which can tolerate small changes in latency, HPC applications cannot afford such performance loss.
Saravanan et al. [29] provide a detailed evaluation on Energy Efficient Ethernet (EEE) from the perspective of HPC. They propose a technique to further increase power savings of HPC systems by leaving the link in active mode until a threshold time is reached.
VI. CONCLUSIONS
High-performance computing is an increasingly important tool for modern science. There is a constant demand for more powerful supercomputers, but increasing performance is leading to excessive peak power demand and energy consumption. Now that energy consumption is beginning to account for a significant fraction of an HPC system's total cost of ownership, there is pressure for all system components to become more energy efficient. An important characteristic of energy-efficiency is energy proportionality. Although processors and memories are now close to energy proportional, highperformance interconnects are not.
This paper presents a software-directed mechanism for interconnect link energy proportionality. We propose the PPA algorithm, to be executed within the PMPI layer of MPI. Putting the intelligence in the MPI library allows differentiation by the system integrator and customisation by the operations department, while avoiding any need for modifications to the user's source code. This allows energy savings to be achieved for unmodified existing MPI applications.
The PPA algorithm detects the repetitive communication patterns that are typical of modern scientific applications, and it uses this knowledge to predict the durations of the link idle periods. The links are put into low-power mode during idle periods until a short time before they are expected to become active again, leading to a significant reduction in the average link energy consumption at negligible loss in performance.
We evaluated the possible power savings with strong scaling runs, which gives pessimistic results, since network utilization, which should be proportional to energy consumption, increases with the number of nodes. In addition, strong scaling leads to shorter computation periods, meaning that constant overheads in changing power mode are amortized over short idle periods. Weak scaling runs would therefore lead to larger observed energy savings. Nevertheless, the results show the possibility for significant energy savings in IB switches of up to 33%, with a negligible increase in execution time of around 1%.
We also show possibilities for further switch energy savings, by powering down other elements in the switch. Such elements take much longer to change power state, requiring up to a millisecond to wake, increasing the need for accurate prediction mechanisms such as the PPA algorithm. We will evaluate this scenario in future work, by taking into account the powering down of other elements in the switch.
Finally, it is important to note that the principles of our system are not restricted to Infiniband. Many modern interconnect technologies, like Infiniband, have multiple lanes at the physical layer. For example, 40GbE Ethernet has four lanes at 10 Gb/s each, although there is currently no standard for turning lanes on and off individually. Proposals like ours may have an impact on future standardisation efforts.
