Abstract-Energy consumption has become one of the greatest challenges in the field of high performance computing (HPC). The energy cost produced by supercomputers during the lifetime of the installation is similar to acquisition. Thus, besides its impact on the environment, energy is a limiting factor for the HPC.
I. INTRODUCTION
High performance computing has been, for decades the only goal of increasing the processing speed of computationally complex applications such as scientific applications. Supercomputers were designed exclusively with the aim of increasing the number of floating point operations per second (FLOPS). This is reflected in the TOP500 list [1] , which uses the FLOPS metric to determine the ranking of supercomputers. The performance and the trade-off price/performance were the most important objectives. Unfortunately, not only the number of transistors were doubled every 18-24 months to increase the performance of a node (Moore's law), but also energy consumption was doubled [2] . Thus, this led to the appearance of supercomputers that consume vast amounts of electrical power and produce so much heat that large cooling facilities must be constructed to ensure proper performance.
According to the Lawrence Livermore National Laboratory (LLNL), for every watt (W) of energy consumed, 0.7 W is spent in cooling to dissipate the energy. The energy consumption of current supercomputers is so high that it produces a huge economic impact. In 2005, annual spending in electrical energy at LLNL was of 14.6 million dollars [2] . In 2002, Dr. Schmidt, CEO of Google, told: "what matters most to the computer designers at Google is not speed but power -low power, because data centres can consume as much electricity as a city" [3] . Many supercomputing centres say the annual spending on energy is equivalent to the cost of acquisition of the supercomputer.
The energy consumption not only has an economic impact. The lack of exploitation of renewable and clean energy sources also affects the ecology and society. It may be noted that the most (and largest) of supercomputers in the world are in the U.S.. Half of the electrical energy in this country is produced with coal [4] , impacting heavily on the environment, and health and life-threatening of people, because of mineral extraction and pollution from burning coal, among others.
The Green Computing comes to avoid damages resulting from high energy consumption of computers. In 2007 the first list of the Green500 [5] was published, ranking the most energy-efficient supercomputers in the world. Thus, began the new era of green computing, avoiding the focus of performance-at-any-cost. Today, the TOP500 is not the only interesting ranking, but also the Green500.
Our research aims to reduce the energy consumption of computer systems to run parallel HPC applications. In this way we hope to contribute to the economic, environmental and society factor. In this article we analyse the possible influence on the energy consumption of parallel programming paradigms of shared memory and message passing, and the behaviour that they present at different clock frequencies of CPUs. Specifically, we use OpenMP (a shared memory parallel programming model) [6] and MPI (a message passing programming model) [7] implementations from the NAS parallel benchmarks, running on a dual socket server with dual core processors. The focus of the experiments present on this paper are limited to the computational node but not the interconnection network.
The results show that the programming model has a major impact on the energy consumption of computer systems. It was found that the impact of reduced clock frequencies on the execution time, energy efficiency, and maximum power depends not only on the type of application but also on its implementation in a specific programming model. We believe that this study may be an important starting point for future works in the area.
The remaining of this paper is organised as follows. Section II discusses briefly the state of the art regarding the energy consumption of HPC systems. Section III analyses the current state of parallel programming models. Its influence on the energy consumption of computer systems, and the behaviour at different CPU clock frequencies are discussed in section IV. Finally, section V describes the conclusions and future works.
II. GREEN HPC
This section describes the metrics related to energy, later the energy-saving techniques for HPC platforms are explained, and finally the energy characteristics of modern microprocessors are presented.
A. Energy Metrics
Power is the rate at which the computer consumes electrical energy. The watt (W) is the unit of power, equivalent to 1 joule by second (1 J/s). Energy is the total amount of electrical energy that the computer consumes over time, and is measured in joules or watt-hour (Wh).
The most common metric to measure energy efficiency is the one used by Green500: performance per watt (MFLOPS/W) [8] . However, other metrics have been defined. For example, Sun Microsystems defines the metric SWaP (Space, Wattage and Performance) which considers the measure of space to calculate the energy efficiency of a computer.
Generally, the works in HPC platforms are oriented to reduce energy consumption but not power. The energy is related to the total economic cost of electricity consumed by the system. However, the maximum power of a computer system determines the capacity of the electrical infrastructure and cooling equipment [9] . Thus, we decided to include power measurements in our experiments.
B. Energy Efficiency of Modern Microprocessors
According to the latest list published by the Green500, in November 2010, accelerator-based supercomputers hold 8 of the top 10 spots on the Green500 list. Most of the accelerators used in these supercomputers are based on Cell processors (PowerXCell 8i) of IBM, or on Graphics Processing Units (GPUs) of NVIDIA or AMD. Cell processors are hybrids that have a general purpose processor and several small processing units designed for vectorised floating-point code. GPU computing or GPGPU (General Purpose GPU) is the use of a GPU to do general purpose scientific and engineering computing. The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU.
The last five green500 list were topped by accelerator-based systems. However, in the last list the trend has changed and topping the Green500 is the IBM BlueGene/Q prototype. The CPU is a 17 core 64 bit Power processor with 4 way multithreading. The prototype system achieves 1684 MFLOPS/W. In second place is an acelerator-based system that achieves 1448 MFLOPS/W. This system uses the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) many-core accelerator that is specially developed for scientific applications.
C. Energy Saving Techniques in HPC
There are two forms of energy consumption, the dynamic and static. Dynamic consumption arises from the activity of the circuits, while static consumption is produced during the idle state of the circuits (also known as idle or leakage power consumption). Techniques to reduce energy consumption of computer systems are oriented to both types of consumption, and involve both hardware and software. It is important to mention that some techniques may reduce power but not necessarily the energy, and vice versa.
There are several techniques to save energy in HPC systems, of which the most important are:
Circuit and Logic Level Techniques: It is possible to reduce the width of transistors to reduce their dynamic power consumption, however, the transistors' delay will increase. Another techniques dealing with the reordering of transistors, logic gates and signals, the develop of new flip-flops that consume less energy, and the redesign of both the processor's control logic and the algorithm that determines the voltage needed for each available clock frequency [10] .
Exploitation of Parallelism: The multicore era is emerging as solution to three problems: the memory wall, the instruction level parallelism (ILP) wall, and power wall [11] . The power wall refers to the limit in the amount of energy that can be dissipated in a CPU chip using commodity cooling techniques. This limit of energy dissipation stopping to continuing with the design strategy used until early 2000, where the system performance was increased by increasing the CPU clockfrequency.
Currently, processors have several slow cores instead of a single fast core. For example, Intel has gone from a unicore "Pentium 4 Prescott" (2004) of 3.6 GHz and 103 W, to a dual-core "Kentsfield Core 2" (2007) of 2.6 GHz and 95 W. Whenever possible get a good speedup 1 , the exchange of clock speed by parallelism allow to save energy [9] .
Functional Units and Nodes Interconnections: It is important to reduce the energy consumption produced by the communications between functional units inside the node. So, many techniques deal with buses redesigning: segmentation, data encoding to minimise changes or transitions of the buses' lines (between 0 and 1), and voltage reduction. Furthermore, the buses can be replaced by networks to support higher bandwidth and more concurrent connections (such as the QuickPath Interconnect of Intel or the Hyper Transport of AMD). Other techniques are focused on inter-node communications, dealing with the redesign of the interconnection networks of the parallel systems.
Memory Optimisations: One possible strategy is to divide the memory into smaller components, and activate only the memory circuits required for each memory access. The memory hierarchy of systems also helps to reduce energy consumption. For example, considering a memory system with one level of cache, the cache access consumes less power than main memory access because the first is smaller.
Adaptive Processor Architecture: Some hardware structures can be designed so that its parameters are configurable as needed. This may allow to use the minimum resources required for each code execution. For example, adaptive caches have been proposed which could be partially activated according to the access pattern and workload. Furthermore, adaptive instruction queues were proposed, which can be partially activated [12] .
Dynamic Voltage Scaling: Reducing the supply voltage reduces power consumption. However, it increases the delay of logic gates, so that the clock frequency should be reduced to allow the circuit to work properly. Their implementation is not trivial because it may result in an intolerable performance loss, or in a increment of energy caused by an increment of the program run-time [13] .
Our proposition of analise how the applications react upon different clock frequencies is in this field.
Resource Hibernation: The computer components consume power even when idle. Thus, the technique of resource hibernation turns off or disconnects components in idle moments. The components that can be hibernated depend on each system and can include: hard disks, cores, network interface cards, and memories.
Power Management at Compiler Level: There are many ways a compiler can help to reduce energy consumption. Some performance optimisations are also beneficial to save energy. Examples are the reduction of the number of memory accesses, and the approaching of data to the processor using preferably the registers and low caches levels. However, some performance optimisations increase the code size or the parallelism, which could increase the use of resources and power peaks. Therefore, it is necessary to make specific code optimisations for energy.
Power Management at Application Level: It is possible to save energy at the application level. Some techniques are focused on adapting the applications to the execution platform, providing methodologies for designing energy efficient applications. Others are involved in developing application programming interfaces (APIs) for data exchange between different levels: hardware, operating system and application. The sharing of information between these levels allow to take better decisions to reduce energy consumption. For example, if the disk has been hibernated, the application could activates the disk before having to access it. Thus, the activation delay of the device can be avoided.
Our proposition of analise what parallel programming model should be selected to implement an application that produces the lowest energy consumption is in this field.
III. CURRENT STATE OF PARALLEL PROGRAMMING MODELS
A parallel programming paradigm or model provides the programmer an abstract framework to represent a parallel algorithm, and to map connections between the application logic components and the physical components of the underlying parallel machine. Typically, a particular programming paradigm results in higher performance than others for certain architectures of parallel systems. Moreover, these paradigms are generally aimed to the exploitation of a particular type of parallelism. Thus, the benefits obtained by using a programming paradigm, are closely related to the type of parallelism that can be exploited in the specific application, and the architecture of the parallel system where it will run.
Programming models differ on the way of dealing with process communication and synchronisation, either as shared memory or distributed memory. In the case of shared memory, processes communicate through a common memory address space. OpenMP is the most widely used model on shared memory. In distributed memory, a model of message passing between processes is adopted where data is moved from one address space to another. MPI (Message Passing Interface) is the most widely used implementation of this last model.
Implementations of shared memory programming models have been presented to run on distributed memory machines [14] . However, the high cost of this virtualisation is reflected in a lower performance of applications, where the cache consistency is the major drawback that affects the performance loss. Something similar happens when using MPI on shared memory, usually resulting in a loss of performance. Thus, a hybrid parallel programming model is adopted, which is a mixture of a message passing model and a shared memory model.
Both MPI and OpenMP are not suitable for use with accelerators. Internal parallelism of accelerators are usually MIMD (Multiple Instruction Multiple Data) and SIMD (Single Instruction Multiple Data), where the latter is not addressed by MPI or OpenMP. Moreover, GPUs are not designed to run an operating system, and a CPU is used for this. Thus, the GPUs add another level of parallelism determined by the concurrency between the CPU and the GPU. This new level of parallelism is not natively supported by MPI and OpenMP.
Languages extensions to work with accelerators have been proposed, such is the case of the extension to OpenMP [15] . However, a standardised high-level programming language for accelerators is not yet available.
We believe it is important to study the energy performance of both message passing and shared memory standard models (MPI and OpenMP) in pursuit of establishing a methodology for the use of them to develop energy efficient programs.
IV. ANALYSIS OF THE IMPACT OF PARALLEL PROGRAMMING MODELS AND CPUS CLOCK FREQUENCY ON ENERGY CONSUMPTION
In this section we describe the methodology and analyse experiments carried out to verify the influence of parallel programming models and CPU frequency scaling in the energy consumption of parallel systems. For these experiments we used the NAS benchmarks, implemented in both MPI and OpenMP. The evaluated parallel platform is a dual socket Intel Server System SC5650BCDP, with dual core Intel Xeon E5502 processors and 16GB of main memory (8GB per socket).
The experiments presented in this paper are limited only to one computational node, because, by the moment, we are not focused in the interconnection network. The system runs a Linux operating system (kernel 2.6.32-31), OpenMPI 1.4.1 (an implementation of the standard MPI-2), and GOMP 4.4.3 (implementation of the standard OpenMP v3.0 for the C, C++, and Fortran 95 compilers).
The following describes the methodology used to measure energy consumption and scale the CPUs' clock frequency. Next, energy consumption ranges of the platform under study are presented. Finally, the energy and performance system behaviour executing the NAS benchmarks implemented in OpenMP and MPI at different CPUs' clock frequencies are detailed and analysed.
A. Methodology
This section describes the methodology used to measure the energy consumption of the parallel system, and describes how to scale the CPUs' clock frequency.
Energy Measurement Procedure: We measure the energy consumption of the whole HPC node. For this, we use the oscilloscope Yokogawa DL2700 [16] and the Fluke 80i-110s [17] AC/DC current probe. The DL2700 is an multichannel (8 channels), long-recording digital oscilloscope. The voltage is measured directly by one input channel. The current of the phase conductor is measured using the Fluke current probe that is connected to another input channel of the oscilloscope. Then, the power is calculated as the product of the measured voltage and current. The sample rate for the experiments was of 10 samples by second.
For ease of comparisons between experiments, measurements were converted to relative values or percentages.
Clock Frequency Scaling: Modern general purpose processors can scale the frequency of each core individually. Access is through the ACPI. It is possible to know, for a given core, the available frequencies and the frequency currently in use, respectively, reading the following two files:
/sys/devices/system/cpu/cpu0/cpufreq/ scaling_available_frequencies /sys/devices/system/cpu/cpu0/cpufreq/ scaling_cur_freq To change the frequency you can use cpufreq-selector command (requires root privileges). Running the command "cpufreq-selector -c 0 -f 1000000" sets the core number 0 to 1GHz.
B. Energy Consumption Ranges of the Platform
We performed experiments to quantify the influence of the running of programs on the energy consumption of the parallel system. The energy consumption of the system under study at idle state was measured, i.e., running only the processes of the operating system. We also measured the maximum peak power of the platform for different NAS parallel benchmarks at the highest CPUs' clock frequency. We determined the problem size of the benchmarks so that its main memory requirements for execution are met and memory swapping never happens.
The server consumes about 117W in idle state and had a maximum peak power of about 221W. That is, the server is near to double the idle state energy consumption. That is, the server is able to nearly double the idle state energy consumption. These results prove the high capacity of the software to alter the power consumption of the platform, justifying the importance of its study.
C. OpenMP versus MPI at Different CPUs' Clock Frequencies
We selected the CG, IS and EP benchmarks which are computation bound, and the MG benchmark which is communication bound [18] . We select only these benchmarks to show in this work because the behaviour of the others (NAS parallel benchmarks) do not provide additional information for our objective. We determined the problem size of the benchmarks so that its main memory requirements for execution are met and memory swapping never happens. Thus, we chose a class C data size.
Four experiments by benchmark were defined and determined by a combination of the two programming models (OpenMP and MPI) and the maximum and minimum frequency of CPUs supported by the platform. The experiments were made with fixed CPUs' clock frequency (no DVS algorithm was used). Figure 1 shows the performance (max is better), energy efficiency (max is better) and maximum power consumption (min is better) linearly scaled between 0 and 1 for each benchmark. Note that the results shown in a graph cannot be compared with the results of a different graph because the scale is relative for each specific benchmark.
The following peculiarities of the experimental results are observed. At equal CPU frequency, OpenMP produces a lower or equal maximum peak power than MPI for every evaluated benchmark. Respect to the performance and energy efficiency, OpenMP is better than MPI for the benchmarks IS, EP and MG, but is worst for the CG benchmark.
For each evaluated benchmark, lower CPUs' clock frequencies resulted in better energy behaviour. But sometimes the gain in energy efficiency is low and the performance loss is very high as happen with the EP benchmark. Still, in this case To make an in depth analysis of the impact of the decreasing of CPUs' clock frequency on each benchmark, we use three synthetic parameters: energy savings, slowdown, and maximum power reduction. The results for each evaluated parameters are shown in figures 2, 3, and 4, respectively.
For all the evaluated benchmarks, the modification of clock frequency of CPUs produces a bigger energy impact in MPI than OpenMP programs (see figure 2 ). For our parallel platform, the maximum possible energy savings with the lowest CPUs' clock frequency is 6.2%.
The slowdown (see figure 3 ) is bigger for OpenMP than MPI programs when the CPUs' clock frequency is decreased. For our experiments the slowdown ranges from 8.57% to 16.72%, but this will be reduced using DVS algorithms. The impact of CPUs' clock frequency on maximum peak power is very high, sometimes it is higher with OpenMP programs and sometimes with MPI programs (see figure 4) . For our parallel machine, the impact on maximum peak power ranges from 12% to 16.31%.
From our findings, we can say that the programming model has a high influence on the energy consumed by the machine running the parallel application. Given an algorithm, the change of CPUs' clock frequencies impact differently on energy consumption, maximum peak power and execution time depending on the programming model used to implement the application.
Unlike other energy-related works, we include the measure of power in our experiments. However, it is important to note that, ideally, the energy consumption of the cooling system should be considered in the calculation of power and energy consumption. Thus, the characteristics of the cooling equipment may influence the choice of the programming model used to implement a particular application.
V. CONCLUSIONS AND FUTURE WORKS
This article presented an impact of parallel programming models and CPUs' clock frequency on energy consumption of HPC systems. We studied the energy consumption of a parallel system to run scientific applications of HPC. Specifically, the NAS benchmarks were selected for the experiments. We search evidence of the influence of the parallel programming model (used to implement an application) in energy efficiency, performance and maximum power usage. We evaluated the OpenMP implementation of the shared memory model, and the MPI implementation of the message passing model. We also studied the effect on these parameters by using different clock frequencies of CPUs.
As a next step we plan to study the reason behind the difference in energy consumption behavior of the two studied parallel programming models, and to extend the energy analysis to the interconnection network. In addition, we plan to synthesise the behaviour of different parallel applications (implemented in a certain parallel programming model) in several microbenchmarks. We believe that knowing the energy characteristics of these microbenchmarks we could predict the energy consumption of many applications. Through these studies, we will intend to develop more accurate algorithms of mapping and DVS to take into account the performance, energy efficiency, and maximum peak power consumption.
