Abstract-Dramatic environmental and economic impact of the ever increasing power and energy consumption of modern computing devices in data centers is now a critical challenge. On one hand, designers use technology scaling as one of the methods to face the phenomenon called dark silicon (only segments of a chip function concurrently due to power restrictions). On the other hand, designers use extreme-scale systems such as teradevices to meet the performance needs of their applications which in turn increases the power consumption of the platform. In order to overcome these challenges, we need novel computing paradigms that address energy efficiency. One of the promising solutions is to incorporate parallel distributed methodologies at different abstraction levels. The FP7 project ParaDIME focuses on this objective to provide different distributed methodologies (software-hardware techniques) at different abstraction levels to attack the power-wall problem. In particular, the ParaDIME framework will utilize: circuit and architecture operation below safe voltage limits for drastic energy savings, specialized energy-aware computing accelerators, heterogeneous computing, energy-aware runtime, approximate computing and power-aware message passing. The major outcome of the project will be a processor architecture for a heterogeneous distributed system that utilizes future device characteristics for drastic energy savings. Wherever possible, ParaDIME will adopt multidisciplinary techniques, such as hardware support for message passing, runtime energy optimization utilizing new hardware energy performance counters, use of accelerators for error recovery from sub-safe voltage operation, and approximate computing through annotated code. Furthermore, we will establish and investigate the theoretical limits of energy savings at the device, circuit, architecture, runtime and programming model levels of the computing stack, as well as quantify the actual energy savings achieved by the ParaDIME approach for the complete computing stack with the real environment.
I. INTRODUCTION
The growing popularity of cloud computing has greatly increased the scale of data centers resulting in significant increase in power dissipation. We have identified several issues at different levels (programming model, runtime, hardware) that need to be addressed in order to reduce the power consumption and increase the energy efficiency of data centers.
The first issue is that modern servers suffer from the vivid power fluctuations due to sudden changes in workload. This is attributed to lower utilization of processor power for running the software. In real scenarios, the processor needs only 20% of its processing power to execute the application. When scaling beyond the computing node to the data center, power consumption is independent of the computing load of the system. The nodes keep state in memory and on local disk which means that they cannot be turned off even if the load is low. For this issue, we need to propose a novel scheduling policy between the computing nodes to raise work load at the same time adhering to Service Level Agreements (SLAs).
The second issue is the trend in technology minimization and the associated gains in performance and productivity. On one hand, we expect technology scaling to finally come faceto-face with the problem of dark silicon (only segments of a chip can function concurrently due to power restrictions), which will push us to use devices with completely new characteristics that must be studied. On the other hand, as core counts increase, the shared memory model based on cache coherence will severely limit code scalability and increase energy consumption. Therefore, to overcome these problems, we need new computer architectures that are radically more energy efficient.
The third issue is related to the programming model for the data center applications. It must provide interface such as annotations with the applications to maximize the utilization of resources during runtime. It also must interact with the hardware to provide information that can be used to apply aggressive energy saving techniques. Moreover, it should not rely on the shared-memory paradigm to enable scaling to a high number of cores.
The ParaDIME project addresses these issues to minimize energy consumption of data centers. The high level objectives of the ParaDIME Project can be summarized as follows:
• Objective 1: To develop an energy-aware programming model driven associated ecosystem, ParaDIME Computing Node/Stack (applications, runtime and architecture) that combines energy efficient SW programming and HW design methodologies and utilizes new emerging devices at the limit of CMOS scaling to radically decrease energy consumption.
• Objective 2: To build a reference Data Center (Infrastructure as a Service or IaaS) platform that incorporates new energy conscious workload scheduling techniques utilizing information from the runtime to radically decrease energy consumption.
The rest of the paper is organized as follows. Section II presents the proposed ParaDIME project flow and methodologies. Selection of the benchmarks is given in Section III. Section IV illustrates the preliminary results of methodologies proposed. Finally, we conclude in Section V.
II. PARADIME PROJECT FLOW AND METHODOLOGIES ParaDIME stands for Parallel Distributed Infrastructure for Minimization of Energy. As the name states, this project focuses on the minimization and optimization of energy consumption for the data center. Fig. 1 presents a overview of the project which is based on several energy minimization methodologies. Fig. 1 shows two computing nodes. First node represents the prototype developed with the simulator at the architectural-level and second node is the real hardware. The various energy efficient methodologies, which are proposed in this project at different level are as follows: A. Device-level
1) Emerging Devices:
The geometric scaling has been the driving force during the past decades of CMOS dominance, which is losing steam, we expect the device scaling in the next decade to be tougher, relying on material science and engineering. Fig. 2 shows the logic V dd scaling trend over different technology nodes. As is evident from the figure, the diminishing difference between V dd and V t results in performance degradation as well as induces more variability in the design. To balance this problem, commensurate electrostatic and mobility scaling is necessary. In the near future FinFET devices seem plausible as an effective means to extend MOS scaling for high-performance or low-power technologies at 20nm and beyond. They provide sufficient protection against short-channel effects, especially "off-state" leakage current. As we go beyond 10 nm scaling, the III-Vs are promising successors of MOSFETs due to their potential for sub-60 mV sub-threshold swing. Such a reduced swing will be a requirement for the ultra-low voltage operation of future generation of transistors. For the near future, the behavior of these devices is expected to follow past patterns, however the emerging devices for the mid to far future will require new logic cell concepts that steer us towards the energy-efficient operation and yield-aware design. ParaDIME will develop functional blocks such as adders, multipliers, register file as well as virtual prototypes of different ARM processors cores to enable architectural-level designers to predict the performance for both near and far-future technology devices to minimize the energy consumption.
2) Voltage Limits:
Due to the recent issues impacting device scaling as we approach the end of the CMOS roadmap, safe operation margins have been increasing. In particular, there is a substantial "tax" in the case of guard-bands for supply voltage. This guard-band is increasing due to systematic and random variability, increased thermal stresses and noise margins;. If we go below the safe limit and the associated guard-band, one might encounter sporadic errors while simultaneously saving energy dramatically. As an example, decreasing the supply voltage from 1.2V to 0.7V will lead to 3x dynamic energy savings. This will require support from other levels, for example using selective duplex replication at the architectural and programming model levels for driving the chip below safe V dd .
B. Architectural-level 1) Efficient message passing:
At the microarchitecture level, we have two goals for efficient message passing: (1) to leverage message passing to implement an efficient and scalable architecture and (2) to reduce the energy consumption of delivering messages. We will achieve the first goal by eliminating the structures and overhead required for cachecoherent shared-memory systems, thus improving the overall energy-efficiency of the system. We are aware that cachecoherency is the state-of-the-art for chip multiprocessors, and that it is very convenient for the programmer as data communication between cores is hidden. However, scaling to higher core counts is not trivial and alternatives are already being explored for many-core architectures [1] . We will address our second goal of reducing the energy consumption of delivering messages by providing two relatively simple mechanisms: including instructions to send messages to a particular level of the cache hierarchy of the destination core and avoiding copying data when sending messages to the same core.
On the top of that, we propose to use a message passing co-processor that accelerates the processing of messages by offloading the it to the co-processor and avoiding the OS/runtime software. One step on the road to achieving this accelerated processing of message is to have fast-task switching between the threads in the processor. The cost of spawning a thread can be very high and it is not cost effective if the task to execute is small. This would typically prevent programmers to specify small tasks, with the consequence of not expressing part of the available parallelism. We envision a system where the main core can spawn threads in a co-processor with much reduced cost, enabled by a very simple interface based on full/empty bits.
2) Operation below safe Vdd:
Reducing the supply voltage of a circuit (Vdd) is a well-known technique for making tradeoffs between performance and power [17] . Many commodity processors implement Dynamic Voltage and Frequency Scaling (DVFS) and offer a number of power modes that use different levels of supply voltage. However, the applicability of Vdd reduction is limited by safeguard bands that are necessary to ensure correctness of execution. Redundancy can be used to detect and correct errors, but full replication incurs significant energy overhead if not used carefully. In ParaDIME, we investigate techniques to lower the Vdd below the safe limits that only result in an overall reduction in energy consumption. We assume that the programmer indicates when performance can be aggressively traded off for reduced power dissipation (e.g., with the help of annotations at the programming level).
3) Reduced precision computing:
In ParaDIME, we will approximate floating point computation by reducing the precision whenever the programmer indicates that computation does not require high precision, e.g. using the IEEE standard 754 format for floating point. Thus, floating point data will use fewer bits than standard format (e.g. 16) , and both the floating point unit that operates them and the register file can be considerably smaller [16] . We will also implement a similar mechanism for integer values that are known to be narrow, i.e. that are inside a small range of values [2] . In a similar fashion, data caches and memory will be modified to exploit the smaller size of data, either by powering down some blocks or packing more data in the same capacity. This can also be leveraged to reduce the amount of data sent in messages.
4) Heterogeneous computing:
In ParaDIME, we propose to implement different strategies that incorporate heterogeneity at two levels: architecture and device. At the architecture level, one promising way to deal with the dark silicon issue is to use heterogeneous processors, for example with several accelerators, where only a small number of them is powered on simultaneously. In ParaDIME, we will use specialized accelerators such as vector co-processors and heterogeneous CPU cores/processor in the same system such as ARM (big.LITTLE) and NVIDIA's Quadro. We will evaluate the power, performance and energy characteristics of common existing hardware accelerators, such as FPGAs, GPUs or DSPs and heterogeneous CPUs. At the device level, we will introduce heterogeneity to reduce energy consumption and to maintain the CPU performance without any degradation.
a) Architecture-level heterogeneity:
In the context of ParaDIME, we will use the Power Estimation Tool at Systemlevel (PETS) [14] , for estimating and optimizing power. This tool simplifies application porting as well as enables the user to choose the processor architecture upon which to perform hardware/software co-simulation. PETS was initially developed for the evaluation of MPSoC systems [10] , [9] , [11] . In ParaDIME, we have extended it to model a variety of other systems, including DSPs, FPGAs and multi-core (dual-and quad-core) ARM processors [13] , [12] , [15] .
b) Device-level heterogeneity:
The main challenge is to reduce power consumption by reducing the supply voltage due to concerns of either reducing performance (due to reduced drive currents) or increasing leakage (when reducing threshold voltage simultaneously). The sub-threshold slope of the transistor is a key factor in influencing the leakage power consumption. In this work, we propose the use of III-V devices that exhibit sub-threshold slopes steeper than the theoretical limit of 60 mV/decade found in CMOS devices. Consequently, III-Vs can provide higher performance than FinFETs based designs at lower voltages. However, at higher voltages, the Ion of FinFETs is much larger than can be accomplished by the flushing mechanism employed in existing III-V devices. This trade-off enables architectural innovations through use of heterogeneous systems that employ both III-V and FinFET based circuit elements. Heterogeneous chip-multiprocessors that incorporate cores with different frequencies, micro-architectural resources and instruction-set architectures are already emerging. In all these works, the energy-performance optimizations are performed by appropriately mapping the application to a preferred core.
C. Programming-level
The programmer can influence software helping the overall system to be more energy-efficient. In this project, we focus on the interaction of applications with the hardware as well with the runtime. The interface to the hardware can be seen as an extension of the programming model (API), which allows the programmer to indicate safe sections for lowering V dd , as well as marking types and methods for calculating and storing values with reduced precision. For efficient message passing, we rely on the actor model.
The interface to the runtime (at data centers) provides distributed communication for enabling energy-efficient allocation and scheduling of resources (= static energy profiles). The definition of static energy profiles are configuration-based indications by the user for the runtime.
For ParaDIME, we decided to use Scala, a general-purpose language, that runs on top of the JVM and combines functional and object-oriented programming patterns. Since the release of Scala v2.10, Scala includes the Akka framework, a library with actor model support.
1) API: General ParaDIME annotation design:
In addition to Scalas being compatible with the Java API, it is also possible to write a part of the code using other languages such as C or CUDA and then integrate these into the main Scala application rather easily. We have used this capability to define low-level annotations for indicating data types with reduced precision, and safe regions for operations below safe V dd . For low-level annotations we use Scala annotation macros, which allow us to define annotations and with them linked macros that will be executed at compile time.
We define two general annotations (@AlterMethod, @AlterType) that can be applied to the definition of types as well as methods and are fairly generic, which allows us to adapt to the provided features in the course of the project. We do not support annotations that target method calls to avoid errors if the programmer forgets to reset the voltage or precision. With the annotations it is possible to add a variable amount of arguments in the form of key = value pairs, which can be combined as necessary.
To simplify the usage for the programmer, we define three profiles (nominal, safe and unsafe) for interacting with the hardware related to lowering the V dd . The programmer can choose three types of annotations as specified in the following listing 1. Although it is possible to define low V dd also for types using the @AlterType annotation, it is not recommended to actually use them, since the change of voltage requires several instruction cycles (a compiler warning will be raised in such a case). An annotation consists of the key-value pair lowVdd=<nominal, safe, unsafe>.
We chose these simple types to be independent from the supplied voltages of the hardware. Note that the nominal value represents the default case and does not have to be necessarily used yet. However, we leave it open to extend the annotations scope away from definition of methods and types to calls depending on the future development of ParaDIME.
Safe means that the hardware chooses reduced voltage, but above safe limits, whereas unsafe lowers the voltage radically such that errors are very likely and might have to be handled and error detection and recovery mechanisms are required. Which error detection and recovery mechanism is appropriate will be chosen by the hardware, the programmer only indicates whether a reliability mechanism is required.
Similarly to lowering the V dd , a programmer can annotate types and methods for using reduced precision. Again, we provide three profiles, the programmer can choose from the following key-value pairs: precision=<standard, reduced, radical>. The hardware chooses the appropriate values linked to these profiles, whereas standard means full precision. As specified by the hardware it is possible to add type annotations to Float and Double definitions. In all cases the hardware will use fewer bits to represent data with reduced precision.
Open questions to consider include (1) the specifics of annotations at method declaration, (2) the effects of calculations with values on reduced precision and (3) the effects of annotations on definition of variables with custom types. The second point might be handled under the hood.
2) Efficient Message Passing: Actor Model and Scala STM extension:
We proposed several extensions for improving the efficiency of the actor model that are more or less visible to the programmer. We first introduced concurrent message processing by encapsulating each processing within an actor in a transaction (using Transactional Memory (TM)) [4] . With sequential processing, access to the state will be suboptimal when operations do not conflict (e.g., modifications to disjoint parts of the state, multiple read operations). TM can guarantee safe concurrent access in most of these cases and can handle conflicting situations by aborting and restarting transactions. However, we noticed that in cases of high contention, the performance of parallel processing dropped close to or even below the performance of sequential processing. Hence, we need to reduce the contention. We presented an extension of this work [3] , which is in contrast to the other extensions visible to the programmer because we introduce a new method call to Scala STM. We propose a combination of two approaches: (1) relaxing the atomicity and isolation for some read-only operations and (2) determining the optimal number of threads, executing transactional operations dynamically throughout the execution of the algorithm.
3) Static energy profiles:
In this section, we define static energy profiles, which will be extended in the course of the project. As an example, the profiles can be low energy, economy, and high performance (similar to battery usage modes on a PC). If the programmer indicates that the application should run in low energy mode, the ParaDIME framework can make decisions on allocating the required resources: e.g., low number of VMs on low-performance CPUs, limited parallelization, low number of actors, etc.
In ParaDIME, we target the static energy profiles towards data-centers i.e., we can use the energy profiles for managing the allocation of VMs and any runtime-level decision. We plan to use performance counters as well as power estimation to validate energy profiles.
D. Runtime-level 1) Operation below safe Vdd:
We want to increase the energy-efficiency at the CPU level by decreasing the CPUs supply voltage. Modern CPUs already incorporate energyefficiency measures. The processor supports several power states in the real machine. The ACPI standard defines exactly four different power states: C0 -C3. Besides power states, the processor may also support different performance states. The number of performance states differs between processors. They are numbered P0, P1,....., Pn. Each successively higher state reduces the processor's performance, because the voltage and/or frequency are reduced. Based on the application annotations the Vdd can be lowered. This is achieved, for example, by annotating non-critical sections. The runtime also provides automatic voltage and frequency scaling and allows for low and near threshold operation. It will provide the current state information to the application.
2) Energy-efficiency at the data center level:
To increase the energy efficiency at the data center, we plan to increase average utilization levels. By pushing utilization levels up, the comparatively high baseline power consumption is compensated for. We plan to combat the previously mentioned drawbacks of high utilization levels by executing a mix of compute tasks on each server. We distinguish between two types of tasks: interactive and batch. Interactive tasks have stringent performance requirements expressed as service level agreements (SLAs). Batch tasks, on the other hand, have turnaround times 2 to 3 magnitudes larger than interactive tasks, i.e., hours or days. This flexibility allows us to achieve utilization values of 90% and higher. Each server executes a mix of interactive and batch tasks. We have to ensure that interactive jobs never account for more than, say, 50% of the load. Additional capacity is consumed by batch tasks. Whenever there is a spike in interactive load, batch tasks will yield their resources to the interactive tasks. As soon as the surge in interactive load subsides, the batch tasks will continue executing, occupying all available spare resources. In ParaDIME, we propose energy efficient scheduling decisions for runtime based on the information given to it by the hardware and application. We also propose, support for migrating applications between physical servers if it deems this beneficial with respect to energy efficiency and also a mechanism to swiftly reactivate suspended virtual machines.
3) Energy-proportionality at the data center level:
Energyproportional computing is a concept where the power required by a computing system is directly proportional to the work performed. A standard commodity server is typically not energy-proportional. An energy-proportional server would draw 0 W at 0% utilization. The power drawn would increase linearly with utilization. Even though individual components, here servers, may not be energy proportional, it has been shown that energy-proportionality can be approached at the aggregate level. A server with close to 0% utilization can be switched off, while the work is taken over by the remaining servers. We propose a novel energy-proportional placement decisions of virtual machines and also a mechanism to switch off and which workloads must be moved between servers. Migration decisions follow a cost/benefit analysis.
4) Carbon-aware scheduling between multiple data centers:
Energy efficiency is less important if sufficient cheap and carbon emission-free energy sources are available. Because energy is a growing cost factor for data center operators, reducing the overall consumption in turn reduces the overall operating expenditures. Coupled with penalties for carbon emissions the urge to cut energy consumption is even stronger. If, however, a cheap and green energy source is available, the overall consumption may suddenly be secondary. When data centers have access to alternative energy sources, say solar and coal, the question of where to process a task is then also dependent on where energy is cheap, plentiful, and green. Within ParaDIME, we propose a scheduling decisions across data centers to select the "greenest" data center among those available. The placement decision is based on information about projected energy availability, cost, and heat demand.
5) Heterogeneous Computing:
CPUs are general purpose microprocessors. But even they have a multitude of special purpose circuitry to help with, e.g., floating point operations and streaming data manipulation (SSE1/2/3/4). Besides the CPU, there are other components, which take over specialized tasks. The most prominent example is the graphics processing unit (GPU). The GPU is an accelerator for graphics processing. Rendering 3D scenes is a complex task which can, however, be sped up significantly with special-purpose hardware. The idea of accelerators is to implement certain functionality in hardware instead of executing it in software on a general purpose processing unit. By offloading tasks to the accelerator, the CPU is free to do alternative work, or sleep if there is nothing else to do. The accelerator, because it is specialized, will perform the same task more efficiently. The runtime proposes mechanisms to indicate when a task can be sent/offloaded to an accelerator and also a mechanisms to turn accelerators off.
6) Energy-efficient Storage:
The energy-efficient storage system is an object store with a simple interface to get, put, update, and delete objects. Objects are binary data blobs as far as the storage system is concerned. Each object is replicated R times, where R is the replication factor. The replication factor is tunable. It allows different trade-offs for data availability and storage overhead. Besides the minimum replication factor R, there exist additional copies of popular objects. These exist solely to cope with increased read requests. Whenever the aggregated client read throughput exceeds the available bandwidth of live replicas, additional copies are brought online. This ensures that the storage system only consumes energy in proportion to the client demands. In ParaDIME, we propose an interface to the application to persistently store data. The interface is linked to the application in the form of a library. The library provides the basic primitives to create, read, update, and delete data objects. To the library each object is an opaque binary string. The storage library provides an additional level of encapsulation: the details of accessing the storage system can be changed without re-writing the dependent applications.
III. BENCHMARKS
We began our selection process by surveying a wide range of algorithms and their available implementations (as benchmarks) with respect to a set of high level, "nonstarter" criteria.
A. K-means Benchmark
The K-means algorithm groups objects in an Ndimensional space into K clusters. This algorithm is not embarrassingly parallel and may benefit from optimistic concurrency. In first year of the ParaDIME, we released several versions of K-means. The first version that we released is our research baseline which is a reimplementation of the sequential kmeans application as found in STAMP using Scala. Second, we implemented a multi-threaded shared memory version of k-means. Finally, we released an actor-based implementation to show the performance differences for message passing.
Our proposed actor implementation of k-means consists of two types of actors: the (a)(always one) coordinating actor and the (b)(many) worker actors. Worker Actors(WAs) claim responsibility for processing a disjoint chunk of the input dataset and executing the multi-threaded shared memory kmeans algorithm.
B. Hydraulic Sub-surface Simulation (Hydra) Application
Multiple-point geostatistics [8] is a prominent tool that has proven effective for performing geostatistical simulations. At its core, the technique analyzes the relationships between multiple variables in several locations at a time. In general, the cost associated with the deterministic determination of the hydraulic properties of the subsurface is prohibitively high. Hence, the aim of multiple-point geostatistical simulation is to simulate the hydraulic properties of the subsurface based on a given number of samples. Hydra is already implemented in CUDA and has a graphical user interface that allows for executing a mixture of batch and interactive tasks. Additionally, the base implementation of Hydra is of the appropriate size and complexity. Finally, it lends itself to possible use in a data center, making it particularly desirable for testing the storage API that will be provided by the runtime. This is particularly important, because none of the other applications to which it was compared were suitable for testing this aspect of the infrastructure.
IV. PRELIMINARY EXPERIMENTAL RESULTS
In this section, we will present our preliminary results based on the previously proposed methodologies.
A. Architectural-level results

1) Below safe Vdd:
In this section, we analyze the feasibility of applying the error detection schemes with TMbased error recovery. We are specifically interested in how much we can lower the voltage while still providing high error detection capability. For the evaluation we consider the following two scenarios: 1) We investigate the energy overhead of the error detection schemes and the combined error detection and recovery and 2) combination of different error detection schemes. Our preliminary results are shown in Fig. 3 and Fig. 4 . In Fig. 3 , we summarize the performance of all applications in the SPLASH benchmark by averaging their energy consumption. The energy consumption is normalized to the error-free base case in which 2V supply voltage is used. From this graph (Fig. 3) , we can observe that when a transaction consists of 100 instructions, Double Modular Redundancy (DMR) starts to outperform the base-case, when Vdd is 1.4V (up to 28% reduction) or 1.2V (up to 54% reduction). Due to the increase in the fault rate, the probability of faults causing rollbacks repeatedly becomes significantly high. Thus, the energy consumption of DMR increases drastically after this voltage level. There is a trade-off between energy efficiency and reliability, as we can see for DMR and symptom-based error detection and TM recovery. Thus, we can for example combine symptom-based error detection and DMR for consuming less energy, but providing full reliability for critical parts. In Fig. 4 , we analyzed the energy overhead of this combination in comparison to the base case and DMR only for a transaction size of 100 instructions. We assume that 30, 50 or 70% of the application are only secured by symptom-based error detection. With this combination it is possible to lower the Vdd to 1 V (in comparison to 1.2 V with DMR only) and still be more efficient than the base case. Specifically, we reduce the energy consumption by 66% in comparison to the base case.
2) Multicore and heterogeneous computing at the architectural-level: As we mentioned before, we used PETS tool to estimate power for multi-core and heterogeneous processor at the architectural-level. The processor architecture which we use in this project for heterogeneous computing are DSP C64x, GPU Tegra3, FPGA Xilinx Zynq and Multicore Cortex-A9. All the cores/processors execute the same workload. Fig. 5 shows the total energy consumption in mJ for K-Means application which is one of the selected benchmarks for this project. In terms of energy consumption, we observe that until a certain number of cores, the total system energy consumption decreases as the number of execution cycles is reduced and then it tends to stabilize as the system performance improves. But increasing the number of processors over a certain limit tends to be futile, as it just adds new conflicts at the bus level, leading to more waiting cycles. Another thing, FPGA produces higher energy efficiency compared to the other but programming using HDL is tedious. GPU is second energy efficient compared to DSP's and multicores. If we couple a low power arm with GPU then it will be more energy efficient than have big INTEL or AMD cores. 
B. Programming-level results
Experiments in this section were executed on an i3 machine with 2 cores and 4 threads, using 16 clusters. Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz. We estimate the power consumption on process level using PowerAPI [7] .
Considering power estimation, we see in Figure 6 a) that the sequential implementation requires the lowest power. The reason is that only one of the cores is occupied. The other implementations use all of the cores; this fact is reflected by almost doubled power consumption. However, with parallel systems, we can considerably improve the performance, having impact on energy consumption of the application (energy = power * time). In Figure 6 a) shows that actors and thread implementations can reduce the execution time, but the execution consumes around the same amount of energy. To actually gain from concurrent executions it is necessary to scale more as shown in Figure 6 b). For these results, we execute the same experiment with an increasing number of threads on a 48-core AMD Opteron. Starting from similar numbers with 4 threads, it can be seen that the actor implementation scales better than the lock-based implementation. Additionally, from 8 threads the improvement of execution time is high enough to save energy, leading to a final improvement of 90 % points over the sequential execution considering execution time. 
C. Data center-level results
Fast virtual machine resume: As outlined in Section II-D, one aspect of ParaDIME is to increase the energy efficiency of a single data center. In this context, we identified a previously neglected class of applications with only sporadic resource requirements, for example, a web server which only answers a few requests per hour. For this class of infrequently accessed services, it makes sense to suspend the services while it is idle and only resume it when a new request arrives. While suspending idle services helps the resource provider to reduce its required capacity, it is important to reactivate the service swiftly once more work arrives. We have modified the opensource virtual machine emulator qemu/kvm, to resume virtual machines almost instantly. To evaluate our modifications, we performed benchmarks with different applications, storage technologies (HDD vs. SSD), and storage locations (directattached vs networked). Figure 7 illustrates the results for resuming a virtual machine from a checkpoint stored on a network-accessible SSD.
While Figure 7 presents data for three different resume strategies, we focus on the hybrid resume variant. Depending on the application, it is possible to resume a virtual machine over the network in 1.0 to 2.8 seconds. While some applications, notably Mediawiki, take longer to resume, because they access more memory during the resume, other applications, e.g., Django and Rubis, take less time. This is the worst case delay only experienced on the first request. For subsequent requests, when the VM is already running again, will be answered much faster. Further measurements and a more detailed explanation are available in the original publication [6] .
Periodic state synchronization: The goal to power off servers is complicated by directly attached storage, because as soon as server is offline, the stored data becomes inaccessible. In this context we developed a system named dsync to efficiently synchronize gigabytes of data between two machines. Figure 8 compares the synchronization time taken for different synchronization methods. We observe that dsync is among the fastest methods, while referring the interested reader to the original publication [5] to learn more about the exact differences between the various methods. The wall-clock time taken to synchronize is only one aspect in which the methods differ. Resource consumption, such as disk I/O and computational overhead, are also important metrics to consider in this context. From the device-level, we are currently investigating on 14nm and 7nm nodes of FinFET and III-V devices and exploring the stacking concept for data centers (2.5D and 3D). We also examine the concept of lowering the V DD and the delay error rates for those device specifications.
At the architectural-level, we have started to implement message passing co-processor and lowering the V DD for ARM and Intel processors. We have successfully implemented the heterogeneous part of this project with FPGA, DSP and GPU. Currently, we are exploring the device-level heterogeneity at the architectural-level.
At the programming-level, we have released the K-means implementation of scala code. Now, we are working on programmer friendly annotations by which software developers can mention and implement, in which part of the code, lowering of the V DD and error recovery concept can be introduced.
At the run-time, we are now implementing below safe V DD concept and trying to explore various other methods to promote green computing such as energy efficient storage facility etc.
The outcome of this project will serve as a roadmap for future data center processors, will promote green computing and will serve as a example for the programmer to develop energy efficient softwares for data centers.
