8,981 research outputs found
Toward sustainable data centers: a comprehensive energy management strategy
Data centers are major contributors to the emission of carbon dioxide to the atmosphere, and this contribution is expected to increase in the following years. This has encouraged the development of techniques to reduce the energy consumption and the environmental footprint of data centers. Whereas some of these techniques have succeeded to reduce the energy consumption of the hardware equipment of data centers (including IT, cooling, and power supply systems), we claim that sustainable data centers will be only possible if the problem is faced by means of a holistic approach that includes not only the aforementioned techniques but also intelligent and unifying solutions that enable a synergistic and energy-aware management of data centers.
In this paper, we propose a comprehensive strategy to reduce the carbon footprint of data centers that uses the energy as a driver of their management procedures. In addition, we present a holistic management architecture for sustainable data centers that implements the aforementioned strategy, and we propose design guidelines to accomplish each step of the proposed strategy, referring to related achievements and enumerating the main challenges that must be still solved.Peer ReviewedPostprint (author's final draft
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications
Many scientific problems require multiple distinct computational tasks to be
executed in order to achieve a desired solution. We introduce the Ensemble
Toolkit (EnTK) to address the challenges of scale, diversity and reliability
they pose. We describe the design and implementation of EnTK, characterize its
performance and integrate it with two distinct exemplar use cases: seismic
inversion and adaptive analog ensembles. We perform nine experiments,
characterizing EnTK overheads, strong and weak scalability, and the performance
of two use case implementations, at scale and on production infrastructures. We
show how EnTK meets the following general requirements: (i) implementing
dedicated abstractions to support the description and execution of ensemble
applications; (ii) support for execution on heterogeneous computing
infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv)
fault tolerance. We discuss novel computational capabilities that EnTK enables
and the scientific advantages arising thereof. We propose EnTK as an important
addition to the suite of tools in support of production scientific computing
High-Throughput Computing on High-Performance Platforms: A Case Study
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan---a DOE leadership
facility in conjunction with traditional distributed high- throughput computing
to reach sustained production scales of approximately 52M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner
Predicting Scheduling Failures in the Cloud
Cloud Computing has emerged as a key technology to deliver and manage
computing, platform, and software services over the Internet. Task scheduling
algorithms play an important role in the efficiency of cloud computing services
as they aim to reduce the turnaround time of tasks and improve resource
utilization. Several task scheduling algorithms have been proposed in the
literature for cloud computing systems, the majority relying on the
computational complexity of tasks and the distribution of resources. However,
several tasks scheduled following these algorithms still fail because of
unforeseen changes in the cloud environments. In this paper, using tasks
execution and resource utilization data extracted from the execution traces of
real world applications at Google, we explore the possibility of predicting the
scheduling outcome of a task using statistical models. If we can successfully
predict tasks failures, we may be able to reduce the execution time of jobs by
rescheduling failed tasks earlier (i.e., before their actual failing time). Our
results show that statistical models can predict task failures with a precision
up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of
such predictions using the tool kit GloudSim and found that they can improve
the number of finished tasks by up to 40%. We also perform a case study using
the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene
expression correlations analysis study from breast cancer research. We find
that when extending the scheduler of Hadoop with our predictive models, the
percentage of failed jobs can be reduced by up to 45%, with an overhead of less
than 5 minutes
On the feasibility of collaborative green data center ecosystems
The increasing awareness of the impact of the IT sector on the environment, together with economic factors, have fueled many research efforts to reduce the energy expenditure of data centers. Recent work proposes to achieve additional energy savings by exploiting, in concert with customers, service workloads and to reduce data centers’ carbon footprints by adopting demand-response mechanisms between data centers and their energy providers. In this paper, we debate about the incentives that customers and data centers can have to adopt such measures and propose a new service type and pricing scheme that is economically attractive and technically realizable. Simulation results based on real measurements confirm that our scheme can achieve additional energy savings while preserving service performance and the interests of data centers and customers.Peer ReviewedPostprint (author's final draft
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.
We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research
project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement
n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft
- …