38,801 research outputs found
Exploring the Fairness and Resource Distribution in an Apache Mesos Environment
Apache Mesos, a cluster-wide resource manager, is widely deployed in massive
scale at several Clouds and Data Centers. Mesos aims to provide high cluster
utilization via fine grained resource co-scheduling and resource fairness among
multiple users through Dominant Resource Fairness (DRF) based allocation. DRF
takes into account different resource types (CPU, Memory, Disk I/O) requested
by each application and determines the share of each cluster resource that
could be allocated to the applications. Mesos has adopted a two-level
scheduling policy: (1) DRF to allocate resources to competing frameworks and
(2) task level scheduling by each framework for the resources allocated during
the previous step. We have conducted experiments in a local Mesos cluster when
used with frameworks such as Apache Aurora, Marathon, and our own framework
Scylla, to study resource fairness and cluster utilization. Experimental
results show how informed decision regarding second level scheduling policy of
frameworks and attributes like offer holding period, offer refusal cycle and
task arrival rate can reduce unfair resource distribution. Bin-Packing
scheduling policy on Scylla with Marathon can reduce unfair allocation from
38\% to 3\%. By reducing unused free resources in offers we bring down the
unfairness from to 90\% to 28\%. We also show the effect of task arrival rate
to reduce the unfairness from 23\% to 7\%
A Competitive Flow Time Algorithm for Heterogeneous Clusters Under Polytope Constraints
Modern data centers consist of a large number of heterogeneous resources such as CPU, memory, network bandwidth, etc. The resources are pooled into clusters for various reasons such as scalability, resource consolidation, and privacy. Clusters are often heterogeneous so that they can better serve jobs with different characteristics submitted from clients. Each job benefits differently depending on how much resource is allocated to the job, which in turn translates to how quickly the job gets completed.
In this paper, we formulate this setting, which we term Multi-Cluster Polytope Scheduling (MCPS). In MCPS, a set of n jobs arrive over time to be executed on m clusters. Each cluster i is associated with a polytope P_i, which constrains how fast one can process jobs assigned to the cluster. For MCPS, we seek to optimize the popular objective of minimizing average weighted flow time of jobs in the online setting. We give a constant competitive algorithm with small constant resource augmentation for a large class of polytopes, which capture many interesting problems that arise in practice. Further, our algorithm is non-clairvoyant. Our algorithm and analysis combine and generalize techniques developed in the recent results for the classical unrelated machines scheduling and the polytope scheduling problem [10,12,11]
Power Management Techniques for Data Centers: A Survey
With growing use of internet and exponential growth in amount of data to be
stored and processed (known as 'big data'), the size of data centers has
greatly increased. This, however, has resulted in significant increase in the
power consumption of the data centers. For this reason, managing power
consumption of data centers has become essential. In this paper, we highlight
the need of achieving energy efficiency in data centers and survey several
recent architectural techniques designed for power management of data centers.
We also present a classification of these techniques based on their
characteristics. This paper aims to provide insights into the techniques for
improving energy efficiency of data centers and encourage the designers to
invent novel solutions for managing the large power dissipation of data
centers.Comment: Keywords: Data Centers, Power Management, Low-power Design, Energy
Efficiency, Green Computing, DVFS, Server Consolidatio
Energy-Aware Lease Scheduling in Virtualized Data Centers
Energy efficiency has become an important measurement of scheduling
algorithms in virtualized data centers. One of the challenges of
energy-efficient scheduling algorithms, however, is the trade-off between
minimizing energy consumption and satisfying quality of service (e.g.
performance, resource availability on time for reservation requests). We
consider resource needs in the context of virtualized data centers of a private
cloud system, which provides resource leases in terms of virtual machines (VMs)
for user applications. In this paper, we propose heuristics for scheduling VMs
that address the above challenge. On performance evaluation, simulated results
have shown a significant reduction on total energy consumption of our proposed
algorithms compared with an existing First-Come-First-Serve (FCFS) scheduling
algorithm with the same fulfillment of performance requirements. We also
discuss the improvement of energy saving when additionally using migration
policies to the above mentioned algorithms.Comment: 10 pages, 2 figures, Proceedings of the Fifth International
Conference on High Performance Scientific Computing, March 5-9, 2012, Hanoi,
Vietna
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
- …