16,770 research outputs found
Single machine scheduling with general positional deterioration and rate-modifying maintenance
We present polynomial-time algorithms for single machine problems with generalized positional deterioration effects and machine maintenance. The decisions should be taken regarding possible sequences of jobs and on the number of maintenance activities to be included into a schedule in order to minimize the overall makespan. We deal with general non-decreasing functions to represent deterioration rates of job processing times. Another novel extension of existing models is our assumption that a maintenance activity does not necessarily fully restore the machine to its original perfect state. In the resulting schedules, the jobs are split into groups, a particular group to be sequenced after a particular maintenance period, and the actual processing time of a job is affected by the group that job is placed into and its position within the group
Variable Annealing Length and Parallelism in Simulated Annealing
In this paper, we propose: (a) a restart schedule for an adaptive simulated
annealer, and (b) parallel simulated annealing, with an adaptive and
parameter-free annealing schedule. The foundation of our approach is the
Modified Lam annealing schedule, which adaptively controls the temperature
parameter to track a theoretically ideal rate of acceptance of neighboring
states. A sequential implementation of Modified Lam simulated annealing is
almost parameter-free. However, it requires prior knowledge of the annealing
length. We eliminate this parameter using restarts, with an exponentially
increasing schedule of annealing lengths. We then extend this restart schedule
to parallel implementation, executing several Modified Lam simulated annealers
in parallel, with varying initial annealing lengths, and our proposed parallel
annealing length schedule. To validate our approach, we conduct experiments on
an NP-Hard scheduling problem with sequence-dependent setup constraints. We
compare our approach to fixed length restarts, both sequentially and in
parallel. Our results show that our approach can achieve substantial
performance gains, throughout the course of the run, demonstrating our approach
to be an effective anytime algorithm.Comment: Tenth International Symposium on Combinatorial Search, pages 2-10.
June 201
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization
Computer vision is experiencing an AI renaissance, in which machine learning
models are expediting important breakthroughs in academic research and
commercial applications. Effectively training these models, however, is not
trivial due in part to hyperparameters: user-configured values that control a
model's ability to learn from data. Existing hyperparameter optimization
methods are highly parallel but make no effort to balance the search across
heterogeneous hardware or to prioritize searching high-impact spaces. In this
paper, we introduce a framework for massively Scalable Hardware-Aware
Distributed Hyperparameter Optimization (SHADHO). Our framework calculates the
relative complexity of each search space and monitors performance on the
learning task over all trials. These metrics are then used as heuristics to
assign hyperparameters to distributed workers based on their hardware. We first
demonstrate that our framework achieves double the throughput of a standard
distributed hyperparameter optimization framework by optimizing SVM for MNIST
using 150 distributed workers. We then conduct model search with SHADHO over
the course of one week using 74 GPUs across two compute clusters to optimize
U-Net for a cell segmentation task, discovering 515 models that achieve a lower
validation loss than standard U-Net.Comment: 10 pages, 6 figure
Garbage collection auto-tuning for Java MapReduce on Multi-Cores
MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests
ERA: A Framework for Economic Resource Allocation for the Cloud
Cloud computing has reached significant maturity from a systems perspective,
but currently deployed solutions rely on rather basic economics mechanisms that
yield suboptimal allocation of the costly hardware resources. In this paper we
present Economic Resource Allocation (ERA), a complete framework for scheduling
and pricing cloud resources, aimed at increasing the efficiency of cloud
resources usage by allocating resources according to economic principles. The
ERA architecture carefully abstracts the underlying cloud infrastructure,
enabling the development of scheduling and pricing algorithms independently of
the concrete lower-level cloud infrastructure and independently of its
concerns. Specifically, ERA is designed as a flexible layer that can sit on top
of any cloud system and interfaces with both the cloud resource manager and
with the users who reserve resources to run their jobs. The jobs are scheduled
based on prices that are dynamically calculated according to the predicted
demand. Additionally, ERA provides a key internal API to pluggable algorithmic
modules that include scheduling, pricing and demand prediction. We provide a
proof-of-concept software and demonstrate the effectiveness of the architecture
by testing ERA over both public and private cloud systems -- Azure Batch of
Microsoft and Hadoop/YARN. A broader intent of our work is to foster
collaborations between economics and system communities. To that end, we have
developed a simulation platform via which economics and system experts can test
their algorithmic implementations
Single-machine scheduling with a time-dependent learning effect
Author name used in this publication: J.-B. WangAuthor name used in this publication: C. T. NgAuthor name used in this publication: T. C. E. Cheng2007-2008 > Academic research: refereed > Publication in refereed journalAccepted ManuscriptPublishe
- ā¦