46 research outputs found
Timely-Throughput Optimal Coded Computing over Cloud Networks
In modern distributed computing systems, unpredictable and unreliable
infrastructures result in high variability of computing resources. Meanwhile,
there is significantly increasing demand for timely and event-driven services
with deadline constraints. Motivated by measurements over Amazon EC2 clusters,
we consider a two-state Markov model for variability of computing speed in
cloud networks. In this model, each worker can be either in a good state or a
bad state in terms of the computation speed, and the transition between these
states is modeled as a Markov chain which is unknown to the scheduler. We then
consider a Coded Computing framework, in which the data is possibly encoded and
stored at the worker nodes in order to provide robustness against nodes that
may be in a bad state. With timely computation requests submitted to the system
with computation deadlines, our goal is to design the optimal computation-load
allocation scheme and the optimal data encoding scheme that maximize the timely
computation throughput (i.e, the average number of computation tasks that are
accomplished before their deadline). Our main result is the development of a
dynamic computation strategy called Lagrange Estimate-and Allocate (LEA)
strategy, which achieves the optimal timely computation throughput. It is shown
that compared to the static allocation strategy, LEA increases the timely
computation throughput by 1.4X - 17.5X in various scenarios via simulations and
by 1.27X - 6.5X in experiments over Amazon EC2 clustersComment: to appear in MobiHoc 201
Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly
Abstract Despite prior research on outlier mitigation, our analysis of jobs from the Facebook cluster shows that outliers still occur, especially in small jobs. Small jobs are particularly sensitive to long-running outlier tasks because of their interactive nature. Outlier mitigation strategies rely on comparing different tasks of the same job and launching speculative copies for the slower tasks. However, small jobs execute all their tasks simultaneously, thereby not providing sufficient time to observe and compare tasks. Building on the observation that clusters are underutilized, we take speculation to its logical extreme-run full clones of jobs to mitigate the effect of outliers. The heavy-tail distribution of job sizes implies that we can impact most jobs without using much resources. Trace-driven simulations show that average completion time of all the small jobs improves by 47% using cloning, at the cost of just 3% extra resources
Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale
As clusters continue to grow in size and complexity, providing scalable and predictable performance is an increasingly important challenge. A crucial roadblock to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers. However, speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. In this work, we present Hopper, a job scheduler that is speculation-aware, i.e., that integrates the tradeoffs associated with speculation into job scheduling decisions. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation