1 research outputs found
Towards providing reliable job completion time predictions using PCS
In this paper we build a case for providing job completion time predictions
to cloud users, similar to the delivery date of a package or arrival time of a
booked ride. Our analysis reveals that providing predictability can come at the
expense of performance and fairness. Existing cloud scheduling systems optimize
for extreme points in the trade-off space, making them either extremely
unpredictable or impractical.
To address this challenge, we present PCS, a new scheduling framework that
aims to provide predictability while balancing other traditional objectives.
The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a
suitable configuration of different WFQ parameters (e.g., class weights) that
meets specific goals for predictability. It uses a simulation-aided search
strategy, to efficiently discover WFQ configurations that lie on the Pareto
front of the trade-off space between these objectives. We implement and
evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a
small scale GPU testbed and larger-scale simulations, shows that PCS can
provide accurate completion time estimates while marginally compromising on
performance and fairness