Many resource management techniques for task scheduling, energy and carbon
efficiency, and cost optimization in workflows rely on a-priori task runtime
knowledge. Building runtime prediction models on historical data is often not
feasible in practice as workflows, their input data, and the cluster
infrastructure change. Online methods, on the other hand, which estimate task
runtimes on specific machines while the workflow is running, have to cope with
a lack of measurements during start-up. Frequently, scientific workflows are
executed on heterogeneous infrastructures consisting of machines with different
CPU, I/O, and memory configurations, further complicating predicting runtimes
due to different task runtimes on different machine types.
This paper presents Lotaru, a method for locally predicting the runtimes of
scientific workflow tasks before they are executed on heterogeneous compute
clusters. Crucially, our approach does not rely on historical data and copes
with a lack of training data during the start-up. To this end, we use
microbenchmarks, reduce the input data to quickly profile the workflow locally,
and predict a task's runtime with a Bayesian linear regression based on the
gathered data points from the local workflow execution and the microbenchmarks.
Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be
used for advanced scheduling methods on distributed cluster infrastructures.
In our evaluation with five real-world scientific workflows, our method
outperforms two state-of-the-art runtime prediction baselines and decreases the
absolute prediction error by more than 12.5%. In a second set of experiments,
the prediction performance of our method, using the predicted runtimes for
state-of-the-art scheduling, carbon reduction, and cost prediction, enables
results close to those achieved with perfect prior knowledge of runtimes