1 research outputs found
Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning
In many domains, the previous decade was characterized by increasing data
volumes and growing complexity of computational workloads, creating new demands
for highly data-parallel computing in distributed systems. Effective operation
of these systems is challenging when facing uncertainties about the performance
of jobs and tasks under varying resource configurations, e.g., for scheduling
and resource allocation. We survey predictive performance modeling (PPM)
approaches to estimate performance metrics such as execution duration, required
memory or wait times of future jobs and tasks based on past performance
observations. We focus on non-intrusive methods, i.e., methods that can be
applied to any workload without modification, since the workload is usually a
black-box from the perspective of the systems managing the computational
infrastructure. We classify and compare sources of performance variation,
predicted performance metrics, required training data, use cases, and the
underlying prediction techniques. We conclude by identifying several open
problems and pressing research needs in the field.Comment: 19 pages, 3 figures, 5 table