1 research outputs found
Balsam: Automated Scheduling and Execution of Dynamic, Data-Intensive HPC Workflows
We introduce the Balsam service to manage high-throughput task scheduling and
execution on supercomputing systems. Balsam allows users to populate a task
database with a variety of tasks ranging from simple independent tasks to
dynamic multi-task workflows. With abstractions for the local resource
scheduler and MPI environment, Balsam dynamically packages tasks into ensemble
jobs and manages their scheduling lifecycle. The ensembles execute in a pilot
"launcher" which (i) ensures concurrent, load-balanced execution of arbitrary
serial and parallel programs with heterogeneous processor requirements, (ii)
requires no modification of user applications, (iii) is tolerant of task-level
faults and provides several options for error recovery, (iv) stores provenance
data (e.g task history, error logs) in the database, (v) supports dynamic
workflows, in which tasks are created or killed at runtime. Here, we present
the design and Python implementation of the Balsam service and launcher. The
efficacy of this system is illustrated using two case studies: hyperparameter
optimization of deep neural networks, and high-throughput single-point quantum
chemistry calculations. We find that the unique combination of flexible
job-packing and automated scheduling with dynamic (pilot-managed) execution
facilitates excellent resource utilization. The scripting overheads typically
needed to manage resources and launch workflows on supercomputers are
substantially reduced, accelerating workflow development and execution.Comment: SC '18: 8th Workshop on Python for High-Performance and Scientific
Computing (PyHPC 2018