506 research outputs found
Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows
Scientific workflows are designed as directed acyclic graphs (DAGs) and
consist of multiple dependent task definitions. They are executed over a large
amount of data, often resulting in thousands of tasks with heterogeneous
compute requirements and long runtimes, even on cluster infrastructures. In
order to optimize the workflow performance, enough resources, e.g., CPU and
memory, need to be provisioned for the respective tasks. Typically, workflow
systems rely on user resource estimates which are known to be highly
error-prone and can result in over- or underprovisioning. While resource
overprovisioning leads to high resource wastage, underprovisioning can result
in long runtimes or even failed tasks.
In this paper, we propose two different reinforcement learning approaches
based on gradient bandits and Q-learning, respectively, in order to minimize
resource wastage by selecting suitable CPU and memory allocations. We provide a
prototypical implementation in the well-known scientific workflow management
system Nextflow, evaluate our approaches with five workflows, and compare them
against the default resource configurations and a state-of-the-art feedback
loop baseline. The evaluation yields that our reinforcement learning approaches
significantly reduce resource wastage compared to the default configuration.
Further, our approaches also reduce the allocated CPU hours compared to the
state-of-the-art feedback loop by 6.79% and 24.53%.Comment: Paper accepted in 2022 IEEE International Conference on Big Data
Workshop BPOD 202
Predicting Dynamic Memory Requirements for Scientific Workflow Tasks
With the increasing amount of data available to scientists in disciplines as
diverse as bioinformatics, physics, and remote sensing, scientific workflow
systems are becoming increasingly important for composing and executing
scalable data analysis pipelines. When writing such workflows, users need to
specify the resources to be reserved for tasks so that sufficient resources are
allocated on the target cluster infrastructure. Crucially, underestimating a
task's memory requirements can result in task failures. Therefore, users often
resort to overprovisioning, resulting in significant resource wastage and
decreased throughput.
In this paper, we propose a novel online method that uses monitoring time
series data to predict task memory usage in order to reduce the memory wastage
of scientific workflow tasks. Our method predicts a task's runtime, divides it
into k equally-sized segments, and learns the peak memory value for each
segment depending on the total file input size. We evaluate the prototype
implementation of our method using workflows from the publicly available
nf-core repository, showing an average memory wastage reduction of 29.48%
compared to the best state-of-the-art approac
Scalable HPC & AI infrastructure for COVID-19 therapeutics
COVID-19 has claimed more than 2.7 × 106 lives and resulted in over 124 × 106 infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. We discuss innovations in computational infrastructure and methods that are accelerating and advancing drug design. Specifically, we describe several methods that integrate artificial intelligence and simulation-based approaches, and the design of computational infrastructure to support these methods at scale. We discuss their implementation, characterize their performance, and highlight science advances that these capabilities have enabled
How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface
Scientific workflow management systems (SWMSs) and resource managers together
ensure that tasks are scheduled on provisioned resources so that all
dependencies are obeyed, and some optimization goal, such as makespan
minimization, is fulfilled. In practice, however, there is no clear separation
of scheduling responsibilities between an SWMS and a resource manager because
there exists no agreed-upon separation of concerns between their different
components. This has two consequences. First, the lack of a standardized API to
exchange scheduling information between SWMSs and resource managers hinders
portability. It incurs costly adaptations when a component should be replaced
by another one (e.g., an SWMS with another SWMS on the same resource manager).
Second, due to overlapping functionalities, current installations often
actually have two schedulers, both making partial scheduling decisions under
incomplete information, leading to suboptimal workflow scheduling.
In this paper, we propose a simple REST interface between SWMSs and resource
managers, which allows any SWMS to pass dynamic workflow information to a
resource manager, enabling maximally informed scheduling decisions. We provide
an exemplary implementation of this API for Nextflow as an SWMS and Kubernetes
as a resource manager. Our experiments with nine real-world workflows show that
this strategy reduces makespan by up to 25.1% and 10.8% on average compared to
the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread
implementation of this API would enable leaner code bases, a simpler exchange
of components of workflow systems, and a unified place to implement new
scheduling algorithms.Comment: Paper accepted in: 2023 23rd IEEE International Symposium on Cluster,
Cloud and Internet Computing (CCGrid
Failure-awareness and dynamic adaptation in data scheduling
Over the years, scientific applications have become more complex and more data intensive. Especially large scale simulations and scientific experiments in areas such as physics, biology, astronomy and earth sciences demand highly distributed resources to satisfy excessive computational requirements. Increasing data requirements and the distributed nature of the resources made I/O the major bottleneck for end-to-end application performance. Existing systems fail to address issues such as reliability, scalability, and efficiency in dealing with wide area data access, retrieval and processing. In this study, we explore data-intensive distributed computing and study challenges in data placement in distributed environments. After analyzing different application scenarios, we develop new data scheduling methodologies and the key attributes for reliability, adaptability and performance optimization of distributed data placement tasks. Inspired by techniques used in microprocessor and operating system architectures, we extend and adapt some of the known low-level data handling and optimization techniques to distributed computing. Two major contributions of this work include (i) a failure-aware data placement paradigm for increased fault-tolerance, and (ii) adaptive scheduling of data placement tasks for improved end-to-end performance. The failure-aware data placement includes early error detection, error classification, and use of this information in scheduling decisions for the prevention of and recovery from possible future errors. The adaptive scheduling approach includes dynamically tuning data transfer parameters over wide area networks for efficient utilization of available network capacity and optimized end-to-end data transfer performance
- …