38 research outputs found

    How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface

    Full text link
    Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is fulfilled. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon separation of concerns between their different components. This has two consequences. First, the lack of a standardized API to exchange scheduling information between SWMSs and resource managers hinders portability. It incurs costly adaptations when a component should be replaced by another one (e.g., an SWMS with another SWMS on the same resource manager). Second, due to overlapping functionalities, current installations often actually have two schedulers, both making partial scheduling decisions under incomplete information, leading to suboptimal workflow scheduling. In this paper, we propose a simple REST interface between SWMSs and resource managers, which allows any SWMS to pass dynamic workflow information to a resource manager, enabling maximally informed scheduling decisions. We provide an exemplary implementation of this API for Nextflow as an SWMS and Kubernetes as a resource manager. Our experiments with nine real-world workflows show that this strategy reduces makespan by up to 25.1% and 10.8% on average compared to the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread implementation of this API would enable leaner code bases, a simpler exchange of components of workflow systems, and a unified place to implement new scheduling algorithms.Comment: Paper accepted in: 2023 23rd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid

    Heuristics for Scaling up Distributed Protein Docking

    Get PDF
    Docking is a computational technique which predicts the interaction between a protein and a potential drug compound. Virtual screening is a tool, which employs docking, to investigate huge libraries of compounds and predicts potential drug molecules that bind favorably to the protein of interest. The size of one such commercially available library is about 13 million compounds. It would take approximately 400 years of CPU time to examine this library! As an alternative a high performance computing application with a distributed docking strategy is needed, which can efficiently predict the favorable compounds and can eventually be scaled for huge libraries. In this thesis, IncreDock, a scoring based incremental docking software has been developed to improve the efficiency of virtual screening process. IncreDock provides two approaches to the distributed docking problem. First, it allows for a completely parallel implementation where the entire library is explored simultaneously in an unordered fashion. Second, and more important, is the ordered incremental parallel implementation where the library is explored in increments and a scoring function is used to determine the order of dockings. IncreDock was used to perform docking studies on four different proteins using a library of 10,573 compounds. The results suggest that IncreDock is able to predict better compounds in early increments of dockings. IncreDock, thus, provides an effective initial strategy to sample out good ligands in less compute time and forms a good precursor to a tool that can proficiently investigate huge chemical libraries

    Workload Schedulers - Genesis, Algorithms and Comparisons

    Get PDF
    In this article we provide brief descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems, Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion, we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems

    Remodelling Scientific Workflows for Cloud

    Get PDF
    Viimastel aastatel on hakanud teaduslikes kogukondades huvi pilvearvutuse vastu kasvama. Teaduskatsete läbiviimisel pilves on mitmeid eeliseid nagu elastsus, paindlikkus ja hooldatavus, kuid varasemad uuringud näitavad, et üks suurimaid probleeme teadusprogrammide jooksutamisel pilves on omavaheliste masinate andmevahetuse suurus. Üks lahendus sellele probleemile oleks tuvastada komponendid, mis omavahel palju suhtlevad ning panna nad pilves ühte kohta jooksma, et vähendada omavahelist andmevahetust. Antud bakalaureuse töös jagati (partitsioneeriti) Montage töövoo osad pilves asuvate virtuaalmasinate vahel ning rakendati valmis kirjutatud P2P süsteemi, et vähendada pilves olevat suhtlust. Tänu P2P süsteemile ja teadusprogrammi partitsioneerimisele vähendati kogu suhtlust pilves kuni 80%.In recent years, cloud computing has raised significant interest in the scientific community. Running scientific experiments in the cloud has its advantages like elasticity, scalability and software maintenance. However, the communication latencies are observed to be the major hindrance for migrating scientific computing applications to the cloud. The problem escalates further when we consider scientific workflows, where significant data is exchanged across different tasks. One way to overcome this problem is to reduce the data communication by partitioning and scheduling the workflow and adapting a peer-to-peer file sharing among the nodes. Different size Montage workflows were considered for the analysis of this problem. From the study it was observed that the partitioning along with the peer-to-peer file sharing reduced the data communication in the cloud up to 80

    Quality of service based data-aware scheduling

    Get PDF
    Distributed supercomputers have been widely used for solving complex computational problems and modeling complex phenomena such as black holes, the environment, supply-chain economics, etc. In this work we analyze the use of these distributed supercomputers for time sensitive data-driven applications. We present the scheduling challenges involved in running deadline sensitive applications on shared distributed supercomputers running large parallel jobs and introduce a ``data-aware\u27\u27 scheduling paradigm that overcomes these challenges by making use of Quality of Service classes for running applications on shared resources. We evaluate the new data-aware scheduling paradigm using an event-driven hurricane simulation framework which attempts to run various simulations modeling storm surge, wave height, etc. in a timely fashion to be used by first responders and emergency officials. We further generalize the work and demonstrate with examples how data-aware computing can be used in other applications with similar requirements

    Statistical Modeling of Resource Availability in Desktop Grids

    Get PDF
    Desktop grids are compute platforms that aggregate and harvest the idle CPU cycles of individually owned personal computers and workstations. A challenge for using these platforms is that the compute resources are volatile. Due to this volatility the vast majority of desktop grid applications are embarrassingly parallel and high-throughput. Deeper understanding of the nature of resource availability is needed to enable the use of desktop grids for a broader class of applications. In this document we further this understanding thanks to statistical analysis of availability traces collected on real-world desktop grid platforms

    Grid-centric scheduling strategies for workflow applications

    Get PDF
    Grid computing faces a great challenge because the resources are not localized, but distributed, heterogeneous and dynamic. Thus, it is essential to provide a set of programming tools that execute an application on the Grid resources with as little input from the user as possible. The thesis of this work is that Grid-centric scheduling techniques of workflow applications can provide good usability of the Grid environment by reliably executing the application on a large scale distributed system with good performance. We support our thesis with new and effective approaches in the following five aspects. First, we modeled the performance of the existing scheduling approaches in a multi-cluster Grid environment. We implemented several widely-used scheduling algorithms and identified the best candidate. The study further introduced a new measurement, based on our experiments, which can improve the schedule quality of some scheduling algorithms as much as 20 fold in a multi-cluster Grid environment. Second, we studied the scalability of the existing Grid scheduling algorithms. To deal with Grid systems consisting of hundreds of thousands of resources, we designed and implemented a novel approach that performs explicit resource selection decoupled from scheduling Our experimental evaluation confirmed that our decoupled approach can be scalable in such an environment without sacrificing the quality of the schedule by more than 10%. Third, we proposed solutions to address the dynamic nature of Grid computing with a new cluster-based hybrid scheduling mechanism. Our experimental results collected from real executions on production clusters demonstrated that this approach produces programs running 30% to 100% faster than the other scheduling approaches we implemented on both reserved and shared resources. Fourth, we improved the reliability of Grid computing by incorporating fault- tolerance and recovery mechanisms into the workow application execution. Our experiments on a simulated multi-cluster Grid environment demonstrated the effectiveness of our approach and also characterized the three-way trade-off between reliability, performance and resource usage when executing a workflow application. Finally, we improved the large batch-queue wait time often found in production Grid clusters. We developed a novel approach to partition the workow application and submit them judiciously to achieve less total batch-queue wait time. The experimental results derived from production site batch queue logs show that our approach can reduce total wait time by as much as 70%. Our approaches combined can greatly improve the usability of Grid computing while increasing the performance of workow applications on a multi-cluster Grid environment

    Executing Large Scale Scientific Workflows in Public Clouds

    Get PDF
    Scientists in different fields, such as high-energy physics, earth science, and astronomy are developing large-scale workflow applications. In many use cases, scientists need to run a set of interrelated but independent workflows (i.e., workflow ensembles) for the entire scientific analysis. As a workflow ensemble usually contains many sub-workflows in each of which hundreds or thousands of jobs exist with precedence constraints, the execution of such a workflow ensemble makes a great concern with cost even using elastic and pay-as-you-go cloud resources. In this thesis, we develop a set of methods to optimize the execution of large-scale scientific workflows in public clouds with both cost and deadline constraints with a two-step approach. Firstly, we present a set of methods to optimize the execution of scientific workflow in public clouds, with the Montage astronomical mosaic engine running on Amazon EC2 as an example. Secondly, we address three main challenges in realizing benefits of using public clouds when executing large-scale workflow ensembles: (1) execution coordination, (2) resource provisioning, and (3) data staging. To this end, we develop a new pulling-based workflow execution system with a profiling-based resource provisioning strategy. Our results show that our solution system can achieve 80% speed-up, by removing scheduling overhead, compared to the well-known Pegasus workflow management system when running scientific workflow ensembles. Besides, our evaluation using Montage workflow ensembles on around 1000-core Amazon EC2 clusters has demonstrated the efficacy of our resource provisioning strategy in terms of cost effectiveness within deadline
    corecore