2,059 research outputs found

    Scheduling Computational Workflows on Failure-Prone Platforms

    Get PDF
    International audienceWe study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, roll-back and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations

    Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

    Full text link
    Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing

    Chiminey: Reliable Computing and Data Management Platform in the Cloud

    Full text link
    The enabling of scientific experiments that are embarrassingly parallel, long running and data-intensive into a cloud-based execution environment is a desirable, though complex undertaking for many researchers. The management of such virtual environments is cumbersome and not necessarily within the core skill set for scientists and engineers. We present here Chiminey, a software platform that enables researchers to (i) run applications on both traditional high-performance computing and cloud-based computing infrastructures, (ii) handle failure during execution, (iii) curate and visualise execution outputs, (iv) share such data with collaborators or the public, and (v) search for publicly available data.Comment: Preprint, ICSE 201

    Exascale machines require new programming paradigms and runtimes

    Get PDF
    Extreme scale parallel computing systems will have tens of thousands of optionally accelerator-equiped nodes with hundreds of cores each, as well as deep memory hierarchies and complex interconnect topologies. Such Exascale systems will provide hardware parallelism at multiple levels and will be energy constrained. Their extreme scale and the rapidly deteriorating reliablity of their hardware components means that Exascale systems will exhibit low mean-time-between-failure values. Furthermore, existing programming models already require heroic programming and optimisation efforts to achieve high efficiency on current supercomputers. Invariably, these efforts are platform-specific and non-portable. In this paper we will explore the shortcomings of existing programming models and runtime systems for large scale computing systems. We then propose and discuss important features of programming paradigms and runtime system to deal with large scale computing systems with a special focus on data-intensive applications and resilience. Finally, we also discuss code sustainability issues and propose several software metrics that are of paramount importance for code development for large scale computing systems

    SANTO: Social Aerial NavigaTion in Outdoors

    Get PDF
    In recent years, the advances in remote connectivity, miniaturization of electronic components and computing power has led to the integration of these technologies in daily devices like cars or aerial vehicles. From these, a consumer-grade option that has gained popularity are the drones or unmanned aerial vehicles, namely quadrotors. Although until recently they have not been used for commercial applications, their inherent potential for a number of tasks where small and intelligent devices are needed is huge. However, although the integrated hardware has advanced exponentially, the refinement of software used for these applications has not beet yet exploited enough. Recently, this shift is visible in the improvement of common tasks in the field of robotics, such as object tracking or autonomous navigation. Moreover, these challenges can become bigger when taking into account the dynamic nature of the real world, where the insight about the current environment is constantly changing. These settings are considered in the improvement of robot-human interaction, where the potential use of these devices is clear, and algorithms are being developed to improve this situation. By the use of the latest advances in artificial intelligence, the human brain behavior is simulated by the so-called neural networks, in such a way that computing system performs as similar as possible as the human behavior. To this end, the system does learn by error which, in an akin way to the human learning, requires a set of previous experiences quite considerable, in order for the algorithm to retain the manners. Applying these technologies to robot-human interaction do narrow the gap. Even so, from a bird's eye, a noticeable time slot used for the application of these technologies is required for the curation of a high-quality dataset, in order to ensure that the learning process is optimal and no wrong actions are retained. Therefore, it is essential to have a development platform in place to ensure these principles are enforced throughout the whole process of creation and optimization of the algorithm. In this work, multiple already-existing handicaps found in pipelines of this computational gauge are exposed, approaching each of them in a independent and simple manner, in such a way that the solutions proposed can be leveraged by the maximum number of workflows. On one side, this project concentrates on reducing the number of bugs introduced by flawed data, as to help the researchers to focus on developing more sophisticated models. On the other side, the shortage of integrated development systems for this kind of pipelines is envisaged, and with special care those using simulated or controlled environments, with the goal of easing the continuous iteration of these pipelines.Thanks to the increasing popularity of drones, the research and development of autonomous capibilities has become easier. However, due to the challenge of integrating multiple technologies, the available software stack to engage this task is restricted. In this thesis, we accent the divergencies among unmanned-aerial-vehicle simulators and propose a platform to allow faster and in-depth prototyping of machine learning algorithms for this drones

    DIET : new developments and recent results

    Get PDF
    Among existing grid middleware approaches, one simple, powerful, and flexibleapproach consists of using servers available in different administrative domainsthrough the classic client-server or Remote Procedure Call (RPC) paradigm.Network Enabled Servers (NES) implement this model also called GridRPC.Clients submit computation requests to a scheduler whose goal is to find aserver available on the grid. The aim of this paper is to give an overview of anNES middleware developed in the GRAAL team called DIET and to describerecent developments. DIET (Distributed Interactive Engineering Toolbox) is ahierarchical set of components used for the development of applications basedon computational servers on the grid.Parmi les intergiciels de grilles existants, une approche simple, flexible et performante consiste a utiliser des serveurs disponibles dans des domaines administratifs diffĂ©rents Ă  travers le paradigme classique de l’appel de procĂ©dure Ă distance (RPC). Les environnements de ce type, connus sous le terme de Network Enabled Servers, implĂ©mentent ce modĂšle appelĂ© GridRPC. Des clientssoumettent des requĂȘtes de calcul Ă  un ordonnanceur dont le but consiste Ă trouver un serveur disponible sur la grille.Le but de cet article est de donner un tour d’horizon d’un intergiciel dĂ©veloppĂ©dans le projet GRAAL appelĂ© DIET 1. DIET (Distributed Interactive Engineering Toolbox) est un ensemble hiĂ©rarchique de composants utilisĂ©s pour ledĂ©veloppement d’applications basĂ©es sur des serveurs de calcul sur la grille
    • 

    corecore