6,782 research outputs found

    What does fault tolerant Deep Learning need from MPI?

    Full text link
    Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

    A Study on the Parallelization of Terrain-Covering Ant Robots Simulations

    Get PDF
    Agent-based simulation is used as a tool for supporting (time-critical) decision making in differentiated contexts. Hence, techniques for speeding up the execution of agent-based models, such as Parallel Discrete Event Simulation (PDES), are of great relevance/benefit. On the other hand, parallelism entails that the final output provided by the simulator should closely match the one provided by a traditional sequential run. This is not obvious given that, for performance and efficiency reasons, parallel simulation engines do not allow the evaluation of global predicates on the simulation model evolution with arbitrary time-granularity along the simulation time-Axis. In this article we present a study on the effects of parallelization of agent-based simulations, focusing on complementary aspects such as performance and reliability of the provided simulation output. We target Terrain Covering Ant Robots (TCAR) simulations, which are useful in rescue scenarios to determine how many agents (i.e., robots) should be used to completely explore a certain terrain for possible victims within a given time. © 2014 Springer-Verlag Berlin Heidelberg

    A Taxonomy of Workflow Management Systems for Grid Computing

    Full text link
    With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure

    A WOA-based optimization approach for task scheduling in cloud Computing systems

    Get PDF
    Task scheduling in cloud computing can directly affect the resource usage and operational cost of a system. To improve the efficiency of task executions in a cloud, various metaheuristic algorithms, as well as their variations, have been proposed to optimize the scheduling. In this work, for the first time, we apply the latest metaheuristics WOA (the whale optimization algorithm) for cloud task scheduling with a multiobjective optimization model, aiming at improving the performance of a cloud system with given computing resources. On that basis, we propose an advanced approach called IWC (Improved WOA for Cloud task scheduling) to further improve the optimal solution search capability of the WOA-based method. We present the detailed implementation of IWC and our simulation-based experiments show that the proposed IWC has better convergence speed and accuracy in searching for the optimal task scheduling plans, compared to the current metaheuristic algorithms. Moreover, it can also achieve better performance on system resource utilization, in the presence of both small and large-scale tasks

    Glowworm swarm optimisation based task scheduling for cloud computing

    Get PDF
    corecore