Search CORE

6,782 research outputs found

What does fault tolerant Deep Learning need from MPI?

Author: Amatya Vinay
Daily Jeff
Siegel Charles
Vishnu Abhinav
Publication venue
Publication date: 01/01/2017
Field of study

Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

arXiv.org e-Print Archive

Crossref

A Study on the Parallelization of Terrain-Covering Ant Robots Simulations

Author: D.R. Jefferson
G. Cordasco
J. Svennebring
M.W. Macy
P. Richmond
R. Brown
S. Luke
T. Takahashi
W. Marurngsith
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Agent-based simulation is used as a tool for supporting (time-critical) decision making in differentiated contexts. Hence, techniques for speeding up the execution of agent-based models, such as Parallel Discrete Event Simulation (PDES), are of great relevance/benefit. On the other hand, parallelism entails that the final output provided by the simulator should closely match the one provided by a traditional sequential run. This is not obvious given that, for performance and efficiency reasons, parallel simulation engines do not allow the evaluation of global predicates on the simulation model evolution with arbitrary time-granularity along the simulation time-Axis. In this article we present a study on the effects of parallelization of agent-based simulations, focusing on complementary aspects such as performance and reliability of the provided simulation output. We target Terrain Covering Ant Robots (TCAR) simulations, which are useful in rescue scenarios to determine how many agents (i.e., robots) should be used to completely explore a certain terrain for possible victims within a given time. © 2014 Springer-Verlag Berlin Heidelberg

Crossref

ART

Archivio della ricerca- Università di Roma La Sapienza

Self-Chord: a Bio-Inspired P2P Framework for Self-Organizing Distributed Systems

Author: Forestiero A.
Leonardi Emilio
Mastrioanni C.
Meo Michela
Publication venue: IEEE and ACM
Publication date: 01/01/2010
Field of study

Crossref

PORTO Publications Open Repository TOrino

A Taxonomy of Workflow Management Systems for Grid Computing

Author: Buyya Rajkumar
Yu Jia
Publication venue
Publication date: 01/01/2005
Field of study

With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure

arXiv.org e-Print Archive

CiteSeerX

A WOA-based optimization approach for task scheduling in cloud Computing systems

Author: Chen Xuan
Cheng Long
Liu Cong
Liu Jinwei
Liu Qingzhi
Mao Ying
Murphy John
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Task scheduling in cloud computing can directly affect the resource usage and operational cost of a system. To improve the efficiency of task executions in a cloud, various metaheuristic algorithms, as well as their variations, have been proposed to optimize the scheduling. In this work, for the first time, we apply the latest metaheuristics WOA (the whale optimization algorithm) for cloud task scheduling with a multiobjective optimization model, aiming at improving the performance of a cloud system with given computing resources. On that basis, we propose an advanced approach called IWC (Improved WOA for Cloud task scheduling) to further improve the optimal solution search capability of the WOA-based method. We present the detailed implementation of IWC and our simulation-based experiments show that the proposed IWC has better convergence speed and accuracy in searching for the optimal task scheduling plans, compared to the current metaheuristic algorithms. Moreover, it can also achieve better performance on system resource utilization, in the presence of both small and large-scale tasks

Irish Universities

DCU Online Research Access Service

Glowworm swarm optimisation based task scheduling for cloud computing

Author: Alboaneen Dabiah Ahmed
Tianfield Huaglory
Zhang Yan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/03/2017
Field of study

ResearchOnline@GCU