Search CORE

97 research outputs found

Reliable Fault Tolerance System for Service Composition in Mobile Ad Hoc Network

Author: C Shoba Bindu
Poola Veeresh
R Praveen Sam
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2019
Field of study

A Due to the rapid development of smart processing mobile devices, Mobile applications are exploring the use of web services in MANETs to satisfy the user needs. Complex user needs are satisfied by the service composition where a complex service is created by combining one or more atomic services. Service composition has a significant challenge in MANETs due to its limited bandwidth, constrained energy sources, dynamic node movement and often suffers from node failures. These constraints increase the failure rate of service composition. To overcome these, we propose Reliable Fault Tolerant System for Service Composition in MANETs (RFTSC) which makes use of the checkpointing technique for service composition in MANETs. We propose fault policies for each fault in service composition when the faults occur. Failure of services in the service composition process is recovered locally by making use of Checkpointing system and by using discovered services which satisfies the QoS constraints. A Multi-Service Tree (MST) is proposed to recover failed services with O(1) time complexity. Simulation result shows that the proposed approach is efficient when compared to existing approaches

Institute of Advanced Engineering and Science

Enhancing Energy Production with Exascale HPC Methods

Author: Camata José J.
Cela José M.
Costa Danilo
Coutinho Alvaro LGA
Fernández-Galisteo Daniel
Jiménez Carmen
Kourdioumov Vadim
Mattoso Marta
Mayo-García Rafael
Miras Thomas
Moríñigo José A.
Navarro Jorge
Navaux Philippe O.A.
Oliveira Daniel de
Rodríguez-Pascual Manuel
Silva Vítor
Souza Renan
Valduriez Patrick
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

High Performance Computing (HPC) resources have become the key actor for achieving more ambitious challenges in many disciplines. In this step beyond, an explosion on the available parallelism and the use of special purpose processors are crucial. With such a goal, the HPC4E project applies new exascale HPC techniques to energy industry simulations, customizing them if necessary, and going beyond the state-of-the-art in the required HPC exascale simulations for different energy sources. In this paper, a general overview of these methods is presented as well as some specific preliminary results.The research leading to these results has received funding from the European Union's Horizon 2020 Programme (2014-2020) under the HPC4E Project (www.hpc4e.eu), grant agreement n° 689772, the Spanish Ministry of Economy and Competitiveness under the CODEC2 project (TIN2015-63562-R), and from the Brazilian Ministry of Science, Technology and Innovation through Rede Nacional de Pesquisa (RNP). Computer time on Endeavour cluster is provided by the Intel Corporation, which enabled us to obtain the presented experimental results in uncertainty quantification in seismic imagingPostprint (author's final draft

INRIA a CCSD electronic archive server

HAL-Rennes 1

SSP: Eliminating Redundant Writes in Failure-Atomic NVRAMs via Shadow Sub-Paging

Author: Bittman Daniel
Coburn Joel
Hitz Dave
Kolli Aasheesh
Kwon Youngjin
Lee Changman
Lee Se Kwon
Minh Chi Cao
Ni Yuanjiang
Pelley Steven
Talluri Madhusudhan
Venkataraman Shivaram
Volos Haris
Xu Jian
Yang Jun
Zhao Jishen
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

eScholarship - University of California

Applying future Exascale HPC methodologies in the energy sector

Author: Camata José J.
Cela José M.
Costa Danilo
Coutinho Alvaro LGA
Fernández-Galisteo Daniel
Jiménez Carmen
Kourdioumov Vadim
Mattoso Marta
Mayo-García Rafael
Miras Thomas
Moríñigo José A.
Navarro Jose
Oliveira Daniel de
Rodríguez-Pascual Manuel
Silva Vítor
Souza Renan
Valduriez Patrick
Publication venue
Publication date: 26/09/2016
Field of study

The appliance of new exascale HPC techniques to energy industry simulations is absolutely needed nowadays. In this sense, the common procedure is to customize these techniques to the specific energy sector they are of interest in order to go beyond the state-of-the-art in the required HPC exascale simulations. With this aim, the HPC4E project is developing new exascale methodologies to three different energy sources that are the present and the future of energy: wind energy production and design, efficient combustion systems for biomass-derived fuels (biogas), and exploration geophysics for hydrocarbon reservoirs. In this work, the general exascale advances proposed as part of HPC4E and its outcome to specific results in different domains are presented.The research leading to these results has received funding from the European Union's Horizon 2020 Programme (2014-2020) under the HPC4E Project (www.hpc4e.eu), grant agreement n° 689772, the Spanish Ministry of Economy and Competitiveness under the CODEC2 project (TIN2015-63562-R), and from the Brazilian Ministry of Science, Technology and Innovation through Rede Nacional de Pesquisa (RNP). Computer time on Endeavour cluster is provided by the Intel Corporation, which enabled us to obtain the presented experimental results in uncertainty quantification in seismic imaging.Postprint (author's final draft

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

Author: Chen Xu
Chen Yuheng
Guo Yongqiang
Li Kangyu
Li Qingping
Li Shigang
Wu Baodong
Xia Lei
Xiang Tieyao
Publication venue
Publication date: 18/10/2023
Field of study

Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.Comment: 14 pages, 9 figure

arXiv.org e-Print Archive