Search CORE

35 research outputs found

High Performance Pipelined Process Migration with RDMA

Author: Besseron Xavier
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2011
Field of study

Crossref

Open Repository and Bibliography - Luxembourg

Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?

Author: Besseron Xavier
Meshram Vilobh
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2011
Field of study

Crossref

Open Repository and Bibliography - Luxembourg

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Author: Besseron Xavier
Huang Jian
Ouyang Xiangyong
Panda Dhabaleswar K.
Rajachandrasekar Raghunath
Wang Hao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2011
Field of study

Crossref

Open Repository and Bibliography - Luxembourg

Modified teaching-learning based optimization for 0–1 knapsack optimization problems

Author: Haibin Ouyang
Qing Wang
Xiangyong Kong
Publication venue: IEEE
Publication date: 01/05/2017
Field of study

Crossref

A prediction-based adaptive grouping differential evolution algorithm for constrained numerical optimization

Author: Haibin Ouyang
Xiangyong Kong
Xiaoxue Piao
Publication venue: Springer Science and Business Media LLC
Publication date: 28/07/2013
Field of study

Crossref

Accelerating checkpoint operation by node-level write aggregation on multicore systems

Author: Dhabaleswar K. P
Karthik Gopalakrishnan
Xiangyong Ouyang
Publication venue
Publication date: 01/01/2009
Field of study

Abstract—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size due to constraints within the file system. Furthermore, with the advent of multi-core architecture, the situation is aggravated due to larger number of processes running on the same node, trying to checkpoint simultaneously. This results in increased number of file writes at the time of checkpointing which leads to performance degradation. As a result, deployment of Checkpoint/Restart mechanisms for large scale parallel applications is limited. In this work, we explore the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library. Our profiling of the checkpoints for the NAS parallel benchmarks revealed a large number of small file writes interspersed with large writes. Based on these observation we propose to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes. At the aggregation threshold of 512KB, the implementation of our design in BLCR shows improvements from 27 % to 32 % over the original BLCR in terms of time cost to checkpoint an MPI application. I

CiteSeerX

Crossref

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Distributed Storage System

Author: Haruo Yokota
Tomohiro Yoshiharay
Xiangyong Ouyang
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date
Field of study

Crossref

Enhancing Checkpoint Performance with Staging IO and SSD

Author: Dhabaleswar K. Panda
Sonya Marcarelli
Xiangyong Ouyang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. How-ever existing mechanism of checkpoint writing to par-allel file systems doesn’t perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to en-hance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strat-egy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 [6] with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2.

CiteSeerX

Crossref