Search CORE

17 research outputs found

Improving Performance of Iterative Methods by Lossy Checkponting

Author: Acosta J. Mora
Agullo E.
Balay S.
Barrett R.
Barrett R.
Bode B.
Calhoun J.
Heath M. T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/05/2018
Field of study

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

arXiv.org e-Print Archive

Crossref

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Dynamic Quality Metric Oriented Error-bounded Lossy Compression for Scientific Datasets

Author: Cappello Franck
Chen Zizhong
Di Sheng
Liang Xin
Liu Jinyang
Zhao Kai
Publication venue
Publication date: 21/10/2023
Field of study

With the ever-increasing execution scale of high performance computing (HPC) applications, vast amounts of data are being produced by scientific research every day. Error-bounded lossy compression has been considered a very promising solution to address the big-data issue for scientific applications because it can significantly reduce the data volume with low time cost meanwhile allowing users to control the compression errors with a specified error bound. The existing error-bounded lossy compressors, however, are all developed based on inflexible designs or compression pipelines, which cannot adapt to diverse compression quality requirements/metrics favored by different application users. In this paper, we propose a novel dynamic quality metric oriented error-bounded lossy compression framework, namely QoZ. The detailed contribution is three-fold. (1) We design a novel highly-parameterized multi-level interpolation-based data predictor, which can significantly improve the overall compression quality with the same compressed size. (2) We design the error-bounded lossy compression framework QoZ based on the adaptive predictor, which can auto-tune the critical parameters and optimize the compression result according to user-specified quality metrics during online compression. (3) We evaluate QoZ carefully by comparing its compression quality with multiple state-of-the-arts on various real-world scientific application datasets. Experiments show that, compared with the second-best lossy compressor, QoZ can achieve up to 70% compression ratio improvement under the same error bound, up to 150% compression ratio improvement under the same PSNR, or up to 270% compression ratio improvement under the same SSIM

arXiv.org e-Print Archive

H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 04/10/2018
Field of study

Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Informátic

Lossy Compression and Its Application on Large Scale Scientific Datasets

Author: Reza Tasmia
Publication venue: Clemson University Libraries
Publication date: 01/12/2021
Field of study

High Performance Computing (HPC) applications are always expanding in data size and computational complexity. It is becoming necessary to consider fault tolerance and system recovery to reduce computation and resource cost in HPC systems. The computation of modern large scale HPC applications are facing bottleneck due to computation complexities, increased runtime and large data storage requirements. These issues can not be ignored in current supercomputing era. Data compression is one of the effective ways to address data storage issue. Among data compression, the lossy compression is much more feasible and efficient than the traditional lossless compression due to low I/O bandwidth of large applications. The goal of this work is to observe and find the optimal lossy compression configuration which has the minimal user controlled error with maximum compression ratio. For this purpose two large scale application have been experimented with various parameters of well known compression method called SZ. The first application is a quantum chemistry based HPC application NWChem. The second application is the vascular blood flow simulation data generated by parallel lattice Boltzmann code for fluid flow simulations with complex geometries called HemeLB. SZ compressor is integrated in the applications\u27 code for testing the correctness and scalability and give a comparative picture of the performance change. Lastly the statistical methods are tested to pre-determine the data distortion for different error bounds

Clemson University: TigerPrints

H-RADIC: una solución de tolerancia a fallos para clústeres virtuales en ambientes multi-nube

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/12/2018
Field of study

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment´s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicación paralela termina su ejecución cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o clústeres virtuales, todos protegidos por RADIC, donde el estado de la ejecución de cada sitio es guardado periódicamente en otro de los sitios y de ahí es recuperado en el caso de una falla. El articulo detalla la configuración de la arquitectura y los resultados de los experimentos usando 3 clústeres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar múltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del clúster es posible implementar un nivel más en la jerarquía de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementación de este protector esta fuera del camino critico de la ampliación y depende solamente de la utilización de recursos.Facultad de Informátic

Servicio de Difusión de la Creación Intelectual

H-RADIC: una solución de tolerancia a fallos para clústeres virtuales en ambientes multi-nube

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/12/2018
Field of study

H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

Author: Castro-León Marcela
Luque Fadón Emilio
Rexachs del Rosario Dolores
Royo Ambrosio
Villamayor Jorge
Publication venue
Publication date: 01/06/2018
Field of study

Servicio de Difusión de la Creación Intelectual