17 research outputs found

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Dynamic Quality Metric Oriented Error-bounded Lossy Compression for Scientific Datasets

    Full text link
    With the ever-increasing execution scale of high performance computing (HPC) applications, vast amounts of data are being produced by scientific research every day. Error-bounded lossy compression has been considered a very promising solution to address the big-data issue for scientific applications because it can significantly reduce the data volume with low time cost meanwhile allowing users to control the compression errors with a specified error bound. The existing error-bounded lossy compressors, however, are all developed based on inflexible designs or compression pipelines, which cannot adapt to diverse compression quality requirements/metrics favored by different application users. In this paper, we propose a novel dynamic quality metric oriented error-bounded lossy compression framework, namely QoZ. The detailed contribution is three-fold. (1) We design a novel highly-parameterized multi-level interpolation-based data predictor, which can significantly improve the overall compression quality with the same compressed size. (2) We design the error-bounded lossy compression framework QoZ based on the adaptive predictor, which can auto-tune the critical parameters and optimize the compression result according to user-specified quality metrics during online compression. (3) We evaluate QoZ carefully by comparing its compression quality with multiple state-of-the-arts on various real-world scientific application datasets. Experiments show that, compared with the second-best lossy compressor, QoZ can achieve up to 70% compression ratio improvement under the same error bound, up to 150% compression ratio improvement under the same PSNR, or up to 270% compression ratio improvement under the same SSIM

    H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Inform谩tic

    Lossy Compression and Its Application on Large Scale Scientific Datasets

    Get PDF
    High Performance Computing (HPC) applications are always expanding in data size and computational complexity. It is becoming necessary to consider fault tolerance and system recovery to reduce computation and resource cost in HPC systems. The computation of modern large scale HPC applications are facing bottleneck due to computation complexities, increased runtime and large data storage requirements. These issues can not be ignored in current supercomputing era. Data compression is one of the effective ways to address data storage issue. Among data compression, the lossy compression is much more feasible and efficient than the traditional lossless compression due to low I/O bandwidth of large applications. The goal of this work is to observe and find the optimal lossy compression configuration which has the minimal user controlled error with maximum compression ratio. For this purpose two large scale application have been experimented with various parameters of well known compression method called SZ. The first application is a quantum chemistry based HPC application NWChem. The second application is the vascular blood flow simulation data generated by parallel lattice Boltzmann code for fluid flow simulations with complex geometries called HemeLB. SZ compressor is integrated in the applications\u27 code for testing the correctness and scalability and give a comparative picture of the performance change. Lastly the statistical methods are tested to pre-determine the data distortion for different error bounds

    H-RADIC: una soluci贸n de tolerancia a fallos para cl煤steres virtuales en ambientes multi-nube

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it麓s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment麓s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicaci贸n paralela termina su ejecuci贸n cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o cl煤steres virtuales, todos protegidos por RADIC, donde el estado de la ejecuci贸n de cada sitio es guardado peri贸dicamente en otro de los sitios y de ah铆 es recuperado en el caso de una falla. El articulo detalla la configuraci贸n de la arquitectura y los resultados de los experimentos usando 3 cl煤steres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar m煤ltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del cl煤ster es posible implementar un nivel m谩s en la jerarqu铆a de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementaci贸n de este protector esta fuera del camino critico de la ampliaci贸n y depende solamente de la utilizaci贸n de recursos.Facultad de Inform谩tic

    H-RADIC: una soluci贸n de tolerancia a fallos para cl煤steres virtuales en ambientes multi-nube

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it麓s execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes parallel applications protected by RADIC in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiment麓s results using 3 clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that by adding a cluster protector it will be possible to implement the next level in the hierarchy, where the first level in the RADIC hierarchy works as an observer at a site level. In adition, the experiments showed that the protection implementation is out of the critical path of the application and it depends on the utilized resources.Aunque las plataformas en la nube parecen ser muy confiables, varios incidentes de disponibilidad han podemos asegurarnos que una aplicaci贸n paralela termina su ejecuci贸n cuando el sitio en la nube ha sido afectado por una falla? Este articulo presenta HRADIC, un enfoque basado en la arquitectura RADIC, esta ejecuta aplicaciones paralelas en al menos 3 diferentes sitios o cl煤steres virtuales, todos protegidos por RADIC, donde el estado de la ejecuci贸n de cada sitio es guardado peri贸dicamente en otro de los sitios y de ah铆 es recuperado en el caso de una falla. El articulo detalla la configuraci贸n de la arquitectura y los resultados de los experimentos usando 3 cl煤steres ejecutando aplicaciones NAS en paralelo, protegidas con DMTCP (una herramienta para realizar m煤ltiples checkpoints). Nuestros experimentos muestran que al agregar un protector del cl煤ster es posible implementar un nivel m谩s en la jerarqu铆a de RADIC, donde el primer nivel funciona como observador. Los experimentos muestran que la implementaci贸n de este protector esta fuera del camino critico de la ampliaci贸n y depende solamente de la utilizaci贸n de recursos.Facultad de Inform谩tic

    H-RADIC: The Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

    Get PDF
    Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper presents H-RADIC, an approach based on RADIC architecture, that executes a parallel application in at least 3 different virtual clusters or sites. The execution state of each site is saved periodically in another site and it is recovered in case of failure. The paper details the configuration of the architecture and the experiments results using 3 virtual clusters running NAS parallel applications protected with DMTCP, a very well-known distributed multi-threaded checkpoint tool. Our experiments show that the execution time was increased between a 5% to 36% without failures and 27% to 66% in case of failures.Facultad de Inform谩tic
    corecore