4 research outputs found

    Application-level differential checkpointing for HPC applications with dynamic datasets

    High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.This project has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST); and from the European Unions Horizon 2020 research and innovation programme under the LEGaTO Project (legato- project.eu), grant agreement No 780681.Peer ReviewedPostprint (author's final draft


    End-to-end data integrity is of utmost importance when sending data through a communication network, and a common way to ensure this is by appending a few bits for error detection (e.g., a checksum or cyclic redundancy check) to the data sent. Data can be corrupted at the sending or receiving hosts, in one of the intermediate systems (e.g., routers and switches), in the network interface card, or on the transmission link. The Internet’s Transmission Control Protocol (TCP) uses a 16-bit one’s complement checksum for end-to-end error detection of each TCP segment [1]. The TCP protocol specification dates back to the 1970s, and better error detection alternatives exist (e.g., Fletcher checksum, Adler checksum, Cyclic Redundancy Check (CRC)) that provide higher error detection efficiency; nevertheless, the one’s complement checksum is still in use today as part of the TCP standard. The TCP checksum has low computational complexity when compared to software implementations of the other algorithms. Some of the original reasons for selecting the 16-bit one’s complement checksum are its simple calculation, and the property that its computation on big- and little-endian machines result in the same checksum but byte-swapped. This latter characteristic is not true for a two’s complement checksum. A negative characteristic of one’s and two’s complement checksums is that changing the order of the data does not affect the checksum. In [2], the authors collected two years of data and concluded after analysis that the TCP checksum “will fail to detect errors for roughly one in 16 million to 10 billion packets.” While some of the sources responsible for TCP checksum errors have decreased in the nearly 20 years since this study was published (e.g., the ACK-of-FIN TCP software bug), it is not clear what we would find if the study were repeated. It would also be difficult to repeat this study today because of privacy concerns. The advent of hardware CRC32C instructions on Intel x86 and ARM CPUs offers the promise of significantly improved error detection (probability of undetected errors proportional to 2 -32 versus 2-16) at a comparable CPU time to the one’s complement checksum. The goal of this research is to compare the execution time of the following error detection algorithms: CRC32C (using generator polynomial 0x1EDC6F41), Adler checksum, Fletcher check sum, and one’s complement checksum using both software and special hardware instructions. For CRC32C, the software implementations tested were bit-wise, nibble-wise, byte-wise, slicing-by-4 and slicing-by-8 algorithms. Intel’s CRC32 and PCLMULQDQ instructions and ARM’s CRC32C instruction were also used as part of testing hardware instruction implementations. A comparative study of all these algorithms on Intel Core i3-2330M shows that the CRC32C hardware instruction implementation is approximately 38% faster than the 16-bit TCP one’s complement checksum at 1500 bytes, and the 16-bit TCP one’s complement checksum is roughly 11% faster than the hardware instruction based CRC32C at 64 bytes. On the ARM Cortex-A53, the hardware CRC32C algorithm is approximately 20% faster than the 16-bit TCP one’s complement checksum at 64 bytes, and the 16-bit TCP one’s complement checksum is roughly 13% faster than the hardware instruction based CRC32C at 1500 bytes. Because the hardware CRC32C instruction is commonly available on most Intel processors and a growing number of ARM processors these days, we argue that it is time to reconsider adding a TCP Option to use hardware CRC32C. The primary impediments to replacing the TCP one’s complement checksum with CRC32C are Network Address Translation (NAT) and TCP checksum offload. NAT requires the recalculation of the TCP checksum in the NAT device because the IPv4 address, and possibly the TCP port number change, when packets move through a NAT device. These NAT devices are able to compute the new checksum incrementally due to the properties of the one’s complement checksum. The eventual transition to IPv6 will hopefully eliminate the need for NAT. Most Ethernet Network Interface Cards (NIC) support TCP checksum offload, where the TCP checksum is computed in the NIC rather than on the host CPU. There is a risk of undetected errors with this approach since the error detection is no longer end-to-end; nevertheless, it is the default configuration in many operating systems including Windows 10 [3] and MacOS. CRC32C has been implemented in some NICs to support the iSCSI protocol, so it is possible that TCP CRC32C offload could be supported in the future. In the near term, our proposal is to include a TCP Option for CRC32C in addition to the one’s complement checksum for higher reliability

    Resilience for large ensemble computations

    With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years the performance of the fastest clusters went from the teraFLOPS domain (ASCI RED: 2.3 teraFLOPS) to the pre-exaFLOPS domain (Fugaku: 442 petaFLOPS), and we will soon have the first supercomputer with a peak performance cracking the exaFLOPS (El Capitan: 1.5 exaFLOPS). Ensemble techniques experience a renaissance with the availability of those extreme scales. Especially recent techniques, such as particle filters, will benefit from it. Current ensemble methods in climate science, such as ensemble Kalman filters, exhibit a linear dependency between the problem size and the ensemble size, while particle filters show an exponential dependency. Nevertheless, with the prospect of massive computing power come challenges such as power consumption and fault-tolerance. The mean-time-between-failures shrinks with the number of components in the system, and it is expected to have failures every few hours at exascale. In this thesis, we explore and develop techniques to protect large ensemble computations from failures. We present novel approaches in differential checkpointing, elastic recovery, fully asynchronous checkpointing, and checkpoint compression. Furthermore, we design and implement a fault-tolerant particle filter with pre-emptive particle prefetching and caching. And finally, we design and implement a framework for the automatic validation and application of lossy compression in ensemble data assimilation. Altogether, we present five contributions in this thesis, where the first two improve state-of-the-art checkpointing techniques, and the last three address the resilience of ensemble computations. The contributions represent stand-alone fault-tolerance techniques, however, they can also be used to improve the properties of each other. For instance, we utilize elastic recovery (2nd contribution) for mitigating resiliency in an online ensemble data assimilation framework (3rd contribution), and we built our validation framework (5th contribution) on top of our particle filter implementation (4th contribution). We further demonstrate that our contributions improve resilience and performance with experiments on various architectures such as Intel, IBM, and ARM processors.Amb l’increment de les capacitats de còmput dels supercomputadors, es poden simular models de sistemes físics encara més detallats, i es poden resoldre problemes de més grandària en qualsevol tipus de sistema numèric. Durant els últims vint anys, el rendiment dels clústers més ràpids ha passat del domini dels teraFLOPS (ASCI RED: 2.3 teraFLOPS) al domini dels pre-exaFLOPS (Fugaku: 442 petaFLOPS), i aviat tindrem el primer supercomputador amb un rendiment màxim que sobrepassa els exaFLOPS (El Capitan: 1.5 exaFLOPS). Les tècniques d’ensemble experimenten un renaixement amb la disponibilitat d’aquestes escales tan extremes. Especialment les tècniques més noves, com els filtres de partícules, se¿n beneficiaran. Els mètodes d’ensemble actuals en climatologia, com els filtres d’ensemble de Kalman, exhibeixen una dependència lineal entre la mida del problema i la mida de l’ensemble, mentre que els filtres de partícules mostren una dependència exponencial. No obstant, juntament amb les oportunitats de poder computar massivament, apareixen desafiaments com l’alt consum energètic i la necessitat de tolerància a errors. El temps de mitjana entre errors es redueix amb el nombre de components del sistema, i s’espera que els errors s’esdevinguin cada poques hores a exaescala. En aquesta tesis, explorem i desenvolupem tècniques per protegir grans càlculs d’ensemble d’errors. Presentem noves tècniques en punts de control diferencials, recuperació elàstica, punts de control totalment asincrònics i compressió de punts de control. A més, dissenyem i implementem un filtre de partícules tolerant a errors amb captació i emmagatzematge en caché de partícules de manera preventiva. I finalment, dissenyem i implementem un marc per la validació automàtica i l’aplicació de compressió amb pèrdua en l’assimilació de dades d’ensemble. En total, en aquesta tesis presentem cinc contribucions, les dues primeres de les quals milloren les tècniques de punts de control més avançades, mentre que les tres restants aborden la resiliència dels càlculs d’ensemble. Les contribucions representen tècniques independents de tolerància a errors; no obstant, també es poden utilitzar per a millorar les propietats de cadascuna. Per exemple, utilitzem la recuperació elàstica (segona contribució) per a mitigar la resiliència en un marc d’assimilació de dades d’ensemble en línia (tercera contribució), i construïm el nostre marc de validació (cinquena contribució) sobre la nostra implementació del filtre de partícules (quarta contribució). A més, demostrem que les nostres contribucions milloren la resiliència i el rendiment amb experiments en diverses arquitectures, com processadors Intel, IBM i ARM.Postprint (published version