221 research outputs found

    Resilience for large ensemble computations

    Get PDF
    With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years the performance of the fastest clusters went from the teraFLOPS domain (ASCI RED: 2.3 teraFLOPS) to the pre-exaFLOPS domain (Fugaku: 442 petaFLOPS), and we will soon have the first supercomputer with a peak performance cracking the exaFLOPS (El Capitan: 1.5 exaFLOPS). Ensemble techniques experience a renaissance with the availability of those extreme scales. Especially recent techniques, such as particle filters, will benefit from it. Current ensemble methods in climate science, such as ensemble Kalman filters, exhibit a linear dependency between the problem size and the ensemble size, while particle filters show an exponential dependency. Nevertheless, with the prospect of massive computing power come challenges such as power consumption and fault-tolerance. The mean-time-between-failures shrinks with the number of components in the system, and it is expected to have failures every few hours at exascale. In this thesis, we explore and develop techniques to protect large ensemble computations from failures. We present novel approaches in differential checkpointing, elastic recovery, fully asynchronous checkpointing, and checkpoint compression. Furthermore, we design and implement a fault-tolerant particle filter with pre-emptive particle prefetching and caching. And finally, we design and implement a framework for the automatic validation and application of lossy compression in ensemble data assimilation. Altogether, we present five contributions in this thesis, where the first two improve state-of-the-art checkpointing techniques, and the last three address the resilience of ensemble computations. The contributions represent stand-alone fault-tolerance techniques, however, they can also be used to improve the properties of each other. For instance, we utilize elastic recovery (2nd contribution) for mitigating resiliency in an online ensemble data assimilation framework (3rd contribution), and we built our validation framework (5th contribution) on top of our particle filter implementation (4th contribution). We further demonstrate that our contributions improve resilience and performance with experiments on various architectures such as Intel, IBM, and ARM processors.Amb l’increment de les capacitats de còmput dels supercomputadors, es poden simular models de sistemes físics encara més detallats, i es poden resoldre problemes de més grandària en qualsevol tipus de sistema numèric. Durant els últims vint anys, el rendiment dels clústers més ràpids ha passat del domini dels teraFLOPS (ASCI RED: 2.3 teraFLOPS) al domini dels pre-exaFLOPS (Fugaku: 442 petaFLOPS), i aviat tindrem el primer supercomputador amb un rendiment màxim que sobrepassa els exaFLOPS (El Capitan: 1.5 exaFLOPS). Les tècniques d’ensemble experimenten un renaixement amb la disponibilitat d’aquestes escales tan extremes. Especialment les tècniques més noves, com els filtres de partícules, se¿n beneficiaran. Els mètodes d’ensemble actuals en climatologia, com els filtres d’ensemble de Kalman, exhibeixen una dependència lineal entre la mida del problema i la mida de l’ensemble, mentre que els filtres de partícules mostren una dependència exponencial. No obstant, juntament amb les oportunitats de poder computar massivament, apareixen desafiaments com l’alt consum energètic i la necessitat de tolerància a errors. El temps de mitjana entre errors es redueix amb el nombre de components del sistema, i s’espera que els errors s’esdevinguin cada poques hores a exaescala. En aquesta tesis, explorem i desenvolupem tècniques per protegir grans càlculs d’ensemble d’errors. Presentem noves tècniques en punts de control diferencials, recuperació elàstica, punts de control totalment asincrònics i compressió de punts de control. A més, dissenyem i implementem un filtre de partícules tolerant a errors amb captació i emmagatzematge en caché de partícules de manera preventiva. I finalment, dissenyem i implementem un marc per la validació automàtica i l’aplicació de compressió amb pèrdua en l’assimilació de dades d’ensemble. En total, en aquesta tesis presentem cinc contribucions, les dues primeres de les quals milloren les tècniques de punts de control més avançades, mentre que les tres restants aborden la resiliència dels càlculs d’ensemble. Les contribucions representen tècniques independents de tolerància a errors; no obstant, també es poden utilitzar per a millorar les propietats de cadascuna. Per exemple, utilitzem la recuperació elàstica (segona contribució) per a mitigar la resiliència en un marc d’assimilació de dades d’ensemble en línia (tercera contribució), i construïm el nostre marc de validació (cinquena contribució) sobre la nostra implementació del filtre de partícules (quarta contribució). A més, demostrem que les nostres contribucions milloren la resiliència i el rendiment amb experiments en diverses arquitectures, com processadors Intel, IBM i ARM.Postprint (published version

    Application-level differential checkpointing for HPC applications with dynamic datasets

    Get PDF
    High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.This project has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST); and from the European Unions Horizon 2020 research and innovation programme under the LEGaTO Project (legato- project.eu), grant agreement No 780681.Peer ReviewedPostprint (author's final draft

    Towards zero-waste recovery and zero-overhead checkpointing in ensemble data assimilation

    Get PDF
    Ensemble data assimilation is a powerful tool for increasing the accuracy of climatological states. It is based on combining observations with the results from numerical model simulations. The method comprises two steps, (1) the propagation, where the ensemble states are advanced by the numerical model and (2) the analysis, where the model states are corrected with observations. One bottleneck in ensemble data assimilation is circulating the ensemble states between the two steps. Often, the states are circulated using files. This article presents an extended implementation of Melissa-DA, an in-memory ensemble data assimilation framework, allowing zero-overhead checkpointing and recovery with few or zero recomputation. We hide the checkpoint creation using dedicated threads and MPI processes. We benchmark our implementation with up to 512 members simulating the Lorenz96 model using 10 9 gridpoints. We utilize up to 8 K processes and 8 TB of checkpoint data per cycle and reach a peak performance of 52 teraFLOPS.Part of the research presented here has received funding from the Horizon 2020 (H2020) funding framework under grant/award number: 824158; Energy oriented Centre of Excellence II (EoCoE-II).Peer ReviewedPostprint (author's final draft

    Checkpoint restart support for heterogeneous HPC applications

    Get PDF
    As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs.In this paper, we extend an application level checkpoint library, called fault tolerance interface (FTI), to support multi-node/multi-GPU checkpoints. In contrast to previous work, our library includes a memory manager, which upon a checkpoint invocation tracks the actual location of the data to be stored and handles the data accordingly. We analyze the overhead of the checkpoint/restart procedure and we present a series of optimization steps to massively decrease the checkpoint and recovery time of our implementation. To further reduce the checkpoint time we present a differential checkpoint approach which writes only the updated data to the checkpoint file. Our approach is evaluated and, in the best case scenario, the execution time of a normal checkpoint is reduced by 15x in contrast with a non-optimized version, in the case of differential checkpoint the overhead can drop to 2.6% when checkpointing every 30s.The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement #780681.Peer ReviewedPostprint (author's final draft

    Impacts of the Tropical Pacific/Indian Oceans on the Seasonal Cycle of the West African Monsoon

    Get PDF
    The current consensus is that drought has developed in the Sahel during the second half of the twentieth century as a result of remote effects of oceanic anomalies amplified by local land–atmosphere interactions. This paper focuses on the impacts of oceanic anomalies upon West African climate and specifically aims to identify those from SST anomalies in the Pacific/Indian Oceans during spring and summer seasons, when they were significant. Idealized sensitivity experiments are performed with four atmospheric general circulation models (AGCMs). The prescribed SST patterns used in the AGCMs are based on the leading mode of covariability between SST anomalies over the Pacific/Indian Oceans and summer rainfall over West Africa. The results show that such oceanic anomalies in the Pacific/Indian Ocean lead to a northward shift of an anomalous dry belt from the Gulf of Guinea to the Sahel as the season advances. In the Sahel, the magnitude of rainfall anomalies is comparable to that obtained by other authors using SST anomalies confined to the proximity of the Atlantic Ocean. The mechanism connecting the Pacific/Indian SST anomalies with West African rainfall has a strong seasonal cycle. In spring (May and June), anomalous subsidence develops over both the Maritime Continent and the equatorial Atlantic in response to the enhanced equatorial heating. Precipitation increases over continental West Africa in association with stronger zonal convergence of moisture. In addition, precipitation decreases over the Gulf of Guinea. During the monsoon peak (July and August), the SST anomalies move westward over the equatorial Pacific and the two regions where subsidence occurred earlier in the seasons merge over West Africa. The monsoon weakens and rainfall decreases over the Sahel, especially in August.Peer reviewe

    Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019

    Get PDF
    Background: In an era of shifting global agendas and expanded emphasis on non-communicable diseases and injuries along with communicable diseases, sound evidence on trends by cause at the national level is essential. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) provides a systematic scientific assessment of published, publicly available, and contributed data on incidence, prevalence, and mortality for a mutually exclusive and collectively exhaustive list of diseases and injuries. Methods: GBD estimates incidence, prevalence, mortality, years of life lost (YLLs), years lived with disability (YLDs), and disability-adjusted life-years (DALYs) due to 369 diseases and injuries, for two sexes, and for 204 countries and territories. Input data were extracted from censuses, household surveys, civil registration and vital statistics, disease registries, health service use, air pollution monitors, satellite imaging, disease notifications, and other sources. Cause-specific death rates and cause fractions were calculated using the Cause of Death Ensemble model and spatiotemporal Gaussian process regression. Cause-specific deaths were adjusted to match the total all-cause deaths calculated as part of the GBD population, fertility, and mortality estimates. Deaths were multiplied by standard life expectancy at each age to calculate YLLs. A Bayesian meta-regression modelling tool, DisMod-MR 2.1, was used to ensure consistency between incidence, prevalence, remission, excess mortality, and cause-specific mortality for most causes. Prevalence estimates were multiplied by disability weights for mutually exclusive sequelae of diseases and injuries to calculate YLDs. We considered results in the context of the Socio-demographic Index (SDI), a composite indicator of income per capita, years of schooling, and fertility rate in females younger than 25 years. Uncertainty intervals (UIs) were generated for every metric using the 25th and 975th ordered 1000 draw values of the posterior distribution. Findings: Global health has steadily improved over the past 30 years as measured by age-standardised DALY rates. After taking into account population growth and ageing, the absolute number of DALYs has remained stable. Since 2010, the pace of decline in global age-standardised DALY rates has accelerated in age groups younger than 50 years compared with the 1990–2010 time period, with the greatest annualised rate of decline occurring in the 0–9-year age group. Six infectious diseases were among the top ten causes of DALYs in children younger than 10 years in 2019: lower respiratory infections (ranked second), diarrhoeal diseases (third), malaria (fifth), meningitis (sixth), whooping cough (ninth), and sexually transmitted infections (which, in this age group, is fully accounted for by congenital syphilis; ranked tenth). In adolescents aged 10–24 years, three injury causes were among the top causes of DALYs: road injuries (ranked first), self-harm (third), and interpersonal violence (fifth). Five of the causes that were in the top ten for ages 10–24 years were also in the top ten in the 25–49-year age group: road injuries (ranked first), HIV/AIDS (second), low back pain (fourth), headache disorders (fifth), and depressive disorders (sixth). In 2019, ischaemic heart disease and stroke were the top-ranked causes of DALYs in both the 50–74-year and 75-years-and-older age groups. Since 1990, there has been a marked shift towards a greater proportion of burden due to YLDs from non-communicable diseases and injuries. In 2019, there were 11 countries where non-communicable disease and injury YLDs constituted more than half of all disease burden. Decreases in age-standardised DALY rates have accelerated over the past decade in countries at the lower end of the SDI range, while improvements have started to stagnate or even reverse in countries with higher SDI. Interpretation: As disability becomes an increasingly large component of disease burden and a larger component of health expenditure, greater research and developm nt investment is needed to identify new, more effective intervention strategies. With a rapidly ageing global population, the demands on health services to deal with disabling outcomes, which increase with age, will require policy makers to anticipate these changes. The mix of universal and more geographically specific influences on health reinforces the need for regular reporting on population health in detail and by underlying cause to help decision makers to identify success stories of disease control to emulate, as well as opportunities to improve. Funding: Bill & Melinda Gates Foundation. © 2020 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 licens

    Global age-sex-specific fertility, mortality, healthy life expectancy (HALE), and population estimates in 204 countries and territories, 1950-2019 : a comprehensive demographic analysis for the Global Burden of Disease Study 2019

    Get PDF
    Background: Accurate and up-to-date assessment of demographic metrics is crucial for understanding a wide range of social, economic, and public health issues that affect populations worldwide. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019 produced updated and comprehensive demographic assessments of the key indicators of fertility, mortality, migration, and population for 204 countries and territories and selected subnational locations from 1950 to 2019. Methods: 8078 country-years of vital registration and sample registration data, 938 surveys, 349 censuses, and 238 other sources were identified and used to estimate age-specific fertility. Spatiotemporal Gaussian process regression (ST-GPR) was used to generate age-specific fertility rates for 5-year age groups between ages 15 and 49 years. With extensions to age groups 10–14 and 50–54 years, the total fertility rate (TFR) was then aggregated using the estimated age-specific fertility between ages 10 and 54 years. 7417 sources were used for under-5 mortality estimation and 7355 for adult mortality. ST-GPR was used to synthesise data sources after correction for known biases. Adult mortality was measured as the probability of death between ages 15 and 60 years based on vital registration, sample registration, and sibling histories, and was also estimated using ST-GPR. HIV-free life tables were then estimated using estimates of under-5 and adult mortality rates using a relational model life table system created for GBD, which closely tracks observed age-specific mortality rates from complete vital registration when available. Independent estimates of HIV-specific mortality generated by an epidemiological analysis of HIV prevalence surveys and antenatal clinic serosurveillance and other sources were incorporated into the estimates in countries with large epidemics. Annual and single-year age estimates of net migration and population for each country and territory were generated using a Bayesian hierarchical cohort component model that analysed estimated age-specific fertility and mortality rates along with 1250 censuses and 747 population registry years. We classified location-years into seven categories on the basis of the natural rate of increase in population (calculated by subtracting the crude death rate from the crude birth rate) and the net migration rate. We computed healthy life expectancy (HALE) using years lived with disability (YLDs) per capita, life tables, and standard demographic methods. Uncertainty was propagated throughout the demographic estimation process, including fertility, mortality, and population, with 1000 draw-level estimates produced for each metric. Findings: The global TFR decreased from 2·72 (95% uncertainty interval [UI] 2·66–2·79) in 2000 to 2·31 (2·17–2·46) in 2019. Global annual livebirths increased from 134·5 million (131·5–137·8) in 2000 to a peak of 139·6 million (133·0–146·9) in 2016. Global livebirths then declined to 135·3 million (127·2–144·1) in 2019. Of the 204 countries and territories included in this study, in 2019, 102 had a TFR lower than 2·1, which is considered a good approximation of replacement-level fertility. All countries in sub-Saharan Africa had TFRs above replacement level in 2019 and accounted for 27·1% (95% UI 26·4–27·8) of global livebirths. Global life expectancy at birth increased from 67·2 years (95% UI 66·8–67·6) in 2000 to 73·5 years (72·8–74·3) in 2019. The total number of deaths increased from 50·7 million (49·5–51·9) in 2000 to 56·5 million (53·7–59·2) in 2019. Under-5 deaths declined from 9·6 million (9·1–10·3) in 2000 to 5·0 million (4·3–6·0) in 2019. Global population increased by 25·7%, from 6·2 billion (6·0–6·3) in 2000 to 7·7 billion (7·5–8·0) in 2019. In 2019, 34 countries had negative natural rates of increase; in 17 of these, the population declined because immigration was not sufficient to counteract the negative rate of decline. Globally, HALE increased from 58·6 years (56·1–60·8) in 2000 to 63·5 years (60·8–66·1) in 2019. HALE increased in 202 of 204 countries and territories between 2000 and 2019

    Vapor phase preparation and characterization of the carbon micro-coils

    Get PDF

    Study of W boson production in pPb collisions at vsNN = 5.02 TeV

    Get PDF
    The first study of W boson production in pPb collisions is presented, for bosons decaying to a muon or electron, and a neutrino. The measurements are based on a data sample corresponding to an integrated luminosity of 34.6 nb-1 at a nucleon–nucleon centre-of-mass energy of vsNN = 5.02 TeV, collected by the CMS experiment. The W boson differential cross sections, lepton charge asymmetry, and forward–backward asymmetries are measured for leptons of transverse momentum exceeding 25 GeV/c, and as a function of the lepton pseudorapidity in the |?lab| < 2.4range. Deviations from the expectations based on currently available parton distribution functions are observed, showing the need for including W boson data in nuclear parton distribution global fits
    corecore