3,834 research outputs found
Fault Tolerant Computation of Hyperbolic Partial Differential Equations with the Sparse Grid Combination Technique
As the computing power of supercomputers continues to increase
exponentially the mean time between failures (MTBF) is
decreasing. Checkpoint-restart has historically been the method
of choice for recovering from failures. However, such methods
become increasingly inefficient as the time required to complete
a checkpoint-restart cycle approaches the MTBF. There is
therefore a need to explore different ways of making computations
fault tolerant. This thesis studies generalisations of the sparse
grid combination technique with the goal of developing and
analysing a holistic approach to the fault tolerant computation
of partial differential equations (PDEs). Sparse grids allow one
to reduce the computational complexity of high dimensional
problems with only small loss of accuracy. A drawback is the need
to perform computations with a hierarchical basis rather than a
traditional nodal basis. We survey classical error estimates for
sparse grid interpolation and extend results to functions which
are non-zero on the boundary. The combination technique
approximates sparse grid solutions via a sum of many coarse
approximations which need not be computed with a hierarchical
basis. Study of the combination technique often assumes that
approximations satisfy an error splitting formula. We adapt
classical error splitting results to our slightly different
convention of combination level.
Literature on the application of the combination technique to
hyperbolic PDEs is scarce, particularly when solved with explicit
finite difference methods. We show a particular family of finite
difference discretisations for the advection equation solved via
the method of lines has solutions which satisfy an error
splitting formula. As a consequence, classical error splitting
based estimates are readily applied to finite difference
solutions of many hyperbolic PDEs. Our analysis also reveals how
repeated combinations throughout the computation leads to a
reduction in approximation error.
Generalisations of the combination technique are studied and
developed at depth. The truncated combination technique is a
modification of the classical method used in practical
applications and we provide analogues of classical error
estimates. Adaptive sparse grids are then studied via a lattice
framework. A detailed examination reveals many results regarding
combination coefficients and extensions of classical error
estimates. The framework is also applied to the study of
extrapolation formula. These extensions of the combination
technique provide the foundations for the development of the
general coefficient problem. Solutions to this problem allow one
to combine any collection of coarse approximations on nested
grids. Lastly, we show how the combination technique is made
fault tolerant via application of the general coefficient
problem. Rather than recompute coarse solutions which fail we
instead find new coefficients to combine remaining solutions.
This significantly reduces computational overheads in the
presence of faults with only small loss of accuracy. The latter
is established with a careful study of the expected error for
some select cases. We perform numerical experiments by computing
combination solutions of the scalar advection equation in a
parallel environment with simulated faults. The results support
the preceding analysis and show that the overheads are indeed
small and a significant improvement over traditional
checkpoint-restart methods
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz,
Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft
Magic-State Functional Units: Mapping and Scheduling Multi-Level Distillation Circuits for Fault-Tolerant Quantum Architectures
Quantum computers have recently made great strides and are on a long-term
path towards useful fault-tolerant computation. A dominant overhead in
fault-tolerant quantum computation is the production of high-fidelity encoded
qubits, called magic states, which enable reliable error-corrected computation.
We present the first detailed designs of hardware functional units that
implement space-time optimized magic-state factories for surface code
error-corrected machines. Interactions among distant qubits require surface
code braids (physical pathways on chip) which must be routed. Magic-state
factories are circuits comprised of a complex set of braids that is more
difficult to route than quantum circuits considered in previous work [1]. This
paper explores the impact of scheduling techniques, such as gate reordering and
qubit renaming, and we propose two novel mapping techniques: braid repulsion
and dipole moment braid rotation. We combine these techniques with graph
partitioning and community detection algorithms, and further introduce a
stitching algorithm for mapping subgraphs onto a physical machine. Our results
show a factor of 5.64 reduction in space-time volume compared to the best-known
previous designs for magic-state factories.Comment: 13 pages, 10 figure
- …