30,600 research outputs found
Distributed Parallel Extreme Event Analysis in Next Generation Simulation Architectures
Numerical simulations present challenges as they reach exascale because they generate petabyte-scale data that cannot be saved without interrupting the simulation
due to I/O constraints. Data scientists must be able to reduce, extract, and visualize the data while the simulation is running, which is essential for in transit and
post analysis. Next generation architectures in supercomputing include a burst buļ¬er
technology composed of SSDs primarily for the use of checkpointing the simulation
in case a restart is required. In the case of turbulence simulations, this checkpoint
provides an opportunity to perform analysis on the data without interrupting the
simulation.
First, we present a method of extracting velocity data in high vorticity regions.
This method requires calculating the vorticity of the entire dataset and identifying
regions where the threshold is above a speciļ¬ed value. Next we create a 3D stencil
from values above the threshold and dilate the stencil. Finally we use the stencil to
extract velocity data from the original dataset. The result is a dataset that is over
an order of magnitude smaller and contains all the data required to study extreme events and visualization of vorticity.
The next extraction utilizes the zfp lossy compressor to compress the entire velocity dataset. The compressed representation results in a dataset an order of magnitude
smaller than the raw simulation data. This provides the researcher approximate data
not captured by the velocity extraction. The error introduced is bounded, and results in a dataset that is visually indistinguishable from the original dataset.
Finally we present a modular distributed parallel extraction system. This system allows a data scientist to run the previously mentioned extraction algorithms in a
distributed parallel cluster of burst buļ¬er nodes. The extraction algorithms are built
as modules for the system and run in parallel on burst buļ¬er nodes. A feature extraction coordinator synchronizes the simulation with the extraction process. A data
scientist only needs to write one module that performs the extraction or visualization
on a single subset of data and the system will execute that module at scale on burst
buļ¬ers, managing all the communication, synchronization, and parallelism required
to perform the analysis
ASCR/HEP Exascale Requirements Review Report
This draft report summarizes and details the findings, results, and
recommendations derived from the ASCR/HEP Exascale Requirements Review meeting
held in June, 2015. The main conclusions are as follows. 1) Larger, more
capable computing and data facilities are needed to support HEP science goals
in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of
the demand at the 2025 timescale is at least two orders of magnitude -- and in
some cases greater -- than that available currently. 2) The growth rate of data
produced by simulations is overwhelming the current ability, of both facilities
and researchers, to store and analyze it. Additional resources and new
techniques for data analysis are urgently needed. 3) Data rates and volumes
from HEP experimental facilities are also straining the ability to store and
analyze large and complex data volumes. Appropriately configured
leadership-class facilities can play a transformational role in enabling
scientific discovery from these datasets. 4) A close integration of HPC
simulation and data analysis will aid greatly in interpreting results from HEP
experiments. Such an integration will minimize data movement and facilitate
interdependent workflows. 5) Long-range planning between HEP and ASCR will be
required to meet HEP's research needs. To best use ASCR HPC resources the
experimental HEP program needs a) an established long-term plan for access to
ASCR computational and data resources, b) an ability to map workflows onto HPC
resources, c) the ability for ASCR facilities to accommodate workflows run by
collaborations that can have thousands of individual members, d) to transition
codes to the next-generation HPC platforms that will be available at ASCR
facilities, e) to build up and train a workforce capable of developing and
using simulations and analysis to support HEP scientific research on
next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio
Parallel and Distributed Simulation from Many Cores to the Public Cloud (Extended Version)
In this tutorial paper, we will firstly review some basic simulation concepts
and then introduce the parallel and distributed simulation techniques in view
of some new challenges of today and tomorrow. More in particular, in the last
years there has been a wide diffusion of many cores architectures and we can
expect this trend to continue. On the other hand, the success of cloud
computing is strongly promoting the everything as a service paradigm. Is
parallel and distributed simulation ready for these new challenges? The current
approaches present many limitations in terms of usability and adaptivity: there
is a strong need for new evaluation metrics and for revising the currently
implemented mechanisms. In the last part of the paper, we propose a new
approach based on multi-agent systems for the simulation of complex systems. It
is possible to implement advanced techniques such as the migration of simulated
entities in order to build mechanisms that are both adaptive and very easy to
use. Adaptive mechanisms are able to significantly reduce the communication
cost in the parallel/distributed architectures, to implement load-balance
techniques and to cope with execution environments that are both variable and
dynamic. Finally, such mechanisms will be used to build simulations on top of
unreliable cloud services.Comment: Tutorial paper published in the Proceedings of the International
Conference on High Performance Computing and Simulation (HPCS 2011). Istanbul
(Turkey), IEEE, July 2011. ISBN 978-1-61284-382-
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
Engineering failure analysis and design optimisation with HiP-HOPS
The scale and complexity of computer-based safety critical systems, like those used in the transport and manufacturing industries, pose significant challenges for failure analysis. Over the last decade, research has focused on automating this task. In one approach, predictive models of system failure are constructed from the topology of the system and local component failure models using a process of composition. An alternative approach employs model-checking of state automata to study the effects of failure and verify system safety properties. In this paper, we discuss these two approaches to failure analysis. We then focus on Hierarchically Performed Hazard Origin & Propagation Studies (HiP-HOPS) - one of the more advanced compositional approaches - and discuss its capabilities for automatic synthesis of fault trees, combinatorial Failure Modes and Effects Analyses, and reliability versus cost optimisation of systems via application of automatic model transformations. We summarise these contributions and demonstrate the application of HiP-HOPS on a simplified fuel oil system for a ship engine. In light of this example, we discuss strengths and limitations of the method in relation to other state-of-the-art techniques. In particular, because HiP-HOPS is deductive in nature, relating system failures back to their causes, it is less prone to combinatorial explosion and can more readily be iterated. For this reason, it enables exhaustive assessment of combinations of failures and design optimisation using computationally expensive meta-heuristics. (C) 2010 Elsevier Ltd. All rights reserved
- ā¦