242 research outputs found
Spatial support vector regression to detect silent errors in the exascale era
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing
Research Program, under Contract DE-AC02-06CH11357, by FI-DGR 2013 scholarship, by HiPEAC PhD Collaboration
Grant, the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402, and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Toward bio-inspired information processing with networks of nano-scale switching elements
Unconventional computing explores multi-scale platforms connecting
molecular-scale devices into networks for the development of scalable
neuromorphic architectures, often based on new materials and components with
new functionalities. We review some work investigating the functionalities of
locally connected networks of different types of switching elements as
computational substrates. In particular, we discuss reservoir computing with
networks of nonlinear nanoscale components. In usual neuromorphic paradigms,
the network synaptic weights are adjusted as a result of a training/learning
process. In reservoir computing, the non-linear network acts as a dynamical
system mixing and spreading the input signals over a large state space, and
only a readout layer is trained. We illustrate the most important concepts with
a few examples, featuring memristor networks with time-dependent and history
dependent resistances
ASCR/HEP Exascale Requirements Review Report
This draft report summarizes and details the findings, results, and
recommendations derived from the ASCR/HEP Exascale Requirements Review meeting
held in June, 2015. The main conclusions are as follows. 1) Larger, more
capable computing and data facilities are needed to support HEP science goals
in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of
the demand at the 2025 timescale is at least two orders of magnitude -- and in
some cases greater -- than that available currently. 2) The growth rate of data
produced by simulations is overwhelming the current ability, of both facilities
and researchers, to store and analyze it. Additional resources and new
techniques for data analysis are urgently needed. 3) Data rates and volumes
from HEP experimental facilities are also straining the ability to store and
analyze large and complex data volumes. Appropriately configured
leadership-class facilities can play a transformational role in enabling
scientific discovery from these datasets. 4) A close integration of HPC
simulation and data analysis will aid greatly in interpreting results from HEP
experiments. Such an integration will minimize data movement and facilitate
interdependent workflows. 5) Long-range planning between HEP and ASCR will be
required to meet HEP's research needs. To best use ASCR HPC resources the
experimental HEP program needs a) an established long-term plan for access to
ASCR computational and data resources, b) an ability to map workflows onto HPC
resources, c) the ability for ASCR facilities to accommodate workflows run by
collaborations that can have thousands of individual members, d) to transition
codes to the next-generation HPC platforms that will be available at ASCR
facilities, e) to build up and train a workforce capable of developing and
using simulations and analysis to support HEP scientific research on
next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio
The OpenMC Monte Carlo particle transport code
A new Monte Carlo code called OpenMC is currently under development at the Massachusetts Institute of Technology as a tool for simulation on high-performance computing platforms. Given that many legacy codes do not scale well on existing and future parallel computer architectures, OpenMC has been developed from scratch with a focus on high performance scalable algorithms as well as modern software design practices. The present work describes the methods used in the OpenMC code and demonstrates the performance and accuracy of the code on a variety of problems.United States. Department of Energy (DE-AC05-00OR22725
Scalability in the Presence of Variability
Supercomputers are used to solve some of the world’s most computationally demanding
problems. Exascale systems, to be comprised of over one million cores and capable of 10^18
floating point operations per second, will probably exist by the early 2020s, and will provide
unprecedented computational power for parallel computing workloads. Unfortunately,
while these machines hold tremendous promise and opportunity for applications in High
Performance Computing (HPC), graph processing, and machine learning, it will be a major
challenge to fully realize their potential, because to do so requires balanced execution across
the entire system and its millions of processing elements. When different processors take different
amounts of time to perform the same amount of work, performance imbalance arises,
large portions of the system sit idle, and time and energy are wasted. Larger systems incorporate
more processors and thus greater opportunity for imbalance to arise, as well as larger
performance/energy penalties when it does. This phenomenon is referred to as performance
variability and is the focus of this dissertation.
In this dissertation, we explain how to design system software to mitigate variability
on large scale parallel machines. Our approaches span (1) the design, implementation, and
evaluation of a new high performance operating system to reduce some classes of performance
variability, (2) a new performance evaluation framework to holistically characterize
key features of variability on new and emerging architectures, and (3) a distributed modeling
framework that derives predictions of how and where imbalance is manifesting in order to
drive reactive operations such as load balancing and speed scaling. Collectively, these efforts
provide a holistic set of tools to promote scalability through the mitigation of variability
Exploring the capabilities of support vector machines in detecting silent data corruptions
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected.
In this work, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830. In addition, this material is based upon work supported by the National Science Foundation under Grant No. 1619253, and also by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC02-06CH11357 (DOE Catalog project) and in part by the European Union FEDER funds under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Master of Science
thesisTo minimize resource consumption and maximize performance, computer architecture research has been investigating approaches that may compute inaccurate solutions. Such hardware inaccuracies may induce a wide variety of program behaviors which are not obs
Runtime-aware architectures
In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores face. The runtime system of the parallel programming model has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In the paper, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective.This work has been partially supported by the European Research Council under the European Union’s 7th FP, ERC Grant Agreement number 321253, by the Spanish Ministry of Science and Innovation under grant TIN2012-34557 and by the HiPEAC Network of Excellence. M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI- 2012-15047, and M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft
New Design Techniques for Dynamic Reconfigurable Architectures
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …