Search CORE

1,605 research outputs found

Recommended from our members

Toward Resilience and Data Reduction in Exascale Scientific Computing

Author: Liang Xin
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Because of the ever-increasing execution scale, reliability and data management are becoming more and more important for scientific applications. On the one hand, exascale systems are anticipated to be more susceptible to soft errors ,e.g. silent data corruptions, due to the reduction in the size of transistors and the increase of the number of components. These errors will lead to corrupted results without warning, making the output of the computation untrustable. On the other hand, large volumes of highly variable data are produced by scientific computing with high velocity on exascale systems or advanced instruments, and the I/O time on storing these data is prohibitive due to the I/O bottleneck in parallel file systems. In this work, we leverage algorithm-based fault tolerance (ABFT) and error-bound lossy compression to tackle the two problems, in order to support efficient scientific computing on exascale systems.We propose an efficient fault tolerant scheme to tolerant soft errors in Fast Fourier Transform (FFT), one of the most important computation kernels widely used in scientific computing. Traditional redundancy approaches will at least double the execution time or resources, limiting the usage in practice because of the large overhead. Previous works on offline ABFT algorithms for FFT mitigate this problem by providing resilient FFT with lower overhead, but these algorithms fail to make progress in vulnerable environments with high error rates because they can only detect and correct errors after the whole computation finishes. We propose an online ABFT scheme for large-scale FFT inspired by the divide-and-conquer nature of the FFT computation. We devise fault tolerant schemes for both computational and memory errors in FFT, with both serial and parallel optimizations. Experimental results demonstrate that the proposed approach provides more timely error detection and recovery as well as better fault coverage with less overhead, compared to the offline ABFT algorithm.To alleviate the I/O bottleneck in the parallel file systems, we work on a prediction-based error-bounded lossy compressor to significantly reduce the size of scientific datasets while retaining the accuracy of the decompressed data, with adaptive prediction algorithms and compression models. We first propose a regression-based predictor for better prediction accuracy than traditional approaches under large error bounds, followed by an adaptive algorithm that dynamically selects between the traditional Lorenzo predictor and the proposed regression-based predictor, leading to very high compression ratios with little visual distortion. We further unify the prediction-based model and transform-baed model by using transform-based compressors as a predictor, with novel optimizations toward efficient coefficient encoding for both the two models. The proposed adaptive multi-algorithm design provides better compression ratios given the same distortion, significantly reducing storage requirements and I/O time.We further adapt the compression algorithms and compressors to different requirements and/or objectives in realistic scenarios. We leverage a logarithmic transform to precondition the data, which turns a relative-error-bound compression problem into an absolute-error-bound compression problem. This transform aligns two different error requirements while improving the compression quality, efficiently reducing the workload for compressor design. We also correlate the compression algorithm with system information to achieve better I/O performance compared to traditional single compressor deployment. These studies further improve the efficiency of lossy compression from the perspective of efficient I/O in the context of scientific simulation, making scientific applications running on exascale systems more efficient

eScholarship - University of California

Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

Author: Anam Mohammad Ashraful
Andreopoulos Yiannis
Publication venue
Publication date: 16/04/2016
Field of study

A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on

M

integer data streams (

M\geq3

), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, the

M

input integer data streams are linearly superimposed to form

M

numerically-entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The results are extracted from the

M

entangled output streams by additions and arithmetic shifts. Any soft errors affecting any single disentangled output stream are guaranteed to be detectable via a specific post-computation reliability check. In addition, when utilizing a separate processor core for each of the

M

streams, the proposed approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entanglement, extraction and validation of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via fast Fourier transforms, circular convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between

0.03\%

7\%

reduction in processing throughput for a wide variety of LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201

arXiv.org e-Print Archive

UCL Discovery

Data criticality estimation in software applications

Author: Benso Alfredo
Di Carlo Stefano
Di Natale Giorgio
Prinetto Paolo Ernesto
Tagliaferri Luca
Publication venue: IEEE
Publication date: 01/01/2003
Field of study

In safety-critical applications it is often possible to exploit software techniques to increase system's fault- tolerance. Common approaches are based on data redundancy to prevent data corruption during the software execution. Duplicating most critical variables only can significantly reduce the memory and performance overheads, while still guaranteeing very good results in terms of fault-tolerance improvement. This paper presents a new methodology to compute the criticality of variables in target software applications. Instead of resorting to time consuming fault injection experiments, the proposed solution is based on the run- time analysis of the variables' behavior logged during the execution of the target application under different workloads

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures

Author: Jafri Syed Mohammad Asad Hassan
Publication venue: Turku Centre for Computer Science
Publication date: 28/01/2015
Field of study

This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast

UTUPub

NASA. Lewis Research Center Advanced Modulation and Coding Project: Introduction and overview

Author: Budinger James M.
Publication venue
Publication date
Field of study

The Advanced Modulation and Coding Project at LeRC is sponsored by the Office of Space Science and Applications, Communications Division, Code EC, at NASA Headquarters and conducted by the Digital Systems Technology Branch of the Space Electronics Division. Advanced Modulation and Coding is one of three focused technology development projects within the branch's overall Processing and Switching Program. The program consists of industry contracts for developing proof-of-concept (POC) and demonstration model hardware, university grants for analyzing advanced techniques, and in-house integration and testing of performance verification and systems evaluation. The Advanced Modulation and Coding Project is broken into five elements: (1) bandwidth- and power-efficient modems; (2) high-speed codecs; (3) digital modems; (4) multichannel demodulators; and (5) very high-data-rate modems. At least one contract and one grant were awarded for each element

NASA Technical Reports Server

Multichannel demultiplexer/demodulator technologies for future satellite communication systems

Author: Abramovitz Irwin
Budinger James M.
Courtois Hector A.
Ivancic William D.
Staples Edward J.
Publication venue
Publication date
Field of study

NASA-Lewis' Space Electronics Div. supports ongoing research in advanced satellite communication architectures, onboard processing, and technology development. Recent studies indicate that meshed VSAT (very small aperture terminal) satellite communication networks using FDMA (frequency division multiple access) uplinks and TDMA (time division multiplexed) downlinks are required to meet future communication needs. One of the critical advancements in such a satellite communication network is the multichannel demultiplexer/demodulator (MCDD). The progress is described which was made in MCDD development using either acousto-optical, optical, or digital technologies

NASA Technical Reports Server

A Prototype SpaceVPX Lite (VITA 78.1) System using SpaceFibre for Data and Control Planes

Author: Ferrer Florit Albert
Gonzalez-Villafranca Alberto
McClements Christopher
Parkes Stephen
Srivastava Ashish
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2017
Field of study

Crossref

University of Dundee Online Publications

Failure Mitigation in Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

Author: Anam Mohammad Ashraful
Andreopoulos Yiannis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/09/2015
Field of study

A new roll-forward technique is proposed that recovers from any single fail-stop failure in

M

integer data streams (

M\geq3

) when undergoing linear, sesquilinear or bijective (LSB) operations, such as: scaling, additions/subtractions, inner or outer vector products and permutations. In the proposed approach, the

M

input integer data streams are linearly superimposed to form

M

numerically entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The output results can be extracted from any

M-1

entangled output streams by additions and arithmetic shifts, thereby guaranteeing robustness to a fail-stop failure in any single stream computation. Importantly, unlike other methods, the number of operations required for the entanglement, extraction and recovery of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via convolution operations. Our analysis and experiments reveal that the proposed approach incurs only

1.8\%

2.8\%

reduction in processing throughput in comparison to the failure-intolerant approach. This overhead is 9 to 14 times smaller than that of the equivalent checksum-based method. Thus, our proposal can be used in distributed systems and unreliable processor hardware, or safety-critical applications, where robustness against fail-stop failures becomes a necessity.Comment: Proc. 21st IEEE International On-Line Testing Symposium (IOLTS 2015), July 2015, Halkidiki, Greec

arXiv.org e-Print Archive

Crossref