945 research outputs found
DFT and BIST of a multichip module for high-energy physics experiments
Engineers at Politecnico di Torino designed a multichip module for high-energy physics experiments conducted on the Large Hadron Collider. An array of these MCMs handles multichannel data acquisition and signal processing. Testing the MCM from board to die level required a combination of DFT strategie
A software-based self test of CUDA Fermi GPUs
Nowadays, Graphical Processing Units (GPUs) have become increasingly popular due to their high computational power and low prices. This makes them particularly suitable for high-performance computing applications, like data elaboration and financial computation. In these fields, high efficient test methodologies are mandatory. One of the most effective ways to detect and localize hardware faults in GPUs is a Software-Based-Self-Test methodology (SBST). In this paper a fully comprehensive SBST and fault localization methodology for GPUs is presented. This novel approach exploits different custom test strategies for each component inside the GPU architecture. Such strategies guarantee both permanent fault detection and accurate fault localization
Simulating chemistry efficiently on fault-tolerant quantum computers
Quantum computers can in principle simulate quantum physics exponentially
faster than their classical counterparts, but some technical hurdles remain.
Here we consider methods to make proposed chemical simulation algorithms
computationally fast on fault-tolerant quantum computers in the circuit model.
Fault tolerance constrains the choice of available gates, so that arbitrary
gates required for a simulation algorithm must be constructed from sequences of
fundamental operations. We examine techniques for constructing arbitrary gates
which perform substantially faster than circuits based on the conventional
Solovay-Kitaev algorithm [C.M. Dawson and M.A. Nielsen, \emph{Quantum Inf.
Comput.}, \textbf{6}:81, 2006]. For a given approximation error ,
arbitrary single-qubit gates can be produced fault-tolerantly and using a
limited set of gates in time which is or ; with sufficient parallel preparation of ancillas, constant average
depth is possible using a method we call programmable ancilla rotations.
Moreover, we construct and analyze efficient implementations of first- and
second-quantized simulation algorithms using the fault-tolerant arbitrary gates
and other techniques, such as implementing various subroutines in constant
time. A specific example we analyze is the ground-state energy calculation for
Lithium hydride.Comment: 33 pages, 18 figure
Fault Diagnosis of Hybrid Computing Systems Using Chaotic-Map Method
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-core central processing units (CPUs), many integrated core (MIC) and graphics processing unit (GPU) accelerators. These computing units and their interconnections are subject to different classes of hardware and software faults, which should be detected to support mitigation measures. We present the chaotic-map method that uses the exponential divergence and wide Fourier properties of the trajectories, combined with memory allocations and assignments to diagnose component-level faults in these hybrid computing systems. We propose lightweight codes that utilize highly parallel chaotic-map computations tailored to isolate faults in arithmetic units, memory elements and interconnects. The diagnosis module on a node utilizes pthreads to place chaotic-map threads on CPU and MIC cores, and CUDA C and OpenCL kernels on GPU blocks. We present experimental diagnosis results on five multi-core CPUs; one MIC; and, seven GPUs with typical diagnosis run-times under a minute
Recursive quantum repeater networks
Internet-scale quantum repeater networks will be heterogeneous in physical
technology, repeater functionality, and management. The classical control
necessary to use the network will therefore face similar issues as Internet
data transmission. Many scalability and management problems that arose during
the development of the Internet might have been solved in a more uniform
fashion, improving flexibility and reducing redundant engineering effort.
Quantum repeater network development is currently at the stage where we risk
similar duplication when separate systems are combined. We propose a unifying
framework that can be used with all existing repeater designs. We introduce the
notion of a Quantum Recursive Network Architecture, developed from the emerging
classical concept of 'recursive networks', extending recursive mechanisms from
a focus on data forwarding to a more general distributed computing request
framework. Recursion abstracts independent transit networks as single relay
nodes, unifies software layering, and virtualizes the addresses of resources to
improve information hiding and resource management. Our architecture is useful
for building arbitrary distributed states, including fundamental distributed
states such as Bell pairs and GHZ, W, and cluster states.Comment: 14 page
Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors.
For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process.
Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use
on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead.
Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation
- …