33,357 research outputs found
Exploiting Data Representation for Fault Tolerance
We explore the link between data representation and soft errors in dot
products. We present an analytic model for the absolute error introduced should
a soft error corrupt a bit in an IEEE-754 floating-point number. We show how
this finding relates to the fundamental linear algebra concepts of
normalization and matrix equilibration. We present a case study illustrating
that the probability of experiencing a large error in a dot product is
minimized when both vectors are normalized. Furthermore, when data is
normalized we show that the absolute error is less than one or very large,
which allows us to detect large errors. We demonstrate how this finding can be
used by instrumenting the GMRES iterative solver. We count all possible errors
that can be introduced through faults in arithmetic in the computationally
intensive orthogonalization phase, and show that when scaling is used the
absolute error can be bounded above by one
Evaluating the Impact of SDC on the GMRES Iterative Solver
Increasing parallelism and transistor density, along with increasingly
tighter energy and peak power constraints, may force exposure of occasionally
incorrect computation or storage to application codes. Silent data corruption
(SDC) will likely be infrequent, yet one SDC suffices to make numerical
algorithms like iterative linear solvers cease progress towards the correct
answer. Thus, we focus on resilience of the iterative linear solver GMRES to a
single transient SDC. We derive inexpensive checks to detect the effects of an
SDC in GMRES that work for a more general SDC model than presuming a bit flip.
Our experiments show that when GMRES is used as the inner solver of an
inner-outer iteration, it can "run through" SDC of almost any magnitude in the
computationally intensive orthogonalization phase. That is, it gets the right
answer using faulty data without any required roll back. Those SDCs which it
cannot run through, get caught by our detection scheme
Soft-Decision-Driven Channel Estimation for Pipelined Turbo Receivers
We consider channel estimation specific to turbo equalization for
multiple-input multiple-output (MIMO) wireless communication. We develop a
soft-decision-driven sequential algorithm geared to the pipelined turbo
equalizer architecture operating on orthogonal frequency division multiplexing
(OFDM) symbols. One interesting feature of the pipelined turbo equalizer is
that multiple soft-decisions become available at various processing stages. A
tricky issue is that these multiple decisions from different pipeline stages
have varying levels of reliability. This paper establishes an effective
strategy for the channel estimator to track the target channel, while dealing
with observation sets with different qualities. The resulting algorithm is
basically a linear sequential estimation algorithm and, as such, is
Kalman-based in nature. The main difference here, however, is that the proposed
algorithm employs puncturing on observation samples to effectively deal with
the inherent correlation among the multiple demapper/decoder module outputs
that cannot easily be removed by the traditional innovations approach. The
proposed algorithm continuously monitors the quality of the feedback decisions
and incorporates it in the channel estimation process. The proposed channel
estimation scheme shows clear performance advantages relative to existing
channel estimation techniques.Comment: 11 pages; IEEE Transactions on Communications 201
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude.
We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research
project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement
n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft
An extensive study on iterative solver resilience : characterization, detection and prediction
Soft errors caused by transient bit flips have the potential to significantly impactan applicalion's behavior. This has motivated the design of an array of techniques to detect, isolate, and correct soft errors using microarchitectural, architectural, compilationbased, or application-level techniques to minimize their impact on the executing application. The first step toward the design of good error detection/correction techniques involves an understanding of an application's vulnerability to soft errors. This work focuses on silent data e orruption's effects on iterative solvers and efforts to mitigate those effects.
In this thesis, we first present the first comprehensive characterizalion of !he impact of soft errors on !he convergen ce characteris tics of six iterative methods using application-level fault injection. We analyze the impact of soft errors In terms of the type of error (single-vs multi-bit), the distribution and location of bits affected, the data structure and statement impacted, and varialion with time. We create a public access database with more than 1.5 million fault injection results. We then analyze the performance of soft error detection mechanisms and present the comparalive results. Molivated by our observations, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive al their conclusions. Our evalualion demonstrates improved results over individual detectors. We then propase amachine learning based method to predict a program's error behavior to make fault injection studies more efficient. We demonstrate this method on asse ssing the performance of soft error detectors. We show that our method maintains 84% accuracy on average with up to 53% less cost. We also show, once a model is trained further fault injection tests would cost 10% of the expected full fault injection runs.“Soft errors” causados por cambios de estado transitorios en bits, tienen el potencial de impactar significativamente el comportamiento de una aplicación. Esto, ha motivado el diseño de una variedad de técnicas para detectar, aislar y corregir soft errors aplicadas a micro-arquitecturas, arquitecturas, tiempo de compilación y a nivel de aplicación para minimizar su impacto en la ejecución de una aplicación. El primer paso para diseñar una buna técnica de detección/corrección de errores, implica el conocimiento de las vulnerabilidades de la aplicación ante posibles soft errors. Este trabajo se centra en los efectos de la corrupción silenciosa de datos en soluciones iterativas, así como en los esfuerzos para mitigar esos efectos. En esta tesis, primeramente, presentamos la primera caracterización extensiva del impacto de soft errors sobre las características convergentes de seis métodos iterativos usando inyección de fallos a nivel de aplicación. Analizamos el impacto de los soft errors en términos del tipo de error (único vs múltiples-bits), de la distribución y posición de los bits afectados, las estructuras de datos, instrucciones afectadas y de las variaciones en el tiempo. Creamos una base de datos pública con más de 1.5 millones de resultados de inyección de fallos. Después, analizamos el desempeño de mecanismos de detección de soft errors actuales y presentamos los resultados de su comparación. Motivados por las observaciones de los resultados presentados, evaluamos un detector de soft errors basado en técnicas de machine learning que toma como entrada las características observadas en el tiempo de ejecución individual de los detectores anteriores al llegar a su conclusión. La evaluación de los resultados obtenidos muestra una mejora por sobre los detectores individualmente. Basados en estos resultados propusimos un método basado en machine learning para predecir el comportamiento de los errores en un programa con el fin de hacer el estudio de inyección de errores mas eficiente. Presentamos este método para evaluar el rendimiento de los detectores de soft errors. Demostramos que nuestro método mantiene una precisión del 84% en promedio con hasta un 53% de mejora en el tiempo de ejecución. También mostramos que una vez que un modelo ha sido entrenado, las pruebas de inyección de errores siguientes costarían 10% del tiempo esperado de ejecución.Postprint (published version
A new method for aspherical surface fitting with large-volume datasets
In the framework of form characterization of aspherical surfaces, European National Metrology Institutes (NMIs) have been developing ultra-high precision machines having the ability to measure aspherical lenses with an uncertainty of few tens of nanometers. The fitting of the acquired aspherical datasets onto their corresponding theoretical model should be achieved at the same level of precision. In this article, three fitting algorithms are investigated: the Limited memory-Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), the Levenberg–Marquardt (LM) and one variant of the Iterative Closest Point (ICP). They are assessed based on their capacities to converge relatively fast to achieve a nanometric level of accuracy, to manage a large volume of data and to be robust to the position of the data with respect to the model. Nev-ertheless, the algorithms are first evaluated on simulated datasets and their performances are studied. The comparison of these algorithms is extended on measured datasets of an aspherical lens. The results validate the newly used method for the fitting of aspherical surfaces and reveal that it is well adapted, faster and less complex than the LM or ICP methods.EMR
- …