Search CORE

24 research outputs found

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Author: Architectures Software Developer's Intel®
Berry M.
Degalahal V.
Family Intel® Xeon®
Kleen A.
Li X.
Manual Architecture Programmer's
Shewchuk J. R.
Sorin D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.This work has been partially supported by the European Research Council under the European Union's 7th FP, ERC Advanced Grant 321253, and by the Spanish Ministry of Science and Innovation under grant TIN2012-34557. L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moreto has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas has been partially supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the European Union's 7th FP (contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures

Author: Alameldeen
Alonso
Anderson
Bacha
Cappello
Catalán
Chandrakasan
Degalahal
Enrique S. Quintana-Ortí
Ernst
Gensh
Golub
Hennessy
Ibtesham
Johnston
José R. Herrero
Karpuzcu
Leng
Mills
Moore
Petersen
Rafael Rodríguez-Sánchez
Sandra Catalán
Smith
Smith
Tan
Tan
Wei
Weiser
Wilkerson
Zee
Zee
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

[EN] Near Threshold Voltage (NTV) computing has been recently proposed as a technique to save energy, at the cost of incurring higher error rates including, among others, Silent Data Corruption (SDC). In this paper, we evaluate the energy efficiency of dense linear algebra routines using several low-power multicore processors and we analyze whether the potential energy reduction achieved when scaling the processor to operate at a low voltage compensates the cost of integrating a fault tolerance mechanism that tackles SDC. Our study targets algorithmic-based fault-tolerant versions of the dense matrix-vector and matrix(matrix) multiplication kernels (GEMV and GEMM, respectively), using the BLIS framework, as well as an implementation of the LU factorization with partial pivoting built on top of GEMM, Furthermore, we tailor the study for a number of representative 32-bit and 64-bit multicore processors from ARM that were specifically designed for energy efficiency. (C) 2017 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by project CICYT TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. d'Innovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Quintana Ortí, ES.; Rodríguez-Sánchez, R. (2018). Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures. Parallel Computing. 73:28-39. https://doi.org/10.1016/j.parco.2017.05.004S28397

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

RiuNet

Soft errors issues in low-power caches

Author: Lin Li
M. Kandemir
M.J. Irwin
V. Degalahal
V. Narayanan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

The effect of threshold voltages on soft error rate

Author: M. J. Irwin
N. Vijaykrishnan
R. Ramanarayanan
V. Degalahal
Y. Xie
Publication venue
Publication date
Field of study

CiteSeerX

Compiler-directed instruction duplication for soft error detection. Design, Automation and Test in Europe

Author: Feihui Li
Jie S Hu
Mahmut Kandemir
Mary J Irwin
N Vijaykrishnan
Vijay Degalahal
Publication venue
Publication date: 01/01/2005
Field of study

Abstract In this work, we experiment with complier-directed instruction duplication to detect soft errors in VLIW datapaths . In the proposed approach, the compiler determines the instruction schedule by balancing the permissible performance degradation with the required degree of duplication. Our experimental results show that our algorithms allow the designer to perform tradeoff analysis between performance and reliability

CiteSeerX