202 research outputs found

    Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms

    Get PDF
    Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors. For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process. Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead. Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation

    Resilient Authenticated Execution of Critical Applications in Untrusted Environments

    Get PDF
    Abstract-Modern computer systems are built on a foundation of software components from a variety of vendors. While critical applications may undergo extensive testing and evaluation procedures, the heterogeneity of software sources threatens the integrity of the execution environment for these trusted programs. For instance, if an attacker can combine an application exploit with a privilege escalation vulnerability, the operating system (OS) can become corrupted. Alternatively, a malicious or faulty device driver running with kernel privileges could threaten the application. While the importance of ensuring application integrity has been studied in prior work, proposed solutions immediately terminate the application once corruption is detected. Although, this approach is sufficient for some cases, it is undesirable for many critical applications. In order to overcome this shortcoming, we have explored techniques for leveraging a trusted virtual machine monitor (VMM) to observe the application and potentially repair damage that occurs. In this paper, we describe our system design, which leverages efficient coding and authentication schemes, and we present the details of our prototype implementation to quantify the overhead of our approach. Our work shows that it is feasible to build a resilient execution environment, even in the presence of a corrupted OS kernel, with a reasonable amount of storage and performance overhead

    Benchmarking: More Aspects of High Performance Computing

    Get PDF
    The original HPL algorithm makes the assumption that all data can be fit entirely in the main memory. This assumption will obviously give a good performance due to the absence of disk I/O. However, not all applications can fit their entire data in memory. These applications which require a fair amount of I/O to move data to and from main memory and secondary storage, are more indicative of usage of an Massively Parallel Processor (MPP) System. Given this scenario a well designed I/O architecture will play a significant part in the performance of the MPP System on regular jobs. And, this is not represented in the current Benchmark. The modified HPL algorithm is hoped to be a step in filling this void. The most important factor in the performance of out-of-core algorithms is the actual I/O operations performed and their efficiency in transferring data to/from main memory and disk, Various methods were introduced in the report for performing I/O operations. The I/O method to use depends on the design of the out-of-core algorithm. Conversely, the performance of the out-of-core algorithm is affected by the choice of I/O operations. This implies, good performance is achieved when I/O efficiency is closely tied with the out-of-core algorithms. The out-of-core algorithms must be designed from the start. It is easily observed in the timings for various plots, that I/O plays a significant part in the overall execution time. This leads to an important conclusion, retro-fitting an existing code may not be the best choice. The right-looking algorithm selected for the LU factorization is a recursive algorithm and performs well when the entire dataset is in memory. At each stage of the loop the entire trailing submatrix is read into memory panel by panel. This gives a polynomial number of I/O reads and writes. If the left-looking algorithm was selected for the main loop, the number of I/O operations involved will be linear on the number of columns. This is due to the data access pattern for the left-looking factorization. The right-looking algorithm performs better for in-core data, but the left-looking will perform better for out-of-core data due to the reduced I/O operations. Hence the conclusion that out-of-core algorithms will perform better when designed from start. The out-of-core and thread based computation do not interact in this case, since I/O is not done by the threads. The performance of the thread based computation does not depend on I/O as the algorithms are in the BLAS algorithms which assumes all the data to be in memory. This is the reason the out-of-core results and OpenMP threads results were presented separately and no attempt to combine them was made. In general, the modified HPL performs better with larger block sizes, due to less I/O involved for out-of-core part and better cache utilization for the thread based computation

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos.Postprint (published version

    Adjoint-Based Uncertainty Quantification and Sensitivity Analysis for Reactor Depletion Calculations

    Get PDF
    Depletion calculations for nuclear reactors model the dynamic coupling between the material composition and neutron flux and help predict reactor performance and safety characteristics. In order to be trusted as reliable predictive tools and inputs to licensing and operational decisions, the simulations must include an accurate and holistic quantification of errors and uncertainties in its outputs. Uncertainty quantification is a formidable challenge in large, realistic reactor models because of the large number of unknowns and myriad sources of uncertainty and error. We present a framework for performing efficient uncertainty quantification in depletion problems using an adjoint approach, with emphasis on high-fidelity calculations using advanced massively parallel computing architectures. This approach calls for a solution to two systems of equations: (a) the forward, engineering system that models the reactor, and (b) the adjoint system, which is mathematically related to but different from the forward system. We use the solutions of these systems to produce sensitivity and error estimates at a cost that does not grow rapidly with the number of uncertain inputs. We present the framework in a general fashion and apply it to both the source-driven and k-eigenvalue forms of the depletion equations. We describe the implementation and verification of solvers for the forward and ad- joint equations in the PDT code, and we test the algorithms on realistic reactor analysis problems. We demonstrate a new approach for reducing the memory and I/O demands on the host machine, which can be overwhelming for typical adjoint algorithms. Our conclusion is that adjoint depletion calculations using full transport solutions are not only computationally tractable, they are the most attractive option for performing uncertainty quantification on high-fidelity reactor analysis problems

    Enabling Distributed Applications Optimization in Cloud Environment

    Get PDF
    The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud. In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications
    corecore