14 research outputs found

    Design and Optimization of Adaptable BCH Codecs for NAND Flash Memories

    Get PDF
    NAND flash memories represent a key storage technology for solid-state storage systems. However, they suffer from serious reliability and endurance issues that must be mitigated by the use of proper error correction codes. This paper proposes the design and implementation of an optimized Bose-Chaudhuri-Hocquenghem hardware codec core able to adapt its correction capability in a range of predefined values. Code adaptability makes it possible to efficiently trade-off, in-field reliability and code complexity. This feature is very important considering that the reliability of a NAND flash memory continuously decreases over time, meaning that the required correction capability is not fixed during the life of the device. Experimental results show that the proposed architecture enables to save resources when the device is in the early stages of its lifecycle, while introducing a limited overhead in terms of are

    Dependability Assessment of NAND Flash-memory for Mission-critical Applications

    Get PDF
    It is a matter of fact that NAND flash memory devices are well established in consumer market. However, it is not true that the same architectures adopted in the consumer market are suitable for mission critical applications like space. In fact, USB flash drives, digital cameras, MP3 players are usually adopted to store "less significant" data which are not changing frequently (e.g., MP3s, pictures, etc.). Therefore, in spite of NAND flash's drawbacks, a modest complexity is usually needed in the logic of commercial flash drives. On the other hand, mission critical applications have different reliability requirements from commercial scenarios. Moreover, they are usually playing in a hostile environment (e.g., the space) which contributes to worsen all the issues. We aim at providing practical valuable guidelines, comparisons and tradeoffs among the huge number of dimensions of fault tolerant methodologies for NAND flash applied to critical environments. We hope that such guidelines will be useful for our ongoing research and for all the interested reader

    Dependability Assessment of NAND Flash-memory for Mission-critical Applications

    Get PDF
    It is a matter of fact that NAND flash memory devices are well established in consumer market. However, it is not true that the same architectures adopted in the consumer market are suitable for mission critical applications like space. In fact, USB flash drives, digital cameras, MP3 players are usually adopted to store "less significant" data which are not changing frequently (e.g., MP3s, pictures, etc.). Therefore, in spite of NAND flash鈥檚 drawbacks, a modest complexity is usually needed in the logic of commercial flash drives. On the other hand, mission critical applications have different reliability requirements from commercial scenarios. Moreover, they are usually playing in a hostile environment (e.g., the space) which contributes to worsen all the issues. We aim at providing practical valuable guidelines, comparisons and tradeoffs among the huge number of dimensions of fault tolerant methodologies for NAND flash applied to critical environments. We hope that such guidelines will be useful for our ongoing research and for all the interested readers

    FLARES: an aging aware algorithm to autonomously adapt the error correction capability in NAND Flash memories

    Get PDF
    With the advent of solid-state storage systems, NAND flash memories are becoming a key storage technology. However, they suffer from serious reliability and endurance issues during the operating lifetime that can be handled by the use of appropriate error correction codes (ECC) in order to reconstruct the information when needed.. Adaptable ECCs may provide the flexibility to avoid worst-case reliability design thus leading to improved performance. However, a way to control such adaptable ECCs strength is required. This paper proposes FLARES, an algorithm able to adapt the ECC correction capability of each page of a flash based on a flash RBER prediction model and on a measurement of the number of errors detected in a given time window. FLARES has been fully implemented within the YAFFS 2 filesystem under the Linux operating system. This allowed us to perform an extensive set of simulations on a set of standard benchmarks that highlighted the benefit of FLARES on the overall storage subsystem performance

    Performance and Reliability Analysis of Cross-Layer Optimizations of NAND Flash Controllers

    Get PDF
    NAND flash memories are becoming the predominant technology in the implementation of mass storage systems for both embedded and high-performance applications. However, when considering data and code storage in non-volatile memories (NVMs), such as NAND flash memories, reliability and performance be- come a serious concern for systems' designer. Designing NAND flash based systems based on worst-case scenarios leads to waste of resources in terms of performance, power consumption, and storage capacity. This is clearly in contrast with the request for run-time reconfigurability, adaptivity, and resource optimiza- tion in nowadays computing systems. There is a clear trend toward supporting differentiated access modes in flash memory controllers, each one setting a differentiated trade-off point in the performance-reliability optimization space. This is supported by the possibility of tuning the NAND flash memory performance, reli- ability and power consumption acting on several tuning knobs such as the flash programming algorithm and the flash error correcting code. However, to successfully exploit these degrees of freedom, it is mandatory to clearly understand the effect the combined tuning of these parameters have on the full NVM sub-system. This paper performs a comprehensive quantitative analysis of the benefits provided by the run-time reconfigurability of an MLC NAND flash controller through the combined effect of an adaptable memory programming circuitry coupled with run-time adaptation of the ECC correction capability. The full non- volatile memory (NVM) sub-system is taken into account, starting from the characterization of the low level circuitry to the effect of the adaptation on a wide set of realistic benchmarks in order to provide the readers a clear figure of the benefit this combined adaptation would provide at the system leve

    Towards Design and Analysis For High-Performance and Reliable SSDs

    Get PDF
    NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems. However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads

    Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview

    Get PDF
    Advanced computing systems realized in forthcoming technologies hold the promise of a significant increase of computational capabilities. However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable. Developing new methods to evaluate the reliability of these systems in an early design stage has the potential to save costs, produce optimized designs and have a positive impact on the product time-to-market. CLERECO European FP7 research project addresses early reliability evaluation with a cross-layer approach across different computing disciplines, across computing system layers and across computing market segments. The fundamental objective of the project is to investigate in depth a methodology to assess system reliability early in the design cycle of the future systems of the emerging computing continuum. This paper presents a general overview of the CLERECO project focusing on the main tools and models that are being developed that could be of interest for the research community and engineering practice

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven m谩s comunes a medida que las tecnolog铆as de silicio reducen su tama帽o. La variabilidad de fabricaci贸n y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparici贸n siempre causa errores inesperados. Adem谩s, la memoria tambi茅n sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos est谩n causados por efectos como la radiaci贸n o los reducidos m谩rgenes de voltaje y frecuencia. Otras restricciones coet谩neas, como el consumo de energ铆a y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez m谩s complejas. Para programar tales procesadores, se desarrollaron modelos de programaci贸n basados en entornos de ejecuci贸n. Estos sistemas forman una nueva abstracci贸n entre hardware y software, realizando tareas como la distribuci贸n del trabajo entre recursos inform谩ticos: n煤cleos de procesadores, aceleradores, etc. Estos entornos de ejecuci贸n disponen de mucha informaci贸n tanto sobre el hardware como sobre las aplicaciones, y ofrecen as铆 muchas posibilidades de optimizaci贸n. Esta tesis propone soluciones a los fallos en memoria entre m煤ltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los c贸digos de correcci贸n de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecuci贸n. La primera contribuci贸n de esta tesis es una t茅cnica a nivel algor铆tmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los m茅todos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperaci贸n de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programaci贸n para superponer las computaciones del algoritmo y de la recuperaci贸n de datos. La segunda parte de esta tesis propone una m茅trica para caracterizar el impacto de los fallos en la memoria. Esta m茅trica supera en precisi贸n a las m茅tricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificaci贸n de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por 煤ltimo, la contribuci贸n a nivel arquitect贸nico de esta tesis es un esquema de C贸digo de Correcci贸n de Errores (ECC por sus siglas en ingl茅s) adaptable din谩micamente. Este esquema puede aumentar la protecci贸n de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodolog铆a para estimar el riesgo de fallo en tiempo de ejecuci贸n utilizando nuestra m茅trica, a trav茅s de las herramientas de monitorizaci贸n del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado din谩micamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracci贸n del coste de redundancia requerido para un ECC uniformemente m谩s fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecuci贸n permiten aprovechar al m谩ximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar m谩s fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecuci贸n permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperaci贸n, identificando datos redundantes, o focalizando esfuerzos de protecci贸n en los datos cr铆ticos.Postprint (published version

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven m谩s comunes a medida que las tecnolog铆as de silicio reducen su tama帽o. La variabilidad de fabricaci贸n y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparici贸n siempre causa errores inesperados. Adem谩s, la memoria tambi茅n sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos est谩n causados por efectos como la radiaci贸n o los reducidos m谩rgenes de voltaje y frecuencia. Otras restricciones coet谩neas, como el consumo de energ铆a y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez m谩s complejas. Para programar tales procesadores, se desarrollaron modelos de programaci贸n basados en entornos de ejecuci贸n. Estos sistemas forman una nueva abstracci贸n entre hardware y software, realizando tareas como la distribuci贸n del trabajo entre recursos inform谩ticos: n煤cleos de procesadores, aceleradores, etc. Estos entornos de ejecuci贸n disponen de mucha informaci贸n tanto sobre el hardware como sobre las aplicaciones, y ofrecen as铆 muchas posibilidades de optimizaci贸n. Esta tesis propone soluciones a los fallos en memoria entre m煤ltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los c贸digos de correcci贸n de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecuci贸n. La primera contribuci贸n de esta tesis es una t茅cnica a nivel algor铆tmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los m茅todos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperaci贸n de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programaci贸n para superponer las computaciones del algoritmo y de la recuperaci贸n de datos. La segunda parte de esta tesis propone una m茅trica para caracterizar el impacto de los fallos en la memoria. Esta m茅trica supera en precisi贸n a las m茅tricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificaci贸n de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por 煤ltimo, la contribuci贸n a nivel arquitect贸nico de esta tesis es un esquema de C贸digo de Correcci贸n de Errores (ECC por sus siglas en ingl茅s) adaptable din谩micamente. Este esquema puede aumentar la protecci贸n de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodolog铆a para estimar el riesgo de fallo en tiempo de ejecuci贸n utilizando nuestra m茅trica, a trav茅s de las herramientas de monitorizaci贸n del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado din谩micamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracci贸n del coste de redundancia requerido para un ECC uniformemente m谩s fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecuci贸n permiten aprovechar al m谩ximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar m谩s fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecuci贸n permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperaci贸n, identificando datos redundantes, o focalizando esfuerzos de protecci贸n en los datos cr铆ticos

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design
    corecore