2,786 research outputs found

    Sustainable Fault-handling Of Reconfigurable Logic Using Throughput-driven Assessment

    Get PDF
    A sustainable Evolvable Hardware (EH) system is developed for SRAM-based reconfigurable Field Programmable Gate Arrays (FPGAs) using outlier detection and group testing-based assessment principles. The fault diagnosis methods presented herein leverage throughput-driven, relative fitness assessment to maintain resource viability autonomously. Group testing-based techniques are developed for adaptive input-driven fault isolation in FPGAs, without the need for exhaustive testing or coding-based evaluation. The techniques maintain the device operational, and when possible generate validated outputs throughout the repair process. Adaptive fault isolation methods based on discrepancy-enabled pair-wise comparisons are developed. By observing the discrepancy characteristics of multiple Concurrent Error Detection (CED) configurations, a method for robust detection of faults is developed based on pairwise parallel evaluation using Discrepancy Mirror logic. The results from the analytical FPGA model are demonstrated via a self-healing, self-organizing evolvable hardware system. Reconfigurability of the SRAM-based FPGA is leveraged to identify logic resource faults which are successively excluded by group testing using alternate device configurations. This simplifies the system architect\u27s role to definition of functionality using a high-level Hardware Description Language (HDL) and system-level performance versus availability operating point. System availability, throughput, and mean time to isolate faults are monitored and maintained using an Observer-Controller model. Results are demonstrated using a Data Encryption Standard (DES) core that occupies approximately 305 FPGA slices on a Xilinx Virtex-II Pro FPGA. With a single simulated stuck-at-fault, the system identifies a completely validated replacement configuration within three to five positive tests. The approach demonstrates a readily-implemented yet robust organic hardware application framework featuring a high degree of autonomous self-control

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria

    A Holistic Solution for Reliability of 3D Parallel Systems

    Full text link
    As device scaling slows down, emerging technologies such as 3D integration and carbon nanotube field-effect transistors are among the most promising solutions to increase device density and performance. These emerging technologies offer shorter interconnects, higher performance, and lower power. However, higher levels of operating temperatures and current densities project significantly higher failure rates. Moreover, due to the infancy of the manufacturing process, high variation, and defect densities, chip designers are not encouraged to consider these emerging technologies as a stand-alone replacement for Silicon-based transistors. The goal of this dissertation is to introduce new architectural and circuit techniques that can work around high-fault rates in the emerging 3D technologies, improving performance and reliability comparable to Silicon. We propose a new holistic approach to the reliability problem that addresses the necessary aspects of an effective solution such as detection, diagnosis, repair, and prevention synergically for a practical solution. By leveraging 3D fabric layouts, it proposes the underlying architecture to efficiently repair the system in the presence of faults. This thesis presents a fault detection scheme by re-executing instructions on idle identical units that distinguishes between transient and permanent faults while localizing it to the granularity of a pipeline stage. Furthermore, with the use of a dynamic and adaptive reconfiguration policy based on activity factors and temperature variation, we propose a framework that delivers a significant improvement in lifetime management to prevent faults due to aging. Finally, a design framework that can be used for large-scale chip production while mitigating yield and variation failures to bring up Carbon Nano Tube-based technology is presented. The proposed framework is capable of efficiently supporting high-variation technologies by providing protection against manufacturing defects at different granularities: module and pipeline-stage levels.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168118/1/javadb_1.pd

    ์ดˆ๊ณ ์šฉ๋Ÿ‰ ์†”๋ฆฌ๋“œ ์Šคํ…Œ์ด๋“œ ๋“œ๋ผ์ด๋ธŒ๋ฅผ ์œ„ํ•œ ์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ ๋ฐ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. ๊น€์ง€ํ™.The development of ultra-large NAND flash storage devices (SSDs) is recently made possible by NAND flash memory semiconductor process scaling and multi-leveling techniques, and NAND package technology, which enables continuous increasing of storage capacity by mounting many NAND flash memory dies in an SSD. As the capacity of an SSD increases, the total cost of ownership of the storage system can be reduced very effectively, however due to limitations of ultra-large SSDs in reliability and performance, there exists some obstacles for ultra-large SSDs to be widely adopted. In order to take advantage of an ultra-large SSD, it is necessary to develop new techniques to improve these reliability and performance issues. In this dissertation, we propose various optimization techniques to solve the reliability and performance issues of ultra-large SSDs. In order to overcome the optimization limitations of the existing approaches, our techniques were designed based on various characteristic evaluation results of NAND flash devices and field failure characteristics analysis results of real SSDs. We first propose a low-stress erase technique for the purpose of reducing the characteristic deviation between wordlines (WLs) in a NAND flash block. By reducing the erase stress on weak WLs, it effectively slows down NAND degradation and improves NAND endurance. From the NAND evaluation results, the conditions that can most effectively guard the weak WLs are defined as the gerase mode. In addition, considering the user workload characteristics, we propose a technique to dynamically select the optimal gerase mode that can maximize the lifetime of the SSD. Secondly, we propose an integrated approach that maximizes the efficiency of copyback operations to improve performance while not compromising data reliability. Based on characterization using real 3D TLC flash chips, we propose a novel per-block error propagation model under consecutive copyback operations. Our model significantly increases the number of successive copybacks by exploiting the aging characteristics of NAND blocks. Furthermore, we devise a resource-efficient error management scheme that can handle successive copybacks where pages move around multiple blocks with different reliability. By utilizing proposed copyback operation for internal data movement, SSD performance can be effectively improved without any reliability issues. Finally, we propose a new recovery scheme, called reparo, for a RAID storage system with ultra-large SSDs. Unlike the existing RAID recovery schemes, reparo repairs a failed SSD at the NAND die granularity without replacing it with a new SSD, thus avoiding most of the inter-SSD data copies during a RAID recovery step. When a NAND die of an SSD fails, reparo exploits a multi-core processor of the SSD controller to identify failed LBAs from the failed NAND die and to recover data from the failed LBAs. Furthermore, reparo ensures no negative post-recovery impact on the performance and lifetime of the repaired SSD. In order to evaluate the effectiveness of the proposed techniques, we implemented them in a storage device prototype, an open NAND flash storage device development environment, and a real SSD environment. And their usefulness was verified using various benchmarks and I/O traces collected the from real-world applications. The experiment results show that the reliability and performance of the ultra-large SSD can be effectively improved through the proposed techniques.๋ฐ˜๋„์ฒด ๊ณต์ •์˜ ๋ฏธ์„ธํ™”, ๋‹ค์น˜ํ™” ๊ธฐ์ˆ ์— ์˜ํ•ด์„œ ์ง€์†์ ์œผ๋กœ ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋Š” ๋‹จ์œ„ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ๋ฉ”๋ชจ๋ฆฌ์™€ ํ•˜๋‚˜์˜ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ๊ธฐ๋ฐ˜ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ ๋‚ด์— ์ˆ˜ ๋งŽ์€ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ๋ฉ”๋ชจ๋ฆฌ ๋‹ค์ด๋ฅผ ์‹ค์žฅํ•  ์ˆ˜ ์žˆ๊ฒŒํ•˜๋Š” ๋‚ธ๋“œ ํŒจํ‚ค์ง€ ๊ธฐ์ˆ ๋กœ ์ธํ•ด ํ•˜๋“œ๋””์Šคํฌ๋ณด๋‹ค ํ›จ์”ฌ ๋” ํฐ ์ดˆ๊ณ ์šฉ๋Ÿ‰์˜ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ์ €์žฅ์žฅ์น˜์˜ ๊ฐœ๋ฐœ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. ํ”Œ๋ž˜์‰ฌ ์ €์žฅ์žฅ์น˜์˜ ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์˜ ์ด ์†Œ์œ ๋น„์šฉ์„ ์ค„์ด๋Š”๋ฐ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‚˜, ์‹ ๋ขฐ์„ฑ ๋ฐ ์„ฑ๋Šฅ์˜ ์ธก๋ฉด์—์„œ์˜ ํ•œ๊ณ„๋กœ ์ธํ•ด์„œ ์ดˆ๊ณ ์šฉ๋Ÿ‰ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ์ €์žฅ์žฅ์น˜๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š”๋ฐ ์žˆ์–ด์„œ ์žฅ์• ๋ฌผ๋กœ ์ž‘์šฉํ•˜๊ณ  ์žˆ๋‹ค. ์ดˆ๊ณ ์šฉ๋Ÿ‰ ์ €์žฅ์žฅ์น˜์˜ ์žฅ์ ์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋Ÿฌํ•œ ์‹ ๋ขฐ์„ฑ ๋ฐ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ๋ฒ•์˜ ๊ฐœ๋ฐœ์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ดˆ๊ณ ์šฉ๋Ÿ‰ ๋‚ธ๋“œ๊ธฐ๋ฐ˜ ์ €์žฅ์žฅ์น˜(SSD)์˜ ๋ฌธ์ œ์ ์ธ ์„ฑ๋Šฅ ๋ฐ ์‹ ๋ขฐ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ์ตœ์ ํ™” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์˜ ์ตœ์ ํ™” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์šฐ๋ฆฌ์˜ ๊ธฐ์ˆ ์€ ์‹ค์ œ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ์†Œ์ž์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ํŠน์„ฑ ํ‰๊ฐ€ ๊ฒฐ๊ณผ์™€ SSD์˜ ํ˜„์žฅ ๋ถˆ๋Ÿ‰ ํŠน์„ฑ ๋ถ„์„๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๋‚ธ๋“œ์˜ ํ”Œ๋ž˜์‰ฌ ํŠน์„ฑ๊ณผ SSD, ๊ทธ๋ฆฌ๊ณ  ํ˜ธ์ŠคํŠธ ์‹œ์Šคํ…œ์˜ ๋™์ž‘ ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ ์„ฑ๋Šฅ ๋ฐ ์‹ ๋ขฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ๋ถˆ๋ก๋‚ด์˜ ํŽ˜์ด์ง€๋“ค๊ฐ„์˜ ํŠน์„ฑํŽธ์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ๋™์ ์ธ ์†Œ๊ฑฐ ์ŠคํŠธ๋ ˆ์Šค ๊ฒฝ๊ฐ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ๋‚ธ๋“œ ๋ธ”๋ก์˜ ๋‚ด๊ตฌ์„ฑ์„ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ ํŠน์„ฑ์ด ์•ฝํ•œ ํŽ˜์ด์ง€๋“ค์— ๋Œ€ํ•ด์„œ ๋” ์ ์€ ์†Œ๊ฑฐ ์ŠคํŠธ๋ ˆ์Šค๊ฐ€ ์ธ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋‚ธ๋“œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋กœ ๋ถ€ํ„ฐ ์†Œ๊ฑฐ ์ŠคํŠธ๋ ˆ์Šค ๊ฒฝ๊ฐ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•œ๋‹ค. ๋˜ํ•œ ์‚ฌ์šฉ์ž ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ, ์†Œ๊ฑฐ ์ŠคํŠธ๋ ˆ์Šค ๊ฒฝ๊ฐ ๊ธฐ๋ฒ•์˜ ํšจ๊ณผ๊ฐ€ ์ตœ๋Œ€ํ™” ๋  ์ˆ˜ ์žˆ๋Š” ์ตœ์ ์˜ ๊ฒฝ๊ฐ ์ˆ˜์ค€์„ ๋™์ ์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๋‚ธ๋“œ ๋ธ”๋ก์„ ์—ดํ™”์‹œํ‚ค๋Š” ์ฃผ์š” ์›์ธ์ธ ์†Œ๊ฑฐ ๋™์ž‘์„ ํšจ์œจ์ ์œผ๋กœ ์ œ์–ดํ•จ์œผ๋กœ์จ ์ €์žฅ์žฅ์น˜์˜ ์ˆ˜๋ช…์„ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋‘˜์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณ ์šฉ๋Ÿ‰ SSD์—์„œ์˜ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ ์ด๋™์œผ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ์˜ ์ œํ•œ๋œ ์นดํ”ผ๋ฐฑ(copyback) ๋ช…๋ น์„ ํ™œ์šฉํ•˜๋Š” ์ ์‘ํ˜• ๊ธฐ๋ฒ•์ธ rCPB์„ ์ œ์•ˆํ•œ๋‹ค. rCPB์€ Copyback ๋ช…๋ น์˜ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™” ํ•˜๋ฉด์„œ๋„ ๋ฐ์ดํ„ฐ ์‹ ๋ขฐ์„ฑ์— ๋ฌธ์ œ๊ฐ€ ์—†๋„๋ก ๋‚ธ๋“œ์˜ ๋ธ”๋Ÿญ์˜ ๋…ธํ™”ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•œ ์ƒˆ๋กœ์šด copyback ์˜ค๋ฅ˜ ์ „ํŒŒ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ๋‹ค. ์ด์—๋”ํ•ด, ์‹ ๋ขฐ์„ฑ์ด ๋‹ค๋ฅธ ๋ธ”๋Ÿญ๊ฐ„์˜ copyback ๋ช…๋ น์„ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ ์ด๋™์„ ๋ฌธ์ œ์—†์ด ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ž์› ํšจ์œจ์ ์ธ ์˜ค๋ฅ˜ ๊ด€๋ฆฌ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ์‹ ๋ขฐ์„ฑ์— ๋ฌธ์ œ๋ฅผ ์ฃผ์ง€ ์•Š๋Š” ์ˆ˜์ค€์—์„œ copyback์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ ์ด๋™์„ ์ตœ์ ํ™” ํ•จ์œผ๋กœ์จ SSD์˜ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ดˆ๊ณ ์šฉ๋Ÿ‰ SSD์—์„œ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ์˜ ๋‹ค์ด(die) ๋ถˆ๋Ÿ‰์œผ๋กœ ์ธํ•œ ๋ ˆ์ด๋“œ(redundant array of independent disks, RAID) ๋ฆฌ๋นŒ๋“œ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™” ํ•˜๊ธฐ์œ„ํ•œ ์ƒˆ๋กœ์šด RAID ๋ณต๊ตฌ ๊ธฐ๋ฒ•์ธ reparo๋ฅผ ์ œ์•ˆํ•œ๋‹ค. Reparo๋Š” SSD์— ๋Œ€ํ•œ ๊ต์ฒด์—†์ด SSD์˜ ๋ถˆ๋Ÿ‰ die์— ๋Œ€ํ•ด์„œ๋งŒ ๋ณต๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๋ณต๊ตฌ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ๋ถˆ๋Ÿ‰์ด ๋ฐœ์ƒํ•œ die์˜ ๋ฐ์ดํ„ฐ๋งŒ ์„ ๋ณ„์ ์œผ๋กœ ๋ณต๊ตฌํ•จ์œผ๋กœ์จ ๋ณต๊ตฌ ๊ณผ์ •์˜ ๋ฆฌ๋นŒ๋“œ ํŠธ๋ž˜ํ”ฝ์„ ์ตœ์†Œํ™”ํ•˜๋ฉฐ, SSD ๋‚ด๋ถ€์˜ ๋ณ‘๋ ฌ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ถˆ๋Ÿ‰ die ๋ณต๊ตฌ ์‹œ๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‹จ์ถ•ํ•œ๋‹ค. ๋˜ํ•œ die ๋ถˆ๋Ÿ‰์œผ๋กœ ์ธํ•œ ๋ฌผ๋ฆฌ์  ๊ณต๊ฐ„๊ฐ์†Œ์˜ ๋ถ€์ž‘์šฉ์„ ์ตœ์†Œํ™” ํ•จ์œผ๋กœ์จ ๋ณต๊ตฌ ์ดํ›„์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฐ ์ˆ˜๋ช…์˜ ๊ฐ์†Œ ๋ฌธ์ œ๊ฐ€ ์—†๋„๋ก ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•๋“ค์€ ์ €์žฅ์žฅ์น˜ ํ”„๋กœํ† ํƒ€์ž… ๋ฐ ๊ณต๊ฐœ ๋‚ธ๋“œ ํ”Œ๋ž˜์‰ฌ ์ €์žฅ์žฅ์น˜ ๊ฐœ๋ฐœํ™˜๊ฒฝ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์žฅ SSDํ™˜๊ฒฝ์— ๊ตฌํ˜„๋˜์—ˆ์œผ๋ฉฐ, ์‹ค์ œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ๋ชจ์‚ฌํ•œ ๋‹ค์–‘ํ•œ ๋ฒคํŠธ๋งˆํฌ ๋ฐ ์‹ค์ œ I/O ํŠธ๋ ˆ์ด์Šค๋“ค์„ ์ด์šฉํ•˜์—ฌ ๊ทธ ์œ ์šฉ์„ฑ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•๋“ค์„ ํ†ตํ•ด์„œ ์ดˆ๊ณ ์šฉ๋Ÿ‰ SSD์˜ ์‹ ๋ขฐ์„ฑ ๋ฐ ์„ฑ๋Šฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.I Introduction 1 1.1 Motivation 1 1.2 Dissertation Goals 3 1.3 Contributions 5 1.4 Dissertation Structure 8 II Background 11 2.1 Overview of 3D NAND Flash Memory 11 2.2 Reliability Management in NAND Flash Memory 14 2.3 UL SSD architecture 15 2.4 Related Work 17 2.4.1 NAND endurance optimization by utilizing page characteristics difference 17 2.4.2 Performance optimizations using copyback operation 18 2.4.3 Optimizations for RAID Rebuild 19 2.4.4 Reliability improvement using internal RAID 20 III GuardedErase: Extending SSD Lifetimes by Protecting Weak Wordlines 22 3.1 Reliability Characterization of a 3D NAND Flash Block 22 3.1.1 Large Reliability Variations Among WLs 22 3.1.2 Erase Stress on Flash Reliability 26 3.2 GuardedErase: Design Overview and its Endurance Model 28 3.2.1 Basic Idea 28 3.2.2 Per-WL Low-Stress Erase Mode 31 3.2.3 Per-Block Erase Modes 35 3.3 Design and Implementation of LongFTL 39 3.3.1 Overview 39 3.3.2 Weak WL Detector 40 3.3.3 WAF Monitor 42 3.3.4 GErase Mode Selector 43 3.4 Experimental Results 46 3.4.1 Experimental Settings 46 3.4.2 Lifetime Improvement 47 3.4.3 Performance Overhead 49 3.4.4 Effectiveness of Lowest Erase Relief Ratio 50 IV Improving SSD Performance Using Adaptive Restricted- Copyback Operations 52 4.1 Motivations 52 4.1.1 Data Migration in Modern SSD 52 4.1.2 Need for Block Aging-Aware Copyback 53 4.2 RCPB: Copyback with a Limit 55 4.2.1 Error-Propagation Characteristics 55 4.2.2 RCPB Operation Model 58 4.3 Design and Implementation of rcFTL 59 4.3.1 EPM module 60 4.3.2 Data Migration Mode Selection 64 4.4 Experimental Results 65 4.4.1 Experimental Setup 65 4.4.2 Evaluation Results 66 V Reparo: A Fast RAID Recovery Scheme for Ultra- Large SSDs 70 5.1 SSD Failures: Causes and Characteristics 70 5.1.1 SSD Failure Types 70 5.1.2 SSD Failure Characteristics 72 5.2 Impact of UL SSDs on RAID Reliability 74 5.3 RAID Recovery using Reparo 77 5.3.1 Overview of Reparo 77 5.4 Cooperative Die Recovery 82 5.4.1 Identifier: Parallel Search of Failed LBAs 82 5.4.2 Handler: Per-Core Space Utilization Adjustment 83 5.5 Identifier Acceleration Using P2L Mapping Information 89 5.5.1 Page-level P2L Entrustment to Neighboring Die 90 5.5.2 Block-level P2L Entrustment to Neighboring Die 92 5.5.3 Additional Considerations for P2L Entrustment 94 5.6 Experimental Results 95 5.6.1 Experimental Settings 95 5.6.2 Experimental Results 97 VI Conclusions 109 6.1 Summary 109 6.2 Future Work 111 6.2.1 Optimization with Accurate WAF Prediction 111 6.2.2 Maximizing Copyback Threshold 111 6.2.3 Pre-failure Detection 112๋ฐ•

    Reliability-aware memory design using advanced reconfiguration mechanisms

    Get PDF
    Fast and Complex Data Memory systems has become a necessity in modern computational units in today's integrated circuits. These memory systems are integrated in form of large embedded memory for data manipulation and storage. This goal has been achieved by the aggressive scaling of transistor dimensions to few nanometer (nm) sizes, though; such a progress comes with a drawback, making it critical to obtain high yields of the chips. Process variability, due to manufacturing imperfections, along with temporal aging, mainly induced by higher electric fields and temperature, are two of the more significant threats that can no longer be ignored in nano-scale embedded memory circuits, and can have high impact on their robustness. Static Random Access Memory (SRAM) is one of the most used embedded memories; generally implemented with the smallest device dimensions and therefore its robustness can be highly important in nanometer domain design paradigm. Their reliable operation needs to be considered and achieved both in cell and also in architectural SRAM array design. Recently, and with the approach to near/below 10nm design generations, novel non-FET devices such as Memristors are attracting high attention as a possible candidate to replace the conventional memory technologies. In spite of their favorable characteristics such as being low power and highly scalable, they also suffer with reliability challenges, such as process variability and endurance degradation, which needs to be mitigated at device and architectural level. This thesis work tackles such problem of reliability concerns in memories by utilizing advanced reconfiguration techniques. In both SRAM arrays and Memristive crossbar memories novel reconfiguration strategies are considered and analyzed, which can extend the memory lifetime. These techniques include monitoring circuits to check the reliability status of the memory units, and architectural implementations in order to reconfigure the memory system to a more reliable configuration before a fail happens.Actualmente, el diseรฑo de sistemas de memoria en circuitos integrados busca continuamente que sean mรกs rรกpidos y complejos, lo cual se ha vuelto de gran necesidad para las unidades de computaciรณn modernas. Estos sistemas de memoria estรกn integrados en forma de memoria embebida para una mejor manipulaciรณn de los datos y de su almacenamiento. Dicho objetivo ha sido conseguido gracias al agresivo escalado de las dimensiones del transistor, el cual estรก llegando a las dimensiones nanomรฉtricas. Ahora bien, tal progreso ha conllevado el inconveniente de una menor fiabilidad, dado que ha sido altamente difรญcil obtener elevados rendimientos de los chips. La variabilidad de proceso - debido a las imperfecciones de fabricaciรณn - junto con la degradaciรณn de los dispositivos - principalmente inducido por el elevado campo elรฉctrico y altas temperaturas - son dos de las mรกs relevantes amenazas que no pueden ni deben ser ignoradas por mรกs tiempo en los circuitos embebidos de memoria, echo que puede tener un elevado impacto en su robusteza final. Static Random Access Memory (SRAM) es una de las celdas de memoria mรกs utilizadas en la actualidad. Generalmente, estas celdas son implementadas con las menores dimensiones de dispositivos, lo que conlleva que el estudio de su robusteza es de gran relevancia en el actual paradigma de diseรฑo en el rango nanomรฉtrico. La fiabilidad de sus operaciones necesita ser considerada y conseguida tanto a nivel de celda de memoria como en el diseรฑo de arquitecturas complejas basadas en celdas de memoria SRAM. Actualmente, con el diseรฑo de sistemas basados en dispositivos de 10nm, dispositivos nuevos no-FET tales como los memristores estรกn atrayendo una elevada atenciรณn como posibles candidatos para reemplazar las actuales tecnologรญas de memorias convencionales. A pesar de sus caracterรญsticas favorables, tales como el bajo consumo como la alta escabilidad, ellos tambiรฉn padecen de relevantes retos de fiabilidad, como son la variabilidad de proceso y la degradaciรณn de la resistencia, la cual necesita ser mitigada tanto a nivel de dispositivo como a nivel arquitectural. Con todo esto, esta tesis doctoral afronta tales problemas de fiabilidad en memorias mediante la utilizaciรณn de tรฉcnicas de reconfiguraciรณn avanzada. La consideraciรณn de nuevas estrategias de reconfiguraciรณn han resultado ser validas tanto para las memorias basadas en celdas SRAM como en `memristive crossbarยฟ, donde se ha observado una mejora significativa del tiempo de vida en ambos casos. Estas tรฉcnicas incluyen circuitos de monitorizaciรณn para comprobar la fiabilidad de las unidades de memoria, y la implementaciรณn arquitectural con el objetivo de reconfigurar los sistemas de memoria hacia una configuraciรณn mucho mรกs fiables antes de que el fallo suced

    DECISION SUPPORT MODEL IN FAILURE-BASED COMPUTERIZED MAINTENANCE MANAGEMENT SYSTEM FOR SMALL AND MEDIUM INDUSTRIES

    Get PDF
    Maintenance decision support system is crucial to ensure maintainability and reliability of equipments in production lines. This thesis investigates a few decision support models to aid maintenance management activities in small and medium industries. In order to improve the reliability of resources in production lines, this study introduces a conceptual framework to be used in failure-based maintenance. Maintenance strategies are identified using the Decision-Making Grid model, based on two important factors, including the machinesโ€™ downtimes and their frequency of failures. The machines are categorized into three downtime criterions and frequency of failures, which are high, medium and low. This research derived a formula based on maintenance cost, to re-position the machines prior to Decision-Making Grid analysis. Subsequently, the formula on clustering analysis in the Decision-Making Grid model is improved to solve multiple-criteria problem. This research work also introduced a formula to estimate contractorโ€™s response and repair time. The estimates are used as input parameters in the Analytical Hierarchy Process model. The decisions were synthesized using models based on the contractorsโ€™ technical skills such as experience in maintenance, skill to diagnose machines and ability to take prompt action during troubleshooting activities. Another important criteria considered in the Analytical Hierarchy Process is the business principles of the contractors, which includes the maintenance quality, tools, equipments and enthusiasm in problem-solving. The raw data collected through observation, interviews and surveys in the case studies to understand some risk factors in small and medium food processing industries. The risk factors are analysed with the Ishikawa Fishbone diagram to reveal delay time in machinery maintenance. The experimental studies are conducted using maintenance records in food processing industries. The Decision Making Grid model can detect the top ten worst production machines on the production lines. The Analytical Hierarchy Process model is used to rank the contractors and their best maintenance practice. This research recommends displaying the results on the productionโ€™s indicator boards and implements the strategies on the production shop floor. The proposed models can be used by decision makers to identify maintenance strategies and enhance competitiveness among contractors in failure-based maintenance. The models can be programmed as decision support sub-procedures in computerized maintenance management systems

    Asset management strategies for power electronic converters in transmission networks: Application to HVdc and FACTS devices

    Get PDF
    The urgency for an increased capacity boost bounded by enhanced reliability and sustainability through operating cost reduction has become the major objective of electric utilities worldwide. Power electronics have contributed to this goal for decades by providing additional flexibility and controllability to the power systems. Among power electronic based assets, high-voltage dc (HVdc) transmission systems and flexible ac transmission systems (FACTS) controllers have played a substantial role on sustainable grid infrastructure. Recent advancements in power semiconductor devices, in particular in voltage source converter based technology, have facilitated the widespread application of HVdc systems and FACTS devices in transmission networks. Converters with larger power ratings and higher number of switches have been increasingly deployed for bulk power transfer and large scale renewable integrationโ€”increasing the need of managing power converter assets optimally and in an efficient way. To this end, this paper reviews the state-of-the-art of asset management strategies in the power industry and indicates the research challenges associated with the management of high power converter assets. Emphasis is made on the following aspects: condition monitoring, maintenance policies, and ageing and failure mechanisms. Within this context, the use of a physics-of-failure based assessment for the life-cycle management of power converter assets is introduced and discussed

    Redundant disk arrays: Reliable, parallel secondary storage

    Get PDF
    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures
    • โ€ฆ
    corecore