12 research outputs found

    Towards Energy-Efficient and Reliable Computing: From Highly-Scaled CMOS Devices to Resistive Memories

    Get PDF
    The continuous increase in transistor density based on Moore\u27s Law has led us to highly scaled Complementary Metal-Oxide Semiconductor (CMOS) technologies. These transistor-based process technologies offer improved density as well as a reduction in nominal supply voltage. An analysis regarding different aspects of 45nm and 15nm technologies, such as power consumption and cell area to compare these two technologies is proposed on an IEEE 754 Single Precision Floating-Point Unit implementation. Based on the results, using the 15nm technology offers 4-times less energy and 3-fold smaller footprint. New challenges also arise, such as relative proportion of leakage power in standby mode that can be addressed by post-CMOS technologies. Spin-Transfer Torque Random Access Memory (STT-MRAM) has been explored as a post-CMOS technology for embedded and data storage applications seeking non-volatility, near-zero standby energy, and high density. Towards attaining these objectives for practical implementations, various techniques to mitigate the specific reliability challenges associated with STT-MRAM elements are surveyed, classified, and assessed herein. Cost and suitability metrics assessed include the area of nanomagmetic and CMOS components per bit, access time and complexity, Sense Margin (SM), and energy or power consumption costs versus resiliency benefits. In an attempt to further improve the Process Variation (PV) immunity of the Sense Amplifiers (SAs), a new SA has been introduced called Adaptive Sense Amplifier (ASA). ASA can benefit from low Bit Error Rate (BER) and low Energy Delay Product (EDP) by combining the properties of two of the commonly used SAs, Pre-Charge Sense Amplifier (PCSA) and Separated Pre-Charge Sense Amplifier (SPCSA). ASA can operate in either PCSA or SPCSA mode based on the requirements of the circuit such as energy efficiency or reliability. Then, ASA is utilized to propose a novel approach to actually leverage the PV in Non-Volatile Memory (NVM) arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time

    MFPA: Mixed-Signal Field Programmable Array for Energy-Aware Compressive Signal Processing

    Get PDF
    Compressive Sensing (CS) is a signal processing technique which reduces the number of samples taken per frame to decrease energy, storage, and data transmission overheads, as well as reducing time taken for data acquisition in time-critical applications. The tradeoff in such an approach is increased complexity of signal reconstruction. While several algorithms have been developed for CS signal reconstruction, hardware implementation of these algorithms is still an area of active research. Prior work has sought to utilize parallelism available in reconstruction algorithms to minimize hardware overheads; however, such approaches are limited by the underlying limitations in CMOS technology. Herein, the MFPA (Mixed-signal Field Programmable Array) approach is presented as a hybrid spin-CMOS reconfigurable fabric specifically designed for implementation of CS data sampling and signal reconstruction. The resulting fabric consists of 1) slice-organized analog blocks providing amplifiers, transistors, capacitors, and Magnetic Tunnel Junctions (MTJs) which are configurable to achieving square/square root operations required for calculating vector norms, 2) digital functional blocks which feature 6-input clockless lookup tables for computation of matrix inverse, and 3) an MRAM-based nonvolatile crossbar array for carrying out low-energy matrix-vector multiplication operations. The various functional blocks are connected via a global interconnect and spin-based analog-to-digital converters. Simulation results demonstrate significant energy and area benefits compared to equivalent CMOS digital implementations for each of the functional blocks used: this includes an 80% reduction in energy and 97% reduction in transistor count for the nonvolatile crossbar array, 80% standby power reduction and 25% reduced area footprint for the clockless lookup tables, and roughly 97% reduction in transistor count for a multiplier built using components from the analog blocks. Moreover, the proposed fabric yields 77% energy reduction compared to CMOS when used to implement CS reconstruction, in addition to latency improvements

    The route to a defect tolerant LUT through artificial evolution

    Full text link

    A Multi-layer Fpga Framework Supporting Autonomous Runtime Partial Reconfiguration

    Get PDF
    Partial reconfiguration is a unique capability provided by several Field Programmable Gate Array (FPGA) vendors recently, which involves altering part of the programmed design within an SRAM-based FPGA at run-time. In this dissertation, a Multilayer Runtime Reconfiguration Architecture (MRRA) is developed, evaluated, and refined for Autonomous Runtime Partial Reconfiguration of FPGA devices. Under the proposed MRRA paradigm, FPGA configurations can be manipulated at runtime using on-chip resources. Operations are partitioned into Logic, Translation, and Reconfiguration layers along with a standardized set of Application Programming Interfaces (APIs). At each level, resource details are encapsulated and managed for efficiency and portability during operation. An MRRA mapping theory is developed to link the general logic function and area allocation information to the device related physical configuration level data by using mathematical data structure and physical constraints. In certain scenarios, configuration bit stream data can be read and modified directly for fast operations, relying on the use of similar logic functions and common interconnection resources for communication. A corresponding logic control flow is also developed to make the entire process autonomous. Several prototype MRRA systems are developed on a Xilinx Virtex II Pro platform. The Virtex II Pro on-chip PowerPC core and block RAM are employed to manage control operations while multiple physical interfaces establish and supplement autonomous reconfiguration capabilities. Area, speed and power optimization techniques are developed based on the developed Xilinx prototype. Evaluations and analysis of these prototype and techniques are performed on a number of benchmark and hashing algorithm case studies. The results indicate that based on a variety of test benches, up to 70% reduction in the resource utilization, up to 50% improvement in power consumption, and up to 10 times increase in run-time performance are achieved using the developed architecture and approaches compared with Xilinx baseline reconfiguration flow. Finally, a Genetic Algorithm (GA) for a FPGA fault tolerance case study is evaluated as a ultimate high-level application running on this architecture. It demonstrated that this is a hardware and software infrastructure that enables an FPGA to dynamically reconfigure itself efficiently under the control of a soft microprocessor core that is instantiated within the FPGA fabric. Such a system contributes to the observed benefits of intelligent control, fast reconfiguration, and low overhead

    Design Disjunction for Resilient Reconfigurable Hardware

    Get PDF
    Contemporary reconfigurable hardware devices have the capability to achieve high performance, power efficiency, and adaptability required to meet a wide range of design goals. With scaling challenges facing current complementary metal oxide semiconductor (CMOS), new concepts and methodologies supporting efficient adaptation to handle reliability issues are becoming increasingly prominent. Reconfigurable hardware and their ability to realize self-organization features are expected to play a key role in designing future dependable hardware architectures. However, the exponential increase in density and complexity of current commercial SRAM-based field-programmable gate arrays (FPGAs) has escalated the overhead associated with dynamic runtime design adaptation. Traditionally, static modular redundancy techniques are considered to surmount this limitation; however, they can incur substantial overheads in both area and power requirements. To achieve a better trade-off among performance, area, power, and reliability, this research proposes design-time approaches that enable fine selection of redundancy level based on target reliability goals and autonomous adaptation to runtime demands. To achieve this goal, three studies were conducted: First, a graph and set theoretic approach, named Hypergraph-Cover Diversity (HCD), is introduced as a preemptive design technique to shift the dominant costs of resiliency to design-time. In particular, union-free hypergraphs are exploited to partition the reconfigurable resources pool into highly separable subsets of resources, each of which can be utilized by the same synthesized application netlist. The diverse implementations provide reconfiguration-based resilience throughout the system lifetime while avoiding the significant overheads associated with runtime placement and routing phases. Evaluation on a Motion-JPEG image compression core using a Xilinx 7-series-based FPGA hardware platform has demonstrated the potential of the proposed FT method to achieve 37.5% area saving and up to 66% reduction in power consumption compared to the frequently-used TMR scheme while providing superior fault tolerance. Second, Design Disjunction based on non-adaptive group testing is developed to realize a low-overhead fault tolerant system capable of handling self-testing and self-recovery using runtime partial reconfiguration. Reconfiguration is guided by resource grouping procedures which employ non-linear measurements given by the constructive property of f-disjunctness to extend runtime resilience to a large fault space and realize a favorable range of tradeoffs. Disjunct designs are created using the mosaic convergence algorithm developed such that at least one configuration in the library evades any occurrence of up to d resource faults, where d is lower-bounded by f. Experimental results for a set of MCNC and ISCAS benchmarks have demonstrated f-diagnosability at the individual slice level with average isolation resolution of 96.4% (94.4%) for f=1 (f=2) while incurring an average critical path delay impact of only 1.49% and area cost roughly comparable to conventional 2-MR approaches. Finally, the proposed Design Disjunction method is evaluated as a design-time method to improve timing yield in the presence of large random within-die (WID) process variations for application with a moderately high production capacity

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria

    Leveraging Signal Transfer Characteristics and Parasitics of Spintronic Circuits for Area and Energy-Optimized Hybrid Digital and Analog Arithmetic

    Get PDF
    While Internet of Things (IoT) sensors offer numerous benefits in diverse applications, they are limited by stringent constraints in energy, processing area and memory. These constraints are especially challenging within applications such as Compressive Sensing (CS) and Machine Learning (ML) via Deep Neural Networks (DNNs), which require dot product computations on large data sets. A solution to these challenges has been offered by the development of crossbar array architectures, enabled by recent advances in spintronic devices such as Magnetic Tunnel Junctions (MTJs). Crossbar arrays offer a compact, low-energy and in-memory approach to dot product computation in the analog domain by leveraging intrinsic signal-transfer characteristics of the embedded MTJ devices. The first phase of this dissertation research seeks to build on these benefits by optimizing resource allocation within spintronic crossbar arrays. A hardware approach to non-uniform CS is developed, which dynamically configures sampling rates by deriving necessary control signals using circuit parasitics. Next, an alternate approach to non-uniform CS based on adaptive quantization is developed, which reduces circuit area in addition to energy consumption. Adaptive quantization is then applied to DNNs by developing an architecture allowing for layer-wise quantization based on relative robustness levels. The second phase of this research focuses on extension of the analog computation paradigm by development of an operational amplifier-based arithmetic unit for generalized scalar operations. This approach allows for 95% area reduction in scalar multiplications, compared to the state-of-the-art digital alternative. Moreover, analog computation of enhanced activation functions allows for significant improvement in DNN accuracy, which can be harnessed through triple modular redundancy to yield 81.2% reduction in power at the cost of only 4% accuracy loss, compared to a larger network. Together these results substantiate promising approaches to several challenges facing the design of future IoT sensors within the targeted applications of CS and ML

    Consensus-based evaluation for fault isolation and on-line evolutionary regeneration

    No full text
    While the fault repair capability of Evolvable Hardware (EH) approaches have been previously demonstrated, further improvements to fault handling capability can be achieved by exploiting population diversity during a phases of the fault handling process. A new paradigm for online EH regeneration using Genetic Algorithms (GAs) called Consensus Based Evaluation (CBE) is developed where the performance of individuals is assessed based on broad consensus of the population instead of a conventional fitness function. Adoption of CBE enables information contained in the population to not only enrich the evolutionary process, but also support fault detection and isolation. On-line regeneration of functionality is achieved without additional test vectors by using the results of competitions between individuals in the population. Relative fitness measures support adaptation of the fitness evaluation procedure to support graceful degredation even in the presence of unpredictable changes in the operational environment, inputs, or the FPGA application. Application of CBE to FPGA-based multipliers demonstrates 100% isolation of randomily injected stuck-at faults and evolution of a complete regeneration within 135 repair iterations while precluding the propagation of any discrepant output. The throughput of the system is maintained at 85.35% throughout the repair process

    Consensus-Based Evaluation For Fault Isolation And On-Line Evolutionary Regeneration

    No full text
    While the fault repair capability of Evolvable Hardware (EH) approaches have been previously demonstrated, further improvements to fault handling capability can be achieved by exploiting population diversity during all phases of the fault handling process. A new paradigm for online EH regeneration using Genetic Algorithms (GAs) called Consensus Based Evaluation (CBE) is developed where the performance of individuals is assessed based on broad consensus of the population instead of a conventional fitness function. Adoption of CBE enables information contained in the population to not only enrich the evolutionary process, but also support fault detection and isolation. On-line regeneration of functionality is achieved without additional test vectors by using the results of competitions between individuals in the population. Relative fitness measures support adaptation of the fitness evaluation procedure to support graceful degredation even in the presence of unpredictable changes in the operational environment, inputs, or the FPGA application. Application of CBE to FPGA-based multipliers demonstrates 100% isolation of randomly injected stuck-at faults and evolution of a complete regeneration within 135 repair iterations while precluding the propagation of any discrepant output. The throughput of the system is maintained at 85.35% throughout the repair process. © Springer-Verlag Berlin Heidelberg 2005
    corecore