4,538 research outputs found

    6T CMOS SRAM Stability in Nanoelectronic Era: From Metrics to Built-in Monitoring

    Get PDF
    The digital technology in the nanoelectronic era is based on intensive data processing and battery-based devices. As a consequence, the need for larger and energy-efficient circuits with large embedded memories is growing rapidly in current system-on-chip (SoC). In this context, where embedded SRAM yield dominate the overall SoC yield, the memory sensitivity to process variation and aging effects has aggressively increased. In addition, long-term aging effects introduce extra variability reducing the failure-free period. Therefore, although stability metrics are used intensively in the circuit design phases, more accurate and non-invasive methodologies must be proposed to observe the stability metric for high reliability systems. This chapter reviews the most extended memory cell stability metrics and evaluates the feasibility of tracking SRAM cell reliability evolution implementing a detailed bit-cell stability characterization measurement. The memory performance degradation observation is focused on estimating the threshold voltage (Vth) drift caused by process variation and reliability mechanisms. A novel SRAM stability degradation measurement architecture is proposed to be included in modern memory designs with minimal hardware intrusion. The new architecture may extend the failure-free period by introducing adaptable circuits depending on the measured memory stability parameter

    From FPGA to ASIC: A RISC-V processor experience

    Get PDF
    This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC

    FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

    Full text link
    A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

    Hydra: An Accelerator for Real-Time Edge-Aware Permeability Filtering in 65nm CMOS

    Full text link
    Many modern video processing pipelines rely on edge-aware (EA) filtering methods. However, recent high-quality methods are challenging to run in real-time on embedded hardware due to their computational load. To this end, we propose an area-efficient and real-time capable hardware implementation of a high quality EA method. In particular, we focus on the recently proposed permeability filter (PF) that delivers promising quality and performance in the domains of HDR tone mapping, disparity and optical flow estimation. We present an efficient hardware accelerator that implements a tiled variant of the PF with low on-chip memory requirements and a significantly reduced external memory bandwidth (6.4x w.r.t. the non-tiled PF). The design has been taped out in 65 nm CMOS technology, is able to filter 720p grayscale video at 24.8 Hz and achieves a high compute density of 6.7 GFLOPS/mm2 (12x higher than embedded GPUs when scaled to the same technology node). The low area and bandwidth requirements make the accelerator highly suitable for integration into SoCs where silicon area budget is constrained and external memory is typically a heavily contended resource

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    Microprocessor fault-tolerance via on-the-fly partial reconfiguration

    Get PDF
    This paper presents a novel approach to exploit FPGA dynamic partial reconfiguration to improve the fault tolerance of complex microprocessor-based systems, with no need to statically reserve area to host redundant components. The proposed method not only improves the survivability of the system by allowing the online replacement of defective key parts of the processor, but also provides performance graceful degradation by executing in software the tasks that were executed in hardware before a fault and the subsequent reconfiguration happened. The advantage of the proposed approach is that thanks to a hardware hypervisor, the CPU is totally unaware of the reconfiguration happening in real-time, and there's no dependency on the CPU to perform it. As proof of concept a design using this idea has been developed, using the LEON3 open-source processor, synthesized on a Virtex 4 FPG

    XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference

    Full text link
    Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to conventional deep neural networks at a fraction of the cost in terms of memory and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully digital configurable hardware accelerator IP for BNNs, integrated within a microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid SRAM / standard cell memory. The XNE is able to fully compute convolutional and dense layers in autonomy or in cooperation with the core in the MCU to realize more complex behaviors. We show post-synthesis results in 65nm and 22nm technology for the XNE IP and post-layout results in 22nm for the full MCU indicating that this system can drop the energy cost per binary operation to 21.6fJ per operation at 0.4V, and at the same time is flexible and performant enough to execute state-of-the-art BNN topologies such as ResNet-34 in less than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu
    corecore