90 research outputs found

    Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

    Full text link
    Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating convolution operations. However, maximally exploiting the available memory bandwidth of GPUs for convolution is a challenging task. This paper introduces a general model to address the mismatch between the memory bank width of GPUs and computation data width of threads. Based on this model, we develop two convolution kernels, one for the general case and the other for a special case with one input channel. By carefully optimizing memory access patterns and computation patterns, we design a communication-optimized kernel for the special case and a communication-reduced kernel for the general case. Experimental data based on implementations on Kepler GPUs show that our kernels achieve 5.16X and 35.5% average performance improvement over the latest cuDNN library, for the special case and the general case, respectively

    Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?

    Full text link
    Computing-in-Memory (CiM) architectures based on emerging non-volatile memory (NVM) devices have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, NVM devices suffer from various non-idealities, especially device-to-device variations due to fabrication defects and cycle-to-cycle variations due to the stochastic behavior of devices. As such, the DNN weights actually mapped to NVM devices could deviate significantly from the expected values, leading to large performance degradation. To address this issue, most existing works focus on maximizing average performance under device variations. This objective would work well for general-purpose scenarios. But for safety-critical applications, the worst-case performance must also be considered. Unfortunately, this has been rarely explored in the literature. In this work, we formulate the problem of determining the worst-case performance of CiM DNN accelerators under the impact of device variations. We further propose a method to effectively find the specific combination of device variation in the high-dimensional space that leads to the worst-case performance. We find that even with very small device variations, the accuracy of a DNN can drop drastically, causing concerns when deploying CiM accelerators in safety-critical applications. Finally, we show that surprisingly none of the existing methods used to enhance average DNN performance in CiM accelerators are very effective when extended to enhance the worst-case performance, and further research down the road is needed to address this problem

    Compact and High-Performance TCAM Based on Scaled Double-Gate FeFETs

    Full text link
    Ternary content addressable memory (TCAM), widely used in network routers and high-associativity caches, is gaining popularity in machine learning and data-analytic applications. Ferroelectric FETs (FeFETs) are a promising candidate for implementing TCAM owing to their high ON/OFF ratio, non-volatility, and CMOS compatibility. However, conventional single-gate FeFETs (SG-FeFETs) suffer from relatively high write voltage, low endurance, potential read disturbance, and face scaling challenges. Recently, a double-gate FeFET (DG-FeFET) has been proposed and outperforms SG-FeFETs in many aspects. This paper investigates TCAM design challenges specific to DG-FeFETs and introduces a novel 1.5T1Fe TCAM design based on DG-FeFETs. A 2-step search with early termination is employed to reduce the cell area and improve energy efficiency. A shared driver design is proposed to reduce the peripherals area. Detailed analysis and SPICE simulation show that the 1.5T1Fe DG-TCAM leads to superior search speed and energy efficiency. The 1.5T1Fe TCAM design can also be built with SG-FeFETs, which achieve search latency and energy improvement compared with 2FeFET TCAM.Comment: Accepted by Design Automation Conference (DAC) 202

    Reliable Dynamic Packet Scheduling over Lossy Real-Time Wireless Networks

    Get PDF
    Along with the rapid development and deployment of real-time wireless network (RTWN) technologies in a wide range of applications, effective packet scheduling algorithms have been playing a critical role in RTWNs for achieving desired Quality of Service (QoS) for real-time sensing and control, especially in the presence of unexpected disturbances. Most existing solutions in the literature focus either on static or dynamic schedule construction to meet the desired QoS requirements, but have a common assumption that all wireless links are reliable. Although this assumption simplifies the algorithm design and analysis, it is not realistic in real-life settings. To address this drawback, this paper introduces a novel reliable dynamic packet scheduling framework, called RD-PaS. RD-PaS can not only construct static schedules to meet both the timing and reliability requirements of end-to-end packet transmissions in RTWNs for a given periodic network traffic pattern, but also construct new schedules rapidly to handle abruptly increased network traffic induced by unexpected disturbances while minimizing the impact on existing network flows. The functional correctness of the RD-PaS framework has been validated through its implementation and deployment on a real-life RTWN testbed. Extensive simulation-based experiments have also been performed to evaluate the effectiveness of RD-PaS, especially in large-scale network settings
    • …
    corecore