1,776 research outputs found

    Parallel processing and expert systems

    Get PDF
    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited

    A survey of near-data processing architectures for neural networks

    Get PDF
    Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both high-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers, and researchers in the area of machine learning.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (published version

    A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing

    Get PDF
    Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging. To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge. The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic. The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs. The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas. In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications

    HMC-Based Accelerator Design For Compressed Deep Neural Networks

    Get PDF
    Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation

    AI and ML Accelerator Survey and Trends

    Full text link
    This paper updates the survey of AI accelerators and processors from past three years. This paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Two new trends plots based on accelerator release dates are included in this year's paper, along with the additional trends of some neuromorphic, photonic, and memristor-based inference accelerators.Comment: 10 pages, 4 figures, 2022 IEEE High Performance Extreme Computing (HPEC) Conference. arXiv admin note: substantial text overlap with arXiv:2009.00993, arXiv:2109.0895
    • …
    corecore