860 research outputs found

    Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

    Full text link
    Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

    Get PDF
    Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the MAC units, and achieves a power efficiency of over 3TOp/s/W in a core area of 6.3mm2^2. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real time interactive demonstrations

    A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing

    Get PDF
    Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging. To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge. The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic. The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs. The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas. In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications

    ์ด์ง„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ DRAM ๊ธฐ๋ฐ˜์˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์œ ์Šน์ฃผ.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4). By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation. In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently. First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM. The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators. We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.์ปจ๋ณผ๋ฃจ์…”๋„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ (CNN) ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š”, ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ์‚ฐ์ด ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด์™€ ํ’€๋ฆฌ-์ปค๋„ฅํ‹ฐ๋“œ ๋ ˆ์ด์–ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณฑ์…ˆ๊ณผ ๋ˆ„์  ์—ฐ์‚ฐ์ด๋‹ค. ๊ฒŒ์ดํŠธ-๋กœ์ง ๋ ˆ๋ฒจ์—์„œ๋Š”, ๋Œ€๋Ÿ‰์˜ ๋ฒกํ„ฐ ๋‚ด์ ์œผ๋กœ ์‹คํ–‰๋˜๋ฉฐ, ์ž…๋ ฅ๊ณผ ์ปค๋„ ๋ฒกํ„ฐ๋“ค์„ ๋ฐ˜๋ณตํ•ด์„œ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐํ•œ๋‹ค. ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์—๋Š” ๋ฒ”์šฉ ์—ฐ์‚ฐ ์œ ๋‹›๋ณด๋‹ค, ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•œ ์ž‘์€ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋Œ€๋Ÿ‰์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ์ด ์ผ์ • ์ด์ƒ ๋†’์•„์ง€๋ฉด, ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ „์†ก์— ์˜ํ•ด ์ œํ•œ๋œ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์˜คํ”„-์นฉ์œผ๋กœ ์ „์†กํ•  ๋•Œ์˜ ์—๋„ˆ์ง€ ์†Œ๋ชจ๊ฐ€, ์—ฐ์‚ฐ ์œ ๋‹›์—์„œ ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ์—๋„ˆ์ง€์˜ ์ˆ˜๋ฐฑ๋ฐฐ๋กœ ํฌ๋‹ค. ๋˜ํ•œ ์—ฐ์‚ฐ๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ดˆ๋‹น ์ˆ˜๋ฐฑ ๊ธฐ๊ฐ€~์ˆ˜ ํ…Œ๋ผ-์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ ์ „์†ก์€ ์ดˆ๋‹น ์ˆ˜์‹ญ ๊ธฐ๊ฐ€ ๋ฐ”์ดํŠธ์ด๋‹ค. ๋ฐ์ดํ„ฐ ์ „์†ก์— ์˜ํ•œ ํŒŒ์›Œ์™€ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€, ์ „์†ก๋˜๋Š” ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ๋Š” ๋„คํŠธ์›Œํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–‘์žํ™”ํ•˜์—ฌ, ๋‚ฎ์€ ์ •๋ฐ€๋„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์ด์ง„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ(BNN)๋Š” ์ •๋ฐ€๋„๋ฅผ 1๋น„ํŠธ๊นŒ์ง€ ๊ทน๋‹จ์ ์œผ๋กœ ๋‚ฎ์ถ˜๋‹ค. 16๋น„ํŠธ ์ •๋ฐ€๋„๋ณด๋‹ค ๋„คํŠธ์›Œํฌ์˜ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€๋งŒ, ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ์ •ํ™•๋„๊ฐ€ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ๋˜๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ๊ตฌ์กฐ์ ์œผ๋กœ๋Š”, ์ „์†ก๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ฐ˜๋ณต์ ์ธ ์ „์†ก์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ์œ„์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ์ถ”๋ก  ๊ณผ์ •์—์„œ ๋ณ„๋„์˜ ์—ฐ์‚ฐ ์—†์ด ์ ์šฉ ๊ฐ€๋Šฅํ•˜์—ฌ ๊ฐ€์†๊ธฐ์—์„œ ๋„๋ฆฌ ์ ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, DRAM ๊ธฐ๋ฐ˜์˜ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜๊ณ , DRAM refresh์— ์˜ํ•œ ์„ฑ๋Šฅ ๊ฐ์†Œ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‘ ๋ฐฉ๋ฒ•์€ ํ•˜๋‚˜์˜ DRAM ์นฉ์œผ๋กœ ์ง‘์  ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋…๋ฆฝ์ ์œผ๋กœ ๊ตฌ๋™ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” ๋Œ€๋Ÿ‰์˜ ๋ฒกํ„ฐ ๋‚ด์  ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•œ DRAM ๊ธฐ๋ฐ˜ ๊ฐ€์†๊ธฐ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์ด๋‹ค. BNN์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” CNN๊ฐ€์†๊ธฐ ๋ถ„์•ผ์—์„œ, ๋ฉ”๋ชจ๋ฆฌ์˜ ์…€-์–ด๋ ˆ์ด ๊ตฌ์กฐ๋ฅผ ๋ฒกํ„ฐ ๋‚ด์  ์—ฐ์‚ฐ์— ํ™œ์šฉํ•˜๋Š” ์ปดํ“จํŒ…-์ธ-๋ฉ”๋ชจ๋ฆฌ(CIM) ๊ตฌ์กฐ๊ฐ€ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ํŠนํžˆ, DRAM์—๋Š” ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ ์ „์†ก๋Ÿ‰์˜ ๊ฐ์†Œ์— ์œ ๋ฆฌํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” DRAM ์…€-์–ด๋ ˆ์ด์˜ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๊ณ , DRAM์˜ ๊ธฐ๋ณธ ๋™์ž‘์„ ํ™œ์šฉํ•˜์—ฌ ์—ฐ์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋Š” DRAM ๋ฆฌํ”„๋ ˆ์‰ฌ ์ฃผ๊ธฐ๋ฅผ ๋Š˜๋ ค์„œ ์„ฑ๋Šฅ ์—ดํ™”์™€ ํŒŒ์›Œ ์†Œ๋ชจ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. DRAM์ด ๋ฆฌํ”„๋ ˆ์‰ฌ๋ฅผ ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค, ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์‹œ์Šคํ…œ ํ˜น์€ ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ ๊ฐ์†Œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. DRAM ์นฉ ๋‚ด๋ถ€์—์„œ DRAM์˜ ๋ฆฌํ”„๋ ˆ์‰ฌ ํŠน์„ฑ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ , ๋ฆฌํ”„๋ ˆ์‰ฌ ์ฃผ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. DRAM ๋‚ด๋ถ€์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— DRAM์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์‹œ์Šคํ…œ์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ฐ€์†๊ธฐ์—์„œ๋„ ๋™์ผํ•˜๋‹ค. ๋˜ํ•œ, ์ œ์•ˆ๋œ ๊ฐ€์†๊ธฐ๋ฅผ PyTorch์™€ ๊ฐ™์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ๋„ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก, ์†Œํ”„ํŠธ์›จ์–ด ์Šคํƒ์„ ๋น„๋กฏํ•œ system integration ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ธฐ์กด์˜ TVM compiler์™€ FPGA๋กœ ๊ตฌํ˜„ํ•˜๋Š” TVM/VTA ๊ฐ€์†๊ธฐ์—, DRAM refresh ์‹คํ—˜์—์„œ ๊ฒ€์ฆ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์™€ ์ปค์Šคํ…€ ์ปดํŒŒ์ผ๋Ÿฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด in-DRAM ๊ฐ€์†๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค. ์ด์— ๋”ํ•˜์—ฌ, in-DRAM ๊ฐ€์†๊ธฐ์™€ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์˜ ์„ค๊ณ„ ๋‹จ๊ณ„์—์„œ ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋Šฅ์„ PyTorch์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. PyTorch์—์„œ ์‹ ๊ฒฝ๋ง์„ ์‹คํ–‰ํ•  ๋•Œ, DRAM ๊ฐ€์†๊ธฐ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ณ„์ธต์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณ„์‚ฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋ฐ ๋ฐ์ดํ„ฐ ์ด๋™ ์‹œ๊ฐ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.Abstract i Contents viii List of Tables x List of Figures xiv Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6 2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7 2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10 2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11 2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13 Chapter 3 In-DRAM Neural Network Accelerator 16 3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18 3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21 3.1.3 DRAM Commands with Timing Parameters . . . 22 3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25 3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30 3.3.1 Operation Examples of Row Operator . . . . . . 32 3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39 3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44 3.4.2 Input Data Movement With M2V . . . . . . . . . 47 3.4.3 Internal Data Movement With SiD . . . . . . . . 49 3.4.4 Data Partitioning for Parallel Operation . . . . . 52 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Performance Estimation . . . . . . . . . . . . . 56 3.5.2 Configuration of In-DRAM Accelerator . . . . . 58 3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60 3.5.4 Comparison with the Existing Works . . . . . . . 62 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.1 Performance Comparison with ASIC Accelerators 67 3.6.2 Challenges of The Proposed Architecture . . . . 70 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4 Reducing DRAM Refresh Power Consumption by Runtime Profiling of Retention Time and Dualrow Activation 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88 4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93 4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93 4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96 4.6.3 Temperature, Data Backup and Error Check . . . 96 4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98 4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103 4.8.2 Refresh Period Improvement . . . . . . . . . . . 107 4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 5 System Integration 118 5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118 5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 6 Conclusion 129 Bibliography 131 ๊ตญ๋ฌธ์ดˆ๋ก 153Docto
    • โ€ฆ
    corecore