37 research outputs found

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    A transprecision floating-point cluster for efficient near-sensor data analytics

    Full text link
    Recent applications in the domain of near-sensor computing require the adoption of floating-point arithmetic to reconcile high precision results with a wide dynamic range. In this paper, we propose a multi-core computing cluster that leverages the fined-grained tunable principles of transprecision computing to provide support to near-sensor applications at a minimum power budget. Our design - based on the open-source RISC-V architecture - combines parallelization and sub-word vectorization with near-threshold operation, leading to a highly scalable and versatile system. We perform an exhaustive exploration of the design space of the transprecision cluster on a cycle-accurate FPGA emulator, with the aim to identify the most efficient configurations in terms of performance, energy efficiency, and area efficiency. We also provide a full-fledged software stack support, including a parallel runtime and a compilation toolchain, to enable the development of end-to-end applications. We perform an experimental assessment of our design on a set of benchmarks representative of the near-sensor processing domain, complementing the timing results with a post place-&-route analysis of the power consumption. Finally, a comparison with the state-of-the-art shows that our solution outperforms the competitors in energy efficiency, reaching a peak of 97 Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision vectors

    NP-CGRA: Extending CGRAs for Efficient Processing of Light-weight Deep Neural Networks

    Get PDF
    Department of Electrical EngineeringCoarse-grained reconfigurable architectures (CGRAs) can provide both high energy efficiency and flexibility, making them well-suited for machine learning applications. However previous work on CGRAs has a very limited support for deep neural networks (DNNs), especially for recent light-weight models such as depthwise separable convolution (DSC), which are an important workload for mobile environment. In this paper, we propose a set of architecture extensions and a mapping scheme to greatly enhance CGRA???s performance for DSC kernels. Our experimental results using MobileNets demonstrate that our proposed CGRA enhancement can deliver 8???18?? improvement in area-delay product depending on layer type, over a baseline CGRA with a state-of-the-art CGRA compiler. Moreover, our proposed CGRA architecture can also speed up 3D convolution with similar efficiency than previous work, demonstrating the effectiveness of our architectural features beyond DSC layers.clos

    ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰์„ ์œ„ํ•œ ์‹คํ–‰ ๊ณ„ํš ์ž๋™ ์ƒ์„ฑ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Over the past years, a large number of architectures and accelerators for Deep Neural Networks (DNNs) have been proposed. While exhibiting common features, the number and arrangement of processing elements, the sizes and types of on-chip memory, and the possibilities of parallel execution vary significantly especially in the embedded system domain. The number of off-chip memory accesses and the performance of a DNN on a given accelerator depends not only on the supported computational patterns and the available on-chip memory but also on the sizes and shapes of each layer. Finding a computational pattern that minimizes off-chip memory accesses while maximizing performance is thus a tedious and error-prone task. This thesis presents e-PlaNNer, a compiler framework that generates an optimized execution plan for a given embedded accelerator and Convolutional Neural Network (CNN). For each layer, e-PlaNNer determines the performance-optimal configuration by considering the data movement, tiling, and work distribution. The generated execution plan is transformed to code, allowing for a fast development cycle with different CNNs and hardware accelerators. Evaluated with five neural networks under varying memory configurations and compared to previous works on the Nvidia Jetson TX2, e-PlaNNer achieves 6x speedup and 21.14% reduction of off-chip memory access volume on average. In addition, e-PlaNNer shows meaningful performance compared to well-known deep learning frameworks in terms of end-to-end execution.์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ์œ„ํ•œ ์ˆ˜๋งŽ์€ ์•„ํ‚คํ…์ฒ˜์™€ ๊ฐ€์†๊ธฐ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, ์ผ๋ฐ˜์ ์ธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์ˆ˜ํ–‰ ๋ฐฉ์‹๋“ค์ด ํ•จ๊ป˜ ์ œ์•ˆ๋˜์—ˆ์œผ๋‚˜, ๊ตฌ์ฒด์ ์ธ ์—ฐ์‚ฐ ๋ฐฐ์น˜ ๋ฐฉ์‹๊ณผ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ ๋ฐ ์ข…๋ฅ˜, ๊ทธ๋ฆฌ๊ณ  ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ฐฉ์‹์€ ํŠนํžˆ ๋‚ด์žฅํ˜• ์‹œ์Šคํ…œ์—์„œ ๋‹ค์–‘ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํฌ๊ธฐ ๋ฐ ์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ์€ ์—ฐ์‚ฐ ํ˜•ํƒœ ๋ฐ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ ๋ฟ ์•„๋‹ˆ๋ผ ์‹ ๊ฒฝ๋ง ๊ฐ ๊ณ„์ธต์˜ ํฌ๊ธฐ ๋ฐ ํ˜•ํƒœ์— ๋”ฐ๋ผ์„œ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์—ฐ์‚ฐ ํ˜•ํƒœ๋ฅผ ์ผ์ผ์ด ์ฐพ๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ๋ฒˆ๊ฑฐ๋กœ์šด ์ž‘์—…์ด๋ฉฐ, ๋งŽ์€ ์˜ค๋ฅ˜๋ฅผ ๋ฐœ์ƒ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•  e-PlaNNer๋Š” ์ฃผ์–ด์ง„ ๋‚ด์žฅํ˜• ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ์™€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•˜์—ฌ ์ตœ์ ํ™”๋œ ์‹คํ–‰ ๊ณ„ํš์„ ์ƒ์„ฑํ•ด์ฃผ๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. e-PlaNNer๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๊ฐ ์‹ ๊ฒฝ๋ง ๊ณ„์ธต์— ๋Œ€ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ด๋™, ํƒ€์ผ๋ง, ๊ทธ๋ฆฌ๊ณ  ์ž‘์—… ๋ฐฐ๋ถ„์„ ๊ณ ๋ คํ•œ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋œ ์‹คํ–‰ ๊ณ„ํš์„ ๊ฒฐ์ •ํ•œ๋‹ค. ๋˜ํ•œ, ์ƒ์„ฑ๋œ ์‹คํ–‰ ๊ณ„ํš์„ ์‹ค์ œ ์ปดํŒŒ์ผ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•จ์œผ๋กœ์จ, ์„œ๋กœ ๋‹ค๋ฅธ ๋‹ค์–‘ํ•œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง๊ณผ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ๊ฐœ๋ฐœ ์ฃผ๊ธฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ์œผ๋กœ ๋‹ค์„ฏ ๊ฐ€์ง€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์‘์šฉ์„ Nvidia์˜ Jetson TX2 ์—์„œ ๊ฒ€์ฆํ•˜์—ฌ ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์™€ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, e-PlaNNer๋Š” ํ‰๊ท ์ ์œผ๋กœ 6๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ 21.14% ์˜ ์˜คํ”„์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ๋Ÿ‰ ๊ฐ์†Œ๋ฅผ ๋ณด์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, e-PlaNNer๋Š” ์ „์ฒด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์‹คํ–‰ ๊ด€์ ์—์„œ ๊ธฐ์กด์— ์ž˜ ์•Œ๋ ค์ง„ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์™€์˜ ๋น„๊ต์—์„œ๋„ ์˜๋ฏธ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related Work 5 Chapter 3 Background 8 3.1 Convolutional Neural Networks 8 3.2 DNN Accelerator 9 3.3 Roofline Model 11 Chapter 4 Graph Level Processing 13 4.1 Graph Construction 13 4.2 Schedule Caching 14 Chapter 5 Convolutional Layer Analysis 15 5.1 Loop Structure 16 5.2 Loop Tiling 17 5.3 Dataflow 18 Chapter 6 Execution Planning 20 6.1 Architecture Con figurations 20 6.2 Modeling Off-Chip Memory Accesses 22 6.3 Modeling Performance 24 6.4 Search Space Exploration 25 Chapter 7 Code Generation 32 7.1 Intermediate Representation 33 7.2 Target Code Generation 34 Chapter 8 Evaluation 36 8.1 Experimental Setup 36 8.2 Performance Results 39 8.3 Comparison of Off-chip Memory Access 40 8.4 Framework Results 42 Chapter 9 Discussion 46 Chapter 10 Conclusion 47 Bibliography 48 ์š”์•ฝ 57Maste

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    Full text link
    The past decade has seen the breakdown of two important trends in the computing industry: Mooreโ€™s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy e๏ฌ€iciency than general-purpose processors, such as CPUs. While the performance and e๏ฌ€iciency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-e๏ฌ€iciency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd
    corecore