310 research outputs found

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    ์ด์ข… ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ํ™•์žฅํ˜• ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€์žฅ์šฐ.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋จ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ธฐ์—… ๋ฐ ์—ฐ๊ตฌ์ง„๋“ค์€ ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์€ ํ˜•ํƒœ๊ฐ€ ๋ณต์žกํ•ด์ง€๊ณ ,๋กœ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉฐ, ์ข…๋ฅ˜๊ฐ€ ๋‹ค์–‘ํ•ด์ง€๋Š” ์–‘์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ, ํ™•์žฅ์„ฑ, ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ฐ๊ฐ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„ํฌ๋„๋ฅผ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•œ ์ •์ /๋™์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. (2) ์„ฑ๋Šฅ ๋ถ„์„์„ ํ†ตํ•ด ์•Œ์•„๋‚ธ ์ฃผ๋œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ์š”์†Œ๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒด๋ก ์  ๋ชจ๋ธ ๋ณ‘๋ ฌํ™” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (3) ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ๋“ค์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œํ•˜๋Š” ๊ธฐ์ˆ ๊ณผ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋กœ ์ธํ•œ skewness ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ dynamic scheduler ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (4) ํ˜„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ ๋””์ž์ธ์„ ์ œ์‹œํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋“ค์€ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ (์˜ˆ: CPU, GPU, FPGA, ASIC) ์—๋„ ๋ฒ”์šฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ํšจ๊ณผ์ ์ด๋ฏ€๋กœ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๋ถ„์•ผ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ์ˆ ๋“ค์„ ์ ์šฉํ•˜์—ฌ CPU, GPU, FPGA ๊ฐ๊ฐ์˜ ํ™˜๊ฒฝ์—์„œ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์ด ๋ชจ๋‘ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of todayโ€™s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

    hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

    Full text link
    Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.Comment: 10 pages, 8 figures, TinyML Research Symposium 202

    A library-based tool to translate high level DNN models into hierarchical VHDL descriptions

    Get PDF
    This work presents a tool to convert high level models of deep neural networks into register transfer level designs. In order to make it useful for different target technologies, the output designs are based on hierarchical VHDL descriptions, which are accepted as input files for a wide variety of FPGA, SoC and ASIC digital synthesis tools. The presented tool is aimed to speed up the design and synthesis cycle of such systems and provides the designer with certain capability to balance network latency and hardware resources. It also provides a clock domain crossing to interface the input layer of the synthesized neural networks with sensors running at different clock frequencies. The tool is tested with a neural network which combines convolutional and fully connected layers designed to perform traffic sign recognition tasks and synthesized under different hardware resource usage specifications on a Zynq Ultrascale+ MPSoC development board.This work has been partially funded by Spanish Ministerio de Ciencia e Innovaciรณn (MCI), Agencia Estatal de Investigaciรณn (AEI) and European Region Development Fund (ERDF/FEDER) under grant RTI2018-097088-B-C33

    Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators

    Get PDF
    • โ€ฆ
    corecore