49 research outputs found
Artificial neural networks acceleration on field-programmable gate arrays considering model redundancy
Artificial Neural Networks (ANNs) have dramatically developed over the last ten years, and have been successfully applied in many important areas. A natural follow-up topic is to deploy ANNs to a wider range of hardware platforms. However, modern ANN models may aim for millisecond- or even nanosecond-level latency for each input processing while it is common for them to require million-level operations and gigabyte-scale data access for computing each input. This intrinsic high computational complexity introduces hardware challenges to the system implementation. Meanwhile, the integration of computing resources on hardware platforms is hampered by the slowing down of Moore’s Law. Therefore, it is important to study new design methods for ANN hardware systems that produce high model accuracy with low resource usage. Field-Programmable Gate Array (FPGA) is a natural fit for this topic due to its reconfigurability and flexibility. These features of FPGA allow us to implement customised data paths and data representations on hardware, which makes it the primary platform in this research.
The main topics discussed in this thesis include neural network redundancy and its impact on hardware systems. The main goal is to reduce hardware complexity by reducing neural network redundancy and maintaining accuracy at the same time. To achieve this, redundancy is firstly categorised into two types: model- and data-level. Then, each type is studied in isolation before both are combined in a single system design. First, to study model-level redundancy, an algorithm called dropout is implemented as a way to reduce model-level redundancy during training and used here to reduce hardware cost. Our proposed system achieves a 50% reduction in DSP usage and 33% to 47% fewer on-chip memory usage compared to conventional implementations. Second, in terms of data-level redundancy, we aim to study how data precision affects hardware cost and system throughput. Our experiments show that reduced-precision data present negligible or even no accuracy loss to full-precision data on the tested benchmarks. In particular, the 4-bit fixed point presents a good trade-off between model accuracy and hardware cost compared to other tested data representations. Third, we studied the interactive effect of reducing both model- and data-level redundancy and proposed a FPGA accelerator design for Redundancy-Reduced (RR-) MobileNet [Hea17]. Our proposed RR-MobileNet system achieves a state-of-the-art latency, 7.85 ms, for single image processing in ImageNet inference. Finally, a design guideline is proposed as a step-by-step guidance for redundancy-reduced neural network system design.Open Acces
DASNet: Dynamic Activation Sparsity for Neural Network Efficiency Improvement
To improve the execution speed and efficiency of neural networks in embedded
systems, it is crucial to decrease the model size and computational complexity.
In addition to conventional compression techniques, e.g., weight pruning and
quantization, removing unimportant activations can reduce the amount of data
communication and the computation cost. Unlike weight parameters, the pattern
of activations is directly related to input data and thereby changes
dynamically. To regulate the dynamic activation sparsity (DAS), in this work,
we propose a generic low-cost approach based on winners-take-all (WTA) dropout
technique. The network enhanced by the proposed WTA dropout, namely
\textit{DASNet}, features structured activation sparsity with an improved
sparsity level. Compared to the static feature map pruning methods, DASNets
provide better computation cost reduction. The WTA technique can be easily
applied in deep neural networks without incurring additional training
variables. More importantly, DASNet can be seamlessly integrated with other
compression techniques, such as weight pruning and quantization, without
compromising on accuracy. Our experiments on various networks and datasets
present significant run-time speedups with negligible accuracy loss
Recommended from our members
Lane Compression: A Lightweight Lossless Compression Method for Machine Learning on Embedded Systems
This article presents Lane Compression, a lightweight lossless compression technique for machine learning that is based on a detailed study of the statistical properties of machine learning data. The proposed technique profiles machine learning data gathered ahead of run-time and partitions values bit-wise into different
lanes
with more distinctive statistical characteristics. Then the most appropriate compression technique is chosen for each lane out of a small number of low-cost compression techniques. Lane Compression’s compute and memory requirements are very low and yet it achieves a compression rate comparable to or better than Huffman coding. We evaluate and analyse Lane Compression on a wide range of machine learning networks for both inference and re-training. We also demonstrate the profiling prior to run-time and the ability to configure the hardware based on the profiling guarantee robust performance across different models and datasets. Hardware implementations are described and the scheme’s simplicity makes it suitable for compressing both on-chip and off-chip traffic.
Samsung Advanced Institute of Technology (SAIT
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces