45 research outputs found
On the Reduction of Computational Complexity of Deep Convolutional Neural Networks.
Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy
Recommended from our members
Model-Architecture Co-design of Deep Neural Networks for Embedded Systems
In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems.
This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available.
In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x.
My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable.
My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification
Communication lower bounds for nested bilinear algorithms
We develop lower bounds on communication in the memory hierarchy or between
processors for nested bilinear algorithms, such as Strassen's algorithm for
matrix multiplication. We build on a previous framework that establishes
communication lower bounds by use of the rank expansion, or the minimum rank of
any fixed size subset of columns of a matrix, for each of the three matrices
encoding the bilinear algorithm. This framework provides lower bounds for any
way of computing a bilinear algorithm, which encompasses a larger space of
algorithms than by fixing a particular dependency graph. Nested bilinear
algorithms include fast recursive algorithms for convolution, matrix
multiplication, and contraction of tensors with symmetry. Two bilinear
algorithms can be nested by taking Kronecker products between their encoding
matrices. Our main result is a lower bound on the rank expansion of a matrix
constructed by a Kronecker product derived from lower bounds on the rank
expansion of the Kronecker product's operands. To prove this bound, we map a
subset of columns from a submatrix to a 2D grid, collapse them into a dense
grid, expand the grid, and use the size of the expanded grid to bound the
number of linearly independent columns of the submatrix. We apply the rank
expansion lower bounds to obtain novel communication lower bounds for nested
Toom-Cook convolution, Strassen's algorithm, and fast algorithms for partially
symmetric contractions.Comment: 37 pages, 5 figures, 1 table. Update includes log-log convex/concave
functions to fix previous bug in v
Recommended from our members
1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation
Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy
Even faster integer multiplication
We give a new proof of F\"urer's bound for the cost of multiplying n-bit
integers in the bit complexity model. Unlike F\"urer, our method does not
require constructing special coefficient rings with "fast" roots of unity.
Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8.
We show that an optimised variant of F\"urer's algorithm achieves only K = 16,
suggesting that the new algorithm is faster than F\"urer's by a factor of
2^(log^* n). Assuming standard conjectures about the distribution of Mersenne
primes, we give yet another algorithm that achieves K = 4
Algorithmic Views of Vectorized Polynomial Multipliers for NTRU and NTRU Prime (Long Paper)
This paper explores the design space of vector-optimized polynomial multiplications in the lattice-based key-encapsulation mechanisms NTRU and NTRU Prime. Since NTRU and NTRU Prime do not support straightforward applications of number– theoretic transforms, the state-of-the-art vector code either resorted to Toom–Cook, or introduced various techniques for coefficient ring extensions. All these techniques lead to a large number of small-degree polynomial multiplications, which is the bottleneck in our experiments.
For NTRU Prime, we show how to reduce the number of small-degree polynomial multiplications to nearly 1/4 times compared to the previous vectorized code with the same functionality. Our transformations are based on careful choices of FFTs, including Good–Thomas, Rader’s, Schönhage’s, and Bruun’s FFTs. For NTRU, we show how to deploy Toom-5 with 3-bit losses.
Furthermore, we show that the Toeplitz matrix–vector product naturally translates into efficient implementations with vector-by-scalar multiplication instructions which do not appear in all prior vector-optimized implementations.
We choose the ARM Cortex-A72 CPU which implements the Armv8-A architecture for experiments, because of its wide uses in smartphones, and also the Neon vector instruction set implementing vector-by-scalar multiplications that do not appear in most other vector instruction sets like Intel’s AVX2.
Even for platforms without vector-by-scalar multiplications, we expect significant improvements compared to the state of the art, since our transformations reduce the number of multiplication instructions by a large margin.
Compared to the state-of-the-art optimized implementations, we achieve 2.18Ă— and 6.7Ă— faster polynomial multiplications for NTRU and NTRU Prime, respectively. For full schemes, we additionally vectorize the polynomial inversions, sorting network, and encoding/decoding subroutines in NTRU and NTRU Prime. For ntruhps2048677, we achieve 7.67Ă—, 2.48Ă—, and 1.77Ă— faster key generation, encapsulation, and decapsulation, respectively. For ntrulpr761, we achieve 3Ă—, 2.87Ă—, and 3.25Ă— faster key generation, encapsulation, and decapsulation, respectively. For sntrup761, there are no previously optimized implementations and we significantly outperform the reference implementation