Search CORE

45 research outputs found

On the Reduction of Computational Complexity of Deep Convolutional Neural Networks.

Author: Maji Partha
Mullins Robert
Publication venue: Entropy (Basel)
Publication date: 01/04/2018
Field of study

Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy

Directory of Open Access Journals

Apollo (Cambridge)

Recommended from our members

Model-Architecture Co-design of Deep Neural Networks for Embedded Systems

Author: Maji Partha
Publication venue: University of Cambridge
Publication date: 24/06/2020
Field of study

In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems. This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available. In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x. My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable. My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification

Apollo (Cambridge)

Communication lower bounds for nested bilinear algorithms

Author: Ju Caleb
Solomonik Edgar
Zhang Yifan
Publication venue
Publication date: 01/11/2021
Field of study

We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding the bilinear algorithm. This framework provides lower bounds for any way of computing a bilinear algorithm, which encompasses a larger space of algorithms than by fixing a particular dependency graph. Nested bilinear algorithms include fast recursive algorithms for convolution, matrix multiplication, and contraction of tensors with symmetry. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices. Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product's operands. To prove this bound, we map a subset of columns from a submatrix to a 2D grid, collapse them into a dense grid, expand the grid, and use the size of the expanded grid to bound the number of linearly independent columns of the submatrix. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen's algorithm, and fast algorithms for partially symmetric contractions.Comment: 37 pages, 5 figures, 1 table. Update includes log-log convex/concave functions to fix previous bug in v

arXiv.org e-Print Archive

Recommended from our members

1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation

Author: Maji PP
Mullins R
Publication venue: Artificial Neural Networks and Machine Learning – ICANN 2017
Publication date: 24/10/2017
Field of study

Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy

Apollo (Cambridge)

Even faster integer multiplication

Author: Harvey David
Lecerf Grégoire
van der Hoeven Joris
Publication venue
Publication date: 01/01/2014
Field of study

We give a new proof of F\"urer's bound for the cost of multiplying n-bit integers in the bit complexity model. Unlike F\"urer, our method does not require constructing special coefficient rings with "fast" roots of unity. Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8. We show that an optimised variant of F\"urer's algorithm achieves only K = 16, suggesting that the new algorithm is faster than F\"urer's by a factor of 2^(log^* n). Assuming standard conjectures about the distribution of Mersenne primes, we give yet another algorithm that achieves K = 4

arXiv.org e-Print Archive

CiteSeerX

HAL-Polytechnique

Algorithmic Views of Vectorized Polynomial Multipliers for NTRU and NTRU Prime (Long Paper)

Author: Bo-Yin Yang
Chi-Ting Liu
Han-Ting Chen
Vincent Hwang
Yi-Hua Chung
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 08/08/2023
Field of study

This paper explores the design space of vector-optimized polynomial multiplications in the lattice-based key-encapsulation mechanisms NTRU and NTRU Prime. Since NTRU and NTRU Prime do not support straightforward applications of number– theoretic transforms, the state-of-the-art vector code either resorted to Toom–Cook, or introduced various techniques for coefficient ring extensions. All these techniques lead to a large number of small-degree polynomial multiplications, which is the bottleneck in our experiments. For NTRU Prime, we show how to reduce the number of small-degree polynomial multiplications to nearly 1/4 times compared to the previous vectorized code with the same functionality. Our transformations are based on careful choices of FFTs, including Good–Thomas, Rader’s, Schönhage’s, and Bruun’s FFTs. For NTRU, we show how to deploy Toom-5 with 3-bit losses. Furthermore, we show that the Toeplitz matrix–vector product naturally translates into efficient implementations with vector-by-scalar multiplication instructions which do not appear in all prior vector-optimized implementations. We choose the ARM Cortex-A72 CPU which implements the Armv8-A architecture for experiments, because of its wide uses in smartphones, and also the Neon vector instruction set implementing vector-by-scalar multiplications that do not appear in most other vector instruction sets like Intel’s AVX2. Even for platforms without vector-by-scalar multiplications, we expect significant improvements compared to the state of the art, since our transformations reduce the number of multiplication instructions by a large margin. Compared to the state-of-the-art optimized implementations, we achieve 2.18× and 6.7× faster polynomial multiplications for NTRU and NTRU Prime, respectively. For full schemes, we additionally vectorize the polynomial inversions, sorting network, and encoding/decoding subroutines in NTRU and NTRU Prime. For ntruhps2048677, we achieve 7.67×, 2.48×, and 1.77× faster key generation, encapsulation, and decapsulation, respectively. For ntrulpr761, we achieve 3×, 2.87×, and 3.25× faster key generation, encapsulation, and decapsulation, respectively. For sntrup761, there are no previously optimized implementations and we significantly outperform the reference implementation

Cryptology ePrint Archive