45 research outputs found

    On the Reduction of Computational Complexity of Deep Convolutional Neural Networks.

    Get PDF
    Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy

    Communication lower bounds for nested bilinear algorithms

    Full text link
    We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding the bilinear algorithm. This framework provides lower bounds for any way of computing a bilinear algorithm, which encompasses a larger space of algorithms than by fixing a particular dependency graph. Nested bilinear algorithms include fast recursive algorithms for convolution, matrix multiplication, and contraction of tensors with symmetry. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices. Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product's operands. To prove this bound, we map a subset of columns from a submatrix to a 2D grid, collapse them into a dense grid, expand the grid, and use the size of the expanded grid to bound the number of linearly independent columns of the submatrix. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen's algorithm, and fast algorithms for partially symmetric contractions.Comment: 37 pages, 5 figures, 1 table. Update includes log-log convex/concave functions to fix previous bug in v

    Even faster integer multiplication

    Full text link
    We give a new proof of F\"urer's bound for the cost of multiplying n-bit integers in the bit complexity model. Unlike F\"urer, our method does not require constructing special coefficient rings with "fast" roots of unity. Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8. We show that an optimised variant of F\"urer's algorithm achieves only K = 16, suggesting that the new algorithm is faster than F\"urer's by a factor of 2^(log^* n). Assuming standard conjectures about the distribution of Mersenne primes, we give yet another algorithm that achieves K = 4

    Algorithmic Views of Vectorized Polynomial Multipliers for NTRU and NTRU Prime (Long Paper)

    Get PDF
    This paper explores the design space of vector-optimized polynomial multiplications in the lattice-based key-encapsulation mechanisms NTRU and NTRU Prime. Since NTRU and NTRU Prime do not support straightforward applications of number– theoretic transforms, the state-of-the-art vector code either resorted to Toom–Cook, or introduced various techniques for coefficient ring extensions. All these techniques lead to a large number of small-degree polynomial multiplications, which is the bottleneck in our experiments. For NTRU Prime, we show how to reduce the number of small-degree polynomial multiplications to nearly 1/4 times compared to the previous vectorized code with the same functionality. Our transformations are based on careful choices of FFTs, including Good–Thomas, Rader’s, Schönhage’s, and Bruun’s FFTs. For NTRU, we show how to deploy Toom-5 with 3-bit losses. Furthermore, we show that the Toeplitz matrix–vector product naturally translates into efficient implementations with vector-by-scalar multiplication instructions which do not appear in all prior vector-optimized implementations. We choose the ARM Cortex-A72 CPU which implements the Armv8-A architecture for experiments, because of its wide uses in smartphones, and also the Neon vector instruction set implementing vector-by-scalar multiplications that do not appear in most other vector instruction sets like Intel’s AVX2. Even for platforms without vector-by-scalar multiplications, we expect significant improvements compared to the state of the art, since our transformations reduce the number of multiplication instructions by a large margin. Compared to the state-of-the-art optimized implementations, we achieve 2.18× and 6.7× faster polynomial multiplications for NTRU and NTRU Prime, respectively. For full schemes, we additionally vectorize the polynomial inversions, sorting network, and encoding/decoding subroutines in NTRU and NTRU Prime. For ntruhps2048677, we achieve 7.67×, 2.48×, and 1.77× faster key generation, encapsulation, and decapsulation, respectively. For ntrulpr761, we achieve 3×, 2.87×, and 3.25× faster key generation, encapsulation, and decapsulation, respectively. For sntrup761, there are no previously optimized implementations and we significantly outperform the reference implementation
    corecore