69 research outputs found

    Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime

    Get PDF
    In this paper, we explore the cost of vectorization for polynomial multiplication with coefficients in Zq\mathbb{Z}_q for an odd prime qq. If there is a large power of two dividing q1q−1, we can apply radix-2 Cooley–Tukey fast Fourier transforms to multiply polynomials in Zq[x]\mathbb{Z}_q[x]. The radix-2 nature admits efficient vectorization. Conversely, if 2 is the only power of two dividing q1q−1, we can apply Schönhage’s and Nussbaumer’s FFTs to craft radix-2 roots of unity, but these double the number of coefficients. We show how to avoid this doubling while maintaining vectorization friendliness with Good–Thomas, Rader’s, and Bruun’s FFTs. In particular, we exploit the existing Fermat-prime factor of q1q − 1 for Rader’s FFT and the power-of-two factor of q+1q + 1 for Bruun’s FFT. We implement these ideas for the NTRU Prime instances ntrulpr761/sntrup761, operating over the coefficient ring Z4591\mathbb{Z}_{4591} on a Cortex-A72. sntrup761 is currently used in OpenSSH 9.0 by default. Our polynomial multiplication outperforms the state-of-the-art vector-optimized implementation by 6.1×. For ntrulpr761, our keygen, encap, and decap are 2.98×, 2.79×, and 3.07× faster than the state-of-the-art vector-optimized implementation. For sntrup761, we outperform the reference implementation significantly

    Computing the fast Fourier transform on SIMD microprocessors

    Get PDF
    This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

    Algorithmic Views of Vectorized Polynomial Multipliers for NTRU and NTRU Prime (Long Paper)

    Get PDF
    This paper explores the design space of vector-optimized polynomial multiplications in the lattice-based key-encapsulation mechanisms NTRU and NTRU Prime. Since NTRU and NTRU Prime do not support straightforward applications of number– theoretic transforms, the state-of-the-art vector code either resorted to Toom–Cook, or introduced various techniques for coefficient ring extensions. All these techniques lead to a large number of small-degree polynomial multiplications, which is the bottleneck in our experiments. For NTRU Prime, we show how to reduce the number of small-degree polynomial multiplications to nearly 1/4 times compared to the previous vectorized code with the same functionality. Our transformations are based on careful choices of FFTs, including Good–Thomas, Rader’s, Schönhage’s, and Bruun’s FFTs. For NTRU, we show how to deploy Toom-5 with 3-bit losses. Furthermore, we show that the Toeplitz matrix–vector product naturally translates into efficient implementations with vector-by-scalar multiplication instructions which do not appear in all prior vector-optimized implementations. We choose the ARM Cortex-A72 CPU which implements the Armv8-A architecture for experiments, because of its wide uses in smartphones, and also the Neon vector instruction set implementing vector-by-scalar multiplications that do not appear in most other vector instruction sets like Intel’s AVX2. Even for platforms without vector-by-scalar multiplications, we expect significant improvements compared to the state of the art, since our transformations reduce the number of multiplication instructions by a large margin. Compared to the state-of-the-art optimized implementations, we achieve 2.18× and 6.7× faster polynomial multiplications for NTRU and NTRU Prime, respectively. For full schemes, we additionally vectorize the polynomial inversions, sorting network, and encoding/decoding subroutines in NTRU and NTRU Prime. For ntruhps2048677, we achieve 7.67×, 2.48×, and 1.77× faster key generation, encapsulation, and decapsulation, respectively. For ntrulpr761, we achieve 3×, 2.87×, and 3.25× faster key generation, encapsulation, and decapsulation, respectively. For sntrup761, there are no previously optimized implementations and we significantly outperform the reference implementation

    Fast, Dense Feature SDM on an iPhone

    Full text link
    In this paper, we present our method for enabling dense SDM to run at over 90 FPS on a mobile device. Our contributions are two-fold. Drawing inspiration from the FFT, we propose a Sparse Compositional Regression (SCR) framework, which enables a significant speed up over classical dense regressors. Second, we propose a binary approximation to SIFT features. Binary Approximated SIFT (BASIFT) features, which are a computationally efficient approximation to SIFT, a commonly used feature with SDM. We demonstrate the performance of our algorithm on an iPhone 7, and show that we achieve similar accuracy to SDM

    Acceleration of Deep Learning on FPGA

    Get PDF
    In recent years, deep convolutional neural networks (ConvNet) have shown their popularity in various real world applications. To provide more accurate results, the state-of-the-art ConvNet requires millions of parameters and billions of operations to process a single image, which represents a computational challenge for general purpose processors. As a result, hardware accelerators such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), have been adopted to improve the performance of ConvNet. However, GPU-based solution consumes a considerable amount of power and a traditional RTL design on FPGA requires tedious development that is very time-consuming. In this work, we propose a scalable and parameterized end-to-end ConvNet design using Intel FPGA SDK for OpenCL. To validate the design, we implement VGG 16 model on two different FPGA boards. Consequently, our designs achieve 306.41 GOPS on Intel Stratix A7 and 318.94 GOPS on Intel Arria 10 GX 10AX115. To the best of our knowledge, this outperforms previous FPGA-based accelerators. Compared to the CPU (Intel Xeon E5-2620) and a mid-range GPU (Nvidia K40), our design is 24.3X and 1.7X more energy efficient respectively

    Application Performance Tuning on Xeon Phi

    Get PDF

    The fast multipole method at exascale

    Get PDF
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

    Application Performance Tuning on Xeon Phi

    Get PDF

    (단색수차 보정에 대한 수학적 접근

    Get PDF
    학위논문(박사)--서울대학교 대학원 :자연과학대학 수리과학부,2020. 2. 강명주.This thesis introduces efficient and effective methods for solving monochromatic aberration correction problems. The proposed methods are based on Forward-Backward proximal splitting method, which solves the optimization problem by iteratively solving two sub parts for each step: 1. gradient descent and 2. noise removal. Since the gradient descent part has high computational cost, we develop a low-cost implementation of computing aberration operator and its transpose. Then, we propose 6 different methods, which are based on 6 types of different regularization in the noise removal part. In this thesis, we perform experiments on the proposed image restoration methods. In the experiments, we use synthetic images generated by point spread functions (PSFs), which emulate the effects of monochromatic aberration in modern digital cameras.이 연구는 단색 수차 보정 문제를 풀기 위한 효율적이고 효과적인 방법들을 소개한다. 제안된 방법들은 Forward-Backward proximal splitting 방법에 기반한 것으로 이 방법은 최적화 문제를 경사하강법과 노이즈 제거의 두 문제로 나누어 반복 방법을 통해 푼다. 단색 수차 문제에 있어서 경사하강법은 큰 계산 비용을 요구하기 때문에 수차 연산자의 저비용 구현 방법을 개발한다. 이어서 6가지의 서로 다른 정칙 연산자에 기반한 노이즈 제거 방법을 적용한 영상 복원 방법을 제안한다. 이 연구에서는 제안된 영상 복원 방법들에 대한 실험을 수행한다. 실험에서는 점확산함수 (Point Spread Function)을 이용해 합성된 수차 영상을 이용하는데, 해당 점확산함수는 현대 디지털 카메라의 단색 수차 효과를 모방한 것이다.1 Introduction 1 2 Related Works 5 2.1 Approximation Methods 5 2.1.1 Methods 5 2.1.2 Methods Comparison and Conclusion 7 2.2 Basic Fourier Optics 8 2.2.1 Wavefront Optical Path Difference, W (x, y) 8 2.2.2 Pupil and Amplitude Transfer Functions 11 2.2.3 Point Spread Functions 12 2.3 Mathematical Preliminaries 14 2.3.1 Basic Properties of svcOperators 14 2.3.2 Regularizations in Inverse Problems 16 2.3.3 Convex Optimization Theory 21 3 Proposed Methods 30 3.1 Low Cost Implementation Using Small Support Assumption 31 3.1.1 Vectorization Techniques 33 3.2 Proposed Algorithm 34 3.2.1 Forward Backward Splitting Algorithm 35 3.2.2 Split Bregman Method 38 3.2.3 Algorithms 42 4 Experiments 47 4.1 Implementation Details 47 4.1.1 Generation of synthetic blurry images 47 4.2 Numerical Results 49 4.2.1 Synthetically Blurred Images 50 4.2.2 Image Restoration 52 5 Conclusion and Future Work 65 5.1 Conclusion 65 5.2 Future Work 66 Abstract (in Korean) 71Docto
    corecore