1,877 research outputs found

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    A Study of Energy and Locality Effects using Space-filling Curves

    Full text link
    The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

    Chameleon: A Hybrid Secure Computation Framework for Machine Learning Applications

    Get PDF
    We present Chameleon, a novel hybrid (mixed-protocol) framework for secure function evaluation (SFE) which enables two parties to jointly compute a function without disclosing their private inputs. Chameleon combines the best aspects of generic SFE protocols with the ones that are based upon additive secret sharing. In particular, the framework performs linear operations in the ring Z2l\mathbb{Z}_{2^l} using additively secret shared values and nonlinear operations using Yao's Garbled Circuits or the Goldreich-Micali-Wigderson protocol. Chameleon departs from the common assumption of additive or linear secret sharing models where three or more parties need to communicate in the online phase: the framework allows two parties with private inputs to communicate in the online phase under the assumption of a third node generating correlated randomness in an offline phase. Almost all of the heavy cryptographic operations are precomputed in an offline phase which substantially reduces the communication overhead. Chameleon is both scalable and significantly more efficient than the ABY framework (NDSS'15) it is based on. Our framework supports signed fixed-point numbers. In particular, Chameleon's vector dot product of signed fixed-point numbers improves the efficiency of mining and classification of encrypted data for algorithms based upon heavy matrix multiplications. Our evaluation of Chameleon on a 5 layer convolutional deep neural network shows 133x and 4.2x faster executions than Microsoft CryptoNets (ICML'16) and MiniONN (CCS'17), respectively

    Issues with rounding in the GCC implementation of the ISO 18037:2008 standard fixed-point arithmetic

    Full text link
    We describe various issues caused by the lack of round-to-nearest mode in the \textit{gcc} compiler implementation of the fixed-point arithmetic data types and operations. We demonstrate that round-to-nearest is not performed in the conversion of constants, conversion from one numerical type to a less precise type and results of multiplications. Furthermore, we show that mixed-precision operations in fixed-point arithmetic lose precision on arguments, even before carrying out arithmetic operations. The ISO 18037:2008 standard was created to standardize C language extensions, including fixed-point arithmetic, for embedded systems. Embedded systems are usually based on ARM processors, of which approximately 100 billion have been manufactured by now. Therefore, the observations about numerical issues that we discuss in this paper can be rather dangerous and are important to address, given the wide ranging type of applications that these embedded systems are running.Comment: To appear in the proceedings of the 27th IEEE Symposium on Computer Arithmeti

    FPGA implementation of a LSTM Neural Network

    Get PDF
    Este trabalho pretende fazer uma implementação customizada, em Hardware, duma Rede Neuronal Long Short-Term Memory. O modelo python, assim como a descrição Verilog, e síntese RTL, encontram-se terminadas. Falta apenas fazer o benchmarking e a integração de um sistema de aprendizagem
    • …
    corecore