11 research outputs found

    Edwards curves and FFT-based multiplication

    Get PDF
    This paper introduces fast algorithms for performing group operations on Edwards curves using FFT-based multiplication. Previously known algorithms can use such multiplication too, but better results can be achieved if particular properties of FFT-based arithmetic are accounted for. The introduced algorithms perform operations in extended Edwards coordinates and in Montgomery single coordinate

    Accelerating NTRU based Homomorphic Encryption using GPUs

    Get PDF
    In this work we introduce a large polynomial arithmetic library optimized for Nvidia GPUs to support fully homomorphic encryption schemes. To realize the large polynomial arithmetic library we convert the polynomial with large coefficients using the Chinese Remainder Theorem into many polynomials with small coefficients, and then carry out modular multiplications in the residue space using a custom developed discrete Fourier transform library. We further extend the library to support the homomorphic evaluation operations, i.e. addition, multiplication, and relinearization, in an NTRU based somewhat homomorphic encryption library. Finally, we put the library to use to evaluate homomorphic evaluation of two block ciphers: Prince and AES, which show 2.57 times and 7.6 times speedup, respectively, over an Intel Xeon software implementation

    FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

    Full text link
    Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in O(NlogN)O(N logN) time in sequence length NN but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93×\times over PyTorch and achieves up to 4.4×\times speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality

    Accelerating zero knowledge proofs

    Get PDF
    Les proves de coneixement zero són una eina criptogràfica altament prometedora que permet demostrar que un predicat és correcte sense revelar informació addicional sobre aquest. Aquestes tipus de proves són útils en aplicacions que requereixen tant integritat computacional com privadesa, com ara verificar la correcció dels resultats d'una computació delegada a una altra entitat, on hi poden haver involucrats valors d'entrada confidencials. Tanmateix, té un impediment que obstaculitza la seva adopció pràctica: el procés potencialment lent de generació de les proves. Així doncs, aquest projecte explora la viabilitat d'accelerar les proves de coneixement zero mitjançant hardware, amb l'objectiu de superar aquest obstacle crític.Las pruebas de conocimiento cero representan una herramienta criptográfica altamente prometedora que permite demostrar la corrección de un predicado sin revelar información adicional. Estas pruebas son útiles en aplicaciones que requieren tanto integridad computacional como privacidad, como por ejemplo la validación de los resultados de una computación delegada a otra entidad, donde pueden estar involucrados valores de entrada confidenciales. Sin embargo, existe un desafío significativo que obstaculiza su adopción práctica: el proceso potencialmente lento de generación de pruebas. Como resultado, este proyecto explora la viabilidad de acelerar las pruebas de conocimiento cero utilizando hardware, con el objetivo de superar este obstáculo crítico.Zero-knowledge proofs represent a highly promising cryptographic tool that enables the validation of a statement's correctness without revealing any supplementary information. These proofs find utility in applications demanding both computational integrity and privacy, such as validating outsourced computation results, where confidential input values may be involved. However, a significant challenge hinders their practical adoption: the potentially time-consuming process of generating proofs. Consequently, this project investigates the feasibility of accelerating zero-knowledge proofs using hardware, aiming to overcome this critical hurdle.Outgoin

    CUDA-Accelerated RNS Multiplication in Word-Wise Homomorphic Encryption Schemes

    Get PDF
    Homomorphic encryption (HE), which allows computation over encrypted data, has often been used to preserve privacy. However, the computationally heavy nature and complexity of network topologies make the deployment of HE schemes in the Internet of Things (IoT) scenario difficult. In this work, we propose CARM, the first optimized GPU implementation that covers BGV, BFV and CKKS, targeting for accelerating homomorphic multiplication using GPU in heterogeneous IoT systems. We offer constant-time low-level arithmetic with minimum instructions and memory usage, as well as performance- and memory-prior configurations, and exploit a parametric and generic design, and offer various trade-offs between resource and efficiency, yielding a solution suitable for accelerating RNS homomorphic multiplication on both high-performance and embedded GPUs. Through this, we can offer more real-time evaluation results and relieve the computational pressure on cloud devices. We deploy our implementations on two GPUs and achieve up to 378.4×, 234.5×, and 287.2× speedup for homomorphic multiplication of BGV, BFV, and CKKS on Tesla V100S, and 8.8×, 9.2×, and 10.3× on Jetson AGX Xavier, respectively

    Using GPUs to Compute Large Out-of-card FFTs

    Get PDF
    ABSTRACT The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93× and 2.11× over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048 × 65536 on C2070 with double precision

    New hardware support transactional memory and parallel debugging in multicore processors

    Get PDF
    This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures

    Protocoles scalables de cohérence des caches pour processeurs manycore à espace d'adressage partagé visant la basse consommation.

    Get PDF
    The TSAR architecture (Tera-Scale ARchitecture) developed jointly by Lip6 Bull and CEA-LETI is a CC-NUMA manycore architecture which is scalable up to 1024 cores. The DHCCP cache coherence protocol in the TSAR architecture is a global directory protocol using the write-through policy in the L1 cache for scalability purpose, but this write policy causes a high power consumption which we want to reduce. Currently the biggest semiconductors companies, such as Intel or AMD, use the MESI MOESI protocols in their multi-core processors. These protocols use the write-back policy to reduce the high power consumption due to writes. However, the complexity of implementation and the sharp increase in the coherence traffic when the number of processors increases limits the scalability of these protocols beyond a few dozen cores. In this thesis, we propose a new cache coherence protocol using a hybrid method to process write requests in the L1 private cache : for exclusive lines, the L1 cache controller chooses the write-back policy in order to modify locally the lines as well as eliminate the write traffic for exclusive lines. For shared lines, the L1 cache controller uses the write-through policy to simplify the protocol and in order to guarantee the scalability. We also optimized the current solution for the TLB coherence problem in the TSAR architecture. The new method which is called CC-TLB not only improves the performance, but also reduces the energy consumption. Finally, this thesis introduces a new micro cache between the core and the L1 cache, which allows to reduce the number of accesses to the instruction cache, in order to save energy.L'architecture TSAR (Tera-Scale ARchitecture) développée conjointement par BULL, le Lip6 et le CEA-LETI est une architecture manycore CC-NUMA extensible jusqu'à 1024 cœurs. Le protocole de cohérence de cache DHCCP dans l'architecture TSAR repose sur le principe du répertoire global distribué en utilisant la stratégie d'écriture simultanée afin de passer à l'échelle, mais cette scalabilité a un coût énergétique important que nous cherchons à réduire. Actuellement, les plus grosses entreprises dans le domaine des semi-conducteurs, comme Intel ou AMD, utilisent les protocoles MESI ou MOESI dans leurs processeurs multicoeurs. Ces types de protocoles utilisent la stratégie d'écriture différée pour réduire la forte consommation énergétique due aux écritures. Mais la complexité d'implémentation et la forte augmentation de ce trafic de cohérence quand le nombre de processeurs augmente limite le passage à l'échelle de ces protocoles au-delà de quelques dizaines de coeurs. Dans cette thèse, nous proposons un nouveau protocole de cohérence de cache utilisant une méthode hybride pour traiter les écritures dans le cache L1 privé : pour les lignes non partagées, le contrôleur de cache L1 utilise la stratégie d'écriture différée, de façon à modifier les lignes localement. Pour les lignes partagées, le contrôleur de cache L1 utilise la stratégie d'écriture immédiate pour éviter l'état de propriété exclusive sur ces lignes partagées. Cette méthode, appelée RWT pour Released Write Through, passe non seulement à l'échelle, mais réduit aussi significativement la consommation énergétique liée aux écritures. Nous avons aussi optimisé la solution actuelle pour gérer la cohérence des TLBs dans l'architecture TSAR, en termes de performance et de consommation énergétique. Enfin, nous introduisons dans cette thèse un nouveau petit cache, appelé micro-cache, entre le coeur et le cache L1, afin de réduire le nombre d'accès au cache d'instructions
    corecore