224 research outputs found

    Binary Field Multiplication on ARMv8

    Get PDF
    In this paper, we show efficient implementations of binary field multiplication over ARMv8. We exploit an advanced 64-bit polynomial multiplication (\texttt{PMULL}) supported by ARMv8 and conduct multiple levels of asymptotically faster Karatsuba multiplication. Finally, our method conducts binary field multiplication within 57 clock cycles for B-251. Our proposed method on ARMv8 improves the performance by a factor of 5.55.5 times than previous techniques on ARMv7

    Efficient Implementations of Pairing-Based Cryptography on Embedded Systems

    Get PDF
    Many cryptographic applications use bilinear pairing such as identity based signature, instance identity-based key agreement, searchable public-key encryption, short signature scheme, certificate less encryption and blind signature. Elliptic curves over finite field are the most secure and efficient way to implement bilinear pairings for the these applications. Pairing based cryptosystems are being implemented on different platforms such as low-power and mobile devices. Recently, hardware capabilities of embedded devices have been emerging which can support efficient and faster implementations of pairings on hand-held devices. In this thesis, the main focus is optimization of Optimal Ate-pairing using special class of ordinary curves, Barreto-Naehring (BN), for different security levels on low-resource devices with ARM processors. Latest ARM architectures are using SIMD instructions based NEON engine and are helpful to optimize basic algorithms. Pairing implementations are being done using tower field which use field multiplication as the most important computation. This work presents NEON implementation of two multipliers (Karatsuba and Schoolbook) and compare the performance of these multipliers with different multipliers present in the literature for different field sizes. This work reports the fastest implementation timing of pairing for BN254, BN446 and BN638 curves for ARMv7 architecture which have security levels as 128-, 164-, and 192-bit, respectively. This work also presents comparison of code performance for ARMv8 architectures

    Faster ECC over F2571 (feat. PMULL)

    Get PDF
    In this paper, we show efficient elliptic curve cryptography implementations for B-571 over ARMv8. We improve the previous binary field multiplication with finely aligned multiplication and incomplete reduction techniques by taking advantages of advanced 64-bit polynomial multiplication (\texttt{PMULL}) supported by ARMv8. This approach shows performance enhancements by a factor of 1.34 times than previous binary field implementations. For the point addition and doubling, the special types of multiplication, squaring and addition operations are combined together and optimized, where one reduction operation is optimized in each case. The scalar multiplication is implemented in constant-time Montgomery ladder algorithm, which is secure against timing attacks. Finally the proposed implementations achieved 759,630/331,944 clock cycles for random/fixed scalar multiplications for B-571 over ARMv8, respectively

    2DT-GLS: Faster and exception-free scalar multiplication in the GLS254 binary curve

    Get PDF
    We revisit and improve performance of arithmetic in the binary GLS254 curve by introducing the 2DT-GLS scalar multiplication algorithm. The algorithm includes theoretical and practice-oriented contributions of potential independent interest: (i) for the first time, a proof that the GLS scalar multiplication algorithm does not incur exceptions, such that faster incomplete formulas can be used; (ii) faster dedicated atomic formulas that alleviate the cost of precomputation; (iii) a table compression technique that reduces the storage needed for precomputed points; (iv) a refined constant-time scalar decomposition algorithm that is more robust to rounding. We also present the first GLS254 implementation for Armv8. With our contributions, we set new speed records for constant-time scalar multiplication by 34.5%34.5\% and 6%6\% on 64-bit Arm and Intel platforms, respectively

    DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization

    Full text link
    Efficiently deploying deep neural networks on low-resource edge devices is challenging due to their ever-increasing resource requirements. To address this issue, researchers have proposed multiplication-free neural networks, such as Power-of-Two quantization, or also known as Shift networks, which aim to reduce memory usage and simplify computation. However, existing low-bit Shift networks are not as accurate as their full-precision counterparts, typically suffering from limited weight range encoding schemes and quantization loss. In this paper, we propose the DenseShift network, which significantly improves the accuracy of Shift networks, achieving competitive performance to full-precision networks for vision and speech applications. In addition, we introduce a method to deploy an efficient DenseShift network using non-quantized floating-point activations, while obtaining 1.6X speed-up over existing methods. To achieve this, we demonstrate that zero-weight values in low-bit Shift networks do not contribute to model capacity and negatively impact inference computation. To address this issue, we propose a zero-free shifting mechanism that simplifies inference and increases model capacity. We further propose a sign-scale decomposition design to enhance training efficiency and a low-variance random initialization strategy to improve the model's transfer learning performance. Our extensive experiments on various computer vision and speech tasks demonstrate that DenseShift outperforms existing low-bit multiplication-free networks and achieves competitive performance compared to full-precision networks. Furthermore, our proposed approach exhibits strong transfer learning performance without a drop in accuracy. Our code was released on GitHub

    Optimized Implementation of Encapsulation and Decapsulation of Classic McEliece on ARMv8

    Get PDF
    Recently, the results of the NIST PQC contest were announced. Classic McEliece, one of the 3rd round candidates, was selected as the fourth round candidate. Classic McEliece is the only code-based cipher in the NIST PQC finalists in third round and the algorithm is regarded as secure. However, it has low efficiency. In this paper, we propose an efficient software implementation of Classic McEliece, a code-based cipher, on 64-bit ARMv8 processors. Classic McEliece can be divided into Key Generation, Encapsulation, and Decapsulation. Among them, we propose an optimal implementation for Encapsulation and Decapsulation. Optimized Encapsulation implementation utilizes vector registers to perform 16-byte parallel operations, and optimize using the specificity of the identity matrix. Decapsulation implemented efficient Multiplication and Inversion on F2mF_2^m field. Compared with the previous results, Encapsulation showed the performance improvement of up-to 1.99× than the-state-of-art works
    corecore