19 research outputs found

    Improved Key Pair Generation for Falcon, BAT and Hawk

    Get PDF
    In this short note, we describe a few implementation techniques that allow performing key pair generation for the Falcon and Hawk lattice-based signature schemes, and for the BAT key encapsulation scheme, in a fully constant-time way and without any use of floating-point operations. Our new code is faster than previously published implementations, especially when running on small embedded systems, and uses less RAM

    Falcon Takes Off - A Hardware Implementation of the Falcon Signature Scheme

    Get PDF
    Falcon is one out of three post-quantum signature schemes which have been selected for standardization by NIST in July 2022. To the best of our knowledge, Falcon is the only selected algorithm that does not yet have a publicly reported hardware description that performs signing or key generation. The reason might be that the Falcon signature and key generation algorithms do not fit well in hardware due to the use of floating-point numbers and recursive functions. This publication describes the first hardware implementation for Falcon signing and key generation. To overcome the complexity of the Falcon algorithms, High-Level Synthesis (HLS) was preferred over a hardware description language like Verilog or VHDL. Our HLS code is based on the C reference implementation available at NIST. We describe the required modifications in order to be compliant with HLS, such as rewriting recursive functions into iterative versions. The hardware core at security level 5 requires 45,223 LUTs, 41,370 FFs, 182 DSPs, and 37 BRAMs to calculate one signature in 8.7 ms on a Zynq UltraScale+ FPGA. Security level 5 key generation takes 320.3 ms and requires 100,649 LUTs, 91,029 FFs, 1,215 DSPs, and 69 BRAMs

    Efficient Hardware RNS Decomposition for Post-Quantum Signature Scheme FALCON

    Get PDF
    The recently announced National Institute of Standards and Technology (NIST) Post-quantum cryptography (PQC) third-round standardization process has released its candidates to be standardized and Falcon is one of them. On the other hand, however, very few hardware implementation works for Falcon have been released due to its very complicated computation procedure and intensive complexity. With this background, in this paper, we propose an efficient hardware structure to implement residue numeral system (RNS) decomposition within NTRUSolve (a key arithmetic component for key generation of Falcon). In total, we have proposed three stages of coherent interdependent efforts to finish the proposed work. First, we have identified the necessary algorithmic operation related to RNS decomposition. Then, we have innovatively designed a hardware structure to realize these algorithms. Finally, field-programmable gate array (FPGA)-based implementation has been carried out to verify the superior performance of the proposed hardware structure. For instance, the proposed hardware design involves at least 3.91x faster operational time than the software implementation. To the authors\u27 best knowledge, this is the first paper about the hardware acceleration of RNS decomposition for Falcon, and we hope the outcome of this work will facilitate the research in this area

    Benchmarking and Analysing the NIST PQC Lattice-Based Signature Schemes Standards on the ARM Cortex M7

    Get PDF
    This paper presents an analysis of the two lattice-based digital signature schemes, Dilithium and Falcon, which have been chosen by NIST for standardisation, on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. This research is motivated by the ARM Cortex M7 device being the only processor in the Cortex-M family to offer a double precision (i.e., 64-bit) floating-point unit, making Falcon\u27s implementations, requiring 53 bits of double precision, able to fully run native floating-point operations without any emulation. When benchmarking natively, Falcon shows significant speed-ups between 6.2-8.3x in clock cycles, 6.2-11.8x in runtime, and Dilithium does not show much improvement other than those gained by the slightly faster processor. We then present profiling results of the two schemes on the ARM Cortex M7 to show their respective bottlenecks and operations where the improvements are and can be made. This demonstrates, for example, that some operations in Falcon\u27s procedures observe speed-ups by an order of magnitude. Finally, since Falcon\u27s use of floating points is so rare in cryptography, we test the native FPU instructions on 4 different STM32 development boards with the ARM Cortex M7 and also a Raspberry Pi 3 which is used in some of Falcon\u27s official benchmarking results. We find constant-time irregularities in all of these devices, which makes Falcon insecure on these devices for applications where signature generation can be timed by an attacker

    Isochronous Gaussian Sampling: From Inception to Implementation

    Get PDF
    Gaussian sampling over the integers is a crucial tool in lattice-based cryptography, but has proven over the recent years to be surprisingly challenging to perform in a generic, efficient and provable secure manner. In this work, we present a modular framework for generating discrete Gaussians with arbitrary center and standard deviation. Our framework is extremely simple, and it is precisely this simplicity that allowed us to make it easy to implement, provably secure, portable, efficient, and provably resistant against timing attacks. Our sampler is a good candidate for any trapdoor sampling and it is actually the one that has been recently implemented in the Falcon signature scheme. Our second contribution aims at systematizing the detection of implementation errors in Gaussian samplers. We provide a statistical testing suite for discrete Gaussians called SAGA (Statistically Acceptable GAussian). In a nutshell, our two contributions take a step towards trustable and robust Gaussian sampling real-world implementations

    Nibbling MAYO: Optimized Implementations for AVX2 and Cortex-M4

    Get PDF
    MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive)

    Improved Power Analysis Attacks on Falcon

    Get PDF
    Falcon is one of the three post-quantum signature schemes selected for standardization by NIST. Due to its low bandwidth and high efficiency, Falcon is seen as an attractive option for quantum-safe embedded systems. In this work, we study Falcon\u27s side-channel resistance by analysing its Gaussian samplers. Our results are mainly twofold. The first result is an improved key recovery exploiting the leakage within the base sampler investigated by Guerreau et al. (CHES 2022). Instead of resorting to the fourth moment as in former parallelepiped-learning attacks, we work with the second order statistics covariance and use its spectral decomposition to recover the secret information. Our approach substantially reduces the requirement for measurements and computation resources: 220 000220\,000 traces is sufficient to recover the secret key of Falcon 512 within half an hour with a probability of ≈25%\approx 25\%. As a comparison, even with 10610^6 traces, the former attack still needs about 1000 hours CPU time of lattice reduction for a full key recovery. In addition, our approach is robust to inaccurate leakage classification, which is another advantage over parallelepiped-learning attacks. Our second result is a practical power analysis targeting the integer Gaussian sampler of Falcon. The analysis relies on the leakage of random sign flip within the integer Gaussian sampling. This leakage was exposed in 2018 by Kim and Hong, but it is not considered in Falcon\u27s implementation and unexploited for side channel analysis until now. We identify the leakage within the reference implementation of Falcon on an ARM Cortex-M4 STM32F407IGT6 microprocessor. We also show that this single bit of leakage is in effect enough for practical key recovery: with 170 000170\,000 traces one can fully recover the key of Falcon-512 within half an hour. Furthermore, combining the sign leakage and the aforementioned leakage, one can recover the key with only 45 00045\,000 signature measurements in a short time. As a by-product, we also extend our power analysis to Mitaka which is a recent variant of Falcon. The same leakages exist within the integer Gaussian samplers of Mitaka, and they can also be used to mount key recovery attacks. Nevertheless, the key recovery in Mitaka requires much more traces than it does in Falcon, due to their different lattice Gaussian samplers

    Nibbling MAYO: Optimized Implementations for AVX2 and Cortex-M4

    Get PDF
    MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive)

    Post-Quantum Signatures on RISC-V with Hardware Acceleration

    Get PDF
    CRYSTALS-Dilithium and Falcon are digital signature algorithms based on cryptographic lattices, that are considered secure even if large-scale quantum computers will be able to break conventional public-key cryptography. Both schemes have been selected for standardization in the NIST post-quantum competition. In this work, we present a RISC-V HW/SW odesign that aims to combine the advantages of software- and hardware implementations, i.e. flexibility and performance. It shows the use of lexible hardware accelerators, which have been previously used for Public-Key Encryption (PKE) and Key-Encapsulation Mechanism (KEM), for post-quantum signatures. It is optimized for Dilithium as a generic signature cheme but also accelerates applications that require fast verification of Falcon’s compact signatures. We provide a comparison with previous works showing that for Dilithium and Falcon, cycle counts are significantly reduced, such that our design is faster than previous software implementations or other HW/SW codesigns. In addition to that, we present a compact Globalfoundries 22 nm ASIC design that runs at 800MHz. By using hardware acceleration, energy consumption for Dilithium is reduced by up to 92.2%, and up to 67.5% for Falcon’s signature verification

    HAETAE: Shorter Lattice-Based Fiat-Shamir Signatures

    Get PDF
    We present HAETAE(Hyperball bimodAl modulE rejecTion signAture schemE), a new lattice-based signature scheme, which we submitted to the Korean Post-Quantum Cryptography Competition for standardization. Like the NIST-selected Dilithium signature scheme, HAETAE is based on the Fiat-Shamir with Aborts paradigm,but our design choices target an improved complexity/compactness compromise that is highly relevant for many space-limited application scenarios. We primarily focus on reducing signature and verification key sizes so that signatures fit into one TCP or UDP datagram while preserving a high level of security against a variety of attacks. As a result, our scheme has signature and verification key sizes up to 40% and 25% smaller, respectively, compared than Dilithium. Moreover, we describe how to efficiently protect HAETAE against implementation attacks such as side-channel analysis, making it an attractive candidate for use in IoT and other embedded systems
    corecore