396 research outputs found

    Exploring Parallelism to Improve the Performance of FrodoKEM in Hardware

    Get PDF
    FrodoKEM is a lattice-based key encapsulation mechanism, currently a semi-finalist in NIST’s post-quantum standardisation effort. A condition for these candidates is to use NIST standards for sources of randomness (i.e. seed-expanding), and as such most candidates utilise SHAKE, an XOF defined in the SHA-3 standard. However, for many of the candidates, this module is a significant implementation bottleneck. Trivium is a lightweight, ISO standard stream cipher which performs well in hardware and has been used in previous hardware designs for lattice-based cryptography. This research proposes optimised designs for FrodoKEM, concentrating on high throughput by parallelising the matrix multiplication operations within the cryptographic scheme. This process is eased by the use of Trivium due to its higher throughput and lower area consumption. The parallelisations proposed also complement the addition of first-order masking to the decapsulation module. Overall, we significantly increase the throughput of FrodoKEM; for encapsulation we see a 16 × speed-up, achieving 825 operations per second, and for decapsulation we see a 14 × speed-up, achieving 763 operations per second, compared to the previous state of the art, whilst also maintaining a similar FPGA area footprint of less than 2000 slices.</p

    Provable Secure Software Masking in the Real-World

    Get PDF
    We evaluate eight implementations of provable secure side-channel masking schemes that were published in top-tier academic venues such as Eurocrypt, Asiacrypt, CHES and SAC. Specifically, we evaluate the side-channel attack resistance of eight open-source and first-order side-channel protected AES-128 software implementations on the Cortex-M4 platform. Using a T-test based leakage assessment we demonstrate that all implementations produce first-order leakage with as little as 10,000 traces. Additionally, we demonstrate that all except for two Inner Product Masking based implementations are vulnerable to a straightforward correlation power analysis attack. We provide an assembly level analysis showing potential sources of leakage for two implementations. Some of the studied implementations were provided for benchmarking purposes. We demonstrate several flaws in the benchmarking procedures and question the usefulness of the reported performance numbers in the face of the implementations’ poor side-channel resistance. This work serves as a reminder that practical evaluations cannot be omitted in the context of side-channel analysis

    SNEIK on Microcontrollers: AVR, ARMv7-M, and RISC-V with Custom Instructions

    Get PDF
    SNEIK is a family of lightweight cryptographic algorithms derived from a single 512-bit permutation. The SNEIGEN ``entropy distribution function\u27\u27 was designed to speed up certain functions in post-quantum and lattice-based public key algorithms. We implement and evaluate SNEIK algorithms on popular 8-bit AVR and 32-bit ARMv7-M (Cortex M3/M4) microcontrollers, and also describe an implementation for the open-source RISC-V (RV32I) Instruction Set Architecture (ISA). Our results demonstrate that SNEIK algorithms usually outperform AES and SHA-2/3 on these lightweight targets while having a naturally constant-time design and significantly smaller implementation footprint. The RISC-V architecture is becoming increasingly popular for custom embedded designs that integrate a CPU core with application-specific hardware components. We show that inclusion of two simple custom instructions into the RV32I ISA yields a radical (more than five-fold) speedup of the SNEIK permutation and derived algorithms on that target, allowing us to reach 12.4 cycles/byte SNEIKEN-128 authenticated encryption performance on PQShield\u27s ``Crimson Puppy\u27\u27 RV32I-based SoC. Our performance measurements are for realistic message sizes and have been made using real hardware. We also offer implementation size metrics in terms of RAM, firmware size, and additional FPGA logic for the custom instruction set extensions

    Faster binary-field multiplication and faster binary-field MACs

    Get PDF
    This paper shows how to securely authenticate messages using just 29 bit operations per authenticated bit, plus a constant overhead per message. The authenticator is a standard type of "universal" hash function providing information-theoretic security; what is new is computing this type of hash function at very high speed. At a lower level, this paper shows how to multiply two elements of a field of size 2^128 using just 9062 \approx 71 * 128 bit operations, and how to multiply two elements of a field of size 2^256 using just 22164 \approx 87 * 256 bit operations. This performance relies on a new representation of field elements and new FFT-based multiplication techniques. This paper's constant-time software uses just 1.89 Core 2 cycles per byte to authenticate very long messages. On a Sandy Bridge it takes 1.43 cycles per byte, without using Intel's PCLMULQDQ polynomial-multiplication hardware. This is much faster than the speed records for constant-time implementations of GHASH without PCLMULQDQ (over 10 cycles/byte), even faster than Intel's best Sandy Bridge implementation of GHASH with PCLMULQDQ (1.79 cycles/byte), and almost as fast as state-of-the-art 128-bit prime-field MACs using Intel's integer-multiplication hardware (around 1 cycle/byte). Keywords: Performance, FFTs, Polynomial multiplication, Universal hashing, Message authenticatio

    Efficient Cryptography on the RISC-V Architecture

    Get PDF
    RISC-V is a promising free and open-source instruction set architecture. Most of the instruction set has been standardized and several hardware implementations are commercially available. In this paper we highlight features of RISC-V that are interesting for optimizing implementations of cryptographic primitives. We provide the first optimized assembly implementations of table-based AES, bitsliced AES, ChaCha, and the Keccak-ff[1600] permutation for the RV32I instruction set. With respect to public-key cryptography, we study the performance of arbitrary-precision integer arithmetic without a carry flag. We then estimate the improvement that can be gained by several RISC-V extensions. These performance studies also serve to aid design choices for future RISC-V extensions and implementations

    Exploring NIST LWC/PQC Synergy with R5Sneik: How SNEIK 1.1 Algorithms were Designed to Support Round5

    Get PDF
    Most NIST Post-Quantum Cryptography (PQC) candidate algorithms use symmetric primitives internally for various purposes such as ``seed expansion\u27\u27 and CPA to CCA transforms. Such auxiliary symmetric operations constituted only a fraction of total execution time of traditional RSA and ECC algorithms, but with faster lattice algorithms the impact of symmetric algorithm characteristics can be very significant. A choice to use a specific PQC algorithm implies that its internal symmetric components must also be implemented on all target platforms. This can be problematic for lightweight, embedded (IoT), and hardware implementations. It has been widely observed that current NIST-approved symmetric components (AES, GCM, SHA, SHAKE) form a major bottleneck on embedded and hardware implementation footprint and performance for many of the most efficient NIST PQC proposals. Meanwhile, a separate NIST effort is ongoing to standardize lightweight symmetric cryptography (LWC). Therefore it makes sense to explore which NIST LWC candidates are able to efficiently support internals of post-quantum asymmetric cryptography. We discuss R5Sneik, a variant of Round5 that internally uses SNEIK 1.1 permutation-based primitives instead of SHAKE and AES-GCM. The SNEIK family includes parameter selections specifically designed to support lattice cryptography. R5Sneik is up to 40\% faster than Round5 for some parameter sets on ARM Cortex M4, and has substantially smaller implementation footprint. We introduce the concept of a fast Entropy Distribution Function (EDF), a lightweight diffuser that we expect to have sufficient security properties for lattice seed expansion and many types of sampling, but not for plain encryption or hashing. The same SNEIK 1.1 permutation core (but with a different number of rounds) can also be used to replace AES-GCM as an AEAD when building lightweight cryptographic protocols, halving typical flash footprint on Cortex M4, while boosting performance

    SPEEDY on Cortex--M3: Efficient Software Implementation of SPEEDY on ARM Cortex--M3

    Get PDF
    The SPEEDY block cipher suite announced at CHES 2021 shows excellent hardware performance. However, SPEEDY was not designed to be efficient in software implementations. SPEEDY\u27s 6-bit sbox and bit permutation operations generally do not work efficiently in software. We implemented SPEEDY block cipher by applying the implementation technique of bit slicing. As an implementation technique of bit slicing, SPEEDY can be operated in software very efficiently and can be applied in microcontroller. By calculating the round key in advance, the performance on ARM Cortex-M3 for SPEEDY-5-192, SPEEDY-6-192, and SPEEDY-7-192 are 65.7, 75.25, and 85.16 clock cycles per byte (i.e. cpb), respectively. It showed better performance than AES-128 constant-time implementation and GIFT constant-time implementation in the same platform. Through this, we conclude that SPEEDY can show good performance on embedded environments
    • …
    corecore