20 research outputs found
A Lyra2 FPGA Core for Lyra2REv2-Based Cryptocurrencies
Lyra2REv2 is a hashing algorithm that consists of a chain of individual
hashing algorithms and it is used as a proof-of-work function in several
cryptocurrencies that aim to be ASIC-resistant. The most crucial hashing
algorithm in the Lyra2REv2 chain is a specific instance of the general Lyra2
algorithm. In this work we present the first FPGA implementation of the
aforementioned instance of Lyra2 and we explain how several properties of the
algorithm can be exploited in order to optimize the design.Comment: 5 pages, to be presented at the IEEE International Symposium on
Circuits and Systems (ISCAS) 201
Mining CryptoNight-Haven on the Varium C1100 Blockchain Accelerator Card
Cryptocurrency mining is an energy-intensive process that presents a prime
candidate for hardware acceleration. This work-in-progress presents the first
coprocessor design for the ASIC-resistant CryptoNight-Haven Proof of Work (PoW)
algorithm. We construct our hardware accelerator as a Xilinx Run Time (XRT) RTL
kernel targeting the Xilinx Varium C1100 Blockchain Accelerator Card. The
design employs deeply pipelined computation and High Bandwidth Memory (HBM) for
the underlying scratchpad data. We aim to compare our accelerator to existing
CPU and GPU miners to show increased throughput and energy efficiency of its
hash computation
A Standalone FPGA-based Miner for Lyra2REv2 Cryptocurrencies
Lyra2REv2 is a hashing algorithm that consists of a chain of individual
hashing algorithms, and it is used as a proof-of-work function in several
cryptocurrencies. The most crucial and exotic hashing algorithm in the
Lyra2REv2 chain is a specific instance of the general Lyra2 algorithm. This
work presents the first hardware implementation of the specific instance of
Lyra2 that is used in Lyra2REv2. Several properties of the aforementioned
algorithm are exploited in order to optimize the design. In addition, an
FPGA-based hardware implementation of a standalone miner for Lyra2REv2 on a
Xilinx Multi-Processor System on Chip is presented. The proposed Lyra2REv2
miner is shown to be significantly more energy efficient than both a GPU and a
commercially available FPGA-based miner. Finally, we also explain how the
simplified Lyra2 and Lyra2REv2 architectures can be modified with minimal
effort to also support the recent Lyra2REv3 chained hashing algorithm.Comment: 13 pages, accepted for publication in IEEE Trans. Circuits Syst. I.
arXiv admin note: substantial text overlap with arXiv:1807.0576
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption is a technique that allows computation on
encrypted data. It has the potential to change privacy considerations in the
cloud, but computational and memory overheads are preventing its adoption. TFHE
is a promising Torus-based FHE scheme that relies on bootstrapping, the
noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is
the first hardware accelerator to exploit the inherent noise present in FHE
calculations. Instead of double or single-precision floating-point arithmetic,
it implements TFHE bootstrapping entirely with approximate fixed-point
arithmetic. Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed fixed-point representations
that are up to 50% smaller than prior implementations.
FPT is built as a streaming processor inspired by traditional streaming DSPs:
it instantiates directly cascaded high-throughput computational stages, with
minimal control logic and routing networks. We explore throughput-balanced
compositions of streaming kernels with a user-configurable streaming width in
order to construct a full bootstrapping pipeline. Our approach allows 100%
utilization of arithmetic units and requires only a small bootstrapping key
cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS /
35us. This is in stark contrast to the classical CPU approach to FHE
bootstrapping acceleration, which is typically constrained by memory and
bandwidth.
FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo
U280 datacenter accelerator card. FPT achieves two to three orders of magnitude
higher bootstrapping throughput than existing CPU-based implementations, and
2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
Neural Network Quantisation for Faster Homomorphic Encryption
Homomorphic encryption (HE) enables calculating on encrypted data, which
makes it possible to perform privacypreserving neural network inference. One
disadvantage of this technique is that it is several orders of magnitudes
slower than calculation on unencrypted data. Neural networks are commonly
trained using floating-point, while most homomorphic encryption libraries
calculate on integers, thus requiring a quantisation of the neural network. A
straightforward approach would be to quantise to large integer sizes (e.g. 32
bit) to avoid large quantisation errors. In this work, we reduce the integer
sizes of the networks, using quantisation-aware training, to allow more
efficient computations. For the targeted MNIST architecture proposed by Badawi
et al., we reduce the integer sizes by 33% without significant loss of
accuracy, while for the CIFAR architecture, we can reduce the integer sizes by
43%. Implementing the resulting networks under the BFV homomorphic encryption
scheme using SEAL, we could reduce the execution time of an MNIST neural
network by 80% and by 40% for a CIFAR neural network.Comment: 5 pages, 2 figures, 3 table
Revisiting Higher-Order Masked Comparison for Lattice-Based Cryptography: Algorithms and Bit-sliced Implementations
Masked comparison is one of the most expensive operations in side-channel secure implementations of lattice-based post-quantum cryptography, especially for higher masking orders. First, we introduce two new masked comparison algorithms, which improve the arithmetic comparison of D\u27Anvers et al. and the hybrid comparison method of Coron et al. respectively. We then look into implementation-specific optimizations, and show that small specific adaptations can have a significant impact on the overall performance. Finally, we implement various state-of-the-art comparison algorithms and benchmark them on the same platform (ARM-Cortex M4) to allow a fair comparison between them. We improve on the arithmetic comparison of D\u27Anvers et al. with a factor by using Galois Field multiplications and the hybrid comparison of Coron et al. with a factor by streamlining the design. Our implementation-specific improvements allow a speedup of a straightforward comparison implementation of . We discuss the differences between the various algorithms and provide the implementations and a testing framework to ease future research
Hardware Acceleration of FHEW
The magic of Fully Homomorphic Encryption (FHE) is that it allows operations on encrypted data without decryption. Unfortunately, the slow computation time limits their adoption. The slow computation time results from the vast memory requirements (64Kbits per ciphertext), a bootstrapping key of 1.3 GB, and sizeable computational overhead (10240 NTTs, each NTT requiring 5120 32-bit multiplications). We accelerate the FHEW bootstrapping in hardware on a high-end U280 FPGA.
To reduce the computational complexity, we propose a fast hardware NTT architecture modified from with support for negatively wrapped convolution. The IP module includes large I/O ports to the NTT accelerator and an index bit-reversal block. The total architecture requires less than 225000 LUTs and 1280 DSPs.
Assuming that a fast interface to the FHEW bootstrapping key is available, the execution speed of FHEW bootstrapping can increase by at least 7.5 times
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) is a technique that allows computation on encrypted data. It has the potential to drastically change privacy considerations in the cloud, but high computational and memory overheads are preventing its broad adoption. TFHE is a promising Torus-based FHE scheme that heavily relies on bootstrapping, the noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to heavily exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations that prefer floating-point or integer FFTs.
FPT is built as a streaming processor inspired by traditional streaming DSPs: it instantiates directly cascaded high-throughput computational stages, with minimal control logic and routing networks. We explore different throughput-balanced compositions of streaming kernels with a user-configurable streaming width in order to construct a full bootstrapping pipeline. Our proposed approach allows 100% utilization of arithmetic units and requires only a small bootstrapping key cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35s. This is in stark contrast to the established classical CPU approach to FHE bootstrapping acceleration, which is typically constrained by memory and bandwidth.
FPT is fully implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves two to three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5 higher throughput compared to recent ASIC emulation experiments
Neural Network Quantisation for Faster Homomorphic Encryption
Homomorphic encryption (HE) enables calculating
on encrypted data, which makes it possible to perform privacy-
preserving neural network inference. One disadvantage of this
technique is that it is several orders of magnitudes slower than
calculation on unencrypted data. Neural networks are commonly
trained using floating-point, while most homomorphic encryption
libraries calculate on integers, thus requiring a quantisation of the
neural network. A straightforward approach would be to quantise
to large integer sizes (e.g., 32 bit) to avoid large quantisation errors.
In this work, we reduce the integer sizes of the networks, using
quantisation-aware training, to allow more efficient computations.
For the targeted MNIST architecture proposed by Badawi et al., we reduce the integer sizes by 33% without significant loss
of accuracy, while for the CIFAR architecture, we can reduce the
integer sizes by 43%. Implementing the resulting networks under
the BFV homomorphic encryption scheme using SEAL, we could
reduce the execution time of an MNIST neural network by 80%
and by 40% for a CIFAR neural network
Higher-order masked Saber
Side-channel attacks are formidable threats to the cryptosystems deployed in the real world. An effective and provably secure countermeasure against side-channel attacks is masking. In this work, we present a detailed study of higher-order masking techniques for the key-encapsulation mechanism Saber. Saber is one of the lattice-based finalist candidates in the National Institute of Standards of Technology\u27s post-quantum standardization procedure. We provide a detailed analysis of different masking algorithms proposed for Saber in the recent past and propose an optimized implementation of higher-order masked Saber. Our proposed techniques for first-, second-, and third-order masked Saber have performance overheads of 2.7x, 5x, and 7.7x respectively compared to the unmasked Saber. We show that compared to Kyber which is another lattice-based finalist scheme, Saber\u27s performance degrades less with an increase in the order of masking. We also show that higher-order masked Saber needs fewer random bytes than higher-order masked Kyber. Additionally, we adapt our masked implementation to uSaber, a variant of Saber that was specifically designed to allow an efficient masked implementation. We present the first masked implementation of uSaber, showing that it indeed outperforms masked Saber by at least 12% for any order. We provide optimized implementations of all our proposed masking schemes on ARM Cortex-M4 microcontrollers