571 research outputs found
SoK: Fully Homomorphic Encryption Accelerators
Fully Homomorphic Encryption~(FHE) is a key technology enabling
privacy-preserving computing. However, the fundamental challenge of FHE is its
inefficiency, due primarily to the underlying polynomial computations with high
computation complexity and extremely time-consuming ciphertext maintenance
operations. To tackle this challenge, various FHE accelerators have recently
been proposed by both research and industrial communities. This paper takes the
first initiative to conduct a systematic study on the 14 FHE accelerators --
cuHE/cuFHE, nuFHE, HEAT, HEAX, HEXL, HEXL-FPGA, 100, F1, CraterLake,
BTS, ARK, Poseidon, FAB and TensorFHE. We first make our observations on the
evolution trajectory of these existing FHE accelerators to establish a
qualitative connection between them. Then, we perform testbed evaluations of
representative open-source FHE accelerators to provide a quantitative
comparison on them. Finally, with the insights learned from both qualitative
and quantitative studies, we discuss potential directions to inform the future
design and implementation for FHE accelerators
TPU as Cryptographic Accelerator
Polynomials defined on specific rings are heavily involved in various
cryptographic schemes, and the corresponding operations are usually the
computation bottleneck of the whole scheme.
We propose to utilize TPU, an emerging hardware designed for AI applications,
to speed up polynomial operations and convert TPU to a cryptographic
accelerator.
We also conduct preliminary evaluation and discuss the limitations of current
work and future plan
Recommended from our members
Model-Architecture Co-design of Deep Neural Networks for Embedded Systems
In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems.
This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available.
In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x.
My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable.
My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification
CryptoLight: An Electro-Optical Accelerator for Fully Homomorphic Encryption
Fully homomorphic encryption (FHE) protects data privacy in cloud computing
by enabling computations to directly occur on ciphertexts. Although the speed
of computationally expensive FHE operations can be significantly boosted by
prior ASIC-based FHE accelerators, the performance of key-switching, the
dominate primitive in various FHE operations, is seriously limited by their
small bit-width datapaths and frequent matrix transpositions. In this paper, we
present an electro-optical (EO) FHE accelerator, CryptoLight, to accelerate FHE
operations. Its 512-bit datapath supporting 510-bit residues greatly reduces
the key-switching cost. We also create an in-scratchpad-memory transpose unit
to fast transpose matrices. Compared to prior FHE accelerators, on average,
CryptoLight reduces the latency of various FHE applications by >94.4% and the
energy consumption by >95%.Comment: 6 pages, 8 figure
Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework
Hardware accelerations of deep learning systems have been extensively
investigated in industry and academia. The aim of this paper is to achieve
ultra-high energy efficiency and performance for hardware implementations of
deep neural networks (DNNs). An algorithm-hardware co-optimization framework is
developed, which is applicable to different DNN types, sizes, and application
scenarios. The algorithm part adopts the general block-circulant matrices to
achieve a fine-grained tradeoff between accuracy and compression ratio. It
applies to both fully-connected and convolutional layers and contains a
mathematically rigorous proof of the effectiveness of the method. The proposed
algorithm reduces computational complexity per layer from O() to O() and storage complexity from O() to O(), both for training and
inference. The hardware part consists of highly efficient Field Programmable
Gate Array (FPGA)-based implementations using effective reconfiguration, batch
processing, deep pipelining, resource re-using, and hierarchical control.
Experimental results demonstrate that the proposed framework achieves at least
152X speedup and 71X energy efficiency gain compared with IBM TrueNorth
processor under the same test accuracy. It achieves at least 31X energy
efficiency gain compared with the reference FPGA-based work.Comment: 6 figures, AAAI Conference on Artificial Intelligence, 201
Hardware Architectures for Post-Quantum Cryptography
The rapid development of quantum computers poses severe threats to many commonly-used cryptographic algorithms that are embedded in different hardware devices to ensure the security and privacy of data and communication. Seeking for new solutions that are potentially resistant against attacks from quantum computers, a new research field called Post-Quantum Cryptography (PQC) has emerged, that is, cryptosystems deployed in classical computers conjectured to be secure against attacks utilizing large-scale quantum computers. In order to secure data during storage or communication, and many other applications in the future, this dissertation focuses on the design, implementation, and evaluation of efficient PQC schemes in hardware. Four PQC algorithms, each from a different family, are studied in this dissertation. The first hardware architecture presented in this dissertation is focused on the code-based scheme Classic McEliece. The research presented in this dissertation is the first that builds the hardware architecture for the Classic McEliece cryptosystem. This research successfully demonstrated that complex code-based PQC algorithm can be run efficiently on hardware. Furthermore, this dissertation shows that implementation of this scheme on hardware can be easily tuned to different configurations by implementing support for flexible choices of security parameters as well as configurable hardware performance parameters. The successful prototype of the Classic McEliece scheme on hardware increased confidence in this scheme, and helped Classic McEliece to get recognized as one of seven finalists in the third round of the NIST PQC standardization process. While Classic McEliece serves as a ready-to-use candidate for many high-end applications, PQC solutions are also needed for low-end embedded devices. Embedded devices play an important role in our daily life. Despite their typically constrained resources, these devices require strong security measures to protect them against cyber attacks. Towards securing this type of devices, the second research presented in this dissertation focuses on the hash-based digital signature scheme XMSS. This research is the first that explores and presents practical hardware based XMSS solution for low-end embedded devices. In the design of XMSS hardware, a heterogenous software-hardware co-design approach was adopted, which combined the flexibility of the soft core with the acceleration from the hard core. The practicability and efficiency of the XMSS software-hardware co-design is further demonstrated by providing a hardware prototype on an open-source RISC-V based System-on-a-Chip (SoC) platform. The third research direction covered in this dissertation focuses on lattice-based cryptography, which represents one of the most promising and popular alternatives to today\u27s widely adopted public key solutions. Prior research has presented hardware designs targeting the computing blocks that are necessary for the implementation of lattice-based systems. However, a recurrent issue in most existing designs is that these hardware designs are not fully scalable or parameterized, hence limited to specific cryptographic primitives and security parameter sets. The research presented in this dissertation is the first that develops hardware accelerators that are designed to be fully parameterized to support different lattice-based schemes and parameters. Further, these accelerators are utilized to realize the first software-harware co-design of provably-secure instances of qTESLA, which is a lattice-based digital signature scheme. This dissertation demonstrates that even demanding, provably-secure schemes can be realized efficiently with proper use of software-hardware co-design. The final research presented in this dissertation is focused on the isogeny-based scheme SIKE, which recently made it to the final round of the PQC standardization process. This research shows that hardware accelerators can be designed to offload compute-intensive elliptic curve and isogeny computations to hardware in a versatile fashion. These hardware accelerators are designed to be fully parameterized to support different security parameter sets of SIKE as well as flexible hardware configurations targeting different user applications. This research is the first that presents versatile hardware accelerators for SIKE that can be mapped efficiently to both FPGA and ASIC platforms. Based on these accelerators, an efficient software-hardwareco-design is constructed for speeding up SIKE. In the end, this dissertation demonstrates that, despite being embedded with expensive arithmetic, the isogeny-based SIKE scheme can be run efficiently by exploiting specialized hardware. These four research directions combined demonstrate the practicability of building efficient hardware architectures for complex PQC algorithms. The exploration of efficient PQC solutions for different hardware platforms will eventually help migrate high-end servers and low-end embedded devices towards the post-quantum era
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
‘no worse’ than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
Exploring the acceleration of Nekbone on reconfigurable architectures
Hardware technological advances are struggling to match scientific ambition,
and a key question is how we can use the transistors that we already have more
effectively. This is especially true for HPC, where the tendency is often to
throw computation at a problem whereas codes themselves are commonly bound,
at-least to some extent, by other factors. By redesigning an algorithm and
moving from a Von Neumann to dataflow style, then potentially there is more
opportunity to address these bottlenecks on reconfigurable architectures,
compared to more general-purpose architectures.
In this paper we explore the porting of Nekbone's AX kernel, a widely popular
HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation
is an important part of this code, it is also memory bound on CPUs, and a key
question is whether one can ameliorate this by leveraging FPGAs. We first
explore optimisation strategies for obtaining good performance, with over a
4000 times runtime difference between the first and final version of our kernel
on FPGAs. Subsequently, performance and power efficiency of our approach on an
Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA V100
GPU, with the FPGA outperforming the CPU by around four times, achieving almost
three quarters the GPU performance, and significantly more power efficient than
both. The result of this work is a comparison and set of techniques that both
apply to Nekbone on FPGAs specifically and are also of interest more widely in
accelerating HPC codes on reconfigurable architectures.Comment: Pre-print of paper accepted to IEEE/ACM International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC
Co-processor offloading applied to passive coherent location with Doppler and bearing data
Includes abstract.Includes bibliographical references (leaf 76).This project dealt with the acceleration of an aircraft tracking algorithm using a ClearSpeed mathematical co-processor. The algorithm is based on non-linear differential correction (also known as the Gauss-Newton method) and uses Doppler and bearing data from a Passive Coherent Location (PCL) radar system. A PCL radar uses a network of receivers to track targets through their back-scatter from existing Continuous Wave (CW) transmissions, such as broadcast TV or radio. The lack of an active transmitter in a PCL system results in relatively low procurement, operation and maintenance costs. This is of particular advantage for airports in third world countries, many of which do not have radar assisted air traffic control
- …