93 research outputs found
Flexible HLS-Based Implementation of the Karatsuba Multiplier Targeting Homomorphic Encryption Schemes
Custom accelerators for high-precision integer arithmetic are increasingly used in compute-intensive applications, in particular homomorphic encryption schemes. This work seeks to advance a strategy for faster deployment of these accelerators using the process of high-level synthesis (HLS). Insights from existing number theory software libraries and custom hardware accelerators are used to develop a scalable implementation of Karatsuba modular polynomial multiplication. The accelerator generated from this implementation by the high-level synthesis tool Vivado HLS achieves significant speedup over the implementations available in the highly-optimized FLINT software library. This is an important first step towards a larger goal of enabling HLS-based homomorphic encryption in the cloud
RISE: RISC-V SoC for En/decryption Acceleration on the Edge for Homomorphic Encryption
Today edge devices commonly connect to the cloud to use its storage and
compute capabilities. This leads to security and privacy concerns about user
data. Homomorphic Encryption (HE) is a promising solution to address the data
privacy problem as it allows arbitrarily complex computations on encrypted data
without ever needing to decrypt it. While there has been a lot of work on
accelerating HE computations in the cloud, little attention has been paid to
the message-to-ciphertext and ciphertext-to-message conversion operations on
the edge. In this work, we profile the edge-side conversion operations, and our
analysis shows that during conversion error sampling, encryption, and
decryption operations are the bottlenecks. To overcome these bottlenecks, we
present RISE, an area and energy-efficient RISC-V SoC. RISE leverages an
efficient and lightweight pseudo-random number generator core and combines it
with fast sampling techniques to accelerate the error sampling operations. To
accelerate the encryption and decryption operations, RISE uses scalable,
data-level parallelism to implement the number theoretic transform operation,
the main bottleneck within the encryption and decryption operations. In
addition, RISE saves area by implementing a unified en/decryption datapath, and
efficiently exploits techniques like memory reuse and data reordering to
utilize a minimal amount of on-chip memory. We evaluate RISE using a complete
RTL design containing a RISC-V processor interfaced with our accelerator. Our
analysis reveals that for message-to-ciphertext conversion and
ciphertext-to-message conversion, using RISE leads up to 6191.19X and 2481.44X
more energy-efficient solution, respectively, than when using just the RISC-V
processor
CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution
The migration of computation to the cloud has raised privacy concerns as
sensitive data becomes vulnerable to attacks since they need to be decrypted
for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it
enables meaningful computations to be performed directly on encrypted data.
Nevertheless, FHE is orders of magnitude slower than unencrypted computation,
which hinders its practicality and adoption. Therefore, improving FHE
performance is essential for its real world deployment. In this paper, we
present a year-long effort to design, implement, fabricate, and post-silicon
validate a hardware accelerator for Fully Homomorphic Encryption dubbed CoFHEE.
With a design area of , CoFHEE aims to improve performance of
ciphertext multiplications, the most demanding arithmetic FHE operation, by
accelerating several primitive operations on polynomials, such as polynomial
additions and subtractions, Hadamard product, and Number Theoretic Transform.
CoFHEE supports polynomial degrees of up to with a maximum
coefficient sizes of 128 bits, while it is capable of performing ciphertext
multiplications entirely on chip for . CoFHEE is fabricated in
55nm CMOS technology and achieves 250 MHz with our custom-built low-power
digital PLL design. In addition, our chip includes two communication interfaces
to the host machine: UART and SPI. This manuscript presents all steps and
design techniques in the ASIC development process, ranging from RTL design to
fabrication and validation. We evaluate our chip with performance and power
experiments and compare it against state-of-the-art software implementations
and other ASIC designs. Developed RTL files are available in an open-source
repository
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption is a technique that allows computation on
encrypted data. It has the potential to change privacy considerations in the
cloud, but computational and memory overheads are preventing its adoption. TFHE
is a promising Torus-based FHE scheme that relies on bootstrapping, the
noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is
the first hardware accelerator to exploit the inherent noise present in FHE
calculations. Instead of double or single-precision floating-point arithmetic,
it implements TFHE bootstrapping entirely with approximate fixed-point
arithmetic. Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed fixed-point representations
that are up to 50% smaller than prior implementations.
FPT is built as a streaming processor inspired by traditional streaming DSPs:
it instantiates directly cascaded high-throughput computational stages, with
minimal control logic and routing networks. We explore throughput-balanced
compositions of streaming kernels with a user-configurable streaming width in
order to construct a full bootstrapping pipeline. Our approach allows 100%
utilization of arithmetic units and requires only a small bootstrapping key
cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS /
35us. This is in stark contrast to the classical CPU approach to FHE
bootstrapping acceleration, which is typically constrained by memory and
bandwidth.
FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo
U280 datacenter accelerator card. FPT achieves two to three orders of magnitude
higher bootstrapping throughput than existing CPU-based implementations, and
2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
Efficient Computation and FPGA implementation of Fully Homomorphic Encryption with Cloud Computing Significance
Homomorphic Encryption provides unique security solution for cloud computing. It ensures not only that data in cloud have confidentiality but also that data processing by cloud server does not compromise data privacy. The Fully Homomorphic Encryption (FHE) scheme proposed by Lopez-Alt, Tromer, and Vaikuntanathan (LTV), also known as NTRU(Nth degree truncated polynomial ring) based method, is considered one of the most important FHE methods suitable for practical implementation. In this thesis, an efficient algorithm and architecture for LTV Fully Homomorphic Encryption is proposed. Conventional linear feedback shift register (LFSR) structure is expanded and modified for performing the truncated polynomial ring multiplication in LTV scheme in parallel. Novel and efficient modular multiplier, modular adder and modular subtractor are proposed to support high speed processing of LFSR operations. In addition, a family of special moduli are selected for high speed computation of modular operations. Though the area keeps the complexity of O(Nn^2) with no advantage in circuit level. The proposed architecture effectively reduces the time complexity from O(N log N) to linear time, O(N), compared to the best existing works. An FPGA implementation of the proposed architecture for LTV FHE is achieved and demonstrated. An elaborate comparison of the existing methods and the proposed work is presented, which shows the proposed work gains significant speed up over existing works
On designing hardware accelerator-based systems: interfaces, taxes and benefits
Complementary Metal Oxide Semiconductor (CMOS) Technology scaling has slowed down. One promising approach to sustain the historic performance improvement of computing systems is to utilize hardware accelerators. Today, many commercial computing systems integrate one or more accelerators, with each accelerator optimized to efficiently execute specific tasks.
Over the years, there has been a substantial amount of research on designing hardware accelerators for machine learning (ML) training and inference tasks. Hardware accelerators are also widely employed to accelerate data privacy and security algorithms. In particular, there is currently a growing interest in the use of hardware accelerators for accelerating homomorphic encryption (HE) based privacy-preserving computing.
While the use of hardware accelerators is promising, a realistic end-to-end evaluation of an accelerator when integrated into the full system often reveals that the benefits of an accelerator are not always as expected. Simply assessing the performance of the accelerated portion of an application, such as the inference kernel in ML applications, during performance analysis can be misleading. When designing an accelerator-based system, it is critical to evaluate the system as a whole and account for all the accelerator taxes.
In the first part of our research, we highlight the need for a holistic, end-to-end analysis of workloads using ML and HE applications. Our evaluation of an ML application for a database management system (DBMS) shows that the benefits of offloading ML inference to accelerators depend on several factors, including backend hardware, model complexity, data size, and the level of integration between the ML inference pipeline and the DBMS. We also found that the end-to-end performance improvement is bottlenecked by data retrieval and pre-processing, as well as inference. Additionally, our evaluation of an HE video encryption application shows that while HE client-side operations, i.e., message-to- ciphertext and ciphertext-to-message conversion operations, are bottlenecked by number theoretic transform (NTT) operations, accelerating NTT in hardware alone is not sufficient to get enough application throughput (frame rate per second) improvement. We need to address all bottlenecks such as error sampling, encryption, and decryption in message-to-ciphertext and ciphertext-to-message conversion pipeline.
In the second part of our research, we address the lack of a scalable evaluation infrastructure for building and evaluating accelerator-based systems. To solve this problem, we propose a robust and scalable software-hardware framework for accelerator evaluation, which uses an open-source RISC-V based System-on-Chip (SoC) design called BlackParrot. This framework can be utilized by accelerator designers and system architects to perform an end-to-end performance analysis of coherent and non-coherent accelerators while carefully accounting for the interaction between the accelerator and the rest of the system.
In the third part of our research, we present RISE, which is a full RISC-V SoC designed to efficiently perform message-to-ciphertext and ciphertext-to-message conversion operations. RISE comprises of a BlackParrot core and an efficient custom-designed accelerator tailored to accelerate end-to-end message-to-ciphertext and ciphertext-to-message conversion operations. Our RTL-based evaluation demonstrates that RISE improves the throughput of the video encryption application by 10x-27x for different frame resolutions
HEAX: An Architecture for Computing on Encrypted Data
With the rapid increase in cloud computing, concerns surrounding data
privacy, security, and confidentiality also have been increased significantly.
Not only cloud providers are susceptible to internal and external hacks, but
also in some scenarios, data owners cannot outsource the computation due to
privacy laws such as GDPR, HIPAA, or CCPA. Fully Homomorphic Encryption (FHE)
is a groundbreaking invention in cryptography that, unlike traditional
cryptosystems, enables computation on encrypted data without ever decrypting
it. However, the most critical obstacle in deploying FHE at large-scale is the
enormous computation overhead.
In this paper, we present HEAX, a novel hardware architecture for FHE that
achieves unprecedented performance improvement. HEAX leverages multiple levels
of parallelism, ranging from ciphertext-level to fine-grained modular
arithmetic level. Our first contribution is a new highly-parallelizable
architecture for number-theoretic transform (NTT) which can be of independent
interest as NTT is frequently used in many lattice-based cryptography systems.
Building on top of NTT engine, we design a novel architecture for computation
on homomorphically encrypted data. We also introduce several techniques to
enable an end-to-end, fully pipelined design as well as reducing on-chip memory
consumption. Our implementation on reconfigurable hardware demonstrates
164-268x performance improvement for a wide range of FHE parameters.Comment: To appear in proceedings of ACM ASPLOS 202
CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure
Fully homomorphic encryption (FHE) is in the spotlight as a definitive
solution for privacy, but the high computational overhead of FHE poses a
challenge to its practical adoption. Although prior studies have attempted to
design ASIC accelerators to mitigate the overhead, their designs require
excessive amounts of chip resources (e.g., areas) to contain and process
massive data for FHE operations.
We propose CiFHER, a chiplet-based FHE accelerator with a resizable
structure, to tackle the challenge with a cost-effective multi-chip module
(MCM) design. First, we devise a flexible architecture of a chiplet core whose
configuration can be adjusted to conform to the global organization of chiplets
and design constraints. The distinctive feature of our core is a recomposable
functional unit providing varying computational throughput for number-theoretic
transform (NTT), the most dominant function in FHE. Then, we establish
generalized data mapping methodologies to minimize the network overhead when
organizing the chips into the MCM package in a tiled manner, which becomes a
significant bottleneck due to the technology constraints of MCMs. Also, we
analyze the effectiveness of various algorithms, including a novel limb
duplication algorithm, on the MCM architecture. A detailed evaluation shows
that a CiFHER package composed of 4 to 64 compact chiplets provides performance
comparable to state-of-the-art monolithic ASIC FHE accelerators with
significantly lower package-wide power consumption while reducing the area of a
single core to as small as 4.28mm.Comment: 15 pages, 9 figure
- …