305 research outputs found
Towards Fast and Scalable Private Inference
Privacy and security have rapidly emerged as first order design constraints.
Users now demand more protection over who can see their data (confidentiality)
as well as how it is used (control). Here, existing cryptographic techniques
for security fall short: they secure data when stored or communicated but must
decrypt it for computation. Fortunately, a new paradigm of computing exists,
which we refer to as privacy-preserving computation (PPC). Emerging PPC
technologies can be leveraged for secure outsourced computation or to enable
two parties to compute without revealing either users' secret data. Despite
their phenomenal potential to revolutionize user protection in the digital age,
the realization has been limited due to exorbitant computational,
communication, and storage overheads.
This paper reviews recent efforts on addressing various PPC overheads using
private inference (PI) in neural network as a motivating application. First,
the problem and various technologies, including homomorphic encryption (HE),
secret sharing (SS), garbled circuits (GCs), and oblivious transfer (OT), are
introduced. Next, a characterization of their overheads when used to implement
PI is covered. The characterization motivates the need for both GCs and HE
accelerators. Then two solutions are presented: HAAC for accelerating GCs and
RPU for accelerating HE. To conclude, results and effects are shown with a
discussion on what future work is needed to overcome the remaining overheads of
PI.Comment: Appear in the 20th ACM International Conference on Computing
Frontier
REED: Chiplet-Based Scalable Hardware Accelerator for Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) has emerged as a promising technology for processing encrypted data without the need for decryption. Despite its potential, its practical implementation has faced challenges due to substantial computational overhead. To address this issue, we propose the chiplet-based FHE accelerator design `REED\u27, which enables scalability and offers high throughput, thereby enhancing homomorphic encryption deployment in real-world scenarios. It incorporates well-known wafer yield issues during fabrication which significantly impacts production costs. In contrast to state-of-the-art approaches, we also address data exchange overhead by proposing a non-blocking inter-chiplet communication strategy. We incorporate novel pipelined Number Theoretic Transform and automorphism techniques, leveraging parallelism and providing high throughput.
Experimental results demonstrate that REED 2.5D integrated circuit consumes 177 mm chip area, 82.5 W average power in 7nm technology, and achieves an impressive speedup of up to 5,982 compared to a CPU (24-core 2Intel X5690), and 2 better energy efficiency and 50\% lower development cost than state-of-the-art ASIC accelerator. To evaluate its practical impact, we are the to benchmark an encrypted deep neural network training. Overall, this work successfully enhances the practicality and deployability of fully homomorphic encryption in real-world scenarios
SHIELD: Scalable Homomorphic Implementation of Encrypted Data-Classifiers
Homomorphic encryption (HE) systems enable computations on encrypted data, without decrypting and without knowledge of the secret key. In this work, we describe an optimized Ring Learning With Errors (RLWE) based implementation of a variant of the HE system recently proposed by Gentry, Sahai and Waters (GSW). Although this system was widely believed to be less efficient than its contemporaries, we demonstrate quite the opposite behavior for a large class of applications. We first highlight and carefully exploit the algebraic features of the system to achieve significant speedup over the state-of-the-art HE implementation, namely the IBM homomorphic encryption library (HElib). We introduce several optimizations on top of our HE implementation, and use the resulting scheme to construct a homomorphic Bayesian spam filter, secure multiple keyword search, and a homomorphic evaluator for binary decision trees. Our results show a factor of 10× improvement in performance (under the same security settings and CPU platforms) compared to IBM HElib for these applications. Our system is built to be easily portable to GPUs (unlike IBM HElib) which results in an additional speedup of up to a factor of 103.5× to offer an overall speedup of 1,035×
Toward Practical Privacy-Preserving Convolutional Neural Networks Exploiting Fully Homomorphic Encryption
Incorporating fully homomorphic encryption (FHE) into the inference process
of a convolutional neural network (CNN) draws enormous attention as a viable
approach for achieving private inference (PI). FHE allows delegating the entire
computation process to the server while ensuring the confidentiality of
sensitive client-side data. However, practical FHE implementation of a CNN
faces significant hurdles, primarily due to FHE's substantial computational and
memory overhead. To address these challenges, we propose a set of
optimizations, which includes GPU/ASIC acceleration, an efficient activation
function, and an optimized packing scheme. We evaluate our method using the
ResNet models on the CIFAR-10 and ImageNet datasets, achieving several orders
of magnitude improvement compared to prior work and reducing the latency of the
encrypted CNN inference to 1.4 seconds on an NVIDIA A100 GPU. We also show that
the latency drops to a mere 0.03 seconds with a custom hardware design.Comment: 3 pages, 1 figure, appears at DISCC 2023 (2nd Workshop on Data
Integrity and Secure Cloud Computing, in conjunction with the 56th
International Symposium on Microarchitecture (MICRO 2023)
A Programmable SoC-Based Accelerator for Privacy-Enhancing Technologies and Functional Encryption
A multitude of privacy-enhancing technologies (PETs) has been presented recently to solve the privacy problems of contemporary services utilizing cloud computing. Many of them are based on additively homomorphic encryption (AHE) that allows the computation of additions on encrypted data. The main technical obstacles for adaptation of PETs in practical systems are related to performance overheads compared with current privacy-violating alternatives. In this article, we present a hardware/software (HW/SW) codesign for programmable systems-on-chip (SoCs) that is designed for accelerating applications based on the Paillier encryption. Our implementation is a microcode-based multicore architecture that is suitable for accelerating various PETs using AHE with large integer modular arithmetic. We instantiate the implementation in a Xilinx Zynq-7000 programmable SoC and provide performance evaluations in real hardware. We also investigate its efficiency in a high-end Xilinx UltraScale+ programmable SoC. We evaluate the implementation with two target use cases that have relevance in PETs: privacy-preserving computation of squared Euclidean distances over encrypted data and multi-input functional encryption (FE) for inner products. Both of them represent the first hardware acceleration results for such operations, and in particular, the latter one is among the very first published implementation results of FE on any platform.Peer reviewe
CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure
Fully homomorphic encryption (FHE) is in the spotlight as a definitive
solution for privacy, but the high computational overhead of FHE poses a
challenge to its practical adoption. Although prior studies have attempted to
design ASIC accelerators to mitigate the overhead, their designs require
excessive amounts of chip resources (e.g., areas) to contain and process
massive data for FHE operations.
We propose CiFHER, a chiplet-based FHE accelerator with a resizable
structure, to tackle the challenge with a cost-effective multi-chip module
(MCM) design. First, we devise a flexible architecture of a chiplet core whose
configuration can be adjusted to conform to the global organization of chiplets
and design constraints. The distinctive feature of our core is a recomposable
functional unit providing varying computational throughput for number-theoretic
transform (NTT), the most dominant function in FHE. Then, we establish
generalized data mapping methodologies to minimize the network overhead when
organizing the chips into the MCM package in a tiled manner, which becomes a
significant bottleneck due to the technology constraints of MCMs. Also, we
analyze the effectiveness of various algorithms, including a novel limb
duplication algorithm, on the MCM architecture. A detailed evaluation shows
that a CiFHER package composed of 4 to 64 compact chiplets provides performance
comparable to state-of-the-art monolithic ASIC FHE accelerators with
significantly lower package-wide power consumption while reducing the area of a
single core to as small as 4.28mm.Comment: 15 pages, 9 figure
Edge Computing for AI and ML: Enhancing Performance and Privacy in Data Analysis
Centralised cloud computing paradigms are encountering difficulties with latency, bandwidth, privacy, and security due to the exponential growth of data volumes produced by sensors and Internet of Things (IoT) devices. One potential approach to these constraints is edge computing, which moves computers and storage closer to the data sources. With this paradigm change, data privacy is improved, network congestion is decreased, and real-time processing is made possible. Aiming to improve the efficiency and confidentiality of data analysis applications powered by artificial intelligence (AI) and machine learning (ML), this article investigated the possibility of edge computing. We provide a thorough analysis of the latest developments in edge computing frameworks, algorithms, and architectures that allow for safe and fast training and inference of AI/ML models at the edge. We also go over the main obstacles and where the field may go from here in terms of research. Our research lays the groundwork for future intelligent edge systems by demonstrating the substantial advantages of edge computing in facilitating low-latency, energy-efficient, and privacy-preserving AI/ML applications.
 
Harnessing the Power of Distributed Computing: Advancements in Scientific Applications, Homomorphic Encryption, and Federated Learning Security
Data explosion poses lot of challenges to the state-of-the art systems, applications, and methodologies. It has been reported that 181 zettabytes of data are expected to be generated in 2025 which is over 150\% increase compared to the data that is expected to be generated in 2023. However, while system manufacturers are consistently developing devices with larger storage spaces and providing alternative storage capacities in the cloud at affordable rates, another key challenge experienced is how to effectively process the fraction of large scale of stored data in time-critical conventional systems. One transformative paradigm revolutionizing the processing and management of these large data is distributed computing whose application requires deep understanding. This dissertation focuses on exploring the potential impact of applying efficient distributed computing concepts to long existing challenges or issues in (i) a widely data-intensive scientific application (ii) applying homomorphic encryption to data intensive workloads found in outsourced databases and (iii) security of tokenized incentive mechanism for Federated learning (FL) systems.The first part of the dissertation tackles the Microelectrode arrays (MEAs) parameterization problem from an orthogonal viewpoint enlightened by algebraic topology, which allows us to algebraically parametrize MEAs whose structure and intrinsic parallelism are hard to identify otherwise. We implement a new paradigm, namely Parma, to demonstrate the effectiveness of the proposed approach and report how it outperforms the state-of-the-practice in time, scalability, and memory usage.The second part discusses our work on introducing the concept of parallel caching of secure aggregation to mitigate the performance overhead incurred by the HE module in outsourced databases. The key idea of this optimization approach is caching selected radix-ciphertexts in parallel without violating existing security guarantees of the primitive/base HE scheme. A new radix HE algorithm was designed and applied to both batch and incremental HE schemes, and experiments carried out on six workloads show that the proposed caching boost state-of-the-art HE schemes by high orders of magnitudes.In the third part, I will discuss our work on leveraging the security benefit of blockchains to enhance or protect the fairness and reliability of tokenized incentive mechanism for FL systems. We designed a blockchain-based auditing protocol to mitigate Gaussian attacks and carried out experiments with multiple FL aggregation algorithms, popular data sets and a variety of scales to validate its effectiveness
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
On-device machine learning (ML) inference can enable the use of private user
data on user devices without revealing them to remote servers. However, a pure
on-device solution to private ML inference is impractical for many applications
that rely on embedding tables that are too large to be stored on-device. In
particular, recommendation models typically use multiple embedding tables each
on the order of 1-10 GBs of data, making them impractical to store on-device.
To overcome this barrier, we propose the use of private information retrieval
(PIR) to efficiently and privately retrieve embeddings from servers without
sharing any private information. As off-the-shelf PIR algorithms are usually
too computationally intensive to directly use for latency-sensitive inference
tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR
with the downstream ML application to obtain further speedup. Our GPU
acceleration strategy improves system throughput by more than over
an optimized CPU PIR implementation, and our PIR-ML co-design provides an over
additional throughput improvement at fixed model quality. Together,
for various on-device ML applications such as recommendation and language
modeling, our system on a single V100 GPU can serve up to queries per
second -- a throughput improvement over a CPU-based baseline --
while maintaining model accuracy
- …