305 research outputs found

    Towards Fast and Scalable Private Inference

    Full text link
    Privacy and security have rapidly emerged as first order design constraints. Users now demand more protection over who can see their data (confidentiality) as well as how it is used (control). Here, existing cryptographic techniques for security fall short: they secure data when stored or communicated but must decrypt it for computation. Fortunately, a new paradigm of computing exists, which we refer to as privacy-preserving computation (PPC). Emerging PPC technologies can be leveraged for secure outsourced computation or to enable two parties to compute without revealing either users' secret data. Despite their phenomenal potential to revolutionize user protection in the digital age, the realization has been limited due to exorbitant computational, communication, and storage overheads. This paper reviews recent efforts on addressing various PPC overheads using private inference (PI) in neural network as a motivating application. First, the problem and various technologies, including homomorphic encryption (HE), secret sharing (SS), garbled circuits (GCs), and oblivious transfer (OT), are introduced. Next, a characterization of their overheads when used to implement PI is covered. The characterization motivates the need for both GCs and HE accelerators. Then two solutions are presented: HAAC for accelerating GCs and RPU for accelerating HE. To conclude, results and effects are shown with a discussion on what future work is needed to overcome the remaining overheads of PI.Comment: Appear in the 20th ACM International Conference on Computing Frontier

    REED: Chiplet-Based Scalable Hardware Accelerator for Fully Homomorphic Encryption

    Get PDF
    Fully Homomorphic Encryption (FHE) has emerged as a promising technology for processing encrypted data without the need for decryption. Despite its potential, its practical implementation has faced challenges due to substantial computational overhead. To address this issue, we propose the firstfirst chiplet-based FHE accelerator design `REED\u27, which enables scalability and offers high throughput, thereby enhancing homomorphic encryption deployment in real-world scenarios. It incorporates well-known wafer yield issues during fabrication which significantly impacts production costs. In contrast to state-of-the-art approaches, we also address data exchange overhead by proposing a non-blocking inter-chiplet communication strategy. We incorporate novel pipelined Number Theoretic Transform and automorphism techniques, leveraging parallelism and providing high throughput. Experimental results demonstrate that REED 2.5D integrated circuit consumes 177 mm2^2 chip area, 82.5 W average power in 7nm technology, and achieves an impressive speedup of up to 5,982×\times compared to a CPU (24-core 2×\timesIntel X5690), and 2×\times better energy efficiency and 50\% lower development cost than state-of-the-art ASIC accelerator. To evaluate its practical impact, we are the firstfirst to benchmark an encrypted deep neural network training. Overall, this work successfully enhances the practicality and deployability of fully homomorphic encryption in real-world scenarios

    SHIELD: Scalable Homomorphic Implementation of Encrypted Data-Classifiers

    Get PDF
    Homomorphic encryption (HE) systems enable computations on encrypted data, without decrypting and without knowledge of the secret key. In this work, we describe an optimized Ring Learning With Errors (RLWE) based implementation of a variant of the HE system recently proposed by Gentry, Sahai and Waters (GSW). Although this system was widely believed to be less efficient than its contemporaries, we demonstrate quite the opposite behavior for a large class of applications. We first highlight and carefully exploit the algebraic features of the system to achieve significant speedup over the state-of-the-art HE implementation, namely the IBM homomorphic encryption library (HElib). We introduce several optimizations on top of our HE implementation, and use the resulting scheme to construct a homomorphic Bayesian spam filter, secure multiple keyword search, and a homomorphic evaluator for binary decision trees. Our results show a factor of 10× improvement in performance (under the same security settings and CPU platforms) compared to IBM HElib for these applications. Our system is built to be easily portable to GPUs (unlike IBM HElib) which results in an additional speedup of up to a factor of 103.5× to offer an overall speedup of 1,035×

    Toward Practical Privacy-Preserving Convolutional Neural Networks Exploiting Fully Homomorphic Encryption

    Full text link
    Incorporating fully homomorphic encryption (FHE) into the inference process of a convolutional neural network (CNN) draws enormous attention as a viable approach for achieving private inference (PI). FHE allows delegating the entire computation process to the server while ensuring the confidentiality of sensitive client-side data. However, practical FHE implementation of a CNN faces significant hurdles, primarily due to FHE's substantial computational and memory overhead. To address these challenges, we propose a set of optimizations, which includes GPU/ASIC acceleration, an efficient activation function, and an optimized packing scheme. We evaluate our method using the ResNet models on the CIFAR-10 and ImageNet datasets, achieving several orders of magnitude improvement compared to prior work and reducing the latency of the encrypted CNN inference to 1.4 seconds on an NVIDIA A100 GPU. We also show that the latency drops to a mere 0.03 seconds with a custom hardware design.Comment: 3 pages, 1 figure, appears at DISCC 2023 (2nd Workshop on Data Integrity and Secure Cloud Computing, in conjunction with the 56th International Symposium on Microarchitecture (MICRO 2023)

    A Programmable SoC-Based Accelerator for Privacy-Enhancing Technologies and Functional Encryption

    Get PDF
    A multitude of privacy-enhancing technologies (PETs) has been presented recently to solve the privacy problems of contemporary services utilizing cloud computing. Many of them are based on additively homomorphic encryption (AHE) that allows the computation of additions on encrypted data. The main technical obstacles for adaptation of PETs in practical systems are related to performance overheads compared with current privacy-violating alternatives. In this article, we present a hardware/software (HW/SW) codesign for programmable systems-on-chip (SoCs) that is designed for accelerating applications based on the Paillier encryption. Our implementation is a microcode-based multicore architecture that is suitable for accelerating various PETs using AHE with large integer modular arithmetic. We instantiate the implementation in a Xilinx Zynq-7000 programmable SoC and provide performance evaluations in real hardware. We also investigate its efficiency in a high-end Xilinx UltraScale+ programmable SoC. We evaluate the implementation with two target use cases that have relevance in PETs: privacy-preserving computation of squared Euclidean distances over encrypted data and multi-input functional encryption (FE) for inner products. Both of them represent the first hardware acceleration results for such operations, and in particular, the latter one is among the very first published implementation results of FE on any platform.Peer reviewe

    CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure

    Full text link
    Fully homomorphic encryption (FHE) is in the spotlight as a definitive solution for privacy, but the high computational overhead of FHE poses a challenge to its practical adoption. Although prior studies have attempted to design ASIC accelerators to mitigate the overhead, their designs require excessive amounts of chip resources (e.g., areas) to contain and process massive data for FHE operations. We propose CiFHER, a chiplet-based FHE accelerator with a resizable structure, to tackle the challenge with a cost-effective multi-chip module (MCM) design. First, we devise a flexible architecture of a chiplet core whose configuration can be adjusted to conform to the global organization of chiplets and design constraints. The distinctive feature of our core is a recomposable functional unit providing varying computational throughput for number-theoretic transform (NTT), the most dominant function in FHE. Then, we establish generalized data mapping methodologies to minimize the network overhead when organizing the chips into the MCM package in a tiled manner, which becomes a significant bottleneck due to the technology constraints of MCMs. Also, we analyze the effectiveness of various algorithms, including a novel limb duplication algorithm, on the MCM architecture. A detailed evaluation shows that a CiFHER package composed of 4 to 64 compact chiplets provides performance comparable to state-of-the-art monolithic ASIC FHE accelerators with significantly lower package-wide power consumption while reducing the area of a single core to as small as 4.28mm2^2.Comment: 15 pages, 9 figure

    Edge Computing for AI and ML: Enhancing Performance and Privacy in Data Analysis

    Get PDF
    Centralised cloud computing paradigms are encountering difficulties with latency, bandwidth, privacy, and security due to the exponential growth of data volumes produced by sensors and Internet of Things (IoT) devices. One potential approach to these constraints is edge computing, which moves computers and storage closer to the data sources. With this paradigm change, data privacy is improved, network congestion is decreased, and real-time processing is made possible. Aiming to improve the efficiency and confidentiality of data analysis applications powered by artificial intelligence (AI) and machine learning (ML), this article investigated the possibility of edge computing. We provide a thorough analysis of the latest developments in edge computing frameworks, algorithms, and architectures that allow for safe and fast training and inference of AI/ML models at the edge. We also go over the main obstacles and where the field may go from here in terms of research. Our research lays the groundwork for future intelligent edge systems by demonstrating the substantial advantages of edge computing in facilitating low-latency, energy-efficient, and privacy-preserving AI/ML applications. &nbsp

    Harnessing the Power of Distributed Computing: Advancements in Scientific Applications, Homomorphic Encryption, and Federated Learning Security

    Get PDF
    Data explosion poses lot of challenges to the state-of-the art systems, applications, and methodologies. It has been reported that 181 zettabytes of data are expected to be generated in 2025 which is over 150\% increase compared to the data that is expected to be generated in 2023. However, while system manufacturers are consistently developing devices with larger storage spaces and providing alternative storage capacities in the cloud at affordable rates, another key challenge experienced is how to effectively process the fraction of large scale of stored data in time-critical conventional systems. One transformative paradigm revolutionizing the processing and management of these large data is distributed computing whose application requires deep understanding. This dissertation focuses on exploring the potential impact of applying efficient distributed computing concepts to long existing challenges or issues in (i) a widely data-intensive scientific application (ii) applying homomorphic encryption to data intensive workloads found in outsourced databases and (iii) security of tokenized incentive mechanism for Federated learning (FL) systems.The first part of the dissertation tackles the Microelectrode arrays (MEAs) parameterization problem from an orthogonal viewpoint enlightened by algebraic topology, which allows us to algebraically parametrize MEAs whose structure and intrinsic parallelism are hard to identify otherwise. We implement a new paradigm, namely Parma, to demonstrate the effectiveness of the proposed approach and report how it outperforms the state-of-the-practice in time, scalability, and memory usage.The second part discusses our work on introducing the concept of parallel caching of secure aggregation to mitigate the performance overhead incurred by the HE module in outsourced databases. The key idea of this optimization approach is caching selected radix-ciphertexts in parallel without violating existing security guarantees of the primitive/base HE scheme. A new radix HE algorithm was designed and applied to both batch and incremental HE schemes, and experiments carried out on six workloads show that the proposed caching boost state-of-the-art HE schemes by high orders of magnitudes.In the third part, I will discuss our work on leveraging the security benefit of blockchains to enhance or protect the fairness and reliability of tokenized incentive mechanism for FL systems. We designed a blockchain-based auditing protocol to mitigate Gaussian attacks and carried out experiments with multiple FL aggregation algorithms, popular data sets and a variety of scales to validate its effectiveness

    GPU-based Private Information Retrieval for On-Device Machine Learning Inference

    Full text link
    On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20×20 \times over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5×5 \times additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000100,000 queries per second -- a >100×>100 \times throughput improvement over a CPU-based baseline -- while maintaining model accuracy
    • …
    corecore