874 research outputs found

    Architecture and Circuit Design Optimization for Compute-In-Memory

    Get PDF
    The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Towards compact bandwidth and efficient privacy-preserving computation

    Get PDF
    In traditional cryptographic applications, cryptographic mechanisms are employed to ensure the security and integrity of communication or storage. In these scenarios, the primary threat is usually an external adversary trying to intercept or tamper with the communication between two parties. On the other hand, in the context of privacy-preserving computation or secure computation, the cryptographic techniques are developed with a different goal in mind: to protect the privacy of the participants involved in a computation from each other. Specifically, privacy-preserving computation allows multiple parties to jointly compute a function without revealing their inputs and it has numerous applications in various fields, including finance, healthcare, and data analysis. It allows for collaboration and data sharing without compromising the privacy of sensitive data, which is becoming increasingly important in today's digital age. While privacy-preserving computation has gained significant attention in recent times due to its strong security and numerous potential applications, its efficiency remains its Achilles' heel. Privacy-preserving protocols require significantly higher computational overhead and bandwidth when compared to baseline (i.e., insecure) protocols. Therefore, finding ways to minimize the overhead, whether it be in terms of computation or communication, asymptotically or concretely, while maintaining security in a reasonable manner remains an exciting problem to work on. This thesis is centred around enhancing efficiency and reducing the costs of communication and computation for commonly used privacy-preserving primitives, including private set intersection, oblivious transfer, and stealth signatures. Our primary focus is on optimizing the performance of these primitives.Im Gegensatz zu traditionellen kryptografischen Aufgaben, bei denen Kryptografie verwendet wird, um die Sicherheit und Integrität von Kommunikation oder Speicherung zu gewährleisten und der Gegner typischerweise ein Außenstehender ist, der versucht, die Kommunikation zwischen Sender und Empfänger abzuhören, ist die Kryptografie, die in der datenschutzbewahrenden Berechnung (oder sicheren Berechnung) verwendet wird, darauf ausgelegt, die Privatsphäre der Teilnehmer voreinander zu schützen. Insbesondere ermöglicht die datenschutzbewahrende Berechnung es mehreren Parteien, gemeinsam eine Funktion zu berechnen, ohne ihre Eingaben zu offenbaren. Sie findet zahlreiche Anwendungen in verschiedenen Bereichen, einschließlich Finanzen, Gesundheitswesen und Datenanalyse. Sie ermöglicht eine Zusammenarbeit und Datenaustausch, ohne die Privatsphäre sensibler Daten zu kompromittieren, was in der heutigen digitalen Ära immer wichtiger wird. Obwohl datenschutzbewahrende Berechnung aufgrund ihrer starken Sicherheit und zahlreichen potenziellen Anwendungen in jüngster Zeit erhebliche Aufmerksamkeit erregt hat, bleibt ihre Effizienz ihre Achillesferse. Datenschutzbewahrende Protokolle erfordern deutlich höhere Rechenkosten und Kommunikationsbandbreite im Vergleich zu Baseline-Protokollen (d.h. unsicheren Protokollen). Daher bleibt es eine spannende Aufgabe, Möglichkeiten zu finden, um den Overhead zu minimieren (sei es in Bezug auf Rechen- oder Kommunikationsleistung, asymptotisch oder konkret), während die Sicherheit auf eine angemessene Weise gewährleistet bleibt. Diese Arbeit konzentriert sich auf die Verbesserung der Effizienz und Reduzierung der Kosten für Kommunikation und Berechnung für gängige datenschutzbewahrende Primitiven, einschließlich private Schnittmenge, vergesslicher Transfer und Stealth-Signaturen. Unser Hauptaugenmerk liegt auf der Optimierung der Leistung dieser Primitiven

    BalanceProofs: Maintainable Vector Commitments with Fast Aggregation

    Get PDF
    We present BalanceProofs, the first vector commitment that is maintainable (i.e., supporting sublinear updates) while also enjoying fast proof aggregation and verification. The basic version of BalanceProofs has O(nlogn)O(\sqrt{n}\log n) update time and O(n)O(\sqrt{n}) query time and its constant-size aggregated proofs can be produced and verified in milliseconds. In particular, BalanceProofs improves the aggregation time and aggregation verification time of the only known maintainable and aggregatable vector commitment scheme, Hyperproofs (USENIX SECURITY 2022), by up to 1000×\times and up to 100×\times respectively. Fast verification of aggregated proofs is particularly useful for applications such as stateless cryptocurrencies (and was a major bottleneck for Hyperproofs), where an aggregated proof of balances is produced once but must be verified multiple times and by a large number of nodes. As a limitation, the updating time in BalanceProofs compared to Hyperproofs is roughly 6×6\times slower, but always stays in the range from 10 to 18 milliseconds. We finally study useful tradeoffs in BalanceProofs between (aggregate) proof size, update time and (aggregate) proof computation and verification, by introducing a bucketing technique, and present an extensive evaluation as well as a comparison to Hyperproofs

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Rotatable Zero Knowledge Sets: Post Compromise Secure Auditable Dictionaries with application to Key Transparency

    Get PDF
    Key Transparency (KT) systems allow end-to-end encrypted service providers (messaging, calls, etc.) to maintain an auditable directory of their users’ public keys, producing proofs that all participants have a consistent view of those keys, and allowing each user to check updates to their own keys. KT has lately received a lot of attention, in particular its privacy preserving variants, which also ensure that users and auditors do not learn anything beyond what is necessary to use the service and keep the service provider accountable. Abstractly, the problem of building such systems reduces to constructing so-called append-only Zero-Knowledge Sets (aZKS). Unfortunately, existing aZKS (and KT) solutions do not allow to adequately restore the privacy guarantees after a server compromise, a form of Post-Compromise Security (PCS), while maintaining the auditability properties. In this work we address this concern through the formalization of an extension of aZKS called Rotatable ZKS (RZKS). In addition to providing PCS, our notion of RZKS has several other attractive features, such as a stronger (extractable) soundness notion, and the ability for a communication party with out-of-date data to efficiently “catch up” to the current epoch while ensuring that the server did not erase any of the past data. Of independent interest, we also introduce a new primitive called a Rotatable Verifiable Random Function (VRF), and show how to build RZKS in a modular fashion from a rotatable VRF, ordered accumulator, and append-only vector commitment schemes

    Improving low latency applications for reconfigurable devices

    Get PDF
    This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed. Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture. Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces

    TPU as Cryptographic Accelerator

    Full text link
    Polynomials defined on specific rings are heavily involved in various cryptographic schemes, and the corresponding operations are usually the computation bottleneck of the whole scheme. We propose to utilize TPU, an emerging hardware designed for AI applications, to speed up polynomial operations and convert TPU to a cryptographic accelerator. We also conduct preliminary evaluation and discuss the limitations of current work and future plan

    Zero-Knowledge Functional Elementary Databases

    Get PDF
    Zero-knowledge elementary databases (ZK-EDBs) enable a prover to commit a database D{D} of key-value (x,v)(x,v) pairs and later provide a convincing answer to the query ``send me the value D(x)D(x) associated with xx\u27\u27 without revealing any extra knowledge (including the size of D{D}). After its introduction, several works extended it to allow more expressive queries, but the expressiveness achieved so far is still limited: only a relatively simple queries--range queries over the keys and values-- can be handled by known constructions. In this paper we introduce a new notion called zero knowledge functional elementary databases (ZK-FEDBs), which allows the most general functional queries. Roughly speaking, for any Boolean circuit ff, ZK-FEDBs allows the ZK-EDB prover to provide convincing answers to the queries of the form ``send me all records (x,v){(x,v)} in D{{D}} satisfying f(x,v)=1f(x,v)=1,\u27\u27 without revealing any extra knowledge (including the size of D{D}). We present a construction of ZK-FEDBs in the random oracle model and generic group model, whose proof size is only linear in the length of record and the size of query circuit, and is independent of the size of input database DD. Our technical constribution is two-fold. Firstly, we introduce a new variant of zero-knowledge sets (ZKS) which supports combined operations on sets, and present a concrete construction that is based on groups with unknown order. Secondly, we develop a tranformation that tranforms the query of Boolean circuit into a query of combined operations on related sets, which may be of independent interest
    corecore