Search CORE

68 research outputs found

Optimized Surface Code Communication in Superconducting Quantum Computers

Author: Brown Kenneth R.
Chong Frederic T.
Franklin Diana
Gokhale Pranav
Holmes Adam
Javadi-Abhari Ali
Martonosi Margaret
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Quantum computing (QC) is at the cusp of a revolution. Machines with 100 quantum bits (qubits) are anticipated to be operational by 2020 [googlemachine,gambetta2015building], and several-hundred-qubit machines are around the corner. Machines of this scale have the capacity to demonstrate quantum supremacy, the tipping point where QC is faster than the fastest classical alternative for a particular problem. Because error correction techniques will be central to QC and will be the most expensive component of quantum computation, choosing the lowest-overhead error correction scheme is critical to overall QC success. This paper evaluates two established quantum error correction codes---planar and double-defect surface codes---using a set of compilation, scheduling and network simulation tools. In considering scalable methods for optimizing both codes, we do so in the context of a full microarchitectural and compiler analysis. Contrary to previous predictions, we find that the simpler planar codes are sometimes more favorable for implementation on superconducting quantum computers, especially under conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium on Microarchitectur

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

PARSNIP: Performant Architecture for Race Safety with No Impact on Precision

Author: Devietti Joseph
Peng Yuanfeng
Wood Benjamin P.
Publication venue: Wellesley College Digital Scholarship and Archive
Publication date
Field of study

Data race detection is a useful dynamic analysis for multithreaded programs that is a key building block in record-and-replay, enforcing strong consistency models, and detecting concurrency bugs. Existing software race detectors are precise but slow, and hardware support for precise data race detection relies on assumptions like type safety that many programs violate in practice. We propose PARSNIP, a fully precise hardware-supported data race detector. PARSNIP exploits new insights into the redundancy of race detection metadata to reduce storage overheads. PARSNIP also adopts new race detection metadata encodings that accelerate the common case while preserving soundness and completeness. When bounded hardware resources are exhausted, PARSNIP falls back to a software race detector to preserve correctness. PARSNIP does not assume that target programs are type safe, and is thus suitable for race detection on arbitrary code. Our evaluation of PARSNIP on several PARSEC benchmarks shows that performance overheads range from negligible to 2.6x, with an average overhead of just 1.5x. Moreover, Parsnip outperforms the state-of-the-art Radish hardware race detector by 4.6x

Wellesley College

Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers

Author: Chong Fred T.
Gokhale Pranav
Hoffman Henry
Leung Nelson
Rossi Zane
Schuster David I.
Shi Yunong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/02/2019
Field of study

Recent developments in engineering and algorithms have made real-world applications in quantum computing possible in the near future. Existing quantum programming languages and compilers use a quantum assembly language composed of 1- and 2-qubit (quantum bit) gates. Quantum compiler frameworks translate this quantum assembly to electric signals (called control pulses) that implement the specified computation on specific physical devices. However, there is a mismatch between the operations defined by the 1- and 2-qubit logical ISA and their underlying physical implementation, so the current practice of directly translating logical instructions into control pulses results in inefficient, high-latency programs. To address this inefficiency, we propose a universal quantum compilation methodology that aggregates multiple logical operations into larger units that manipulate up to 10 qubits at a time. Our methodology then optimizes these aggregates by (1) finding commutative intermediate operations that result in more efficient schedules and (2) creating custom control pulses optimized for the aggregate (instead of individual 1- and 2-qubit operations). Compared to the standard gate-based compilation, the proposed approach realizes a deeper vertical integration of high-level quantum software and low-level, physical quantum hardware. We evaluate our approach on important near-term quantum applications on simulations of superconducting quantum architectures. Our proposed approach provides a mean speedup of

5\times

, with a maximum of

10\times

. Because latency directly affects the feasibility of quantum computation, our results not only improve performance but also have the potential to enable quantum computation sooner than otherwise possible.Comment: 13 pages, to apper in ASPLO

arXiv.org e-Print Archive

Crossref

A survey of near-data processing architectures for neural networks

Author: González Colás Antonio María
Hassanpour Mehdi
Riera Villanueva Marc
Publication venue: 'MDPI AG'
Publication date: 17/01/2022
Field of study

Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both high-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers, and researchers in the area of machine learning.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Characterizing Sources of Ineffectual Computations in Deep Learning Networks

Author: Mahmoud M
Moshovos A
Mullins R
Nikolic M
Zhao Y
Publication venue: Proceedings - 2019 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019
Publication date: 01/03/2019
Field of study

Hardware accelerators for inference with neural networks can take advantage of the properties of data they process. Performance gains and reduced memory bandwidth during inference have been demonstrated by using narrower data types [1] [2] and by exploiting the ability to skip and compress values that are zero [3]-[6]. Similarly useful properties have been identified at a lower-level such as varying precision requirements [7] and bit-level sparsity [8] [9]. To date, the analysis of these potential sources of superfluous computation and communication has been constrained to a small number of older Convolutional Neural Networks (CNNs) used for image classification. It is an open question as to whether they exist more broadly. This paper aims to determine whether these properties persist in: (1) more recent and thus more accurate and better performing image classification networks, (2) models for image applications other than classification such as image segmentation and low-level computational imaging, (3) Long-Short-Term-Memory (LSTM) models for non-image applications such as those for natural language processing, and (4) quantized image classification models. We demonstrate that such properties persist and discuss the implications and opportunities for future accelerator designs

Crossref

Apollo (Cambridge)

Recommended from our members

Improving virtual memory performance in virtualized environments

Author: Marathe Yashwant
Publication venue
Publication date: 25/02/2021
Field of study

Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin

Texas ScholarWorks