376 research outputs found
Optimized Surface Code Communication in Superconducting Quantum Computers
Quantum computing (QC) is at the cusp of a revolution. Machines with 100
quantum bits (qubits) are anticipated to be operational by 2020
[googlemachine,gambetta2015building], and several-hundred-qubit machines are
around the corner. Machines of this scale have the capacity to demonstrate
quantum supremacy, the tipping point where QC is faster than the fastest
classical alternative for a particular problem. Because error correction
techniques will be central to QC and will be the most expensive component of
quantum computation, choosing the lowest-overhead error correction scheme is
critical to overall QC success. This paper evaluates two established quantum
error correction codes---planar and double-defect surface codes---using a set
of compilation, scheduling and network simulation tools. In considering
scalable methods for optimizing both codes, we do so in the context of a full
microarchitectural and compiler analysis. Contrary to previous predictions, we
find that the simpler planar codes are sometimes more favorable for
implementation on superconducting quantum computers, especially under
conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium
on Microarchitectur
Learning Independent Program and Architecture Representations for Generalizable Performance Modeling
This paper proposes PerfVec, a novel deep learning-based performance modeling
framework that learns high-dimensional, independent/orthogonal program and
microarchitecture representations. Once learned, a program representation can
be used to predict its performance on any microarchitecture, and likewise, a
microarchitecture representation can be applied in the performance prediction
of any program. Additionally, PerfVec yields a foundation model that captures
the performance essence of instructions, which can be directly used by
developers in numerous performance modeling related tasks without incurring its
training cost. The evaluation demonstrates that PerfVec is more general,
efficient, and accurate than previous approaches
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
Recommended from our members
Bridging the gap between mobile CPU design and user satisfaction via crowdsourcing
This report aims to provide an understanding of how the mobile CPU designs have evolved and its influence on end-user satisfaction. To that end, a quantitative performance analysis is conducted across ten cutting-edge mobile CPU designs studied within top-selling off-the-shelf smartphones released over the past seven years. This analysis is then used to guide a large-scale user study spanning over 25,000 participants via crowdsourcing on the Amazon Mechanical Turk service. The user study asks participants to assess the responsiveness of interactive application use cases for a set of current-generation applications (e.g. Angry Birds and FaceBook) and next-generation applications (i.e. face recognition and augmented reality) relative to the performance capabilities of the devices studied. This framework allows us to quantitatively link how the mobile CPU designs studied impacted end-user satisfaction. The study results indicate that mobile CPU designs have exhibited signifiant performance improvements through aggressive core scaling techniques prevalent in desktop CPUs. Just as was observed in desktop CPU design, these same techniques have lead to excessive mobile CPU power consumption. However, from an end-user perspective this power consumption was not without success. Mobile CPUs have evolved to provide satisfactory experiences for the studied current- generation applications. The reason is that many of these applications rely heavily on single-threaded performance. Other, more recent applications, actually multi-thread user-critical parts of the applications, which also demonstrates that multi- core mobile CPUs are an important design consideration – contrary to conventional wisdom. However, looking ahead, the same mobile CPUs where not able to provide satisfactory experiences for many of the next-generation applications studied, questioning the sustainability of these power-hungry design techniques in future mobile CPU designs.Electrical and Computer Engineerin
Pipelining Saturated Accumulation
Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM
- …