4,000 research outputs found
Optimized Surface Code Communication in Superconducting Quantum Computers
Quantum computing (QC) is at the cusp of a revolution. Machines with 100
quantum bits (qubits) are anticipated to be operational by 2020
[googlemachine,gambetta2015building], and several-hundred-qubit machines are
around the corner. Machines of this scale have the capacity to demonstrate
quantum supremacy, the tipping point where QC is faster than the fastest
classical alternative for a particular problem. Because error correction
techniques will be central to QC and will be the most expensive component of
quantum computation, choosing the lowest-overhead error correction scheme is
critical to overall QC success. This paper evaluates two established quantum
error correction codes---planar and double-defect surface codes---using a set
of compilation, scheduling and network simulation tools. In considering
scalable methods for optimizing both codes, we do so in the context of a full
microarchitectural and compiler analysis. Contrary to previous predictions, we
find that the simpler planar codes are sometimes more favorable for
implementation on superconducting quantum computers, especially under
conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium
on Microarchitectur
Tupleware: Redefining Modern Analytics
There is a fundamental discrepancy between the targeted and actual users of
current analytics frameworks. Most systems are designed for the data and
infrastructure of the Googles and Facebooks of the world---petabytes of data
distributed across large cloud deployments consisting of thousands of cheap
commodity machines. Yet, the vast majority of users operate clusters ranging
from a few to a few dozen nodes, analyze relatively small datasets of up to a
few terabytes, and perform primarily compute-intensive operations. Targeting
these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed
at the challenges faced by the typical user. Tupleware's architecture brings
together ideas from the database, compiler, and programming languages
communities to create a powerful end-to-end solution for data analysis. We
propose novel techniques that consider the data, computations, and hardware
together to achieve maximum performance on a case-by-case basis. Our
experimental evaluation quantifies the impact of our novel techniques and shows
orders of magnitude performance improvement over alternative systems
An assessment of the connection machine
The CM-2 is an example of a connection machine. The strengths and problems of this implementation are considered as well as important issues in the architecture and programming environment of connection machines in general. These are contrasted to the same issues in Multiple Instruction/Multiple Data (MIMD) microprocessors and multicomputers
Towards an Achievable Performance for the Loop Nests
Numerous code optimization techniques, including loop nest optimizations,
have been developed over the last four decades. Loop optimization techniques
transform loop nests to improve the performance of the code on a target
architecture, including exposing parallelism. Finding and evaluating an
optimal, semantic-preserving sequence of transformations is a complex problem.
The sequence is guided using heuristics and/or analytical models and there is
no way of knowing how close it gets to optimal performance or if there is any
headroom for improvement. This paper makes two contributions. First, it uses a
comparative analysis of loop optimizations/transformations across multiple
compilers to determine how much headroom may exist for each compiler. And
second, it presents an approach to characterize the loop nests based on their
hardware performance counter values and a Machine Learning approach that
predicts which compiler will generate the fastest code for a loop nest. The
prediction is made for both auto-vectorized, serial compilation and for
auto-parallelization. The results show that the headroom for state-of-the-art
compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to
1.71x for the auto-parallelized code. These results are based on the Machine
Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and
Compilers for Parallel Computing (LCPC 2018
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
- …