17,320 research outputs found
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs
Deep learning has significantly advanced the state of the
art in artificial intelligence, gaining wide popularity from both industry
and academia. Special interest is around Convolutional Neural Networks
(CNN), which take inspiration from the hierarchical structure
of the visual cortex, to form deep layers of convolutional operations,
along with fully connected classifiers. Hardware implementations of these
deep CNN architectures are challenged with memory bottlenecks that
require many convolution and fully-connected layers demanding large
amount of communication for parallel computation. Multi-core CPU
based solutions have demonstrated their inadequacy for this problem
due to the memory wall and low parallelism. Many-core GPU architectures
show superior performance but they consume high power and also
have memory constraints due to inconsistencies between cache and main
memory. OpenCL is commonly used to describe these architectures for
their execution on GPGPUs or FPGAs. FPGA design solutions are also
actively being explored, which allow implementing the memory hierarchy
using embedded parallel BlockRAMs. This boosts the parallel use
of shared memory elements between multiple processing units, avoiding
data replicability and inconsistencies. This makes FPGAs potentially
powerful solutions for real-time classification of CNNs. In this
paper both Altera and Xilinx adopted OpenCL co-design frameworks
for pseudo-automatic development solutions are evaluated. A comprehensive
evaluation and comparison for a 5-layer deep CNN is presented.
Hardware resources, temporal performance and the OpenCL architecture
for CNNs are discussed. Xilinx demonstrates faster synthesis, better
FPGA resource utilization and more compact boards. Altera provides
multi-platforms tools, mature design community and better execution
times.Ministerio de Economía y Competitividad TEC2016-77785-
Optimized Surface Code Communication in Superconducting Quantum Computers
Quantum computing (QC) is at the cusp of a revolution. Machines with 100
quantum bits (qubits) are anticipated to be operational by 2020
[googlemachine,gambetta2015building], and several-hundred-qubit machines are
around the corner. Machines of this scale have the capacity to demonstrate
quantum supremacy, the tipping point where QC is faster than the fastest
classical alternative for a particular problem. Because error correction
techniques will be central to QC and will be the most expensive component of
quantum computation, choosing the lowest-overhead error correction scheme is
critical to overall QC success. This paper evaluates two established quantum
error correction codes---planar and double-defect surface codes---using a set
of compilation, scheduling and network simulation tools. In considering
scalable methods for optimizing both codes, we do so in the context of a full
microarchitectural and compiler analysis. Contrary to previous predictions, we
find that the simpler planar codes are sometimes more favorable for
implementation on superconducting quantum computers, especially under
conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium
on Microarchitectur
- …