1,267 research outputs found
Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier Intellectual Property Core on FPGA
In todays world, high-power computing applications such as image processing,
digital signal processing, graphics, and robotics require enormous computing
power. These applications use matrix operations, especially matrix
multiplication. Multiplication operations require a lot of computational time
and are also complex in design. We can use field-programmable gate arrays as
low-cost hardware accelerators along with a low-cost general-purpose processor
instead of a high-cost application-specific processor for such applications. In
this work, we employ an efficient Strassens algorithm for matrix multiplication
and a highly efficient run-time-reconfigurable floating-point multiplier for
matrix element multiplication. The run-time-reconfigurable floating-point
multiplier is implemented with custom floating-point format for
variable-precision applications. A very efficient combination of Karatsuba
algorithm and Urdhva Tiryagbhyam algorithm is used to implement the binary
multiplier. This design can effectively adjust the power and delay requirements
according to different accuracy requirements by reconfiguring itself during run
time
Criteria and Approaches for Virtualization on Modern FPGAs
Modern field programmable gate arrays (FPGAs) can produce high performance in
a wide range of applications, and their computational capacity is becoming
abundant in personal computers. Regardless of this fact, FPGA virtualization is
an emerging research field. Nowadays, challenges of the research area come from
not only technical difficulties but also from the ambiguous standards of
virtualization. In this paper, we introduce novel criteria of FPGA
virtualization and discuss several approaches to accomplish those criteria. In
addition, we present and describe in detail the specific FPGA virtualization
architecture that we developed on Intel Arria 10 FPGA. We evaluate our solution
with a combination of applications and microbenchmarks. The result shows that
our virtualization solution can provide a full abstraction of FPGA device in
both user and developer perspective while maintaining a reasonable performance
compared to native FPGA
Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Modern real-time business analytic consist of heterogeneous workloads (e.g,
database queries, graph processing, and machine learning). These analytic
applications need programming environments that can capture all aspects of the
constituent workloads (including data models they work on and movement of data
across processing engines). Polystore systems suit such applications; however,
these systems currently execute on CPUs and the slowdown of Moore's Law means
they cannot meet the performance and efficiency requirements of modern
workloads. We envision Polystore++, an architecture to accelerate existing
polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs).
Polystore++ systems can achieve high performance at low power by identifying
and offloading components of a polystore system that are amenable to
acceleration using specialized hardware. Building a Polystore++ system is
challenging and introduces new research problems motivated by the use of
hardware accelerators (e.g, optimizing and mapping query plans across
heterogeneous computing units and exploiting hardware pipelining and
parallelism to improve performance). In this paper, we discuss these challenges
in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201
A configurable accelerator for manycores: the Explicitly Many-Processor Approach
A new approach to designing processor accelerators is presented. A new
computing model and a special kind of accelerator with dynamic (end-user
programmable) architecture is suggested. The new model considers a processor,
in which a newly introduced supervisor layer coordinates the job of the cores.
The cores have the ability (based on the parallelization information provided
by the compiler, and using the help of the supervisor) to outsource part of the
job they received to some neighbouring core. The introduced changes essentially
and advantageously modify the architecture and operation of the computing
systems. The computing throughput drastically increases, the efficiency of the
technological implementation (computing performance per logic gates) increases,
the non-payload activity for using operating system services decreases, the
real-time behavior changes advantageously, and connecting accelerators to the
processor greatly simplifies. Here only some details of the architecture and
operation of the processor are discussed, the rest is described elsewhere.Comment: 12 pages, 6 figure
Renewing computing paradigms for more efficient parallelization of single-threads
Computing is still based on the 70-years old paradigms introduced by von
Neumann. The need for more performant, comfortable and safe computing forced to
develop and utilize several tricks both in hardware and software. Till now
technology enabled to increase performance without changing the basic computing
paradigms. The recent stalling of single-threaded computing performance,
however, requires to redesign computing to be able to provide the expected
performance. To do so, the computing paradigms themselves must be scrutinized.
The limitations caused by the too restrictive interpretation of the computing
paradigms are demonstrated, an extended computing paradigm introduced, ideas
about changing elements of the computing stack suggested, some implementation
details of both hardware and software discussed. The resulting new computing
stack offers considerably higher computing throughput, simplified hardware
architecture, drastically improved real-time behavior and in general,
simplified and more efficient computing stack.Comment: 28 pages; 7 figure
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Optical Hardware Accelerators using Nonlinear Dispersion Modes for Energy Efficient Computing
This paper proposes a new class of hardware accelerators to alleviate
bottlenecks in the acquisition, analytics, storage and computation of
information carried by wideband streaming signals.Comment: 12 Figure
Smart technologies for effective reconfiguration: the FASTER approach
Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows
GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks
Generative Adversarial Networks (GANs) are one of the most recent deep
learning models that generate synthetic data from limited genuine datasets.
GANs are on the frontier as further extension of deep learning into many
domains (e.g., medicine, robotics, content synthesis) requires massive sets of
labeled data that is generally either unavailable or prohibitively costly to
collect. Although GANs are gaining prominence in various fields, there are no
accelerators for these new models. In fact, GANs leverage a new operator,
called transposed convolution, that exposes unique challenges for hardware
acceleration. This operator first inserts zeros within the multidimensional
input, then convolves a kernel over this expanded array to add information to
the embedded zeros. Even though there is a convolution stage in this operator,
the inserted zeros lead to underutilization of the compute resources when a
conventional convolution accelerator is employed. We propose the GANAX
architecture to alleviate the sources of inefficiency associated with the
acceleration of GANs using conventional convolution accelerators, making the
first GAN accelerator design possible. We propose a reorganization of the
output computations to allocate compute rows with similar patterns of zeros to
adjacent processing engines, which also avoids inconsequential multiply-adds on
the zeros. This compulsory adjacency reclaims data reuse across these
neighboring processing engines, which had otherwise diminished due to the
inserted zeros. The reordering breaks the full SIMD execution model, which is
prominent in convolution accelerators. Therefore, we propose a unified
MIMD-SIMD design for GANAX that leverages repeated patterns in the computation
to create distinct microprograms that execute concurrently in SIMD mode.Comment: Proceedings of the 45th International Symposium on Computer
Architecture (ISCA), 201
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors
© 2016 Cheung, Schultz and Luk.NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation
- …