4,040 research outputs found
Combined Integer and Floating Point Multiplication Architecture(CIFM) for FPGAs and Its Reversible Logic Implementation
In this paper, the authors propose the idea of a combined integer and
floating point multiplier(CIFM) for FPGAs. The authors propose the replacement
of existing 18x18 dedicated multipliers in FPGAs with dedicated 24x24
multipliers designed with small 4x4 bit multipliers. It is also proposed that
for every dedicated 24x24 bit multiplier block designed with 4x4 bit
multipliers, four redundant 4x4 multiplier should be provided to enforce the
feature of self repairability (to recover from the faults). In the proposed
CIFM reconfigurability at run time is also provided resulting in low power. The
major source of motivation for providing the dedicated 24x24 bit multiplier
stems from the fact that single precision floating point multiplier requires
24x24 bit integer multiplier for mantissa multiplication. A reconfigurable,
self-repairable 24x24 bit multiplier (implemented with 4x4 bit multiply
modules) will ideally suit this purpose, making FPGAs more suitable for integer
as well floating point operations. A dedicated 4x4 bit multiplier is also
proposed in this paper. Moreover, in the recent years, reversible logic has
emerged as a promising technology having its applications in low power CMOS,
quantum computing, nanotechnology, and optical computing. It is not possible to
realize quantum computing without reversible logic. Thus, this paper also paper
provides the reversible logic implementation of the proposed CIFM. The
reversible CIFM designed and proposed here will form the basis of the
completely reversible FPGAs.Comment: Published in the proceedings of the The 49th IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS 2006), Puerto Rico, August
2006. Nominated for the Student Paper Award(12 papers are nominated for
Student paper Award among all submissions
Floating-Point Matrix Product on FPGA
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.---- Copyright IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE
A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs
High Performance Computing (HPC) platforms allow scientists to model
computationally intensive algorithms. HPC clusters increasingly use
General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs
provide an attractive alternative to GPGPUs for use as co-processors, but they
are still far from being mainstream due to a number of challenges faced when
using FPGA-based platforms. Our research aims to make FPGA-based high
performance computing more accessible to the scientific community. In this work
we present the results of investigating the acceleration of a particular
atmospheric model, Flexpart, on FPGAs. We focus on accelerating the most
computationally intensive kernel from this model. The key contribution of our
work is the architectural exploration we undertook to arrive at a solution that
best exploits the parallelism available in the legacy code, and is also
convenient to program, so that eventually the compilation of high-level legacy
code to our architecture can be fully automated. We present the three different
types of architecture, comparing their resource utilization and performance,
and propose that an architecture where there are a number of computational
cores, each built along the lines of a vector instruction processor, works best
in this particular scenario, and is a promising candidate for a generic
FPGA-based platform for scientific computation. We also present the results of
experiments done with various configuration parameters of the proposed
architecture, to show its utility in adapting to a range of scientific
applications.Comment: This is an extended pre-print version of work that was presented at
the international symposium on Highly Efficient Accelerators and
Reconfigurable Technologies (HEART2014), Sendai, Japan, June 911, 201
A Many-Core Overlay for High-Performance Embedded Computing on FPGAs
In this work, we propose a configurable many-core overlay for
high-performance embedded computing. The size of internal memory, supported
operations and number of ports can be configured independently for each core of
the overlay. The overlay was evaluated with matrix multiplication, LU
decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform.
The results show that using a system-level many-core overlay avoids complex
hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software
Programmers (FSP 2014) (arXiv:1408.4423
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
- …