Search CORE

2,155 research outputs found

Using reconfigurable computing technology to accelerate matrix decomposition and applications

Author: Wang Xinying
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2016
Field of study

Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

Digital Repository @ Iowa State University (ISU)

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

Author: Gupta Rajesh K.
Lin Jeng-Hau
Srivastava Mani
Tu Zhuowen
Xing Tianwei
Zhang Zhiru
Zhao Ritchie
Publication venue
Publication date: 15/07/2017
Field of study

State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such networks strain the computational capabilities and energy available to embedded and mobile processing platforms, restricting their use in many important applications. In this paper, we push the boundaries of hardware-effective CNN design by proposing BCNN with Separable Filters (BCNNw/SF), which applies Singular Value Decomposition (SVD) on BCNN kernels to further reduce computational and storage complexity. To enable its implementation, we provide a closed form of the gradient over SVD to calculate the exact gradient with respect to every binarized weight in backward propagation. We verify BCNNw/SF on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for CIFAR-10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of 17% and execution time reduction of 31.3% compared to BCNN with only minor accuracy sacrifices.Comment: 9 pages, 6 figures, accepted for Embedded Vision Workshop (CVPRW

arXiv.org e-Print Archive

Crossref

A general framework for efficient FPGA implementation of matrix product

Author: Amira A.
Bensaali F.
Sotudeh R.
Publication venue
Publication date: 01/01/2007
Field of study

Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

University of Hertfordshire Research Archive

Novel sparse OBC based distributed arithmetic architecture for matrix transforms

Author: Amira A
Chandrasekaran S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2007
Field of study

Inner product (IP) forms the basis of a number of signal processing algorithms and applications such as transforms, filters, communication systems etc. Distributed arithmetic (DA) provides an effective methodology to implement IP of vectors and matrices using a simple combination of memory elements, adders and shifters instead of lumped multipliers. This bit level rearrangement results in much higher computational efficiencies and yields compact designs highly suited for high performance resource constrained applications. Offset binary coding (OBC) is an effective technique to further optimize the DA, and allows us to reduce the memory requirements by a factor of two, with minimum additional computational complexity. This makes OBC-DA attractive for applications that are both resource and memory constrained. In addition, sparse matrix factorization techniques can be exploited to further reduce the size of the DA-ROMs. In this paper, the design and implementation of a novel OBC based DA is demonstrated using a generic architecture for implementing discrete orthogonal transforms (DOTs). Implementation is performed on the Xilinx Virtex-II Pro field programmable gate array (FPGA), and a detailed comparison between conventional and OBC based DA is presented to highlight the trade offs in various design metrics including performance, area and power

Crossref

Brunel University Research Archive

Approximate FPGA-based LSTMs under Computation Time Constraints

Author: Bouganis Christos-Savvas
Kouris Alexandros
Rizakis Michalis
Venieris Stylianos I.
Publication venue
Publication date: 05/01/2018
Field of study

Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM) networks have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. However, the models are becoming increasingly demanding in terms of computational and memory load. Emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we address the challenge of deploying computationally demanding LSTMs at a constrained time budget by introducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTM architecture. Combined in an end-to-end framework, the approximation method's parameters are optimised and the architecture is configured to address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life image captioning application indicates that the proposed methods required up to 6.5x less time to achieve the same application-level accuracy compared to a baseline method, while achieving an average of 25x higher accuracy under the same computation time constraints.Comment: Accepted at the 14th International Symposium in Applied Reconfigurable Computing (ARC) 201

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications

Author: Alistarh Dan
Gürel Nezihe Merve
Kara Kaan
Lemmin Thomas
Püschel Markus
Smith Tyler
Stojanov Alen
Zhang Ce
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized {IHT} that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speed-up with negligible loss of recovery quality.Comment: 19 pages, 5 figures, 1 table, in IEEE Transactions on Signal Processin

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)