Search CORE

43 research outputs found

An FPGA implementation of givens rotation based digital architecture for computing eigenvalues of asymmetric matrix

Author: Ayhan Tuba
Köseoğlu İlayda
Yalçın Mustak Erhan
Öztürk Elif
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

This paper proposes the digital circuit design that performs the eigenvalue calculation of asymmetric matrices with realvalued elements. Eigenvalues are computed iteratively through the QR algorithm. In the QR algorithm, the input matrix is factorized into orthogonal Q and upper triangular R matrix, then the RQ product is calculated to obtain an iterated matrix. For a time-efficient QR decomposition process, the Givens Rotation (GR) Principle is utilized to benefit from the parallelization feature. Parallelization is managed by the Systolic Array (SA) architecture that is created by placing Givens Generation (GG) and Row Updates (RU) blocks in a triangle array. In this paper, 4×4 input matrix is used to create a TSA architecture including n-1 diagonal (GG), and (n ∗ (n−1))/2 off-diagonal (RU) modules. In the results section, Givens Rotation is compared with the Gram Schmidt algorithm used in our previous study [1] in terms of error, and area usage.Scopus - Affiliation ID: 60105072Oca

MEF University Institutional Repository

Accelerating Extreme Learning Machine on FPGA by Hardware Implementation of Given Rotation - QRD

Author: Hon Jin Yong
Ismail Nordinah
Ooi Chia Yee
Tan Chong Yeam
Publication venue: 'Penerbit UTHM'
Publication date: 01/01/2019
Field of study

Currently, Extreme Learning Machine (ELM) is one of the research trends in the machine learning field due to its remarkable performances in terms of complexity and computational speed. However, the big data era and the limitations of general-purpose processor cause the increasing of interest in hardware implementation of ELM in order to reduce the computational time. Hence, this work presents the hardware-software co-design of ELM to improve the overall performances. In the co-design paradigm, one of the important components of ELM, namely Given Rotation-QRD (GR-QRD) is developed as a hardware core. Field Programmable Gate Array (FPGA) is chosen as the platform for ELM implementation due to its reconfigurable capability and high parallelism. Moreover, the learning accuracy and computational time would be used to evaluate the performances of the proposed ELM design. Our experiment has shown that GR-QRD accelerator helps to reduce the computational time of ELM training by 41.75% while maintaining the same training accuracy in comparison to pure software of ELM

Journals of Universiti Tun Hussein Onn Malaysia (UTHM)

International Journal of Integrated Engineering

Universiti Teknologi Malaysia Institutional Repository

Recommended from our members

Parallelisation of greedy algorithms for compressive sensing reconstruction

Author: Turner David William
Publication venue: University of Cambridge
Publication date: 25/07/2019
Field of study

Compressive Sensing (CS) is a technique which allows a signal to be compressed at the same time as it is captured. The process of capturing and simultaneously compressing the signal is represented as linear sampling, which can encompass a variety of physical processes or signal processing. Instead of explicitly identifying redundancies in the source signal, CS relies on the property of sparsity in order to reconstruct the compressed signal. While linear sampling is much less burdensome than conventional compression, this is more than made up for by the high computational cost of reconstructing a signal which has been captured using CS. Even when using some of the fastest reconstruction techniques, known as greedy pursuits, reconstruction of large problems can pose a significant burden, consuming a great deal of memory as well as compute time. Parallel computing is the foundation of the field of High Performance Computing (HPC). Modern supercomputers are generally composed of large clusters of standard servers, with a dedicated low-latency high-bandwidth interconnect network. On such a cluster, an appropriately written program can harness vast quantities of memory and computational power. However, in order to exploit a parallel compute resource, an algorithm usually has to be redesigned from the ground up. In this thesis I describe the development of parallel variants of two algorithms commonly used in CS reconstruction, Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP), resulting in the new distributed compute algorithms DistMP and DistOMP. I present the results from experiments showing how DistMP and DistOMP can utilise a compute cluster to solve CS problems much more quickly than a single computer could alone. Speed-up of as much as a factor of 76 is observed with DistMP when utilising 210 workers across 14 servers, compared to a single worker. Finally, I demonstrate how DistOMP can solve a problem with a 429GB equivalent sampling matrix in as little as 62 minutes using a 16-node compute cluster.Funded by an ICASE award from the Engineering and Physical Sciences Research Council, with sponsorship provided by Thales Research and Technology

Apollo (Cambridge)

LOS Throughput Measurements in Real-Time with a 128-Antenna Massive MIMO Testbed

Author: Armour Simon
Beach Mark
Doufexi Angela
Harris Paul
Kundargi Nikhil
Mellios Evangelos
Nieman Karl
Nix Andrew
Zhang Siming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2017
Field of study

Explore Bristol Research

Implementação em hardware reconfigurável de operadores matriciais para solução numérica de sistemas lineares

Author: Arias García Janier
Publication venue: 'Biblioteca Central da UNB'
Publication date: 14/11/2014
Field of study

Tese (mestrado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Mecânica, 2014.Este trabalho apresenta um estudo da implementação de operadores matriciais para solução numérica de sistemas lineares em FPGAs (Field Programmable Gate Arrays). As arquiteturas foram baseadas nos métodos diretos QR, de Schur, assim como na Eliminação Gaussiana. Os métodos foram desenvolvidos usando topologias orientadas a controle e fluxo de dados com representação aritmética de ponto flutuante, permitindo explorar o paralelismo intrínseco dos diferentes algoritmos para solução de sistemas lineares. Desta forma, mantendo o controle da propagação do erro e ganhos de desempenho em termos do tempo de execução, visando a sua aplicabilidade em problemas inversos. As arquiteturas foram desenvolvidas para obter a inversa de uma matriz assim como a solução de um sistema de equações lineares, baseados no método de eliminação Gaussiana (ou sua variante Gauss-Jordan). Além disso, neste trabalho foi proposta e implementada uma nova arquitetura baseada no método de Schur formada pelos seguintes circuitos: QRD-MGS (QR Decomposition via Modified Gram-Schmidt), MMM (Multiplicação Matriz-Matriz) e MDTM (Multiplicação-Diagonal-Transposta-Matriz). Adicionalmente, estudos de consumo de recursos para diferentes tamanhos de matrizes assim como uma análise da propagação do erro foram realizados no intuito de verificar a aplicabilidade dos algoritmos em arquiteturas reconfiguráveis. Neste trabalho, o modulo de Eliminação Gaussiana desenvolvido foi usado para apoiar os cálculos de uma rede neuronal do tipo GMDH na predição da estrutura 3D de uma proteína. Finalmente, foram implementadas duas metodologias, Fusão de Datapath para manter o controle da propaga ção de erro usando apenas uma representação com precisão simples e a Verificação/Validação para realizar uma padronização na validação dessas implementações.This work presents a study on the implementation of matrix operators for the numerical solution of linear systems on FPGAs (Field Programmable Gate Arrays). The architectures were based on direct methods such as QR, Schur as well as the Gaussian elimination. The methods were developed using topologies oriented to both control and to data-flow with a floating point arithmetic representation, exploring the intrinsic parallelism of different algorithms for solving linear systems. Thus, the developed architectures have been achieved maintaining both the control of the error propagation and performance gains in terms of runtime, seeking their applicability in inverse problems. The architectures have been developed to deal with the inverse of a matrix as well as for solving a system of linear equations based on the Gaussian elimination method (or its Gauss-Jordan variant). Additionally, this work has proposed and implemented a novel architecture based on the Schur method composed of the following circuits: QRD-MGS (QR Decomposition via Modi_ed Gram-Schmidt), MMM (Matrix-Matrix Multiplication) and MDTM (Matrix-Diagonal-Transpose-Multiplication). Furthermore, this work presents studies of the resource use for different sizes of matrices as well as the error propagation analysis in order to verify the applicability of the algorithms on reconfigurable hardware. Additionally, the Gaussian elimination module developed in this work was used to support the calculations of a GMDH neural network on an application to predict the 3D structure of a protein. Finally, two methodologies were implemented, the Datapath Fusion to maintain the control of the error propagation using only one representation with single precision and the Verification/Validation to create a benchmark to validate the results of the hardware implementations

Repositório Institucional da Universidade de Brasília

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Performance Analysis of Modified Gram-Schmidt Cholesky Implementation on 16 bits-DSP-chip

Author
Publication venue: 'Deanship of Scientific Research'
Publication date
Field of study

Crossref

Energy Efficient VLSI Circuits for MIMO-WLAN

Author: Senning Carl Christian Sten Dominic
Publication venue: Lausanne, EPFL
Publication date: 22/10/2014
Field of study

Mobile communication - anytime, anywhere access to data and communication services - has been continuously increasing since the operation of the first wireless communication link by Guglielmo Marconi. The demand for higher data rates, despite the limited bandwidth, led to the development of multiple-input multiple-output (MIMO) communication which is often combined with orthogonal frequency division multiplexing (OFDM). Together, these two techniques achieve a high bandwidth efficiency. Unfortunately, techniques such as MIMO-OFDM significantly increase the signal processing complexity of transceivers. While fast improvements in the integrated circuit (IC) technology enabled to implement more signal processing complexity per chip, large efforts had and have to be done for novel algorithms as well as for efficient very large scaled integration (VLSI) architectures in order to meet today's and tomorrow's requirements for mobile wireless communication systems. In this thesis, we will present architectures and VLSI implementations of complete physical (PHY) layer application specific integrated circuits (ASICs) under the constraints imposed by an industrial wireless communication standard. Contrary to many other publications, we do not elaborate individual components of a MIMO-OFDM communication system stand-alone, but in the context of the complete PHY layer ASIC. We will investigate the performance of several MIMO detectors and the corresponding preprocessing circuits, being integrated into the entire PHY layer ASIC, in terms of achievable error-rate, power consumption, and area requirement. Finally, we will assemble the results from the proposed PHY layer implementations in order to enhance the energy efficiency of a transceiver. To this end, we propose a cross-layer optimization of PHY layer and medium access control (MAC) layer

Infoscience - École polytechnique fédérale de Lausanne

SYSTEM-ON-A-CHIP (SOC)-BASED HARDWARE ACCELERATION FOR HUMAN ACTION RECOGNITION WITH CORE COMPONENTS

Author: Safaei Amin
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2018
Field of study

Today, the implementation of machine vision algorithms on embedded platforms or in portable systems is growing rapidly due to the demand for machine vision in daily human life. Among the applications of machine vision, human action and activity recognition has become an active research area, and market demand for providing integrated smart security systems is growing rapidly. Among the available approaches, embedded vision is in the top tier; however, current embedded platforms may not be able to fully exploit the potential performance of machine vision algorithms, especially in terms of low power consumption. Complex algorithms can impose immense computation and communication demands, especially action recognition algorithms, which require various stages of preprocessing, processing and machine learning blocks that need to operate concurrently. The market demands embedded platforms that operate with a power consumption of only a few watts. Attempts have been mad to improve the performance of traditional embedded approaches by adding more powerful processors; this solution may solve the computation problem but increases the power consumption. System-on-a-chip eld-programmable gate arrays (SoC-FPGAs) have emerged as a major architecture approach for improving power eciency while increasing computational performance. In a SoC-FPGA, an embedded processor and an FPGA serving as an accelerator are fabricated in the same die to simultaneously improve power consumption and performance. Still, current SoC-FPGA-based vision implementations either shy away from supporting complex and adaptive vision algorithms or operate at very limited resolutions due to the immense communication and computation demands. The aim of this research is to develop a SoC-based hardware acceleration workflow for the realization of advanced vision algorithms. Hardware acceleration can improve performance for highly complex mathematical calculations or repeated functions. The performance of a SoC system can thus be improved by using hardware acceleration method to accelerate the element that incurs the highest performance overhead. The outcome of this research could be used for the implementation of various vision algorithms, such as face recognition, object detection or object tracking, on embedded platforms. The contributions of SoC-based hardware acceleration for hardware-software codesign platforms include the following: (1) development of frameworks for complex human action recognition in both 2D and 3D; (2) realization of a framework with four main implemented IPs, namely, foreground and background subtraction (foreground probability), human detection, 2D/3D point-of-interest detection and feature extraction, and OS-ELM as a machine learning algorithm for action identication; (3) use of an FPGA-based hardware acceleration method to resolve system bottlenecks and improve system performance; and (4) measurement and analysis of system specications, such as the acceleration factor, power consumption, and resource utilization. Experimental results show that the proposed SoC-based hardware acceleration approach provides better performance in terms of the acceleration factor, resource utilization and power consumption among all recent works. In addition, a comparison of the accuracy of the framework that runs on the proposed embedded platform (SoCFPGA) with the accuracy of other PC-based frameworks shows that the proposed approach outperforms most other approaches

Scholarship at UWindsor

Adaptive Baseband Pro cessing and Configurable Hardware for Wireless Communication

Author: Gangarajaiah Rakesh
Publication venue: Department of Electrical and Information Technology, Lund University
Publication date: 01/01/2017
Field of study

The world of information is literally at one’s fingertips, allowing access to previously unimaginable amounts of data, thanks to advances in wireless communication. The growing demand for high speed data has necessitated theuse of wider bandwidths, and wireless technologies such as Multiple-InputMultiple-Output (MIMO) have been adopted to increase spectral efficiency.These advanced communication technologies require sophisticated signal processing, often leading to higher power consumption and reduced battery life.Therefore, increasing energy efficiency of baseband hardware for MIMO signal processing has become extremely vital. High Quality of Service (QoS)requirements invariably lead to a larger number of computations and a higherpower dissipation. However, recognizing the dynamic nature of the wirelesscommunication medium in which only some channel scenarios require complexsignal processing, and that not all situations call for high data rates, allowsthe use of an adaptive channel aware signal processing strategy to provide adesired QoS. Information such as interference conditions, coherence bandwidthand Signal to Noise Ratio (SNR) can be used to reduce algorithmic computations in favorable channels. Hardware circuits which run these algorithmsneed flexibility and easy reconfigurability to switch between multiple designsfor different parameters. These parameters can be used to tune the operations of different components in a receiver based on feedback from the digitalbaseband. This dissertation focuses on the optimization of digital basebandcircuitry of receivers which use feedback to trade power and performance. Aco-optimization approach, where designs are optimized starting from the algorithmic stage through the hardware architectural stage to the final circuitimplementation is adopted to realize energy efficient digital baseband hardwarefor mobile 4G devices. These concepts are also extended to the next generation5G systems where the energy efficiency of the base station is improved.This work includes six papers that examine digital circuits in MIMO wireless receivers. Several key blocks in these receiver include analog circuits thathave residual non-linearities, leading to signal intermodulation and distortion.Paper-I introduces a digital technique to detect such non-linearities and calibrate analog circuits to improve signal quality. The concept of a digital nonlinearity tuning system developed in Paper-I is implemented and demonstratedin hardware. The performance of this implementation is tested with an analogchannel select filter, and results are presented in Paper-II. MIMO systems suchas the ones used in 4G, may employ QR Decomposition (QRD) processors tosimplify the implementation of tree search based signal detectors. However,the small form factor of the mobile device increases spatial correlation, whichis detrimental to signal multiplexing. Consequently, a QRD processor capableof handling high spatial correlation is presented in Paper-III. The algorithm and hardware implementation are optimized for carrier aggregation, which increases requirements on signal processing throughput, leading to higher powerdissipation. Paper-IV presents a method to perform channel-aware processingwith a simple interpolation strategy to adaptively reduce QRD computationcount. Channel properties such as coherence bandwidth and SNR are used toreduce multiplications by 40% to 80%. These concepts are extended to usetime domain correlation properties, and a full QRD processor for 4G systemsfabricated in 28 nm FD-SOI technology is presented in Paper-V. The designis implemented with a configurable architecture and measurements show thatcircuit tuning results in a highly energy efficient processor, requiring 0.2 nJ to1.3 nJ for each QRD. Finally, these adaptive channel-aware signal processingconcepts are examined in the scope of the next generation of communicationsystems. Massive MIMO systems increase spectral efficiency by using a largenumber of antennas at the base station. Consequently, the signal processingat the base station has a high computational count. Paper-VI presents a configurable detection scheme which reduces this complexity by using techniquessuch as selective user detection and interpolation based signal processing. Hardware is optimized for resource sharing, resulting in a highly reconfigurable andenergy efficient uplink signal detector

Lund University Publications