2,187 research outputs found

    A pipeline structure for the block QR update in digital signal processing

    Get PDF
    [EN] There exist problems in the field of digital signal processing, such as filtering of acoustic signals that require processing a large amount of data in real time. The beamforming algorithm, for instance, is a process that can be modeled by a rectangular matrix built on the input signals of an acoustic system and, thus, changes in real time. To obtain the output signals, it is required to compute its QR factorization. In this paper, we propose to organize the concurrent computational resources of a given multicore computer in a pipeline structure to perform this factorization as fast as possible. The pipeline has been implemented using both the application programming interface OpenMP and GrPPI, a library interface to design parallel applications based on parallel patterns. We tackle not only the performance challenge but also the programmability of our idea using parallel programming frameworks.This work was supported by the Spanish Ministry of Economy and Competitiveness under MINECO and FEDER projects TIN2014-53495-R and TEC2015-67387-C4-1-R.Dolz, MF.; Alventosa, FJ.; Alonso-Jordá, P.; Vidal Maciá, AM. (2019). A pipeline structure for the block QR update in digital signal processing. The Journal of Supercomputing. 75(3):1470-1482. https://doi.org/10.1007/s11227-018-2666-1S14701482753Huang Y, Benesty J, Chen J (2006) Acoustic MIMO signal processing (signals and communication technology). Springer, BerlinRamiro C, Vidal AM, González A (2015) MIMOPack: a high performance computing library for MIMO communication systems. J Supercomput 71:751–760Alventosa FJ, Alonso P, Piñero G, Vidal AM (2016) Implementation of the Beamformer algorithm for the NVIDIA Jetson. In: Actas de la Conferencia, Granada, Spain, pp 201–211. ISBN 978-3-319-49955-0Alventosa FJ, Alonso P, Vidal AM, Piñero G, Quintana-Ortí ES (2018) Fast block QR update in digital signal processing. J Supercomput. https://doi.org/10.1007/s11227-018-2298-5del Rio D, Dolz MF, Fernández J, García JD (2017) A generic parallel pattern interface for stream and data processing. Concurr Comput Pract Exp 29(24):e4175Benesty J, Chen J, Huang Y, Dmochowski J (2007) On microphone-array Beamforming from a MIMO acoustic signal processing perspective. IEEE Trans Audio Speech Lang Process 15(3):1053–1065Lorente J, Piñero G, Vidal AM, Belloch JA, González A (2011) Parallel implementations of Beamforming design and filtering for microphone array applications. In: 19th European Signal Processing Conference (EUSIPCO), Barcelona, Spain, pp 501–505Belloch JA, Ferrer M, González A, Martínez-Zaldívar FJ, Vidal AM (2013) Headphone-based virtual spatialization of sound with a GPU accelerator. J Audio Eng Soc 61:546–561Belloch JA, González A, Martínez-Zaldívar FJ, Vidal AM (2011) Real-time massive convolution for audio applications on GPU. J Supercomput 58(3):449–457Golub GH, Van Loan CF (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, BaltimoreGunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53Dolz MF, Alventosa FJ, Alonso-Jordá P, Vidal AM (2018) A pipeline for the QR update in digital signal processing. In: Proceedings of the 18th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE 2018), Rota, Cádiz, Spain, pp 1–5Quintana-Ortí G, Quintana-Ortí ES, Van De Geijn RA, Van Zee FG, Chan E (2009) Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans Math Softw 36(3):14:1–14:2

    A Pipeline for the QR Update in Digital Signal Processing

    Full text link
    [EN] The input and output signals of a digital signal processing system can often be represented by a rectangular matrix as it is the case of the beamformer algorithm, a very useful particular algorithm that allows extraction of the original input signal once it is cleaned from noise and room reverberation. We use a version of this algorithm in which the system matrix must be factorized to solve a least squares problem. The matrix changes periodically according to the input signal sampled; therefore, the factorization needs to be recalculated as fast as possible. In this paper, we propose to use parallelism through a pipeline pattern. With our pipeline, some partial computations are advanced so that the final time required to update the factorization is highly reducedThis work was supported by the Spanish Ministry of Economy and Competitiveness under MINECO and FEDER projects TIN2014-53495-R and TEC2015-67387-C4-1-R.Dolz, MF.; Alventosa, FJ.; Alonso-JordĂĄ, P.; Vidal MaciĂĄ, AM. (2019). A Pipeline for the QR Update in Digital Signal Processing. Computational and Mathematical Methods. 1:1-13. https://doi.org/10.1002/cmm4.1022S113

    A fast algorithm for LR-2 factorization of Toeplitz matrices

    Get PDF
    In this paper a new order recursive algorithm for the efficient −1 factorization of Toeplitz matrices is described. The proposed algorithm can be seen as a fast modified Gram-Schmidt method which recursively computes the orthonormal columns i, i = 1,2, …,p, of , as well as the elements of R−1, of a Toeplitz matrix with dimensions L × p. The factor estimation requires 8Lp MADS (multiplications and divisions). Matrix −1 is subsequently estimated using 3p2 MADS. A faster algorithm, based on a mixed and −1 updating scheme, is also derived. It requires 7Lp + 3.5p2 MADS. The algorithm can be efficiently applied to batch least squares FIR filtering and system identification. When determination of the optimal filter is the desired task it can be utilized to compute the least squares filter in an order recursive way. The algorithm operates directly on the experimental data, overcoming the need for covariance estimates. An orthogonalized version of the proposed −1 algorithm is derived. Matlab code implementing the algorithm is also supplied

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    FPGA-Based Co-processor for Singular Value Array Reconciliation Tomography

    Get PDF
    This thesis describes a co-processor system that has been designed to accelerate computations associated with Singular Value Array Reconciliation Tomography (SART), a method for locating a wide-band RF source which may be positioned within an indoor environment, where RF propagation characteristics make source localization very challenging. The co-processor system is based on field programmable gate array (FPGA) technology, which offers a low-cost alternative to customized integrated circuits, while still providing the high performance, low power, and small size associated with a custom integrated solution. The system has been developed in VHDL, and implemented on a Virtex-4 SX55 FPGA development platform. The system is easy to use, and may be accessed through a C program or MATLAB script. Compared to a Pentium 4 CPU running at 3 GHz, use of the co-processor system provides a speed-up of about 6 times for the current signal matrix size of 128-by-16. Greater speed-ups may be obtained by using multiple devices in parallel. The system is capable of computing the SART metric to an accuracy of about -145 dB with respect to its true value. This level of accuracy, which is shown to be better than that obtained using single precision floating point arithmetic, allows even relatively weak signals to make a meaningful contribution to the final SART solution
    • …
    corecore