151 research outputs found

    The LifeV library: engineering mathematics beyond the proof of concept

    Get PDF
    LifeV is a library for the finite element (FE) solution of partial differential equations in one, two, and three dimensions. It is written in C++ and designed to run on diverse parallel architectures, including cloud and high performance computing facilities. In spite of its academic research nature, meaning a library for the development and testing of new methods, one distinguishing feature of LifeV is its use on real world problems and it is intended to provide a tool for many engineering applications. It has been actually used in computational hemodynamics, including cardiac mechanics and fluid-structure interaction problems, in porous media, ice sheets dynamics for both forward and inverse problems. In this paper we give a short overview of the features of LifeV and its coding paradigms on simple problems. The main focus is on the parallel environment which is mainly driven by domain decomposition methods and based on external libraries such as MPI, the Trilinos project, HDF5 and ParMetis. Dedicated to the memory of Fausto Saleri.Comment: Review of the LifeV Finite Element librar

    RTL implementation of one-sided jacobi algorithm for singular value decomposition

    Get PDF
    Multi-dimensional digital signal processing such as image processing and image reconstruction involve manipulating of matrix data. Better quality images involve large amount of data, which result in unacceptably slow computation. A parallel processing scheme is a possible solution to solve this problem. This project presented an analysis and comparison to various algorithms for widely used matrix decomposition techniques and various computer architectures. As the result, a parallel implementation of one-sided Jacobi algorithm for computing singular value decomposition (SVD) of a 2х2 matrix on field programmable gate arrays (FPGA) is developed. The proposed SVD design is based on pipelined-datapath architecture The design process is started by evaluating the algorithm using Matlab, design datapath unit and control unit, coding in SystemVerilog HDL, verification and synthesis using Quartus II and simulated on ModelSim-Altera. The original matrix size of 4x4 and 8x8 is used to with the SVD processing element (PE). The result are compared with the Matlab version of the algorithm to evaluate the PE. The computation of SVD can be speed-up of more than 2 by increasing the number of PE at the cost of increased in circuit area

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: ‱ We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. ‱ We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. ‱ We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. ‱ We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. ‱ By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. ‱ We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Doctor of Philosophy

    Get PDF
    dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine

    NASA high performance computing and communications program

    Get PDF
    The National Aeronautics and Space Administration's HPCC program is part of a new Presidential initiative aimed at producing a 1000-fold increase in supercomputing speed and a 100-fold improvement in available communications capability by 1997. As more advanced technologies are developed under the HPCC program, they will be used to solve NASA's 'Grand Challenge' problems, which include improving the design and simulation of advanced aerospace vehicles, allowing people at remote locations to communicate more effectively and share information, increasing scientist's abilities to model the Earth's climate and forecast global environmental trends, and improving the development of advanced spacecraft. NASA's HPCC program is organized into three projects which are unique to the agency's mission: the Computational Aerosciences (CAS) project, the Earth and Space Sciences (ESS) project, and the Remote Exploration and Experimentation (REE) project. An additional project, the Basic Research and Human Resources (BRHR) project exists to promote long term research in computer science and engineering and to increase the pool of trained personnel in a variety of scientific disciplines. This document presents an overview of the objectives and organization of these projects as well as summaries of individual research and development programs within each project

    Physics-Informed Deep Neural Operator Networks

    Full text link
    Standard neural networks can approximate general nonlinear operators, represented either explicitly by a combination of mathematical operators, e.g., in an advection-diffusion-reaction partial differential equation, or simply as a black box, e.g., a system-of-systems. The first neural operator was the Deep Operator Network (DeepONet), proposed in 2019 based on rigorous approximation theory. Since then, a few other less general operators have been published, e.g., based on graph neural networks or Fourier transforms. For black box systems, training of neural operators is data-driven only but if the governing equations are known they can be incorporated into the loss function during training to develop physics-informed neural operators. Neural operators can be used as surrogates in design problems, uncertainty quantification, autonomous systems, and almost in any application requiring real-time inference. Moreover, independently pre-trained DeepONets can be used as components of a complex multi-physics system by coupling them together with relatively light training. Here, we present a review of DeepONet, the Fourier neural operator, and the graph neural operator, as well as appropriate extensions with feature expansions, and highlight their usefulness in diverse applications in computational mechanics, including porous media, fluid mechanics, and solid mechanics.Comment: 33 pages, 14 figures. arXiv admin note: text overlap with arXiv:2204.00997 by other author
    • 

    corecore