115 research outputs found

    Memory-friendly fixed-point iteration method for nonlinear surface mode oscillations of acoustically driven bubbles: from the perspective of high-performance GPU programming

    Get PDF
    A fixed-point iteration technique is presented to handle the implicit nature of the governing equations of nonlinear surface mode oscillations of acoustically excited microbubbles. The model is adopted from the theoretical work of Shaw [1], where the dynamics of the mean bubble radius and the surface modes are bi-directionally coupled via nonlinear terms. The model comprises a set of second-order ordinary differential equations. It extends the classic Keller–Miksis equation and the linearized dynamical equations for each surface mode. Only the implicit parts (containing the second derivatives) are reevaluated during the iteration process. The performance of the technique is tested at various parameter combinations. The majority of the test cases needs only a single reevaluation to achieve 10^-9 error. Although the arithmetic operation count is higher than the Gauss elimination, due to its memory-friendly matrix-free nature, it is a viable alternative for high-performance GPU computations of massive parameter studies

    並列計算アクセラレータへの効率的なアプリケーションマッピングに関する研究

    Get PDF
    長崎大学学位論文 学位記番号:博(工)甲第3号 学位授与年月日:平成26年3月20日Nagasaki University (長崎大学)課程博

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    No Excess Babbage - Design Considerations for the Interface to a Systolic Matrix Processor

    Get PDF
    Pages 174-179 removed for reasons of commercial confidentiality (including print copy)Computational systems used for the solution of large matrix-based numerical problems are rapidly converging on the limits of technology, and novel architectures are now being sought to improve performance. In signal processing and control theory, high performance systems are required which do not contain the physical size penalty of current supercomputers. To achieve performance comparable with current supercomputers, a systolic processing array specifically targeted at matrix applications has been developed at the University of Adelaide. The work of this thesis involves the problem of delivering and receiving the data moving between the processing array and the memory subsystem. This involves the reformatting of existing algorithms to map efficiently onto the matrix array, the design and VLSI layout of a matrix address generation unit using signed digit arithmetic for enhanced performance, and the block level description of a multiport cached memory system. Performance estimates predict a modest configuration will perform selected matrix routines in excess of three GigaFLOPs.Thesis (MESc.) -- University of Adelaide, Department of Electrical and Electronic Engineering, 199

    Implementation in Embedded Systems of State Observers Based on Multibody Dynamics

    Get PDF
    Programa Oficial de Doutoramento en Enxeñaría Naval e Industrial . 5015V01[Abstract] Simulation has become an important tool in the industry that minimizes either the cost and time of new products development and testing. In the automotive industry, the use of simulation is being extended to virtual sensing. Through an accurate model of the vehicle combined with a state estimator, variables that are difficult or costly to measure can be estimated. The virtual sensing approach is limited by the low computational power of invehicle hardware due to the strictest timing, reliability and safety requirements imposed by automotive standards. With the new generation hardware, the computational power of embedded platforms has increased. They are based on heterogeneous processors, where the main processor is combined with a co-processor, such as Field Programmable Gate Arrays (FPGAs). This thesis explores the implementation of a state estimator based on a multibody model of a vehicle in new generation embedded hardware. Different implementation strategies are tested in order to explore the advantages that an FPGA can provide. A new state-parameter-input observer is developed, providing accurate estimations. The proposed observer is combined with an efficient multibody model of a vehicle, achieving real-time execution.[Resumen] La simulación se ha convertido en una importante herramienta para la industria que permite minimizar tanto costes como tiempo de desarrollo y test de nuevos productos. En automoción, el uso de la simulación se extiende al desarrollo de sensores virtuales. Mediante un modelo preciso de un vehículo combinado con un observador de estados, variables que son caras o imposibles de medir pueden ser estimadas. La principal limitación para utilizar sensores virtuales en los vehículos es la baja potencia computacional de los procesadores instalados a bordo, debido a los estrictos requisitos impuestos por los standards de automoción. Con el hardware de nueva generación, el poder de cálculo de las plataformas empotradas se ha visto incrementado. Estos nuevos procesadores son del tipo heterogéneo, donde el procesador principal se complementa con un co-procesador, como una Field Programmable Gate Array (FPGA). Esta tesis explora la implementación de un observador de estados basado en un modelo multicuerpo de un vehículo en hardware empotrado de nueva generación. Se han probado diferentes implementaciones para evaluar las ventajas de disponer de una FPGA en el procesador. Se ha desarrollado un nuevo observador de estados, parámetros y entradas que permite obtener estimaciones de gran precisión. Combinando dicho observador con un eficiente modelo multicuerpo de un vehículo, se consigue rendimiento en tiempo real.[Resumo] A simulación estase a converter nunha importante ferramenta na industria que permite minimizar custes e tempo tanto de desenvolvemento coma de test de novos productos. En automoción, o uso da simulación esténdese á implementación de sensores virtuais. Mediante un modelo preciso dun vehículo combinado cun observador de estados, pódense estimar variables que son caras ou imposíbeis de medir. A principal limitación para utilizar sensores virtuais nos vehículos é a baixa potencia computacional dos procesadores instalados a bordo, debido aos estritos requisitos impostos polos estándares de automoción. Co hardware de nova xeración, o poder de cálculo das plataformas empotradas vese incrementado. Estos novos procesadores son de tipo heteroxéneo, onde o procesador principal compleméntase cun co-procesador, coma unha Field Programmable Gate Array (FPGA). Esta tese explora a implementación dun observador de estados basado nun modelo multicorpo dun vehículo en hardware empotrado de nova xeración. Diferentes implementacións foron probadas para avaliar as vantaxes de dispoñer dunha FPGA no procesador. Un novo observador de estados, parámetros e entradas deseñado nesta tese permite obter estimacións de gran precisión. Combinando dito observador cun eficiente modelo multicorpo dun vehículo, conséguese rendemento de tempo real

    Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

    Get PDF
    abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201

    GPU Accelerated Approach to Numerical Linear Algebra and Matrix Analysis with CFD Applications

    Get PDF
    A GPU accelerated approach to numerical linear algebra and matrix analysis with CFD applications is presented. The works objectives are to (1) develop stable and efficient algorithms utilizing multiple NVIDIA GPUs with CUDA to accelerate common matrix computations, (2) optimize these algorithms through CPU/GPU memory allocation, GPU kernel development, CPU/GPU communication, data transfer and bandwidth control to (3) develop parallel CFD applications for Navier Stokes and Lattice Boltzmann analysis methods. Special consideration will be given to performing the linear algebra algorithms under certain matrix types (banded, dense, diagonal, sparse, symmetric and triangular). Benchmarks are performed for all analyses with baseline CPU times being determined to find speed-up factors and measure computational capability of the GPU accelerated algorithms. The GPU implemented algorithms used in this work along with the optimization techniques performed are measured against preexisting work and test matrices available in the NIST Matrix Market. CFD analysis looked to strengthen the assessment of this work by providing a direct engineering application to analysis that would benefit from matrix optimization techniques and accelerated algorithms. Overall, this work desired to develop optimization for selected linear algebra and matrix computations performed with modern GPU architectures and CUDA developer which were applied directly to mathematical and engineering applications through CFD analysis

    Accelerating Pattern Recognition Algorithms On Parallel Computing Architectures

    Get PDF
    The move to more parallel computing architectures places more responsibility on the programmer to achieve greater performance. The programmer must now have a greater understanding of the underlying architecture and the inherent algorithmic parallelism. Using parallel computing architectures for exploiting algorithmic parallelism can be a complex task. This dissertation demonstrates various techniques for using parallel computing architectures to exploit algorithmic parallelism. Specifically, three pattern recognition (PR) approaches are examined for acceleration across multiple parallel computing architectures, namely field programmable gate arrays (FPGAs) and general purpose graphical processing units (GPGPUs). Phase-only filter correlation for fingerprint identification was studied as the first PR approach. This approach\u27s sensitivity to angular rotations, scaling, and missing data was surveyed. Additionally, a novel FPGA implementation of this algorithm was created using fixed point computations, deep pipelining, and four computation phases. Communication and computation were overlapped to efficiently process large fingerprint galleries. The FPGA implementation showed approximately a 47 times speedup over a central processing unit (CPU) implementation with negligible impact on precision. For the second PR approach, a spiking neural network (SNN) algorithm for a character recognition application was examined. A novel FPGA implementation of the approach was developed incorporating a scalable modular SNN processing element (PE) to efficiently perform neural computations. The modular SNN PE incorporated streaming memory, fixed point computation, and deep pipelining. This design showed speedups of approximately 3.3 and 8.5 times over CPU implementations for 624 and 9,264 sized neural networks, respectively. Results indicate that the PE design could scale to process larger sized networks easily. Finally for the third PR approach, cellular simultaneous recurrent networks (CSRNs) were investigated for GPGPU acceleration. Particularly, the applications of maze traversal and face recognition were studied. Novel GPGPU implementations were developed employing varying quantities of task-level, data-level, and instruction-level parallelism to achieve efficient runtime performance. Furthermore, the performance of the face recognition application was examined across a heterogeneous cluster of multi-core and GPGPU architectures. A combination of multi-core processors and GPGPUs achieved roughly a 996 times speedup over a single-core CPU implementation. From examining these PR approaches for acceleration, this dissertation presents useful techniques and insight applicable to other algorithms to improve performance when designing a parallel implementation

    Gaussian Belief Propagation on a Field-Programmable Gate Array for Solving Linear Equation Systems

    Get PDF
    Solving Linear Equation System (LESs) is a common problem in numerous fields of science. Even though the problem is well studied and powerful solvers are available nowadays, solving LES is still a bottleneck in many numerical applications concerning computation time. This issue especially pertains to applications in mobile robotics constrained by real-time requirements, where on-top power consumption and weight play an important role. This paper provides a general framework to approximately solve large LESs by Gaussian Belief Propagation (GaBP), which is extremely suitable for parallelization and implementation in hardware on a Field-Programmable Gate Array (FPGA). We derive the simple update rules of the Message Passing Algorithm for GaBP and show how to implement the approach efficiently on a System on a Programmable Chip (SoPC). In particular, multiple dedicated co-processors take care of recurring computations in GaBP. Exploiting multiple Direct Memory Access (DMA) controllers in scatter-gather mode and available arithmetic logic slices for numerical calculations accelerate the algorithm. Presented evaluations demonstrate that the approach does not only provide an accurate approximative solution of the LES. It also outperforms traditional solvers with respect to computation time for certain LESs
    corecore