16 research outputs found

    A scalable, portable, FPGA-based implementation of the Unscented Kalman Filter

    Get PDF
    Sustained technological progress has come to a point where robotic/autonomous systems may well soon become ubiquitous. In order for these systems to actually be useful, an increase in autonomous capability is necessary for aerospace, as well as other, applications. Greater aerospace autonomous capability means there is a need for high performance state estimation. However, the desire to reduce costs through simplified development processes and compact form factors can limit performance. A hardware-based approach, such as using a Field Programmable Gate Array (FPGA), is common when high performance is required, but hardware approaches tend to have a more complicated development process when compared to traditional software approaches; greater development complexity, in turn, results in higher costs. Leveraging the advantages of both hardware-based and software-based approaches, a hardware/software (HW/SW) codesign of the Unscented Kalman Filter (UKF), based on an FPGA, is presented. The UKF is split into an application-specific part, implemented in software to retain portability, and a non-application-specific part, implemented in hardware as a parameterisable IP core to increase performance. The codesign is split into three versions (Serial, Parallel and Pipeline) to provide flexibility when choosing the balance between resources and performance, allowing system designers to simplify the development process. Simulation results demonstrating two possible implementations of the design, a nanosatellite application and a Simultaneous Localisation and Mapping (SLAM) application, are presented. These results validate the performance of the HW/SW UKF and demonstrate its portability, particularly in small aerospace systems. Implementation (synthesis, timing, power) details for a variety of situations are presented and analysed to demonstrate how the HW/SW codesign can be scaled for any application

    A Scalable, FPGA-Based Implementation of the Unscented Kalman Filter

    Get PDF
    Autonomous aerospace systems may well soon become ubiquitous pending an increase in autonomous capability. Greater autonomous capability means there is a need for high-performance state estimation. However, the desire to reduce costs through simplified development processes and compact form factors can limit performance. A hardware-based approach, such as using a field-programmable gate array (FPGA), is common when high performance is required, but hardware approaches tend to have a more complicated development process when compared to traditional software approaches; greater development complexity, in turn, results in higher costs. Leveraging the advantages of both hardware-based and software-based approaches, a hardware/software (HW/SW) codesign of the unscented Kalman filter (UKF), based on an FPGA, is presented. The UKF is split into an application-specific part, implemented in software to simplify the development process, and a non-application-specific part, implemented in hardware as a parameterisable ‘black box’ module (i.e. IP core) to increase performance. Simulation results demonstrating a possible nanosatellite application of the design are presented; implementation (synthesis, timing, power) details are also presented

    Generating Posit-Based Accelerators With High-Level Synthesis

    Get PDF
    Recently, the posit number system has demonstrated a higher accuracy over standard floating-point arithmetic for many scientific applications. However, when it comes to implementing accelerators for these applications, the tool support for this arithmetic format is still missing, especially during the step. In this paper, we incorporate the posit data type into the high-level synthesis (HLS) design process, so that we can generate the implementation directly from a given behavioral specification, but using posit numbers instead of the classical floating-point notations. Our evaluations show that, even if posit-based circuits require more area than their floating-point counterparts, they offer higher accuracy when using the same bitwidth. For example, using posit arithmetic can reduce computation errors by about two orders of magnitude when compared to using standard floating-point numbers. Our approach also includes an alternative to mitigate the high overheads of the posits and broadening the potential use of this format. We also propose a hybrid scheme that uses posit numbers only in the private local memory, while the accelerator operates in the classic floating-point notation. This solution is useful when the designers want to optimize local memories and data transfers, but still use legacy high-level synthesis (HLS) tools that only support traditional floating-point notations

    New Results on Non-normalized Floating-point Formats

    Get PDF
    Compulsory normalization of the represented numbers is a key requirement of the floating-point standard. This requirement contributes to fundamental characteristics of the standard, such as taking the most of the precision available, reproducibility and facilitation of comparison and other operations. However, it also imposes a high restriction in effectiveness of basic arithmetic operation implementation. In many embedded applications may be worth to sacrifice the benefits of normalization for gaining in implementation metrics. This paper analyzes and measures the effect of removing the normalization requirement in terms of precision and implementation savings for embedded applications. We propose several adder and multiplier architectures to deal with non-normalized floating-point numbers, and quantify the accuracy loss and the improvements in hardware implementation. Our experiments show that it is possible to reduce the area and power consumption up to 78% in ASIC and 50% in FPGA implementations with a reasonable accuracy loss.TIN2016-80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147

    Implementação em hardware reconfigurável de operadores matriciais para solução numérica de sistemas lineares

    Get PDF
    Tese (mestrado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Mecânica, 2014.Este trabalho apresenta um estudo da implementação de operadores matriciais para solução numérica de sistemas lineares em FPGAs (Field Programmable Gate Arrays). As arquiteturas foram baseadas nos métodos diretos QR, de Schur, assim como na Eliminação Gaussiana. Os métodos foram desenvolvidos usando topologias orientadas a controle e fluxo de dados com representação aritmética de ponto flutuante, permitindo explorar o paralelismo intrínseco dos diferentes algoritmos para solução de sistemas lineares. Desta forma, mantendo o controle da propagação do erro e ganhos de desempenho em termos do tempo de execução, visando a sua aplicabilidade em problemas inversos. As arquiteturas foram desenvolvidas para obter a inversa de uma matriz assim como a solução de um sistema de equações lineares, baseados no método de eliminação Gaussiana (ou sua variante Gauss-Jordan). Além disso, neste trabalho foi proposta e implementada uma nova arquitetura baseada no método de Schur formada pelos seguintes circuitos: QRD-MGS (QR Decomposition via Modified Gram-Schmidt), MMM (Multiplicação Matriz-Matriz) e MDTM (Multiplicação-Diagonal-Transposta-Matriz). Adicionalmente, estudos de consumo de recursos para diferentes tamanhos de matrizes assim como uma análise da propagação do erro foram realizados no intuito de verificar a aplicabilidade dos algoritmos em arquiteturas reconfiguráveis. Neste trabalho, o modulo de Eliminação Gaussiana desenvolvido foi usado para apoiar os cálculos de uma rede neuronal do tipo GMDH na predição da estrutura 3D de uma proteína. Finalmente, foram implementadas duas metodologias, Fusão de Datapath para manter o controle da propaga ção de erro usando apenas uma representação com precisão simples e a Verificação/Validação para realizar uma padronização na validação dessas implementações.This work presents a study on the implementation of matrix operators for the numerical solution of linear systems on FPGAs (Field Programmable Gate Arrays). The architectures were based on direct methods such as QR, Schur as well as the Gaussian elimination. The methods were developed using topologies oriented to both control and to data-flow with a floating point arithmetic representation, exploring the intrinsic parallelism of different algorithms for solving linear systems. Thus, the developed architectures have been achieved maintaining both the control of the error propagation and performance gains in terms of runtime, seeking their applicability in inverse problems. The architectures have been developed to deal with the inverse of a matrix as well as for solving a system of linear equations based on the Gaussian elimination method (or its Gauss-Jordan variant). Additionally, this work has proposed and implemented a novel architecture based on the Schur method composed of the following circuits: QRD-MGS (QR Decomposition via Modi_ed Gram-Schmidt), MMM (Matrix-Matrix Multiplication) and MDTM (Matrix-Diagonal-Transpose-Multiplication). Furthermore, this work presents studies of the resource use for different sizes of matrices as well as the error propagation analysis in order to verify the applicability of the algorithms on reconfigurable hardware. Additionally, the Gaussian elimination module developed in this work was used to support the calculations of a GMDH neural network on an application to predict the 3D structure of a protein. Finally, two methodologies were implemented, the Datapath Fusion to maintain the control of the error propagation using only one representation with single precision and the Verification/Validation to create a benchmark to validate the results of the hardware implementations

    Power and Thermal Management of System-on-Chip

    Get PDF

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

    FPGA-Based Adaptive Digital Beamforming Using Machine Learning for MIMO Systems

    Get PDF
    In modern Multiple-Input and Multiple-Output (MIMO) systems, such as cellular and Wi-Fi technology, an array of antenna elements is used to spatially steer RF signals with the goal of changing the overall antenna gain pattern to achieve a higher Signal-to-interference-plus-noise ratio (SINR). Digital Beamforming (DBF) achieves this steering effect by applying weighted coefficients to antenna elements- similar to digital filtering- which adjust the phase and gain of the received, or transmitted, signals. Since real world MIMO systems are often used in dynamic environments, Adaptive Beamforming techniques have been used to overcome variable challenges to system SINR- such as dispersive channels or inter-device interference- by applying statistically-based algorithms to calculate weights adaptively. However, large element count array systems, with their high degrees of freedom (DOF), can face many challenges in real application of these adaptive algorithms. These statistical matrix methods can be either computationally prohibitive, or utilize non-optimal simplifications, in order to provide adaptive weights in time for an application, especially given a certain system's computational capability; for instance, MIMO communication devices with strict size, weight and power (SWaP) constraints often have processing limitations due to use of low-power processors or Field-Programmable Gate Arrays (FPGAs). Thus, this thesis research investigation will show novel progress in these adaptive MIMO challenges in a twofold approach. First, it will be shown that advances in Machine Learning (ML) and Deep Neural Networks (DNNs) can be directly applied to the computationally complex problem of calculating optimal adaptive beamforming weights via a custom Convolutional Neural Net (CNN). Secondly, the derived adaptive beamforming CNN will be shown to efficiently map to programmable logic FPGA resources which can update adaptive coefficients in real-time. This machine learning implementation is contrasted against the current state-of-the-art FPGA architecture for adaptive beamforming- which uses traditional, Recursive Least Squares (RLS) computation- and is shown to provide adaptive beamforming weights faster, and with fewer FPGA logic resources. The reduction in both processing latency and FPGA fabric utilization enables SWaP constrained MIMO processors to perform adaptive beamforming for higher channel count systems than currently possible with traditional computation methods
    corecore