968 research outputs found

    Square-rich fixed point polynomial evaluation on FPGAs

    Get PDF
    Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials

    Mapping for maximum performance on FPGA DSP blocks

    Get PDF
    The digital signal processing (DSP) blocks on modern field programmable gate arrays (FPGAs) are highly capable and support a variety of different datapath configurations. Unfortunately, inference in synthesis tools can fail to result in circuits that reach maximum DSP block throughput. We have developed a tool that maps graphs of add/sub/mult nodes to DSP blocks on Xilinx FPGAs, ensuring maximum throughput. This is done by delaying scheduling until after the graph has been partitioned onto DSP blocks and scheduled based on their pipeline structure, resulting in a throughput optimized implementation. Our tool prepares equivalent implementations in a variety of other methods, including high-level synthesis (HLS) for comparison. We show that the proposed approach offers an improvement in frequency of 100% over standard pipelined code, and 23% over Vivado HLS synthesis implementation, while retaining code portability, at the cost of a modest increase in logic resource usage

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Reconfigurable elliptic curve cryptography

    Get PDF
    Elliptic Curve Cryptosystems (ECC) have been proposed as an alternative to other established public key cryptosystems such as RSA (Rivest Shamir Adleman). ECC provide more security per bit than other known public key schemes based on the discrete logarithm problem. Smaller key sizes result in faster computations, lower power consumption and memory and bandwidth savings, thus making ECC a fast, flexible and cost-effective solution for providing security in constrained environments. Implementing ECC on reconfigurable platform combines the speed, security and concurrency of hardware along with the flexibility of the software approach. This work proposes a generic architecture for elliptic curve cryptosystem on a Field Programmable Gate Array (FPGA) that performs an elliptic curve scalar multiplication in 1.16milliseconds for GF (2163), which is considerably faster than most other documented implementations. One of the benefits of the proposed processor architecture is that it is easily reprogrammable to use different algorithms and is adaptable to any field order. Also through reconfiguration the arithmetic unit can be optimized for different area/speed requirements. The mathematics involved uses binary extension field of the form GF (2n) as the underlying field and polynomial basis for the representation of the elements in the field. A significant gain in performance is obtained by using projective coordinates for the points on the curve during the computation process

    Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

    Get PDF
    Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations

    Fracturable DSP block for multi-context reconfigurable architectures

    Get PDF
    Multi-context architectures like NATURE enable low-power applications to leverage fast context switching for improved energy efficiency and lower area footprint. The NATURE architecture incorporates 16-bit reconfigurable DSP blocks for accelerating arithmetic computations, however, their fixed precision prevents efficient re-use in mixed-width arithmetic circuits. This paper presents an improved DSP block architecture for NATURE, with native support for temporal folding and run-time fracturability. The proposed DSP block can compute multiple sub-width operations in the same clock cycle and can dynamically switch between sub-width and full-width operations in different cycles. The NanoMap tool for mapping circuits onto NATURE is extended to exploit the fracturable multiplier unit incorporated in the DSP block. We demonstrate the efficiency of the proposed dynamically fracturable DSP block by implementing logic-intensive and compute-intensive benchmark applications. Our results illustrate that the fracturable DSP block can achieve a 53.7% reduction in DSP block utilization and a 42.5% reduction in area with a 122.5% reduction in power-delay product without exploiting logic folding. We also observe an average reduction of 6.43% in power-delay product for circuits that utilize NATURE’s temporal folding compared to the existing full precision DSP block in NATURE, leading to highly compact, energy efficient designs

    Strategies for neural networks in ballistocardiography with a view towards hardware implementation

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy at the University of LutonThe work described in this thesis is based on the results of a clinical trial conducted by the research team at the Medical Informatics Unit of the University of Cambridge, which show that the Ballistocardiogram (BCG) has prognostic value in detecting impaired left ventricular function before it becomes clinically overt as myocardial infarction leading to sudden death. The objective of this study is to develop and demonstrate a framework for realising an on-line BCG signal classification model in a portable device that would have the potential to find pathological signs as early as possible for home health care. Two new on-line automatic BeG classification models for time domain BeG classification are proposed. Both systems are based on a two stage process: input feature extraction followed by a neural classifier. One system uses a principal component analysis neural network, and the other a discrete wavelet transform, to reduce the input dimensionality. Results of the classification, dimensionality reduction, and comparison are presented. It is indicated that the combined wavelet transform and MLP system has a more reliable performance than the combined neural networks system, in situations where the data available to determine the network parameters is limited. Moreover, the wavelet transfonn requires no prior knowledge of the statistical distribution of data samples and the computation complexity and training time are reduced. Overall, a methodology for realising an automatic BeG classification system for a portable instrument is presented. A fully paralJel neural network design for a low cost platform using field programmable gate arrays (Xilinx's XC4000 series) is explored. This addresses the potential speed requirements in the biomedical signal processing field. It also demonstrates a flexible hardware design approach so that an instrument's parameters can be updated as data expands with time. To reduce the hardware design complexity and to increase the system performance, a hybrid learning algorithm using random optimisation and the backpropagation rule is developed to achieve an efficient weight update mechanism in low weight precision learning. The simulation results show that the hybrid learning algorithm is effective in solving the network paralysis problem and the convergence is much faster than by the standard backpropagation rule. The hidden and output layer nodes have been mapped on Xilinx FPGAs with automatic placement and routing tools. The static time analysis results suggests that the proposed network implementation could generate 2.7 billion connections per second performance

    A novel parallel algorithm for surface editing and its FPGA implementation

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophySurface modelling and editing is one of important subjects in computer graphics. Decades of research in computer graphics has been carried out on both low-level, hardware-related algorithms and high-level, abstract software. Success of computer graphics has been seen in many application areas, such as multimedia, visualisation, virtual reality and the Internet. However, the hardware realisation of OpenGL architecture based on FPGA (field programmable gate array) is beyond the scope of most of computer graphics researches. It is an uncultivated research area where the OpenGL pipeline, from hardware through the whole embedded system (ES) up to applications, is implemented in an FPGA chip. This research proposes a hybrid approach to investigating both software and hardware methods. It aims at bridging the gap between methods of software and hardware, and enhancing the overall performance for computer graphics. It consists of four parts, the construction of an FPGA-based ES, Mesa-OpenGL implementation for FPGA-based ESs, parallel processing, and a novel algorithm for surface modelling and editing. The FPGA-based ES is built up. In addition to the Nios II soft processor and DDR SDRAM memory, it consists of the LCD display device, frame buffers, video pipeline, and algorithm-specified module to support the graphics processing. Since there is no implementation of OpenGL ES available for FPGA-based ESs, a specific OpenGL implementation based on Mesa is carried out. Because of the limited FPGA resources, the implementation adopts the fixed-point arithmetic, which can offer faster computing and lower storage than the floating point arithmetic, and the accuracy satisfying the needs of 3D rendering. Moreover, the implementation includes BĂ©zier-spline curve and surface algorithms to support surface modelling and editing. The pipelined parallelism and co-processors are used to accelerate graphics processing in this research. These two parallelism methods extend the traditional computation parallelism in fine-grained parallel tasks in the FPGA-base ESs. The novel algorithm for surface modelling and editing, called Progressive and Mixing Algorithm (PAMA), is proposed and implemented on FPGA-based ES’s. Compared with two main surface editing methods, subdivision and deformation, the PAMA can eliminate the large storage requirement and computing cost of intermediated processes. With four independent shape parameters, the PAMA can be used to model and edit freely the shape of an open or closed surface that keeps globally the zero-order geometric continuity. The PAMA can be applied independently not only FPGA-based ESs but also other platforms. With the parallel processing, small size, and low costs of computing, storage and power, the FPGA-based ES provides an effective hybrid solution to surface modelling and editing
    • 

    corecore