524 research outputs found

    Optimisations arithmétiques et synthèse de haut niveau

    Get PDF
    High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés

    Software-based approximate computing for mathematical functions

    Get PDF
    The four arithmetic floating-point operations (+,−,÷and×) have been precisely specified in IEEE-754 since 1985, but the situation for floating-point mathematical libraries and even some hardware operations such as fused multiply-add is more nuanced as there are varying opinions on which standards should be followed and when it is acceptable to allow some error or when it is necessary to be correctly-rounded. Deterministic correctly-rounded elementary mathematical functions are important in many applications. Others are tolerant to some level of error and would benefit from less accurate, better-performing approximations. We found that, despite IEEE-754 (2008 and 2019 only)specifying that ‘recommended functions’ such as sin, cos or log should be correctly rounded, the mathematical libraries available through standard interfaces in popular programming languages provide neither correct-rounding nor maximally performing approximations, partly due to the differing accuracy requirements of these functions in conflicting standards provided for some languages, such as C. This dissertation seeks to explore the current methods used for the implementation of mathematical functions, show the error present in them and demonstrate methods to produce both low-cost correctly-rounded solutions and better approximations for specific use-cases. This is achieved by: First, exploring the error within existing mathematical libraries and examining how it is impacting existing applications and the development of programming language standards. We then make two contributions which address the accuracy and standard conformance problems that were found: 1) an approach for a correctly-rounded 32-bit implementation of the elementary functions with minimal additional performance cost on modern hardware; and 2) an approach for developing a better performing incorrectly-rounded solution for use when some error is acceptable and conforming with the IEEE-754 standard is not a requirement. For the latter contribution, we introduce a tool for semi-automated generic code sensitivity analysis and approximation. Next, we target the creation of approximations for the standard activation functions used in neural networks. Identifying that significant time is spent in the computation of the activation functions, we generate approximations with different levels of error and better performance characteristics. These functions are then tested in standard neural networks to determine if the approximations have any detrimental effect on the output of the network. We show that, for many networks and activation functions, very coarse approximations are suitable replacements to train the networks equally well at a lower overall time cost. This dissertation makes original contributions to the area of approximate computing. We demonstrate new approaches to safe-approximation and justify approximate computation generally by showing that existing mathematical libraries are already suffering the downsides of approximation and latent error without fully exploiting the optimisation space available due to the existing tolerance to that error and showing that correctly-rounded solutions are possible without a significant performance impact for many 32-bit mathematical functions

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    Testing interval arithmetic libraries, including their IEEE-1788 compliance

    Get PDF
    International audienceAs developers of libraries implementing interval arithmetic, we faced the same difficulties when it came to testing our libraries. What must be tested? How can we devise relevant test cases for unit testing? How can we ensure a high (and possibly 100%) test coverage? In this paper we list the different aspects that, in our opinion, must be tested, giving indications on the choice of test cases. Then we examine how several interval arithmetic libraries actually perform tests. Next, we introduce two frameworks developed specifically to gather test cases and to incorporate easily new libraries in order to test them, namely JInterval and ITF1788. Not every important aspects of our libraries fit in these frameworks and we list extra tests that we deem important, but not easy, to perform

    Digital signal processor fundamentals and system design

    Get PDF
    Digital Signal Processors (DSPs) have been used in accelerator systems for more than fifteen years and have largely contributed to the evolution towards digital technology of many accelerator systems, such as machine protection, diagnostics and control of beams, power supply and motors. This paper aims at familiarising the reader with DSP fundamentals, namely DSP characteristics and processing development. Several DSP examples are given, in particular on Texas Instruments DSPs, as they are used in the DSP laboratory companion of the lectures this paper is based upon. The typical system design flow is described; common difficulties, problems and choices faced by DSP developers are outlined; and hints are given on the best solution

    Energy efficient FPGA based hardware accelerators for financial applications

    Get PDF

    Auto-tuning compiler options for HPC

    Get PDF

    Proceedings of the 7th Conference on Real Numbers and Computers (RNC'7)

    Get PDF
    These are the proceedings of RNC7

    Multiprocessor platform using LEON3 processor

    Get PDF
    The recent advances in embedded systems world, lead us to more complex systems with application specific blocks (IP cores), the System on Chip (SoC) devices. A good example of these complex devices can be encountered in the cell phones that can have image processing cores, communication cores, memory card cores, and others. The need of augmenting systems’ processing performance with lowest power, leads to a concept of Multiprocessor System on Chip (MSoC) in which the execution of multiple tasks can be distributed along various processors. This thesis intends to address the creation of a synthesizable multiprocessing system to be placed in a FPGA device, providing a good flexibility to tailor the system to a specific application. To deliver a multiprocessing system, will be used the synthesisable 32-bit SPARC V8 compliant, LEON3 processor.Os avanços recentes no mundo dos sistemas embebidos levam-nos a sistemas mais complexos com blocos para aplicações específicas (IP cores), os dispositivos System on Chip (SoC). Um bom exemplo destes complexos dispositivos pode ser encontrado nos telemóveis, que podem conter cores de processamento de imagem, cores de comunicações, cores para cartões de memória, entre outros. A necessidade de aumentar o desempenho dos sistemas de processamento com o menor consumo possível, leva ao conceito de Multiprocessor System on Chip (MSoC) em que a execução de múltiplas tarefas pode ser distribuída por vários processadores. Esta Tese pretende abordar a criação de um sistema de multiprocessamento sintetizável para ser colocado numa FPGA, proporcionando uma boa flexibilidade para a adaptação do sistema a uma aplicação específica. Para obter o sistema multiprocessamento, irá ser utilizado o processador sintetizável SPARC V8 de 32-bit, LEON3
    corecore