28 research outputs found

    Optimistic Parallelization of Floating-Point Accumulation

    Get PDF
    Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6Ă— with randomly generated data and 3-7Ă— with summations extracted from Conjugate Gradient benchmarks

    Pipelining Saturated Accumulation

    Get PDF
    Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

    Design-space exploration for the Kulisch accumulator

    Get PDF
    Floating-point sums and dot products accumulate rounding errors that may render the result very inaccurate. To address this, Kulisch proposed to use an internal accumulator large enough to cover the full exponent range of floating-point. With it, sums and dot products become exact operations. This idea failed to materialize in general purpose processors, as it was considered to slow and/or too expensive in terms of resources. It may however be an interesting option in recon-figurable computing, where a designer may use use smaller, more resource-efficient floating-point formats, knowing that sums and dot products will be exact. Another motivation of this work is that these exact operations, contrary to classical floating point ones, are associative, which enables better compiler optimizations. This work therefore compares, in the context of modern FPGAs, several implementations of the Kulisch accumulator: three proposed by Kulisch, and two novel ones. These architectures are implemented in a VivadoHLS-compliant C++ generator that is fully customiz-able. Comparisons targeting Xilinx's Kintex 7 FPGAs show improvement over Kulisch' proposal in both area and speed. In single precision, compared with a naive use of classical operators , the proposed accumulator runs at similar frequency, consumes 10x more resource in single precision, but reduces the overall latency of a large dot product by 25x while vastly improving accuracy

    Optimisations arithmétiques et synthèse de haut niveau

    Get PDF
    High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés

    Adaptive FPGA NoC-based Architecture for Multispectral Image Correlation

    Full text link
    An adaptive FPGA architecture based on the NoC (Network-on-Chip) approach is used for the multispectral image correlation. This architecture must contain several distance algorithms depending on the characteristics of spectral images and the precision of the authentication. The analysis of distance algorithms is required which bases on the algorithmic complexity, result precision, execution time and the adaptability of the implementation. This paper presents the comparison of these distance computation algorithms on one spectral database. The result of a RGB algorithm implementation was discussed

    Exploration architecturale de l'accumulateur de Kulisch

    Get PDF
    National audienceLes sommes de produit utilisant le format flottant accumulent des erreurs d’arrondi pouvantaltérer la précision du résultat. Face à ce constat, Kulisch a proposé d’utiliser un accumulateurinterne suffisamment grand pour couvrir l’éventail d’exposants flottants, ce qui permet de nejamais arrondir les additions. Cette architecture n’a jamais aboutie dans les processeurs grandpublic car elle était considérée comme trop lente et/ou utilisant trop de ressources. Cependant,elle peut être intéressante dans le cadre des FPGAs pour deux raisons. D’une part, on peut yutiliser un format flottant non standard, plus petit que 32 bits. La faible précision du formatpeut être compensée par l’exactitude de l’accumulation dans une architecture dont la tailledevient raisonnable. D’autre part, l’addition de nombres flottants dans un tel accumulateurest associative. Dans un flot de synthèse de haut niveau, ceci permet des optimisations quisont interdites en flottant standard. Ce travail compare donc plusieurs implémentations del’accumulateur de Kulisch, dont deux sont originales. Ces architectures sont implémentées dansun générateur de C++ entièrement configurable produisant du code compatible avec VivadoHLS.Les comparaisons effectuées sur FPGAs Xilinx Kintex 7 montrent une amélioration par rapportà la solution de Kulisch en termes de surface et de rapidité. De plus, la comparaison avec desimplémentations flottantes classiques montre des compromis intéressants

    Vector support for multicore processors with major emphasis on configurable multiprocessors

    Get PDF
    It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency. This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips. The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit. However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value

    Pipelining saturated accumulation

    Full text link

    Adaptive image filtering using run-time reconfiguration

    Get PDF
    This thesis implements an adaptive linear smoothing image filtering algorithm, on a Virtex™-E FPGA using run-time reconfiguration (RTR). An adaptive filter uses a filtering window that runs over the entire image pixel-by-pixel, generating new (filtered) values of the pixels. As the name suggests, an adaptive filter can adapt to the varying nature of an image by adjusting the coefficients of the filtering window depending upon the local variance in the intensity values of pixels. It filters an image in a non-uniform fashion providing greater smoothing in largely uniform areas of the image and lesser smoothing when it encounters edges and step changes in the image. These continual changes, in the coefficient values of the adaptive filter pose a problem in utilizing run-time reconfiguration (RTR) for its implementation, as benefits of RTR emerge only with considerable computing time between reconfigurations. This thesis provides a solution to this problem and reduces the running time of the algorithm through aggressive use of RTR. This work provides details on the RTR implementation of an adaptive filter, along with an estimate of running time and hardware resource requirements, when synthesized on the Virtex™-E FPGA. We use a 3 ×3 size filtering window, and a 256 256 ×size gray scale image as a specific case, achieving speedup of 31 and 84 over pure software implementations running on Pentium III and Sun Ultra systems respectively

    Numerical solutions of differential equations on FPGA-enhanced computers

    Get PDF
    Conventionally, to speed up scientific or engineering (S&E) computation programs on general-purpose computers, one may elect to use faster CPUs, more memory, systems with more efficient (though complicated) architecture, better software compilers, or even coding with assembly languages. With the emergence of Field Programmable Gate Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists and engineers now have another option using FPGA devices as core components to address their computational problems. The hardware-programmable, low-cost, but powerful “FPGA-enhanced computer” has now become an attractive approach for many S&E applications. A new computer architecture model for FPGA-enhanced computer systems and its detailed hardware implementation are proposed for accelerating the solutions of computationally demanding and data intensive numerical PDE problems. New FPGAoptimized algorithms/methods for rapid executions of representative numerical methods such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are designed, analyzed, and implemented on it. Linear wave equations based on seismic data processing applications are adopted as the targeting PDE problems to demonstrate the effectiveness of this new computer model. Their sustained computational performances are compared with pure software programs operating on commodity CPUbased general-purpose computers. Quantitative analysis is performed from a hierarchical set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized numerical algorithms or methods that may be inappropriate for conventional general-purpose computers. The preferable property of in-system hardware reconfigurability of the new system is emphasized aiming at effectively accelerating the execution of complex multi-stage numerical applications. Methodologies for accelerating the targeting PDE problems as well as other numerical PDE problems, such as heat equations and Laplace equations utilizing programmable hardware resources are concluded, which imply the broad usage of the proposed FPGA-enhanced computers
    corecore