1,934 research outputs found
Square-rich fixed point polynomial evaluation on FPGAs
Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials
On the Hardware Implementation of Triangle Traversal Algorithms for Graphics Processing
Current GPU architectures provide impressive processing rates in graphical applications because of their specialized graphics pipeline. However, little attention has been paid to the analysis and study of different hardware architectures to implement speciïŹc pipeline stages. In this work we have identiïŹed one of the key stages in the graphics pipeline, the triangle traversal procedure, and we have implemented three different algorithms in hardware: bounding-box, zig-zag and Hilbert curve-based. The experimental results show that important area-performance trade-offs can be met when implementing key image processing algorithms in hardwar
A low cost reconfigurable soft processor for multimedia applications: design synthesis and programming model
This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. This paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. The core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. Throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Optimisations arithmétiques et synthÚse de haut niveau
High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.Ă cause de la nature relativement jeune des outils de synthĂšse de haut-niveau (HLS), de nombreuses optimisations arithmĂ©tiques n'y sont pas encore implĂ©mentĂ©es. Cette thĂšse propose des optimisations arithmĂ©tiques se servant du contexte spĂ©cifique dans lequel les opĂ©rateurs sont instanciĂ©s.Certaines optimisations sont de simples spĂ©cialisations d'opĂ©rateurs, respectant la sĂ©mantique du C.D'autres nĂ©cĂ©ssitent de s'Ă©loigner de cette sĂ©mantique pour amĂ©liorer le compromis prĂ©cision/coĂ»t/performance.Cette proposition est dĂ©montrĂ© sur des sommes de produits de nombres flottants.La somme est rĂ©alisĂ©e dans un format en virgule-fixe dĂ©fini par son contexte.Quand trop peu dâinformations sont disponibles pour dĂ©finir ce format en virgule-fixe, une stratĂ©gie est de gĂ©nĂ©rer un accumulateur couvrant l'intĂ©gralitĂ© du format flottant.Cette thĂšse explore plusieurs implĂ©mentations d'un tel accumulateur.L'utilisation d'une reprĂ©sentation en complĂ©ment Ă deux permet de rĂ©duire le chemin critique de la boucle d'accumulation, ainsi que la quantitĂ© de ressources utilisĂ©es. Un format alternatif aux nombres flottants, appelĂ© posit, propose d'utiliser un encodage Ă prĂ©cision variable.De plus, ce format est augmentĂ© par un accumulateur exact.Pour Ă©valuer prĂ©cisĂ©ment le coĂ»t matĂ©riel de ce format, cette thĂšse prĂ©sente des architectures d'opĂ©rateurs posits, implĂ©mentĂ©s avec le mĂȘme degrĂ© d'optimisation que celui de l'Ă©tat de l'art des opĂ©rateurs flottants.Une analyse dĂ©taillĂ©e montre que le coĂ»t des opĂ©rateurs posits est malgrĂ© tout bien plus Ă©levĂ© que celui de leurs Ă©quivalents flottants.Enfin, cette thĂšse prĂ©sente une couche de compatibilitĂ© entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothĂšque implĂ©mente un type d'entiers de taille variable, avec de plus une sĂ©mantique strictement typĂ©e, ainsi qu'un ensemble d'opĂ©rateurs ad-hoc optimisĂ©s
HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host
processor with programmable manycore accelerators (PMCAs) to combine
general-purpose computing with domain-specific, efficient processing
capabilities. While leading companies successfully advance their HESoC
products, research lags behind due to the challenges of building a prototyping
platform that unites an industry-standard host processor with an open research
PMCA architecture. In this work we introduce HERO, an FPGA-based research
platform that combines a PMCA composed of clusters of RISC-V cores, implemented
as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host
processor. The PMCA architecture mapped on the FPGA is silicon-proven,
scalable, configurable, and fully modifiable. HERO includes a complete software
stack that consists of a heterogeneous cross-compilation toolchain with support
for OpenMP accelerator programming, a Linux driver, and runtime libraries for
both host and PMCA. HERO is designed to facilitate rapid exploration on all
software and hardware layers: run-time behavior can be accurately analyzed by
tracing events, and modifications can be validated through fully automated hard
ware and software builds and executed tests. We demonstrate the usefulness of
HERO by means of case studies from our research
- âŠ