11 research outputs found

    A Multithreaded Soft Processor for SoPC Area Reduction

    Full text link
    The growth in size and performance of Field Programmable Gate Arrays (FPGAs) has compelled System-on-a-Programmable-Chip (SoPC) designers to use soft proces-sors for controlling systems with large numbers of intellec-tual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor either as peripheral de-vices or/and by using custom instructions (CIs). In large systems, chip multiprocessors (CMPs) are used to execute many programs concurrently. When these programs require the use of the same IP blocks which are accessed as periph-eral devices, they may have to stall waiting for their turn. In the case of CIs, the FPGA logic blocks that implement the CIs may have to be replicated for each processor. In both of these cases FPGA area is wasted, either by idle soft processors or the replication of CI logic blocks. This paper presents a multithreaded (MT) soft processor for area reduction in SoPC implementations. An MT proces-sor allows multiple programs to access the same IP without the need for the logic replication or the replication of whole processors. We first designed a single-threaded processor that is instruction-set compatible to Altera’s Nios II soft processor. Our processor is approximately the same size as the Nios II Economy version, with equivalent performance. We augmented our processor to have 4-way interleaved mul-tithreading capabilities. This paper compares the area us-age and performance of the MT processor versus two CMP systems, using Altera’s and our single-threaded processors, separately. Our results show that we can achieve an area savings of about 45 % for the processor itself, in addition to the area savings due to not replicating CI logic blocks. 1

    A soft processor overlay with tightly-coupled FPGA accelerator

    Get PDF
    FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to improve designers’ productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full application acceleration, it is often necessary to also include a highly efficient processor that integrates and collaborates with the accelerators while maintaining the benefits of being implemented within the same overlay framework. This paper presents an open-source soft processor that is designed to tightly-couple with FPGA accelerators as part of an overlay framework. RISC-V is chosen as the instruction set for its openness and portability, and the soft processor is designed as a 4-stage pipeline to balance resource consumption and performance when implemented on FPGAs. The processor is generically implemented so as to promote design portability and compatibility across different FPGA platforms. Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility. The processor has been synthesized to both low-end and high-performance FPGA families from different vendors, achieving the highest frequency of 268:67MHz and resource consumption comparable to existing RISC-V designs.postprin

    A template-based methodology for efficient microprocessor and FPGA accelerator co-design

    Get PDF
    Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs

    A Hybrid Partially Reconfigurable Overlay Supporting Just-In-Time Assembly of Custom Accelerators on FPGAs

    Get PDF
    The state of the art in design and development flows for FPGAs are not sufficiently mature to allow programmers to implement their applications through traditional software development flows. The stipulation of synthesis as well as the requirement of background knowledge on the FPGAs\u27 low-level physical hardware structure are major challenges that prevent programmers from using FPGAs. The reconfigurable computing community is seeking solutions to raise the level of design abstraction at which programmers must operate, and move the synthesis process out of the programmers\u27 path through the use of overlays. A recent approach, Just-In-Time Assembly (JITA), was proposed that enables hardware accelerators to be assembled at runtime, all from within a traditional software compilation flow. The JITA approach presents a promising path to constructing hardware designs on FPGAs using pre-synthesized parallel programming patterns, but suffers from two major limitations. First, all variant programming patterns must be pre-synthesized. Second, conditional operations are not supported. In this thesis, I present a new reconfigurable overlay, URUK, that overcomes the two limitations imposed by the JITA approach. Similar to the original JITA approach, the proposed URUK overlay allows hardware accelerators to be constructed on FPGAs through software compilation flows. To this basic capability, URUK adds additional support to enable the assembly of presynthesized fine-grained computational operators to be assembled within the FPGA. This thesis provides analysis of URUK from three different perspectives; utilization, performance, and productivity. The analysis includes comparisons against High-Level Synthesis (HLS) and the state of the art approach to creating static overlays. The tradeoffs conclude that URUK can achieve approximately equivalent performance for algebra operations compared to HLS custom accelerators, which are designed with simple experience on FPGAs. Further, URUK shows a high degree of flexibility for runtime placement and routing of the primitive operations. The analysis shows how this flexibility can be leveraged to reduce communication overhead among tiles, compared to traditional static overlays. The results also show URUK can enable software programmers without any hardware skills to create hardware accelerators at productivity levels consistent with software development and compilation

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numĂ©rique, un nombre fractionnaire peut ĂȘtre reprĂ©sentĂ© Ă  l’aide de l’arithmĂ©tique en virgule fixe ou en virgule flottante. L’arithmĂ©tique en virgule fixe est largement considĂ©rĂ©e prĂ©fĂ©rable Ă  celle en virgule flottante pour les architectures matĂ©rielles dĂ©diĂ©es en raison de sa plus faible complexitĂ© d’implĂ©mentation. Dans la mise en Ɠuvre du matĂ©riel, la largeur de mot attribuĂ©e Ă  diffĂ©rents signaux a un impact significatif sur des mĂ©triques telles que les ressources (transistors), la vitesse et la consommation d'Ă©nergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise Ă  optimiser les chemins de donnĂ©es par l'ajustement des longueurs de mots attribuĂ©es aux signaux. Un nombre en virgule fixe est composĂ© d’une partie entiĂšre et d’une partie fractionnaire. Il y a une limite infĂ©rieure au nombre de bits allouĂ©s Ă  la partie entiĂšre, de façon Ă  prĂ©venir les dĂ©bordements pour chaque signal. Cette limite dĂ©pend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant Ă  lui, dĂ©termine la taille de l'erreur de prĂ©cision finie qui est introduite dans les calculs. Il existe un compromis entre la prĂ©cision et l'efficacitĂ© du matĂ©riel dans la sĂ©lection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procĂ©dures importantes: la modĂ©lisation de l'erreur de quantification et la sĂ©lection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont portĂ© sur des circuits spĂ©cialisĂ©s comme plate-forme cible. Dans cette thĂšse, nous introduisons de nouvelles mĂ©thodologies, techniques et algorithmes pour amĂ©liorer l’implĂ©mentation de calculs en virgule fixe dans des circuits et processeurs spĂ©cialisĂ©s. La thĂšse propose une approche amĂ©liorĂ©e de modĂ©lisation d’erreur, basĂ©e sur l'arithmĂ©tique affine, qui aborde certains problĂšmes des mĂ©thodes existantes et amĂ©liore leur prĂ©cision. La thĂšse introduit Ă©galement une technique d'accĂ©lĂ©ration et deux algorithmes semi-analytiques pour la sĂ©lection de la largeur de la partie fractionnaire pour la conception de circuits spĂ©cialisĂ©s. Alors que le premier algorithme suit une stratĂ©gie de recherche progressive, le second utilise une mĂ©thode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexitĂ© de calcul et le coĂ»t rĂ©sultant. Le premier algorithme a une complexitĂ© polynomiale et obtient des rĂ©sultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexitĂ© exponentielle, mais il donne des rĂ©sultats quasi-optimaux par rapport Ă  une recherche exhaustive. Cette thĂšse propose Ă©galement une mĂ©thode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unitĂ©s fonctionnelles sont les principaux objectifs ciblĂ©s par cette optimisation. Un nouvel algorithme d'optimisation a Ă©tĂ© dĂ©veloppĂ© pour trouver la meilleure combinaison de longueurs de mots et d'autres paramĂštres configurables dans la mĂ©thode proposĂ©e. Les exigences de prĂ©cision, dĂ©finies comme l'erreur pire cas, doivent ĂȘtre respectĂ©es par toute solution. Pour faciliter l'Ă©valuation et la mise en Ɠuvre des solutions retenues, un nouvel environnement de conception de processeur a Ă©galement Ă©tĂ© dĂ©veloppĂ©. Cet environnement, qui est appelĂ© PolyCuSP, supporte une large gamme de paramĂštres, y compris ceux qui sont nĂ©cessaires pour Ă©valuer les solutions proposĂ©es par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacitĂ© de modĂ©liser diffĂ©rents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari

    Increasing the efficacy of automated instruction set extension

    Get PDF
    The use of Instruction Set Extension (ISE) in customising embedded processors for a specific application has been studied extensively in recent years. The addition of a set of complex arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting design performance requirements. This thesis proposes and evaluates a reconfigurable ISE implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form on AISE. The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation. A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration. This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated. Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN early-termination is introduced and shown to improve the runtime of the algorithm by up to 7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently espoused notion that “bigger is better” in ISE. The last stretch of work in this thesis is concerned with source-level transformation: the effect of changing the representation of the application on the quality of the combined hardwaresoftware solution. A methodology for combined exploration of source transformation and ISE is presented, and demonstrated to improve the acceleration of the result by an average of 35% versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all design concerns and applications studied here, regardless of ISEs employed

    Optimising and evaluating designs for reconfigurable hardware

    No full text
    Growing demand for computational performance, and the rising cost for chip design and manufacturing make reconfigurable hardware increasingly attractive for digital system implementation. Reconfigurable hardware, such as field-programmable gate arrays (FPGAs), can deliver performance through parallelism while also providing flexibility to enable application builders to reconfigure them. However, reconfigurable systems, particularly those involving run-time reconfiguration, are often developed in an ad-hoc manner. Such an approach usually results in low designer productivity and can lead to inefficient designs. This thesis covers three main achievements that address this situation. The first achievement is a model that captures design parameters of reconfigurable hardware and performance parameters of a given application domain. This model supports optimisations for several design metrics such as performance, area, and power consumption. The second achievement is a technique that enhances the relocatability of bitstreams for reconfigurable devices, taking into account heterogeneous resources. This method increases the flexibility of modules represented by these bitstreams while reducing configuration storage size and design compilation time. The third achievement is a technique to characterise the power consumption of FPGAs in different activity modes. This technique includes the evaluation of standby power and dedicated low-power modes, which are crucial in meeting the requirements for battery-based mobile devices

    Ant colony optimization on runtime reconfigurable architectures

    Get PDF
    corecore