705 research outputs found

    Qualitative and fuzzy analogue circuit design.

    Get PDF

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

    Get PDF
    Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

    Design and Implementation of Hardware Accelerators for Neural Processing Applications

    Full text link
    Primary motivation for this work was the need to implement hardware accelerators for a newly proposed ANN structure called Auto Resonance Network (ARN) for robotic motion planning. ARN is an approximating feed-forward hierarchical and explainable network. It can be used in various AI applications but the application base was small. Therefore, the objective of the research was twofold: to develop a new application using ARN and to implement a hardware accelerator for ARN. As per the suggestions given by the Doctoral Committee, an image recognition system using ARN has been implemented. An accuracy of around 94% was achieved with only 2 layers of ARN. The network also required a small training data set of about 500 images. Publicly available MNIST dataset was used for this experiment. All the coding was done in Python. Massive parallelism seen in ANNs presents several challenges to CPU design. For a given functionality, e.g., multiplication, several copies of serial modules can be realized within the same area as a parallel module. Advantage of using serial modules compared to parallel modules under area constraints has been discussed. One of the module often useful in ANNs is a multi-operand addition. One problem in its implementation is that the estimation of carry bits when the number of operands changes. A theorem to calculate exact number of carry bits required for a multi-operand addition has been presented in the thesis which alleviates this problem. The main advantage of the modular approach to multi-operand addition is the possibility of pipelined addition with low reconfiguration overhead. This results in overall increase in throughput for large number of additions, typically seen in several DNN configurations

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numĂ©rique, un nombre fractionnaire peut ĂȘtre reprĂ©sentĂ© Ă  l’aide de l’arithmĂ©tique en virgule fixe ou en virgule flottante. L’arithmĂ©tique en virgule fixe est largement considĂ©rĂ©e prĂ©fĂ©rable Ă  celle en virgule flottante pour les architectures matĂ©rielles dĂ©diĂ©es en raison de sa plus faible complexitĂ© d’implĂ©mentation. Dans la mise en Ɠuvre du matĂ©riel, la largeur de mot attribuĂ©e Ă  diffĂ©rents signaux a un impact significatif sur des mĂ©triques telles que les ressources (transistors), la vitesse et la consommation d'Ă©nergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise Ă  optimiser les chemins de donnĂ©es par l'ajustement des longueurs de mots attribuĂ©es aux signaux. Un nombre en virgule fixe est composĂ© d’une partie entiĂšre et d’une partie fractionnaire. Il y a une limite infĂ©rieure au nombre de bits allouĂ©s Ă  la partie entiĂšre, de façon Ă  prĂ©venir les dĂ©bordements pour chaque signal. Cette limite dĂ©pend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant Ă  lui, dĂ©termine la taille de l'erreur de prĂ©cision finie qui est introduite dans les calculs. Il existe un compromis entre la prĂ©cision et l'efficacitĂ© du matĂ©riel dans la sĂ©lection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procĂ©dures importantes: la modĂ©lisation de l'erreur de quantification et la sĂ©lection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont portĂ© sur des circuits spĂ©cialisĂ©s comme plate-forme cible. Dans cette thĂšse, nous introduisons de nouvelles mĂ©thodologies, techniques et algorithmes pour amĂ©liorer l’implĂ©mentation de calculs en virgule fixe dans des circuits et processeurs spĂ©cialisĂ©s. La thĂšse propose une approche amĂ©liorĂ©e de modĂ©lisation d’erreur, basĂ©e sur l'arithmĂ©tique affine, qui aborde certains problĂšmes des mĂ©thodes existantes et amĂ©liore leur prĂ©cision. La thĂšse introduit Ă©galement une technique d'accĂ©lĂ©ration et deux algorithmes semi-analytiques pour la sĂ©lection de la largeur de la partie fractionnaire pour la conception de circuits spĂ©cialisĂ©s. Alors que le premier algorithme suit une stratĂ©gie de recherche progressive, le second utilise une mĂ©thode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexitĂ© de calcul et le coĂ»t rĂ©sultant. Le premier algorithme a une complexitĂ© polynomiale et obtient des rĂ©sultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexitĂ© exponentielle, mais il donne des rĂ©sultats quasi-optimaux par rapport Ă  une recherche exhaustive. Cette thĂšse propose Ă©galement une mĂ©thode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unitĂ©s fonctionnelles sont les principaux objectifs ciblĂ©s par cette optimisation. Un nouvel algorithme d'optimisation a Ă©tĂ© dĂ©veloppĂ© pour trouver la meilleure combinaison de longueurs de mots et d'autres paramĂštres configurables dans la mĂ©thode proposĂ©e. Les exigences de prĂ©cision, dĂ©finies comme l'erreur pire cas, doivent ĂȘtre respectĂ©es par toute solution. Pour faciliter l'Ă©valuation et la mise en Ɠuvre des solutions retenues, un nouvel environnement de conception de processeur a Ă©galement Ă©tĂ© dĂ©veloppĂ©. Cet environnement, qui est appelĂ© PolyCuSP, supporte une large gamme de paramĂštres, y compris ceux qui sont nĂ©cessaires pour Ă©valuer les solutions proposĂ©es par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacitĂ© de modĂ©liser diffĂ©rents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari

    Power system real-time thermal rating estimation

    Get PDF
    This Thesis describes the development and testing of a real-time rating estimation algorithm developed at Durham University within the framework of the partially Government-funded research and development project “Active network management based on component thermal properties”, involving Durham University, ScottishPower EnergyNetworks, AREVA-T&D, PB Power and Imass. The concept of real time ratings is based on the observation that power system component current carrying capacity is strongly influenced by variable environmental parameters such as air temperature or wind speed. On the contrary, the current operating practice consists of using static component ratings based on conservative assumptions. Therefore, the adoption of real-time ratings would allow latent network capacity to be unlocked with positive outcomes in a number of aspects of distribution network operation. This research is mainly focused on facilitating renewable energy connection to the distribution level, since thermal overloads are the main cause of constraints for connections at the medium and high voltage levels. Additionally its application is expected to facilitate network operation in case of thermal problems created by load growth, delaying and optimizing network reinforcements. The work aims at providing a solution to part of the problems inherent in the development of a real-time rating system, such as reducing measurements points, data uncertainty and communication failure. An extensive validation allowed a quantification of the performance of the algorithm developed, building the necessary confidence for a practical application of the system developed

    Numerical and Evolutionary Optimization 2020

    Get PDF
    This book was established after the 8th International Workshop on Numerical and Evolutionary Optimization (NEO), representing a collection of papers on the intersection of the two research areas covered at this workshop: numerical optimization and evolutionary search techniques. While focusing on the design of fast and reliable methods lying across these two paradigms, the resulting techniques are strongly applicable to a broad class of real-world problems, such as pattern recognition, routing, energy, lines of production, prediction, and modeling, among others. This volume is intended to serve as a useful reference for mathematicians, engineers, and computer scientists to explore current issues and solutions emerging from these mathematical and computational methods and their applications
    • 

    corecore