823 research outputs found

    High Performance and Optimal Configuration of Accurate Heterogeneous Block-Based Approximate Adder

    Full text link
    Approximate computing is an emerging paradigm to improve power and performance efficiency for error-resilient application. Recent approximate adders have significantly extended the design space of accuracy-power configurable approximate adders, and find optimal designs by exploring the design space. In this paper, a new energy-efficient heterogeneous block-based approximate adder (HBBA) is proposed; which is a generic/configurable model that can be transformed to a particular adder by defining some configurations. An HBBA, in general, is composed of heterogeneous sub-adders, where each sub-adder can have a different configuration. A set of configurations of all the sub-adders in an HBBA defines its configuration. The block-based adders are approximated through inexact logic configuration and truncated carry chains. HBBA increases design space providing additional design points that fall on the Pareto-front and offer better power-accuracy trade-off compared to other configurations. Furthermore, to avoid Mont-Carlo simulations, we propose an analytical modelling technique to evaluate the probability of error and Probability Mass Function (PMF) of error value. Moreover, the estimation method estimates delay, area and power of heterogeneous block-based approximate adders. Thus, based on the analytical model and estimation method, the optimal configuration under a given error constraint can be selected from the whole design space of the proposed adder model by exhaustive search. The simulation results show that our HBBA provides improved accuracy in terms of error metrics compared to some state-of-the-art approximate adders. HBBA with 32 bits length serves about 15% reduction in area and up to 17% reduction in energy compared to state-of-the-art approximate adders.Comment: Submitted to the IEEE-TCAD journal, 16 pages, 16 figure

    APPROXIMATE COMPUTING BASED PROCESSING OF MEA SIGNALS ON FPGA

    Get PDF
    The Microelectrode Array (MEA) is a collection of parallel electrodes that may measure the extracellular potential of nearby neurons. It is a crucial tool in neuroscience for researching the structure, operation, and behavior of neural networks. Using sophisticated signal processing techniques and architectural templates, the task of processing and evaluating the data streams obtained from MEAs is a computationally demanding one that needs time and parallel processing.This thesis proposes enhancing the capability of MEA signal processing systems by using approximate computing-based algorithms. These algorithms can be implemented in systems that process parallel MEA channels using the Field Programmable Gate Arrays (FPGAs). In order to develop approximate signal processing algorithms, three different types of approximate adders are investigated in various configurations. The objective is to maximize performance improvements in terms of area, power consumption, and latency associated with real-time processing while accepting lower output accuracy within certain bounds. On FPGAs, the methods are utilized to construct approximate processing systems, which are then contrasted with the precise system. Real biological signals are used to evaluate both precise and approximative systems, and the findings reveal notable improvements, especially in terms of speed and area. Processing speed enhancements reach up to 37.6%, and area enhancements reach 14.3% in some approximate system modes without sacrificing accuracy. Additional cases demonstrate how accuracy, area, and processing speed may be traded off. Using approximate computing algorithms allows for the design of real-time MEA processing systems with higher speeds and more parallel channels. The application of approximate computing algorithms to process biological signals on FPGAs in this thesis is a novel idea that has not been explored before

    Designing Approximate Computing Circuits with Scalable and Systematic Data-Driven Techniques

    Get PDF
    Semiconductor feature size has been shrinking significantly in the past decades. This decreasing trend of feature size leads to faster processing speed as well as lower area and power consumption. Among these attributes, power consumption has emerged as the primary concern in the design of integrated circuits in recent years due to the rapid increasing demand of energy efficient Internet of Things (IoT) devices. As a result, low power design approaches for digital circuits have become of great attractive in the past few years. To this end, approximate computing in hardware design has emerged as a promising design technique. It provides design opportunities to improve timing and energy efficiency by relaxing computing quality. This technique is feasible because of the error-resiliency of many emerging resource-hungry computational applications such as multimedia processing and machine learning. Thus, it is reasonable to utilize this characteristic to trade an acceptable amount of computing quality for energy saving. In the literature, most prior works on approximate circuit design focus on using manual design strategies to redesign fundamental computational blocks such as adders and multipliers. However, the manual design techniques are not suitable for system level hardware due to much higher design complexity. In order to tackle this challenge, we focus on designing scalable, systematic and general design methodologies that are applicable on any circuits. In this paper, we present two novel approximate circuit design methods based on machine learning techniques. Both methods skip the complicated manual analysis steps and primarily look at the given input-error pattern to generate approximate circuits. Our first work presents a framework for designing compensation block, an essential component in many approximate circuits, based on feature selection. Our second work further extends and optimizes this framework and integrates data-driven consideration into the design. Several case studies on fixed-width multipliers and other approximate circuits are presented to demonstrate the effectiveness of the proposed design methods. The experimental results show that both of the proposed methods are able to automatically and efficiently design low-error approximate circuits

    Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

    Get PDF
    Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

    High-level power optimisation for Digital Signal Processing in Recon gurable Logic

    No full text
    This thesis is concerned with the optimisation of Digital Signal Processing (DSP) algorithm implementations on recon gurable hardware via the selection of appropriate word-lengths for the signals in these algorithms, in order to minimise system power consumption. Whilst existing word-length optimisation work has concentrated on the minimisation of the area of algorithm implementations, this work introduces the rst set of power consumption models that can be evaluated quickly enough to be used within the search of the enormous design space of multiple word-length optimisation problems. These models achieve their speed by estimating both the power consumed within the arithmetic components of an algorithm and the power in the routing wires that connect these components, using only a high-level description of the algorithm itself. Trading o a small reduction in power model accuracy for a large increase in speed is one of the major contributions of this thesis. In addition to the work on power consumption modelling, this thesis also develops a new technique for selecting the appropriate word-lengths for an algorithm implementation in order to minimise its cost in terms of power (or some other metric for which models are available). The method developed is able to provide tight lower and upper bounds on the optimal cost that can be obtained for a particular word-length optimisation problem and can, as a result, nd provably near-optimal solutions to word-length optimisation problems without resorting to an NP-hard search of the design space. Finally the costs of systems optimised via the proposed technique are compared to those obtainable by word-length optimisation for minimisation of other metrics (such as logic area) and the results compared, providing greater insight into the nature of wordlength optimisation problems and the extent of the improvements obtainable by them

    Automated Design of Approximate Accelerators

    Get PDF
    In den letzten zehn Jahren hat das Bedürfnis nach Recheneffizienz die Entwicklung neuer Geräte, Architekturen und Entwurfstechniken motiviert. Approximate Computing hat sich als modernes, energieeffizientes Entwurfsparadigma für Anwendungen herausgestellt, die eine inhärente Fehlertoleranz aufweisen. Wenn die Genauigkeit der Ergebnisse in aktuellen Anwendungen wie Bildverarbeitung, Computer Vision und maschinellem Lernen auf ein akzeptables Maß reduziert wird, können Einsparungen im Schaltungsbereich, bei der Schaltkreisverzögerung und beim Stromverbrauch erzielt werden. Mit dem Aufkommen dieses Approximate Computing Paradigmas wurden in der Literatur viele approximierte Funktionseinheiten angegeben, insbesondere approximierte Addierer und Multiplizierer. Für eine Vielzahl solcher approximierter Schaltkreise und unter Berücksichtigung ihrer Verwendung als Bausteine für den Entwurf von approximierten Beschleunigern für fehlertolerante Anwendungen, ergibt sich eine Herausforderung: die Auswahl dieser approximierten Schaltkreise für eine bestimmte Anwendung, die die erforderlichen Ressourcen minimieren und gleichzeitig eine definierte Genauigkeit erfüllen. Diese Dissertation schlägt automatisierte Methoden zum Entwerfen und Implementieren von approximierten Beschleunigern vor, die aus approximierten arithmetischen Schaltungen aufgebaut sind. Um dies zu erreichen, befasst sich diese Dissertation mit folgenden Herausforderungen und liefert die nachfolgenden neuartigen Beiträge: In der Literatur wurden viele approximierte Addierer und Multiplizierer vorgestellt, indem entweder approximierte Entwürfe aus genauen Implementierungen wie dem Ripple-Carry-Addierer vorgeschlagen oder durch Approximate Logic Synthesis (ALS) Methoden generiert wurden. Ein repräsentativer Satz dieser approximierten Komponenten ist erforderlich, um approximierte Beschleuniger zu bauen. In diesem Sinne präsentiert diese Dissertation zwei Ansätze, um solche approximierte arithmetische Schaltungen zu erstellen. Zunächst wird AUGER vorgestellt, ein Tool, mit dem Register-Transfer Level (RTL) Beschreibungen für einen breiten Satz von approximierten Addierern und Multiplizierer für unterschiedliche Datenbitbreiten- und Genauigkeitskonfigurationen generiert werden können. Mit AUGER kann eine Design Space Exploration (DSE) von approximierten Komponenten durchgeführt werden, um diejenigen zu finden, die für eine gegebene Bitbreite, einen gegebenen Approximationsbereich und eine gegebene Schaltungsmetrik Pareto-optimal sind. Anschließend wird AxLS vorgestellt, ein Framework für ALS, das die Implementierung modernster Methoden und den Vorschlag neuartiger Methoden ermöglicht, um strukturelle Netzlistentransformationen durchzuführen und approximierte arithmetische Schaltungen aus genauen Schaltungen zu generieren. Darüber hinaus bieten beide Werkzeuge eine Fehlercharakterisierung in Form einer Fehlerverteilung und Schaltungseigenschaften (Fläche, Schaltkreisverzögerung und Leistung) für jede von ihnen erzeugte approximierte Schaltung. Diese Informationen sind für das Untersuchungsziel dieser Dissertation von wesentlicher Bedeutung. Trotz der Fehlertoleranz müssen approximierte Beschleuniger so ausgelegt sein, dass sie Genauigkeitsvorgaben erfüllen. Für den Entwurf solcher Beschleuniger unter Verwendung von approximierten arithmetischen Schaltungen ist es daher unerlässlich zu bewerten, wie sich die durch approximierte Schaltungen verursachten Fehler durch andere Berechnungen ausbreiten, entweder genau oder ungenau, und sich schließlich am Ausgang ansammeln. Diese Dissertation schlägt analytische Modelle vor, um die Fehlerpropagation durch genaue und approximierte Berechnungen zu beschreiben. Mit ihnen wird eine automatisierte, compilerbasierte Methodik vorgeschlagen, um die Fehlerpropagation auf approximierten Beschleunigerdesigns abzuschätzen. Diese Methode ist in ein Tool, CEDA, integriert, um schnelle, simulationsfreie Genauigkeitsschätzungen von approximierten Beschleunigermodellen durchzuführen, die unter Verwendung von C-Code beschrieben wurden. Beim Entwurf von approximierten Beschleunigern benötigen sich wiederholende Simulationen auf Gate-Level und die Schaltungssynthese viel Zeit, um viele oder sogar alle möglichen Kombinationen für einen gegebenen Satz von approximierten arithmetischen Schaltungen zu untersuchen. Andererseits basieren aktuelle Trends beim Entwerfen von Beschleunigern auf High-Level Synthesis (HLS) Werkzeugen. In dieser Dissertation werden analytische Modelle zur Schätzung der erforderlichen Rechenressourcen vorgestellt, wenn approximierte Addierer und Multiplizierer in Konstruktionen von approximierten Beschleunigern verwendet werden. Darüber hinaus werden diese Modelle zusammen mit den vorgeschlagenen analytischen Modellen zur Genauigkeitsschätzung in eine DSE-Methodik für fehlertolerante Anwendungen, DSEwam, integriert, um Pareto-optimale oder nahezu Pareto-optimale Lösungen für approximierte Beschleuniger zu identifizieren. DSEwam ist in ein HLS-Tool integriert, um automatisch RTL-Beschreibungen von approximierten Beschleunigern aus C-Sprachbeschreibungen für eine bestimmte Fehlerschwelle und ein bestimmtes Minimierungsziel zu generieren. Die Verwendung von approximierten Beschleunigern muss sicherstellen, dass Fehler, die aufgrund von approximierten Berechnungen erzeugt werden, innerhalb eines definierten Maximalwerts für eine gegebene Genauigkeitsmetrik bleiben. Die Fehler, die durch approximierte Beschleuniger erzeugt werden, hängen jedoch von den Eingabedaten ab, die hinsichtlich der für das Design verwendeten Daten unterschiedlich sein können. In dieser Dissertation wird ECAx vorgestellt, eine automatisierte Methode zur Untersuchung und Anwendung feinkörniger Fehlerkorrekturen mit geringem Overhead in approximierten Beschleunigern, um die Kosten für die Fehlerkorrektur auf Softwareebene (wie es in der Literatur gemacht wird) zu senken. Dies erfolgt durch selektive Korrektur der signifikantesten Fehler (in Bezug auf ihre Größenordnung), die von approximierten Komponenten erzeugt werden, ohne die Vorteile der Approximationen zu verlieren. Die experimentelle Auswertung zeigt Beschleunigungsverbesserungen für die Anwendung im Austausch für einen leicht gestiegenen Flächen- und Leistungsverbrauch im approximierten Beschleunigerdesign
    corecore