483 research outputs found
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
The hardware implementation of deep neural networks (DNNs) has recently
received tremendous attention: many applications in fact require high-speed
operations that suit a hardware implementation. However, numerous elements and
complex interconnections are usually required, leading to a large area
occupation and copious power consumption. Stochastic computing has shown
promising results for low-power area-efficient hardware implementations, even
though existing stochastic algorithms require long streams that cause long
latencies. In this paper, we propose an integer form of stochastic computation
and introduce some elementary circuits. We then propose an efficient
implementation of a DNN based on integral stochastic computing. The proposed
architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62%
average reductions in area and latency compared to the best reported
architecture in literature. We also synthesize the circuits in a 65 nm CMOS
technology and we show that the proposed integral stochastic architecture
results in up to 21% reduction in energy consumption compared to the binary
radix implementation at the same misclassification rate. Due to fault-tolerant
nature of stochastic architectures, we also consider a quasi-synchronous
implementation which yields 33% reduction in energy consumption w.r.t. the
binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure
Automated Design of Approximate Accelerators
In den letzten zehn Jahren hat das BedĂŒrfnis nach Recheneffizienz die Entwicklung neuer GerĂ€te, Architekturen und Entwurfstechniken motiviert. Approximate Computing hat sich als modernes, energieeffizientes Entwurfsparadigma fĂŒr Anwendungen herausgestellt, die eine inhĂ€rente Fehlertoleranz aufweisen. Wenn die Genauigkeit der Ergebnisse in aktuellen Anwendungen wie Bildverarbeitung, Computer Vision und maschinellem Lernen auf ein akzeptables MaĂ reduziert wird, können Einsparungen im Schaltungsbereich, bei der Schaltkreisverzögerung und beim Stromverbrauch erzielt werden.
Mit dem Aufkommen dieses Approximate Computing Paradigmas wurden in der Literatur viele approximierte Funktionseinheiten angegeben, insbesondere approximierte Addierer und Multiplizierer. FĂŒr eine Vielzahl solcher approximierter Schaltkreise und unter BerĂŒcksichtigung ihrer Verwendung als Bausteine fĂŒr den Entwurf von approximierten Beschleunigern fĂŒr fehlertolerante Anwendungen, ergibt sich eine Herausforderung: die Auswahl dieser approximierten Schaltkreise fĂŒr eine bestimmte Anwendung, die die erforderlichen Ressourcen minimieren und gleichzeitig eine definierte Genauigkeit erfĂŒllen.
Diese Dissertation schlÀgt automatisierte Methoden zum Entwerfen und Implementieren von approximierten Beschleunigern vor, die aus approximierten arithmetischen Schaltungen aufgebaut sind. Um dies zu erreichen, befasst sich diese Dissertation mit folgenden Herausforderungen und liefert die nachfolgenden neuartigen BeitrÀge:
In der Literatur wurden viele approximierte Addierer und Multiplizierer vorgestellt, indem entweder approximierte EntwĂŒrfe aus genauen Implementierungen wie dem Ripple-Carry-Addierer vorgeschlagen oder durch Approximate Logic Synthesis (ALS) Methoden generiert wurden. Ein reprĂ€sentativer Satz dieser approximierten Komponenten ist erforderlich, um approximierte Beschleuniger zu bauen. In diesem Sinne prĂ€sentiert diese Dissertation zwei AnsĂ€tze, um solche approximierte arithmetische Schaltungen zu erstellen. ZunĂ€chst wird AUGER vorgestellt, ein Tool, mit dem Register-Transfer Level (RTL) Beschreibungen fĂŒr einen breiten Satz von approximierten Addierern und Multiplizierer fĂŒr unterschiedliche Datenbitbreiten- und Genauigkeitskonfigurationen generiert werden können. Mit AUGER kann eine Design Space Exploration (DSE) von approximierten Komponenten durchgefĂŒhrt werden, um diejenigen zu finden, die fĂŒr eine gegebene Bitbreite, einen gegebenen Approximationsbereich und eine gegebene Schaltungsmetrik Pareto-optimal sind. AnschlieĂend wird AxLS vorgestellt, ein Framework fĂŒr ALS, das die Implementierung modernster Methoden und den Vorschlag neuartiger Methoden ermöglicht, um strukturelle Netzlistentransformationen durchzufĂŒhren und approximierte arithmetische Schaltungen aus genauen Schaltungen zu generieren. DarĂŒber hinaus bieten beide Werkzeuge eine Fehlercharakterisierung in Form einer Fehlerverteilung und Schaltungseigenschaften (FlĂ€che, Schaltkreisverzögerung und Leistung) fĂŒr jede von ihnen erzeugte approximierte Schaltung. Diese Informationen sind fĂŒr das Untersuchungsziel dieser Dissertation von wesentlicher Bedeutung.
Trotz der Fehlertoleranz mĂŒssen approximierte Beschleuniger so ausgelegt sein, dass sie Genauigkeitsvorgaben erfĂŒllen. FĂŒr den Entwurf solcher Beschleuniger unter Verwendung von approximierten arithmetischen Schaltungen ist es daher unerlĂ€sslich zu bewerten, wie sich die durch approximierte Schaltungen verursachten Fehler durch andere Berechnungen ausbreiten, entweder genau oder ungenau, und sich schlieĂlich am Ausgang ansammeln. Diese Dissertation schlĂ€gt analytische Modelle vor, um die Fehlerpropagation durch genaue und approximierte Berechnungen zu beschreiben. Mit ihnen wird eine automatisierte, compilerbasierte Methodik vorgeschlagen, um die Fehlerpropagation auf approximierten Beschleunigerdesigns abzuschĂ€tzen. Diese Methode ist in ein Tool, CEDA, integriert, um schnelle, simulationsfreie GenauigkeitsschĂ€tzungen von approximierten Beschleunigermodellen durchzufĂŒhren, die unter Verwendung von C-Code beschrieben wurden.
Beim Entwurf von approximierten Beschleunigern benötigen sich wiederholende Simulationen auf Gate-Level und die Schaltungssynthese viel Zeit, um viele oder sogar alle möglichen Kombinationen fĂŒr einen gegebenen Satz von approximierten arithmetischen Schaltungen zu untersuchen. Andererseits basieren aktuelle Trends beim Entwerfen von Beschleunigern auf High-Level Synthesis (HLS) Werkzeugen. In dieser Dissertation werden analytische Modelle zur SchĂ€tzung der erforderlichen Rechenressourcen vorgestellt, wenn approximierte Addierer und Multiplizierer in Konstruktionen von approximierten Beschleunigern verwendet werden. DarĂŒber hinaus werden diese Modelle zusammen mit den vorgeschlagenen analytischen Modellen zur GenauigkeitsschĂ€tzung in eine DSE-Methodik fĂŒr fehlertolerante Anwendungen, DSEwam, integriert, um Pareto-optimale oder nahezu Pareto-optimale Lösungen fĂŒr approximierte Beschleuniger zu identifizieren. DSEwam ist in ein HLS-Tool integriert, um automatisch RTL-Beschreibungen von approximierten Beschleunigern aus C-Sprachbeschreibungen fĂŒr eine bestimmte Fehlerschwelle und ein bestimmtes Minimierungsziel zu generieren.
Die Verwendung von approximierten Beschleunigern muss sicherstellen, dass Fehler, die aufgrund von approximierten Berechnungen erzeugt werden, innerhalb eines definierten Maximalwerts fĂŒr eine gegebene Genauigkeitsmetrik bleiben. Die Fehler, die durch approximierte Beschleuniger erzeugt werden, hĂ€ngen jedoch von den Eingabedaten ab, die hinsichtlich der fĂŒr das Design verwendeten Daten unterschiedlich sein können. In dieser Dissertation wird ECAx vorgestellt, eine automatisierte Methode zur Untersuchung und Anwendung feinkörniger Fehlerkorrekturen mit geringem Overhead in approximierten Beschleunigern, um die Kosten fĂŒr die Fehlerkorrektur auf Softwareebene (wie es in der Literatur gemacht wird) zu senken. Dies erfolgt durch selektive Korrektur der signifikantesten Fehler (in Bezug auf ihre GröĂenordnung), die von approximierten Komponenten erzeugt werden, ohne die Vorteile der Approximationen zu verlieren. Die experimentelle Auswertung zeigt Beschleunigungsverbesserungen fĂŒr die Anwendung im Austausch fĂŒr einen leicht gestiegenen FlĂ€chen- und Leistungsverbrauch im approximierten Beschleunigerdesign
A Genetic-algorithm-based Approach to the Design of DCT Hardware Accelerators
As modern applications demand an unprecedented level of computational resources, traditional computing system design paradigms are no longer adequate to guarantee significant performance enhancement at an affordable cost. Approximate Computing (AxC) has been introduced as a potential candidate to achieve better computational performances by relaxing non-critical functional system specifications. In this article, we propose a systematic and high-abstraction-level approach allowing the automatic generation of near Pareto-optimal approximate configurations for a Discrete Cosine Transform (DCT) hardware accelerator. We obtain the approximate variants by using approximate operations, having configurable approximation degree, rather than full-precise ones. We use a genetic searching algorithm to find the appropriate tuning of the approximation degree, leading to optimal tradeoffs between accuracy and gains. Finally, to evaluate the actual HW gains, we synthesize non-dominated approximate DCT variants for two different target technologies, namely, Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). Experimental results show that the proposed approach allows performing a meaningful exploration of the design space to find the best tradeoffs in a reasonable time. Indeed, compared to the state-of-the-art work on approximate DCT, the proposed approach allows an 18% average energy improvement while providing at the same time image quality improvement
A FRAMEWORK FOR OPTIMAL DESIGN OF LOW-POWER FIR FILTERS
Approximate Computing has emerged as a new low-power design approach for application domains characterized by intrinsic error resilience. Digital Signal Processing (DSP) is one such domain where outputs of acceptable quality can be produced even though the internal computations are carried out in an approximate manner. With the ever increasing need for data rates at lower power usage; the need for improved complexity reduction schemes for DSP systems continues. One of the most widely performed steps in DSP is FIR filtering. FIR filters are preferred due to their linea
Calcul approximatif à haute efficacité énergétique pour des applications de l'internet des objets
Reduced width units are ones of the power reduction methods. However such units have been mostly evaluated separately, i.e. not evaluated in a complete applications. In this thesis, we extend the RISC-V processor with reduced width computation and memory units, in which only a number of most significant bits (MSBs), configurable at runtime is active. The energy reduction vs quality of output trade-offs of applications executed with the extended RISC-V are studied. The results indicate that the energy can be reduced by up to 14% for an error †0.1%. Moreover we propose a generic energy model that includes both software parameters and hardware architecture ones. It allows software and hardware designers to have an early insight into the effects of optimizations on software and/or units.Les unitĂ©s Ă taille rĂ©duite font partie des mĂ©thodes proposĂ©es pour la rĂ©duction de la consommation dâĂ©nergie. Cependant, la plupart de ces unitĂ©s sont Ă©valuĂ©es sĂ©parĂ©ment,câest-Ă -dire elles ne sont pas Ă©valuĂ©es dans une application complĂšte. Dans cette thĂšse, des unitĂ©s Ă taille rĂ©duite pour le calcul et pour lâaccĂšs Ă la mĂ©moire de donnĂ©es, configurables au moment de lâexĂ©cution, sont intĂ©grĂ©es dans un processeur RISC-V. La rĂ©duction dâĂ©nergie et la qualitĂ© de sortie des applications exĂ©cutĂ©es sur le processeur RISC-V Ă©tendu avec ces unitĂ©s, sont Ă©valuĂ©es. Les rĂ©sultats indiquent que la consommation dâĂ©nergie peut ĂȘtre rĂ©duite jusquâĂ 14% pour une erreur â€0.1%. De plus, nous avons proposĂ© un modĂšle dâĂ©nergie gĂ©nĂ©rique qui inclut Ă la fois des paramĂštres logiciels et architecturaux. Le modĂšle permet aux concepteurs logiciels et matĂ©riels dâavoir un aperçu rapide sur lâimpact des optimisations effectuĂ©es sur le code source et/ou sur les unitĂ©s de calcul
Investigation of reconfigurable-accuracy approximate adder designs for image processing applications
Ph. D. Thesis.In the last decades, integrated circuits with CMOS technology show
progressive scaling challenges of both increased power density and
power dissipation. Meanwhile, high-performance requirements of
current and future application operations show rapid demands of
computing resources like power. This design conflict has pushed
much effort to search for high performance and energy efficient
design approach, such as approximate computing.
Approximate computing exploits the error resilience of compute-
intensive applications such as image processing applications to
implement approximation design techniques with different levels
of abstractions and scalability. The basic principle is to relax the
strict accuracy requirements in favour of a lower design complexity,
thereby achieving more computational performance (i.e., speed)
and energy saving. The adder arithmetic unit is considered one
of the essential computational blocks in most of the applications.
As such, much effort has explored new designs of an efficient
approximate adder design.
This thesis presents an investigation into design enhancement,
novel approximate adder designs and implementation approaches.
The first approach introduces a modification to the error detection
technique of a popular configurable-accuracy approximate adder
design. The proposed lightweight error detection technique reduces
the required gates of the error detection circuit, thus, mitigating
the design area overhead. Furthermore, at the error correction
process of the adder, we have proposed an extensive error detection
while activating more than one correction stage concurrently. As a
result, this ensures achieving an optimum accuracy of outputs for
the worst case of quality requirements.
In general, approximate (speculative) adder designs use the seg-
mentation technique to divide the adder into multiple short length
sub-adders which operate in parallel. Hence, this would limit the
long chains of carry propagation and result in a better performance
operations. However, the use of overlapped parts of sub-adders
regarding a better carry speculation and then more accuracy be-
comes a significant challenge of a large design area overhead. The
second approach continues mitigating this challenge by present-
ing a novel and simpler adder dividing technique to a number of
sub-adders. The new method uses what is known as the carry-kill
signal for both limiting the carry propagation and applying adder
segmentation. Further, between every two adjacent sub-adders,
one AND gate and one XOR gate are used for carry speculation
and error (i.e., carry propagation) detection respectively. Thus, a
significant reduction of the design overhead has been achieved, yet,
with acceptable levels of output results accuracy. In the third final
approach, simple logic OR gates are used to build the approximate
adder while compensating the conventional full adders operation.
The resulted approximate adder design presents very low complex-
ity, high speed, and low power consumption. Furthermore, instead
of augmenting error recovery circuit, short bit-length exact adders
are used as correction stages to control the general level of output
quality (i.e., without error detection overhead). At the final correc-
tion stage, the proposed design would operate the same as an exact
adder.
To validate the efficiency of these approaches, a number of adders
with different bit-widths are designed and synthesized showing
considerable reductions in the critical delay, silicon area and more
savings in energy consumption, compared to other existing ap-
proaches. In addition to acceptable levels or output errors, which
are extensively analysed for each proposed design.
In this study, the proposed configurable adder designs exhibit
energy/quality trade-offs at a different number of correction stages.
These trade-offs can be effectively exploited to implement adders
in applications, where energy can be gracefully minimised within
the envelope of quality requirements. As such, designs implemen-
tation in an image processing application known as Gaussian blur
filter was introduced, demonstrating the loss in the image quality
at each error correction stage. The output images showed promis-
ing results to use the proposed designs for more energy-efficient
applications, where output quality requirements can be relaxed.Mutah Universit
Practical Techniques for Improving Performance and Evaluating Security on Circuit Designs
As the modern semiconductor technology approaches to nanometer era, integrated circuits (ICs) are facing more and more challenges in meeting performance demand and security. With the expansion of markets in mobile and consumer electronics, the increasing demands require much faster delivery of reliable and secure IC products. In order to improve the performance and evaluate the security of emerging circuits, we present three practical techniques on approximate computing, split manufacturing and analog layout automation. Approximate computing is a promising approach for low-power IC design. Although a few accuracy-configurable adder (ACA) designs have been developed in the past, these designs tend to incur large area overheads as they rely on either redundant computing or complicated carry prediction. We investigate a simple ACA design that contains no redundancy or error detection/correction circuitry and uses very simple carry prediction. The simulation results show that our design dominates the latest previous work on accuracy-delay-power tradeoff while using 39% less area. One variant of this design provides finer-grained and larger tunability than that of the previous works. Moreover, we propose a delay-adaptive self-configuration technique to further improve the accuracy-delay-power tradeoff. Split manufacturing prevents attacks from an untrusted foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We develop a new attack technique based on structural pattern matching. Experimental comparison with existing attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with the k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing.
Analog layout automation is still far behind its digital counterpart. We develop the layout automation framework for analog/mixed-signal ICs. A hierarchical layout synthesis flow which works in bottom-up manner is presented. To ensure the qualified layouts for better circuit performance, we use the constraint-driven placement and routing methodology which employs the expert knowledge via design constraints. The constraint-driven placement uses simulated annealing process to find the optimal solution. The packing represented by sequence pairs and constraint graphs can simultaneously handle different kinds of placement constraints. The constraint-driven routing consists of two stages, integer linear programming (ILP) based global routing and sequential detailed routing. The experiment results demonstrate that our flow can handle complicated hierarchical designs with multiple design constraints. Furthermore, the placement performance can be further improved by using mixed-size block placement which works on large blocks in priority
Highly Automated Formal Verification of Arithmetic Circuits
This dissertation investigates the problems of two distinctive formal verification techniques for verifying large scale multiplier circuits and proposes two approaches to overcome some of these problems. The first technique is equivalence checking based on recurrence relations, while the second one is the symbolic computation technique which is based on the theory of Gröbner bases. This investigation demonstrates that approaches based on symbolic computation have better scalability and more robustness than state-of-the-art equivalence checking techniques for verification of arithmetic circuits. According to this conclusion, the thesis leverages the symbolic computation technique to verify floating-point designs. It proposes a new algebraic equivalence checking, in contrast to classical combinational equivalence checking, the proposed technique is capable of checking the equivalence of two circuits which have different architectures of arithmetic units as well as control logic parts, e.g., floating-point multipliers
- âŠ