213 research outputs found

    Investigation of reconfigurable-accuracy approximate adder designs for image processing applications

    Get PDF
    Ph. D. Thesis.In the last decades, integrated circuits with CMOS technology show progressive scaling challenges of both increased power density and power dissipation. Meanwhile, high-performance requirements of current and future application operations show rapid demands of computing resources like power. This design conflict has pushed much effort to search for high performance and energy efficient design approach, such as approximate computing. Approximate computing exploits the error resilience of compute- intensive applications such as image processing applications to implement approximation design techniques with different levels of abstractions and scalability. The basic principle is to relax the strict accuracy requirements in favour of a lower design complexity, thereby achieving more computational performance (i.e., speed) and energy saving. The adder arithmetic unit is considered one of the essential computational blocks in most of the applications. As such, much effort has explored new designs of an efficient approximate adder design. This thesis presents an investigation into design enhancement, novel approximate adder designs and implementation approaches. The first approach introduces a modification to the error detection technique of a popular configurable-accuracy approximate adder design. The proposed lightweight error detection technique reduces the required gates of the error detection circuit, thus, mitigating the design area overhead. Furthermore, at the error correction process of the adder, we have proposed an extensive error detection while activating more than one correction stage concurrently. As a result, this ensures achieving an optimum accuracy of outputs for the worst case of quality requirements. In general, approximate (speculative) adder designs use the seg- mentation technique to divide the adder into multiple short length sub-adders which operate in parallel. Hence, this would limit the long chains of carry propagation and result in a better performance operations. However, the use of overlapped parts of sub-adders regarding a better carry speculation and then more accuracy be- comes a significant challenge of a large design area overhead. The second approach continues mitigating this challenge by present- ing a novel and simpler adder dividing technique to a number of sub-adders. The new method uses what is known as the carry-kill signal for both limiting the carry propagation and applying adder segmentation. Further, between every two adjacent sub-adders, one AND gate and one XOR gate are used for carry speculation and error (i.e., carry propagation) detection respectively. Thus, a significant reduction of the design overhead has been achieved, yet, with acceptable levels of output results accuracy. In the third final approach, simple logic OR gates are used to build the approximate adder while compensating the conventional full adders operation. The resulted approximate adder design presents very low complex- ity, high speed, and low power consumption. Furthermore, instead of augmenting error recovery circuit, short bit-length exact adders are used as correction stages to control the general level of output quality (i.e., without error detection overhead). At the final correc- tion stage, the proposed design would operate the same as an exact adder. To validate the efficiency of these approaches, a number of adders with different bit-widths are designed and synthesized showing considerable reductions in the critical delay, silicon area and more savings in energy consumption, compared to other existing ap- proaches. In addition to acceptable levels or output errors, which are extensively analysed for each proposed design. In this study, the proposed configurable adder designs exhibit energy/quality trade-offs at a different number of correction stages. These trade-offs can be effectively exploited to implement adders in applications, where energy can be gracefully minimised within the envelope of quality requirements. As such, designs implemen- tation in an image processing application known as Gaussian blur filter was introduced, demonstrating the loss in the image quality at each error correction stage. The output images showed promis- ing results to use the proposed designs for more energy-efficient applications, where output quality requirements can be relaxed.Mutah Universit

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Automated Design of Approximate Accelerators

    Get PDF
    In den letzten zehn Jahren hat das BedĂŒrfnis nach Recheneffizienz die Entwicklung neuer GerĂ€te, Architekturen und Entwurfstechniken motiviert. Approximate Computing hat sich als modernes, energieeffizientes Entwurfsparadigma fĂŒr Anwendungen herausgestellt, die eine inhĂ€rente Fehlertoleranz aufweisen. Wenn die Genauigkeit der Ergebnisse in aktuellen Anwendungen wie Bildverarbeitung, Computer Vision und maschinellem Lernen auf ein akzeptables Maß reduziert wird, können Einsparungen im Schaltungsbereich, bei der Schaltkreisverzögerung und beim Stromverbrauch erzielt werden. Mit dem Aufkommen dieses Approximate Computing Paradigmas wurden in der Literatur viele approximierte Funktionseinheiten angegeben, insbesondere approximierte Addierer und Multiplizierer. FĂŒr eine Vielzahl solcher approximierter Schaltkreise und unter BerĂŒcksichtigung ihrer Verwendung als Bausteine fĂŒr den Entwurf von approximierten Beschleunigern fĂŒr fehlertolerante Anwendungen, ergibt sich eine Herausforderung: die Auswahl dieser approximierten Schaltkreise fĂŒr eine bestimmte Anwendung, die die erforderlichen Ressourcen minimieren und gleichzeitig eine definierte Genauigkeit erfĂŒllen. Diese Dissertation schlĂ€gt automatisierte Methoden zum Entwerfen und Implementieren von approximierten Beschleunigern vor, die aus approximierten arithmetischen Schaltungen aufgebaut sind. Um dies zu erreichen, befasst sich diese Dissertation mit folgenden Herausforderungen und liefert die nachfolgenden neuartigen BeitrĂ€ge: In der Literatur wurden viele approximierte Addierer und Multiplizierer vorgestellt, indem entweder approximierte EntwĂŒrfe aus genauen Implementierungen wie dem Ripple-Carry-Addierer vorgeschlagen oder durch Approximate Logic Synthesis (ALS) Methoden generiert wurden. Ein reprĂ€sentativer Satz dieser approximierten Komponenten ist erforderlich, um approximierte Beschleuniger zu bauen. In diesem Sinne prĂ€sentiert diese Dissertation zwei AnsĂ€tze, um solche approximierte arithmetische Schaltungen zu erstellen. ZunĂ€chst wird AUGER vorgestellt, ein Tool, mit dem Register-Transfer Level (RTL) Beschreibungen fĂŒr einen breiten Satz von approximierten Addierern und Multiplizierer fĂŒr unterschiedliche Datenbitbreiten- und Genauigkeitskonfigurationen generiert werden können. Mit AUGER kann eine Design Space Exploration (DSE) von approximierten Komponenten durchgefĂŒhrt werden, um diejenigen zu finden, die fĂŒr eine gegebene Bitbreite, einen gegebenen Approximationsbereich und eine gegebene Schaltungsmetrik Pareto-optimal sind. Anschließend wird AxLS vorgestellt, ein Framework fĂŒr ALS, das die Implementierung modernster Methoden und den Vorschlag neuartiger Methoden ermöglicht, um strukturelle Netzlistentransformationen durchzufĂŒhren und approximierte arithmetische Schaltungen aus genauen Schaltungen zu generieren. DarĂŒber hinaus bieten beide Werkzeuge eine Fehlercharakterisierung in Form einer Fehlerverteilung und Schaltungseigenschaften (FlĂ€che, Schaltkreisverzögerung und Leistung) fĂŒr jede von ihnen erzeugte approximierte Schaltung. Diese Informationen sind fĂŒr das Untersuchungsziel dieser Dissertation von wesentlicher Bedeutung. Trotz der Fehlertoleranz mĂŒssen approximierte Beschleuniger so ausgelegt sein, dass sie Genauigkeitsvorgaben erfĂŒllen. FĂŒr den Entwurf solcher Beschleuniger unter Verwendung von approximierten arithmetischen Schaltungen ist es daher unerlĂ€sslich zu bewerten, wie sich die durch approximierte Schaltungen verursachten Fehler durch andere Berechnungen ausbreiten, entweder genau oder ungenau, und sich schließlich am Ausgang ansammeln. Diese Dissertation schlĂ€gt analytische Modelle vor, um die Fehlerpropagation durch genaue und approximierte Berechnungen zu beschreiben. Mit ihnen wird eine automatisierte, compilerbasierte Methodik vorgeschlagen, um die Fehlerpropagation auf approximierten Beschleunigerdesigns abzuschĂ€tzen. Diese Methode ist in ein Tool, CEDA, integriert, um schnelle, simulationsfreie GenauigkeitsschĂ€tzungen von approximierten Beschleunigermodellen durchzufĂŒhren, die unter Verwendung von C-Code beschrieben wurden. Beim Entwurf von approximierten Beschleunigern benötigen sich wiederholende Simulationen auf Gate-Level und die Schaltungssynthese viel Zeit, um viele oder sogar alle möglichen Kombinationen fĂŒr einen gegebenen Satz von approximierten arithmetischen Schaltungen zu untersuchen. Andererseits basieren aktuelle Trends beim Entwerfen von Beschleunigern auf High-Level Synthesis (HLS) Werkzeugen. In dieser Dissertation werden analytische Modelle zur SchĂ€tzung der erforderlichen Rechenressourcen vorgestellt, wenn approximierte Addierer und Multiplizierer in Konstruktionen von approximierten Beschleunigern verwendet werden. DarĂŒber hinaus werden diese Modelle zusammen mit den vorgeschlagenen analytischen Modellen zur GenauigkeitsschĂ€tzung in eine DSE-Methodik fĂŒr fehlertolerante Anwendungen, DSEwam, integriert, um Pareto-optimale oder nahezu Pareto-optimale Lösungen fĂŒr approximierte Beschleuniger zu identifizieren. DSEwam ist in ein HLS-Tool integriert, um automatisch RTL-Beschreibungen von approximierten Beschleunigern aus C-Sprachbeschreibungen fĂŒr eine bestimmte Fehlerschwelle und ein bestimmtes Minimierungsziel zu generieren. Die Verwendung von approximierten Beschleunigern muss sicherstellen, dass Fehler, die aufgrund von approximierten Berechnungen erzeugt werden, innerhalb eines definierten Maximalwerts fĂŒr eine gegebene Genauigkeitsmetrik bleiben. Die Fehler, die durch approximierte Beschleuniger erzeugt werden, hĂ€ngen jedoch von den Eingabedaten ab, die hinsichtlich der fĂŒr das Design verwendeten Daten unterschiedlich sein können. In dieser Dissertation wird ECAx vorgestellt, eine automatisierte Methode zur Untersuchung und Anwendung feinkörniger Fehlerkorrekturen mit geringem Overhead in approximierten Beschleunigern, um die Kosten fĂŒr die Fehlerkorrektur auf Softwareebene (wie es in der Literatur gemacht wird) zu senken. Dies erfolgt durch selektive Korrektur der signifikantesten Fehler (in Bezug auf ihre GrĂ¶ĂŸenordnung), die von approximierten Komponenten erzeugt werden, ohne die Vorteile der Approximationen zu verlieren. Die experimentelle Auswertung zeigt Beschleunigungsverbesserungen fĂŒr die Anwendung im Austausch fĂŒr einen leicht gestiegenen FlĂ€chen- und Leistungsverbrauch im approximierten Beschleunigerdesign

    Design of Approximate Circuits by Fabrication of False Timing Paths: The Carry Cut-Back Adder

    Get PDF
    This paper introduces a novel method for designing approximate circuits by fabricating and exploiting false timing paths, i.e. critical paths that cannot be logically activated. This allows to strongly relax timing constraints while guaranteeing minimal and controlled behavioral change. This technique is applied to an approximate adder architecture, called the Carry Cut-Back Adder (CCBA), in which high-significance stages can cut the carry propagation chain at lower-significance positions. This lightweight approach prevents the logic activation of the carry chain, improving performance and energy efficiency while guaranteeing low worst-case errors. A design methodology is presented along with implementation, error optimization and design-space minimization. The CCBA is proven capable of extremely high accuracy while displaying significant circuit savings. For a worst-case precision of 99.999%, energy savings up to 36% are demonstrated compared to exact adders. Finally, an industry-oriented comparison of 32-bit approximate and truncated adders is carried out for mean and worst-case relative errors. The CCBA outperforms both state-of-the-art and truncated adders for high-accuracy and low-power circuits, confirming the interest of the proposed concept to help building highly-efficient approximate or precision-scalable hardware accelerators

    Design Techniques for Energy-Quality Scalable Digital Systems

    Get PDF
    Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing “dynamic” systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than “static” solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff

    Calcul approximatif à haute efficacité énergétique pour des applications de l'internet des objets

    Get PDF
    Reduced width units are ones of the power reduction methods. However such units have been mostly evaluated separately, i.e. not evaluated in a complete applications. In this thesis, we extend the RISC-V processor with reduced width computation and memory units, in which only a number of most significant bits (MSBs), configurable at runtime is active. The energy reduction vs quality of output trade-offs of applications executed with the extended RISC-V are studied. The results indicate that the energy can be reduced by up to 14% for an error ≀ 0.1%. Moreover we propose a generic energy model that includes both software parameters and hardware architecture ones. It allows software and hardware designers to have an early insight into the effects of optimizations on software and/or units.Les unitĂ©s Ă  taille rĂ©duite font partie des mĂ©thodes proposĂ©es pour la rĂ©duction de la consommation d’énergie. Cependant, la plupart de ces unitĂ©s sont Ă©valuĂ©es sĂ©parĂ©ment,c’est-Ă -dire elles ne sont pas Ă©valuĂ©es dans une application complĂšte. Dans cette thĂšse, des unitĂ©s Ă  taille rĂ©duite pour le calcul et pour l’accĂšs Ă  la mĂ©moire de donnĂ©es, configurables au moment de l’exĂ©cution, sont intĂ©grĂ©es dans un processeur RISC-V. La rĂ©duction d’énergie et la qualitĂ© de sortie des applications exĂ©cutĂ©es sur le processeur RISC-V Ă©tendu avec ces unitĂ©s, sont Ă©valuĂ©es. Les rĂ©sultats indiquent que la consommation d’énergie peut ĂȘtre rĂ©duite jusqu’à 14% pour une erreur ≀0.1%. De plus, nous avons proposĂ© un modĂšle d’énergie gĂ©nĂ©rique qui inclut Ă  la fois des paramĂštres logiciels et architecturaux. Le modĂšle permet aux concepteurs logiciels et matĂ©riels d’avoir un aperçu rapide sur l’impact des optimisations effectuĂ©es sur le code source et/ou sur les unitĂ©s de calcul

    Approximate computing: An integrated cross-layer framework

    Get PDF
    A new design approach, called approximate computing (AxC), leverages the flexibility provided by intrinsic application resilience to realize hardware or software implementations that are more efficient in energy or performance. Approximate computing techniques forsake exact (numerical or Boolean) equivalence in the execution of some of the application’s computations, while ensuring that the output quality is acceptable. While early efforts in approximate computing have demonstrated great potential, they consist of ad hoc techniques applied to a very narrow set of applications, leaving in question the applicability of approximate computing in a broader context. The primary objective of this thesis is to develop an integrated cross-layer approach to approximate computing, and to thereby establish its applicability to a broader range of applications. The proposed framework comprises of three key components: (i) At the circuit level, systematic approaches to design approximate circuits, or circuits that realize a slightly modified function with improved efficiency, (ii) At the architecture level, utilize approximate circuits to build programmable approximate processors, and (iii) At the software level, methods to apply approximate computing to machine learning classifiers, which represent an important class of applications that are being utilized across the computing spectrum. Towards this end, the thesis extends the state-of-the-art in approximate computing in the following important directions. Synthesis of Approximate Circuits: First, the thesis proposes a rigorous framework for the automatic synthesis of approximate circuits , which are the hardware building blocks of approximate computing platforms. Designing approximate circuits involves making judicious changes to the function implemented by the circuit such that its hardware complexity is lowered without violating the specified quality constraint. Inspired by classical approaches to Boolean optimization in logic synthesis, the thesis proposes two synthesis tools called SALSA and SASIMI that are general, i.e., applicable to any given circuit and quality specification. The framework is further extended to automatically design quality configurable circuits , which are approximate circuits with the capability to reconfigure their quality at runtime. Over a wide range of arithmetic circuits, complex modules and complete datapaths, the circuits synthesized using the proposed framework demonstrate significant benefits in area and energy. Programmable AxC Processors: Next, the thesis extends approximate computing to the realm of programmable processors by introducing the concept of quality programmable processors (QPPs). A key principle of QPPs is that the notion of quality is explicitly codified in their HW/SW interface i.e., the instruction set. Instructions in the ISA are extended with quality fields, enabling software to specify the accuracy level that must be met during their execution. The micro-architecture is designed with hardware mechanisms to understand these quality specifications and translate them into energy savings. As a first embodiment of QPPs, the thesis presents a quality programmable 1D/2D vector processor QP-Vec, which contains a 3-tiered hierarchy of processing elements. Based on an implementation of QP-Vec with 289 processing elements, energy benefits up to 2.5X are demonstrated across a wide range of applications. Software and Algorithms for AxC: Finally, the thesis addresses the problem of applying approximate computing to an important class of applications viz. machine learning classifiers such as deep learning networks. To this end, the thesis proposes two approaches—AxNN and scalable effort classifiers. Both approaches leverage domain- specific insights to transform a given application to an energy-efficient approximate version that meets a specified application output quality. In the context of deep learning networks, AxNN adapts backpropagation to identify neurons that contribute less significantly to the network’s accuracy, approximating these neurons (e.g., by using lower precision), and incrementally re-training the network to mitigate the impact of approximations on output quality. On the other hand, scalable effort classifiers leverage the heterogeneity in the inherent classification difficulty of inputs to dynamically modulate the effort expended by machine learning classifiers. This is achieved by building a chain of classifiers of progressively growing complexity (and accuracy) such that the number of stages used for classification scale with input difficulty. Scalable effort classifiers yield substantial energy benefits as a majority of the inputs require very low effort in real-world datasets. In summary, the concepts and techniques presented in this thesis broaden the applicability of approximate computing, thus taking a significant step towards bringing approximate computing to the mainstream. (Abstract shortened by ProQuest.

    Energy-efficient embedded machine learning algorithms for smart sensing systems

    Get PDF
    Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

    VLSI Circuits for Approximate Computing

    Get PDF
    Approximate Computing has recently emerged as a promising solution to enhance circuits performance by relaxing the requisite on exact calculations. Multimedia and Machine Learning constitute a typical example of error resilient, albeit compute-intensive, applications. In this dissertation, the design and optimization of approximate fundamental VLSI digital blocks is investigated. In chapter one the theoretical motivations of Approximate Computing, from the VLSI perspective, are discussed. In chapter two my research activity about approximate adders is reported. In this chapter approximate adders for both traditional non-error tolerant applications and error resilient applications are discussed. In chapter three precision-scalable units are investigated. Real-time precision scalability allows adapting the precision level of the unit with the precision requirements of the applications. In this context my research activities regarding approximate Multiply-and-Accumulate and memory units are described. In chapter four a precision-scalable approximate convolver for computer vision applications is discussed. This is composed of both the approximate Multiply-and-Accumulate and memory units, presented in the chapter three

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings
    • 

    corecore