Search CORE

487 research outputs found

FPGA structures for high speed and low overhead dynamic circuit specialization

Author: Kulkarni Amit
Publication venue: Ghent University. Faculty of Engineering and Architecture.
Publication date: 01/01/2017
Field of study

A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

Ghent University Academic Bibliography

Energy-efficient embedded machine learning algorithms for smart sensing systems

Author: OSTA MARIO
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 27/02/2020
Field of study

Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

Archivio istituzionale della ricerca - Università di Genova

Reconfigurable FPGA-Based Baseband Processor for Multi-mode Spectrum Aggregation

Author: Mário Lopes Ferreira
Publication venue
Publication date: 29/07/2019
Field of study

Repositório Aberto da Universidade do Porto

Recommended from our members

Efficient FPGA implementation and power modelling of image and signal processing IP cores

Author: Chandrasekaran Shrutisagar
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2007
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage and signal processing application areas such as consumer electronics, instrumentation, medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area. A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed

Brunel University Research Archive

Image Processing Using FPGAs

Author: Bailey Donald
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

Directory of Open Access Books (DOAB)

Analysis of Bandwidth and Latency Constraints on a Packetized Cloud Radio Access Network Fronthaul

Author: Chaudhary Jay Kant
Publication venue
Publication date: 20/05/2020
Field of study

Cloud radio access network (C-RAN) is a promising architecture for the next-generation RAN to meet the diverse and stringent requirements envisioned by fifth generation mobile communication systems (5G) and future generation mobile networks. C-RAN offers several advantages, such as reduced capital expenditure (CAPEX) and operational expenditure (OPEX), increased spectral efficiency (SE), higher capacity and improved cell-edge performance, and efficient hardware utilization through resource sharing and network function virtualization (NFV). However, these centralization gains come with the need for a fronthaul, which is the transport link connecting remote radio units (RRUs) to the base band unit (BBU) pool. In conventional C-RAN, legacy common public radio interface (CPRI) protocol is used on the fronthaul network to transport the raw, unprocessed baseband in-phase/quadrature-phase (I/Q) samples between the BBU and the RRUs, and it demands a huge fronthaul bandwidth, a strict low-latency, in the order of a few hundred microseconds, and a very high reliability. Hence, in order to relax the excessive fronthaul bandwidth and stringent low-latency requirements, as well as to enhance the flexibility of the fronthaul, it is utmost important to redesign the fronthaul, while still profiting from the acclaimed centralization benefits. Therefore, a flexibly centralized C-RAN with different functional splits has been introduced. In addition, 5G mobile fronthaul (often also termed as an evolved fronthaul ) is envisioned to be packet-based, utilizing the Ethernet as a transport technology. In this thesis, to circumvent the fronthaul bandwidth constraint, a packetized fronthaul considering an appropriate functional split such that the fronthaul data rate is coupled with actual user data rate, unlike the classical C-RAN where fronthaul data rate is always static and independent of the traffic load, is justifiably chosen. We adapt queuing and spatial traffic models to derive the mathematical expressions for statistical multiplexing gains that can be obtained from the randomness in the user traffic. Through this, we show that the required fronthaul bandwidth can be reduced significantly, depending on the overall traffic demand, correlation distance and outage probability. Furthermore, an iterative optimization algorithm is developed, showing the impacts of number of pilots on a bandwidth-constrained fronthaul. This algorithm achieves additional reduction in the required fronthaul bandwidth. Next, knowing the multiplexing gains and possible fronthaul bandwidth reduction, it is beneficial for the mobile network operators (MNOs) to deploy the optical transceiver (TRX) modules in C-RAN cost efficiently. For this, using the same framework, a cost model for fronthaul TRX cost optimization is presented. This is essential in C-RAN, because in a wavelength division multiplexing-passive optical network (WDM-PON) system, TRXs are generally deployed to serve at a peak load. But, because of variations in the traffic demands, owing to tidal effect, the fronthaul can be dimensioned requiring a lower capacity allowing a reasonable outage, thus giving rise to cost saving by deploying fewer TRXs, and energy saving by putting the unused TRXs in sleep mode. The second focus of the thesis is the fronthaul latency analysis, which is a critical performance metric, especially for ultra-reliable and low latency communication (URLLC). An analytical framework to calculate the latency in the uplink (UL) of C-RAN massive multiple-input multiple-output (MIMO) system is presented. For this, a continuous-time queuing model for the Ethernet switch in the fronthaul network, which aggregates the UL traffic from several massive MIMO-aided RRUs, is considered. The closed-form solutions for the moment generating function (MGF) of sojourn time, waiting time and queue length distributions are derived using Pollaczek–Khinchine formula for our M/HE/1 queuing model, and evaluated via numerical solutions. In addition, the packet loss rate – due to the inability of the packets to reach the destination in a certain time – is derived. Due to the slotted nature of the UL transmissions, the model is extended to a discrete-time queuing model. The impact of the packet arrival rate, average packet size, SE of users, and fronthaul capacity on the sojourn time, waiting time and queue length distributions are analyzed. While offloading more signal processing functionalities to the RRU reduces the required fronthaul bandwidth considerably, this increases the complexity at the RRU. Hence, considering the 5G New Radio (NR) flexible numerology and XRAN functional split with a detailed radio frequency (RF) chain at the RRU, the total RRU complexity is computed first, and later, a tradeoff between the required fronthaul bandwidth and RRU complexity is analyzed. We conclude that despite the numerous C-RAN benefits, the stringent fronthaul bandwidth and latency constraints must be carefully evaluated, and an optimal functional split is essential to meet diverse set of requirements imposed by new radio access technologies (RATs).Ein cloud-basiertes Mobilfunkzugangsnetz (cloud radio access network, C-RAN) stellt eine vielversprechende Architektur für das RAN der nächsten Generation dar, um die vielfältigen und strengen Anforderungen der fünften (5G) und zukünftigen Generationen von Mobilfunknetzen zu erfüllen. C-RAN bietet mehrere Vorteile, wie z.B. reduzierte Investitions- (CAPEX) und Betriebskosten (OPEX), erhöhte spektrale Effizienz (SE), höhere Kapazität und verbesserte Leistung am Zellrand sowie effiziente Hardwareauslastung durch Ressourcenteilung und Virtualisierung von Netzwerkfunktionen (network function virtualization, NFV). Diese Zentralisierungsvorteile erfordern jedoch eine Transportverbindung (Fronthaul), die die Antenneneinheiten (remote radio units, RRUs) mit dem Pool an Basisbandeinheiten (basisband unit, BBU) verbindet. Im konventionellen C-RAN wird das bestehende CPRI-Protokoll (common public radio interface) für das Fronthaul-Netzwerk verwendet, um die rohen, unverarbeitet n Abtastwerte der In-Phaseund Quadraturkomponente (I/Q) des Basisbands zwischen der BBU und den RRUs zu transportieren. Dies erfordert eine enorme Fronthaul-Bandbreite, eine strenge niedrige Latenz in der Größenordnung von einigen hundert Mikrosekunden und eine sehr hohe Zuverlässigkeit. Um die extrem große Fronthaul-Bandbreite und die strengen Anforderungen an die geringe Latenz zu lockern und die Flexibilität des Fronthauls zu erhöhen, ist es daher äußerst wichtig, das Fronthaul neu zu gestalten und dabei trotzdem von den erwarteten Vorteilen der Zentralisierung zu profitieren. Daher wurde ein flexibel zentralisiertes CRAN mit unterschiedlichen Funktionsaufteilungen eingeführt. Außerdem ist das mobile 5G-Fronthaul (oft auch als evolved Fronthaul bezeichnet) als paketbasiert konzipiert und nutzt Ethernet als Transporttechnologie. Um die Bandbreitenbeschränkung zu erfüllen, wird in dieser Arbeit ein paketbasiertes Fronthaul unter Berücksichtigung einer geeigneten funktionalen Aufteilung so gewählt, dass die Fronthaul-Datenrate mit der tatsächlichen Nutzdatenrate gekoppelt wird, im Gegensatz zum klassischen C-RAN, bei dem die Fronthaul-Datenrate immer statisch und unabhängig von der Verkehrsbelastung ist. Wir passen Warteschlangen- und räumliche Verkehrsmodelle an, um mathematische Ausdrücke für statistische Multiplexing- Gewinne herzuleiten, die aus der Zufälligkeit im Benutzerverkehr gewonnen werden können. Hierdurch zeigen wir, dass die erforderliche Fronthaul-Bandbreite abhängig von der Gesamtverkehrsnachfrage, der Korrelationsdistanz und der Ausfallwahrscheinlichkeit deutlich reduziert werden kann. Darüber hinaus wird ein iterativer Optimierungsalgorithmus entwickelt, der die Auswirkungen der Anzahl der Piloten auf das bandbreitenbeschränkte Fronthaul zeigt. Dieser Algorithmus erreicht eine zusätzliche Reduktion der benötigte Fronthaul-Bandbreite. Mit dem Wissen über die Multiplexing-Gewinne und die mögliche Reduktion der Fronthaul-Bandbreite ist es für die Mobilfunkbetreiber (mobile network operators, MNOs) von Vorteil, die Module des optischen Sendeempfängers (transceiver, TRX) kostengünstig im C-RAN einzusetzen. Dazu wird unter Verwendung des gleichen Rahmenwerks ein Kostenmodell zur Fronthaul-TRX-Kostenoptimierung vorgestellt. Dies ist im C-RAN unerlässlich, da in einem WDM-PON-System (wavelength division multiplexing-passive optical network) die TRX im Allgemeinen bei Spitzenlast eingesetzt werden. Aufgrund der Schwankungen in den Verkehrsanforderungen (Gezeiteneffekt) kann das Fronthaul jedoch mit einer geringeren Kapazität dimensioniert werden, die einen vertretbaren Ausfall in Kauf nimmt, was zu Kosteneinsparungen durch den Einsatz von weniger TRXn und Energieeinsparungen durch den Einsatz der ungenutzten TRX im Schlafmodus führt. Der zweite Schwerpunkt der Arbeit ist die Fronthaul-Latenzanalyse, die eine kritische Leistungskennzahl liefert, insbesondere für die hochzuverlässige und niedriglatente Kommunikation (ultra-reliable low latency communications, URLLC). Ein analytisches Modell zur Berechnung der Latenz im Uplink (UL) des C-RAN mit massivem MIMO (multiple input multiple output) wird vorgestellt. Dazu wird ein Warteschlangen-Modell mit kontinuierlicher Zeit für den Ethernet-Switch im Fronthaul-Netzwerk betrachtet, das den UL-Verkehr von mehreren RRUs mit massivem MIMO aggregiert. Die geschlossenen Lösungen für die momenterzeugende Funktion (moment generating function, MGF) von Verweildauer-, Wartezeit- und Warteschlangenlängenverteilungen werden mit Hilfe der Pollaczek-Khinchin-Formel für unser M/HE/1-Warteschlangenmodell hergeleitet und mittels numerischer Verfahren ausgewertet. Darüber hinaus wird die Paketverlustrate derjenigen Pakete, die das Ziel nicht in einer bestimmten Zeit erreichen, hergeleitet. Aufgrund der Organisation der UL-Übertragungen in Zeitschlitzen wird das Modell zu einem Warteschlangenmodell mit diskreter Zeit erweitert. Der Einfluss der Paketankunftsrate, der durchschnittlichen Paketgröße, der SE der Benutzer und der Fronthaul-Kapazität auf die Verweildauer-, dieWartezeit- und dieWarteschlangenlängenverteilung wird analysiert. Während das Verlagern weiterer Signalverarbeitungsfunktionalitäten an die RRU die erforderliche Fronthaul-Bandbreite erheblich reduziert, erhöht sich dadurch im Gegenzug die Komplexität der RRU. Daher wird unter Berücksichtigung der flexiblen Numerologie von 5G New Radio (NR) und der XRAN-Funktionenaufteilung mit einer detaillierten RF-Kette (radio frequency) am RRU zunächst die gesamte RRU-Komplexität berechnet und später ein Kompromiss zwischen der erforderlichen Fronthaul-Bandbreite und der RRU-Komplexität untersucht. Wir kommen zu dem Schluss, dass trotz der zahlreichen Vorteile von C-RAN die strengen Bandbreiten- und Latenzbedingungen an das Fronthaul sorgfältig geprüft werden müssen und eine optimale funktionale Aufteilung unerlässlich ist, um die vielfältigen Anforderungen der neuen Funkzugangstechnologien (radio access technologies, RATs) zu erfüllen

Technische Universität Dresden: Qucosa

Designing energy-efficient computing systems using equalization and machine learning

Author: Takhirov Zafar
Publication venue
Publication date: 20/02/2018
Field of study

As technology scaling slows down in the nanometer CMOS regime and mobile computing becomes more ubiquitous, designing energy-efficient hardware for mobile systems is becoming increasingly critical and challenging. Although various approaches like near-threshold computing (NTC), aggressive voltage scaling with shadow latches, etc. have been proposed to get the most out of limited battery life, there is still no “silver bullet” to increasing power-performance demands of the mobile systems. Moreover, given that a mobile system could operate in a variety of environmental conditions, like different temperatures, have varying performance requirements, etc., there is a growing need for designing tunable/reconfigurable systems in order to achieve energy-efficient operation. In this work we propose to address the energy- efficiency problem of mobile systems using two different approaches: circuit tunability and distributed adaptive algorithms. Inspired by the communication systems, we developed feedback equalization based digital logic that changes the threshold of its gates based on the input pattern. We showed that feedback equalization in static complementary CMOS logic enabled up to 20% reduction in energy dissipation while maintaining the performance metrics. We also achieved 30% reduction in energy dissipation for pass-transistor digital logic (PTL) with equalization while maintaining performance. In addition, we proposed a mechanism that leverages feedback equalization techniques to achieve near optimal operation of static complementary CMOS logic blocks over the entire voltage range from near threshold supply voltage to nominal supply voltage. Using energy-delay product (EDP) as a metric we analyzed the use of the feedback equalizer as part of various sequential computational blocks. Our analysis shows that for near-threshold voltage operation, when equalization was used, we can improve the operating frequency by up to 30%, while the energy increase was less than 15%, with an overall EDP reduction of ≈10%. We also observe an EDP reduction of close to 5% across entire above-threshold voltage range. On the distributed adaptive algorithm front, we explored energy-efficient hardware implementation of machine learning algorithms. We proposed an adaptive classifier that leverages the wide variability in data complexity to enable energy-efficient data classification operations for mobile systems. Our approach takes advantage of varying classification hardness across data to dynamically allocate resources and improve energy efficiency. On average, our adaptive classifier is ≈100× more energy efficient but has ≈1% higher error rate than a complex radial basis function classifier and is ≈10× less energy efficient but has ≈40% lower error rate than a simple linear classifier across a wide range of classification data sets. We also developed a field of groves (FoG) implementation of random forests (RF) that achieves an accuracy comparable to Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) under tight energy budgets. The FoG architecture takes advantage of the fact that in random forests a small portion of the weak classifiers (decision trees) might be sufficient to achieve high statistical performance. By dividing the random forest into smaller forests (Groves), and conditionally executing the rest of the forest, FoG is able to achieve much higher energy efficiency levels for comparable error rates. We also take advantage of the distributed nature of the FoG to achieve high level of parallelism. Our evaluation shows that at maximum achievable accuracies FoG consumes ≈1.48×, ≈24×, ≈2.5×, and ≈34.7× lower energy per classification compared to conventional RF, SVM-RBF , Multi-Layer Perceptron Network (MLP), and CNN, respectively. FoG is 6.5× less energy efficient than SVM-LR, but achieves 18% higher accuracy on average across all considered datasets

Boston University Institutional Repository (OpenBU)

Specializing Risc-V Cores for Performance and Power

Author: Henrique Veloso de Sousa
Publication venue
Publication date: 26/09/2022
Field of study

Repositório Aberto da Universidade do Porto

Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

Author: Ngo Hau Trung
Publication venue: ODU Digital Commons
Publication date: 01/01/2007
Field of study

The increasing capacity and capabilities of FPGA devices in recent years provide an attractive option for performance-hungry applications in the image and video processing domain. FPGA devices are often used as implementation platforms for image and video processing algorithms for real-time applications due to their programmable structure that can exploit inherent spatial and temporal parallelism. While performance and area remain as two main design criteria, power consumption has become an important design goal especially for mobile devices. Reduction in power consumption can be achieved by reducing the supply voltage, capacitances, clock frequency and switching activities in a circuit. Switching activities can be reduced by architectural optimization of the processing cores such as adders, multipliers, multiply and accumulators (MACS), etc. This dissertation research focuses on reducing the switching activities in digital circuits by considering data dependencies in bit level, word level and block level neighborhoods in a video frame. The bit level data neighborhood dependency consideration for power reduction is illustrated in the design of pipelined array, Booth and log-based multipliers. For an array multiplier, operands of the multipliers are partitioned into higher and lower parts so that the probability of the higher order parts being zero or one increases. The gating technique for the pipelined approach deactivates part(s) of the multiplier when the above special values are detected. For the Booth multiplier, the partitioning and gating technique is integrated into the Booth recoding scheme. In addition, a delay correction strategy is developed for the Booth multiplier to reduce the switching activities of the sign extension part in the partial products. A novel architecture design for the computation of log and inverse-log functions for the reduction of power consumption in arithmetic circuits is also presented. This also utilizes the proposed partitioning and gating technique for further dynamic power reduction by reducing the switching activities. The word level and block level data dependencies for reducing the dynamic power consumption are illustrated by presenting the design of a 2-D convolution architecture. Here the similarities of the neighboring pixels in window-based operations of image and video processing algorithms are considered for reduced switching activities. A partitioning and detection mechanism is developed to deactivate the parallel architecture for window-based operations if higher order parts of the pixel values are the same. A neighborhood dependent approach (NDA) is incorporated with different window buffering schemes. Consideration of the symmetry property in filter kernels is also applied with the NDA method for further reduction of switching activities. The proposed design methodologies are implemented and evaluated in a FPGA environment. It is observed that the dynamic power consumption in FPGA-based circuit implementations is significantly reduced in bit level, data level and block level architectures when compared to state-of-the-art design techniques. A specific application for the design of a real-time video processing system incorporating the proposed design methodologies for low power consumption is also presented. An image enhancement application is considered and the proposed partitioning and gating, and NDA methods are utilized in the design of the enhancement system. Experimental results show that the proposed multi-level power aware methodology achieves considerable power reduction. Research work is progressing In utilizing the data dependencies in subsequent frames in a video stream for the reduction of circuit switching activities and thereby the dynamic power consumption

Old Dominion University

Wavelet Theory

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The wavelet is a powerful mathematical tool that plays an important role in science and technology. This book looks at some of the most creative and popular applications of wavelets including biomedical signal processing, image processing, communication signal processing, Internet of Things (IoT), acoustical signal processing, financial market data analysis, energy and power management, and COVID-19 pandemic measurements and calculations. The editor’s personal interest is the application of wavelet transform to identify time domain changes on signals and corresponding frequency components and in improving power amplifier behavior

Directory of Open Access Books (DOAB)