Search CORE

50 research outputs found

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Author: Benini Luca
Conti Francesco
Gautschi Michael
Gürkaynak Frank Kagan
Haugou Germain
Loi Igor
Mangard Stefan
Muehlberghuber Michael
Pullini Antonio
Rossi Davide
Schiavone Pasquale Davide
Schilling Robert
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/04/2017
Field of study

Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Ultra-Low Power IoT Smart Visual Sensing Devices for Always-ON Applications

Author: Rusci Manuele <1989>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 27/04/2018
Field of study

This work presents the design of a Smart Ultra-Low Power visual sensor architecture that couples together an ultra-low power event-based image sensor with a parallel and power-optimized digital architecture for data processing. By means of mixed-signal circuits, the imager generates a stream of address events after the extraction and binarization of spatial gradients. When targeting monitoring applications, the sensing and processing energy costs can be reduced by two orders of magnitude thanks to either the mixed-signal imaging technology, the event-based data compression and the use of event-driven computing approaches. From a system-level point of view, a context-aware power management scheme is enabled by means of a power-optimized sensor peripheral block, that requests the processor activation only when a relevant information is detected within the focal plane of the imager. When targeting a smart visual node for triggering purpose, the event-driven approach brings a 10x power reduction with respect to other presented visual systems, while leading to comparable results in terms of detection accuracy. To further enhance the recognition capabilities of the smart camera system, this work introduces the concept of event-based binarized neural networks. By coupling together the theory of binarized neural networks and focal-plane processing, a 17.8% energy reduction is demonstrated on a real-world data classification with a performance drop of 3% with respect to a baseline system featuring commercial visual sensors and a Binary Neural Network engine. Moreover, if coupling the BNN engine with the event-driven triggering detection flow, the average power consumption can be as low as the sleep power of 0.3mW in case of infrequent events, which is 8x lower than a smart camera system featuring a commercial RGB imager

ANALOG SIGNAL PROCESSING SOLUTIONS AND DESIGN OF MEMRISTOR-CMOS ANALOG CO-PROCESSOR FOR ACCELERATION OF HIGH-PERFORMANCE COMPUTING APPLICATIONS

Author: Athreyas Nihar
Publication venue: ScholarWorks@UMass Amherst
Publication date: 16/07/2018
Field of study

Emerging applications in the field of machine vision, deep learning and scientific simulation require high computational speed and are run on platforms that are size, weight and power constrained. With the transistor scaling coming to an end, existing digital hardware architectures will not be able to meet these ever-increasing demands. Analog computation with its rich set of primitives and inherent parallel architecture can be faster, more efficient and compact for some of these applications. The major contribution of this work is to show that analog processing can be a viable solution to this problem. This is demonstrated in the three parts of the dissertation. In the first part of the dissertation, we demonstrate that analog processing can be used to solve the problem of stereo correspondence. Novel modifications to the algorithms are proposed which improves the computational speed and makes them efficiently implementable in analog hardware. The analog domain implementation provides further speedup in computation and has lower power consumption than a digital implementation. In the second part of the dissertation, a prototype of an analog processor was developed using commercially available off-the-shelf components. The focus was on providing experimental results that demonstrate functionality and to show that the performance of the prototype for low-level and mid-level image processing tasks is equivalent to a digital implementation. To demonstrate improvement in speed and power consumption, an integrated circuit design of the analog processor was proposed, and it was shown that such an analog processor would be faster than state-of-the-art digital and other analog processors. In the third part of the dissertation, a memristor-CMOS analog co-processor that can perform floating point vector matrix multiplication (VMM) is proposed. VMM computation underlies some of the major applications. To demonstrate the working of the analog co-processor at a system level, a new tool called PSpice Systems Option is used. It is shown that the analog co-processor has a superior performance when compared to the projected performances of digital and analog processors. Using the new tool, various application simulations for image processing and solution to partial differential equations are performed on the co-processor model

Analogue VLSI for temporal frequency analysis of visual data

Author: Sutherland Alasdair
Publication venue: The University of Edinburgh
Publication date: 01/01/2003
Field of study

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Author: Azarkhish Erfan
Benini Luca
Bonetti Andrea
Emery Stephane
Jokic Petar
Pons Marc
Publication venue
Publication date: 24/06/2021
Field of study

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Energy-Efficient Computing for Mobile Signal Processing

Author: Seo Sangwon
Publication venue
Publication date: 01/01/2011
Field of study

Mobile devices have rapidly proliferated, and deployment of handheld devices continues to increase at a spectacular rate. As today's devices not only support advanced signal processing of wireless communication data but also provide rich sets of applications, contemporary mobile computing requires both demanding computation and efficiency. Most mobile processors combine general-purpose processors, digital signal processors, and hardwired application-specific integrated circuits to satisfy their high-performance and low-power requirements. However, such a heterogeneous platform is inefficient in area, power and programmability. Improving the efficiency of programmable mobile systems is a critical challenge and an active area of computer systems research. SIMD (single instruction multiple data) architectures are very effective for data-level-parallelism intense algorithms in mobile signal processing. However, new characteristics of advanced wireless/multimedia algorithms require architectural re-evaluation to achieve better energy efficiency. Therefore, fourth generation wireless protocol and high definition mobile video algorithms are analyzed to enhance a wide-SIMD architecture. The key enhancements include 1) programmable crossbar to support complex data alignment, 2) SIMD partitioning to support fine-grain SIMD computation, and 3) fused operation to support accelerating frequently used instruction pairs. Near-threshold computation has been attractive in low-power architecture research because it balances performance and power. To further improve energy efficiency in mobile computing, near-threshold computation is applied to a wide SIMD architecture. This proposed near-threshold wide SIMD architecture-Diet SODA-presents interesting architectural design decisions such as 1) very wide SIMD datapath to compensate for degraded performance induced by near-threshold computation and 2) scatter-gather data prefetcher to exploit large latency gap between memory and the SIMD datapath. Although near-threshold computation provides excellent energy efficiency, it suffers from increased delay variations. A systematic study of delay variations in near-threshold computing is performed and simple techniques-structural duplication and voltage/frequency margining-are explored to tolerate and mitigate the delay variations in near-threshold wide SIMD architectures. This dissertation analyzes representative wireless/multimedia mobile signal processing algorithms, proposes an energy-efficient programmable platform, and evaluates performance and power. A main theme of this dissertation is that the performance and efficiency of programmable embedded systems can be significantly improved with a combination of parallel SIMD and near-threshold computations.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86356/1/swseo_1.pd

TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

Author: Boons Bert
De Roose Jaro
Giraldo Sebastian
Jain Vikram
Mei Linyan
Verhelst Marian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/01/2023
Field of study

Extreme edge devices or Internet-of-thing nodes require both ultra-low power always-on processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications like voice recognition, machine monitoring, etc., requires the ability to execute a wide range of ML workloads. This brings challenges in hardware design to build flexible processors operating in ultra-low power regime. This paper presents TinyVers, a tiny versatile ultra-low power ML system-on-chip to enable enhanced intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty-cycling to enable smart sensing applications. The SoC combines a RISC-V host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7

\mu

W deep sleep wake-up controller, and an eMRAM for boot code and ML parameter retention. The SoC can perform up to 17.6 GOPS while achieving a power consumption range from 1.7

\mu

W-20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power consumption below 230

\mu

W in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10

\mu

W.Comment: Accepted in IEEE Journal of Solid-State Circuit

arXiv.org e-Print Archive

Digital and Analog Computing Paradigms in Printed Electronics

Author: Weller Dennis
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 24/03/2021
Field of study

Da das Ende von Moore\u27s Gesetz schon absehbar ist, müssen neue Wege gefunden werden um den innovationsgetriebenen IT-Markt mit neuartiger Elektronik zu sättigen. Durch den Einsatz von kostengünstiger Hardware mit flexiblem Formfaktor, welche auf neuartigen Materialien und Technologien beruhen, können neue Anwendungsbereiche erschlossen werden, welche über konventionelle siliziumbasierte Elektronik hinausgehen. Im Fokus sind hier insbesondere elektronische Systeme, welche es ermöglichen Konsumgüter für den täglichen Bedarf zu überwachen - z.B. im Zusammenhang einer Qualitätskontrolle - indem sie in das Produkt integriert werden als Teil einer intelligenten Verpackung und dadurch nur begrenzte Produktlebenszeit erfordern. Weitere vorhersehbare Anwendungsbereiche sind tragbare Elektronik oder Produkte für das "Internet der Dinge". Hier entstehen Systemanforderungen wie flexible, dehnbare Hardware unter Einsatz von ungiftigen Materialien. Aus diesem Grund werden additive Technologien herangezogen, wie zum Beispiel gedruckte Elektronik, welche als komplementär zu siliziumbasierten Technologien betrachtet wird, da sie durch den simplen Herstellungsprozess sehr geringe Produktionskosten ermöglicht, und darüber hinaus auf ungiftigen und funktionalen Materialien basiert, welche auf flexible Plastik- oder Papiersubstrate aufgetragen werden können. Unter den verschiedenen Druckprozessen ist insbesondere der Tintenstrahldruck für zukünftige gedruckte Elektronikanwendungen interessant, da er eine Herstellung vor Ort und nach Bedarf ermöglicht auf Grund seines maskenlosen Druckprozesses. Da sich jedoch die Technologie der Tintenstrahl-druckbaren Elektronik in der Frühphasenentwicklung befindet, ist es fraglich ob Schaltungen für zukünftige Anwendungsfelder überhaupt entworfen werden können, beziehungsweise ob sie überhaupt herstellbar sind. Da die laterale Auflösung von Druckprozessen sich um mehrere Größenordnungen über siliziumbasierten Herstellungstechnologien befindet und des Weiteren entweder nur p- oder n-dotierte Transistoren verfügbar sind, können existierende Schaltungsentwürfe nicht direkt in die gedruckte Elektronik überführt werden. Dies führt zu der wissenschaftlichen Fragestellung, welche Rechenparadigmen überhaupt sinnvoll anwendbar sind im Bereich der gedruckten Elektronik. Die Beantwortung dieser Frage wird Schaltungsdesignern in der Zukunft helfen, erfolgreich gedruckte Schaltungen für den sich rasch entwickelnden Konsumgütermarkt zu entwerfen und zu produzieren. Aus diesem Anlass exploriert diese Arbeit verschiedene Rechenparadigmen und Schaltungsentwürfe, welche als essenziell für zukünftige, gedruckte Systeme betrachtet werden. Die erfolgte Analyse beruht auf der recht jungen "Electrolyte-gated Transistor" (EGT) Technologie, welche auf einem kostengünstigen Tintenstrahldruckverfahren basiert und sehr geringe Betriebsspannungen ermöglicht. Da bisher nur einfache Logik-Gatter in der EGT-Technologie realisiert wurden, wird in dieser Arbeit der Entwurfsraum weiter exploriert, durch die Entwicklung von gedruckten Speicherbausteinen, Lookup Tabellen, künstliche Neuronen und Entscheidungsbäume. Besonders bei dem künstlichen Neuron und den Entscheidungsbäumen wird Bezug auf Hardware-Implementierungen von Algorithmen des maschinellen Lernens gemacht und die Skalierung der Schaltungen auf die Anwendungsebene aufgezeigt. Die Rechenparadigmen, welche in dieser Arbeit evaluiert wurden, reichen von digitalen, analogen, neuromorphen Berechnungen bis zu stochastischen Verfahren. Zusätzlich wurden individuell anpassbare Schaltungsentwürfe untersucht, welche durch das Tintenstrahldruckverfahren ermöglicht werden und zu substanziellen Verbesserungen bezüglich des Flächenbedarfs, Leistungsverbrauch und Schaltungslatenzen führen, indem variable Entwurfsparameter in die Schaltung fest verdrahtet werden. Da die explorierten Schaltungen die Komplexität von bisher hergestellter, gedruckter Hardware weit übertreffen, ist es prinzipiell nicht automatisch garantiert, dass sie herstellbar sind, was insbesondere die nicht-digitalen Schaltungen betrifft. Aus diesem Grund wurden in dieser Arbeit EGT-basierte Hardware-Prototypen hergestellt und bezüglich Flächenbedarf, Leistungsverbrauch und Latenz charakterisiert. Die Messergebnisse können verwendet werden, um eine Extrapolation auf komplexere anwendungsbezogenere Schaltungsentwürfe durchzuführen. In diesem Zusammenhang wurden Validierungen von den entwickelten Hardware-Implementierungen von Algorithmen des maschinellen Lernens durchgeführt, um einen Wirksamkeitsnachweis zu erhalten. Die Ergebnisse dieser Thesis führen zu mehreren Schlussfolgerungen. Zum ersten kann gefolgert werden, dass die sequentielle Verarbeitung von Algorithmen in gedruckter EGT-basierter Hardware prinzipiell möglich ist, da, wie in dieser Arbeit dargestellt wird, neben kombinatorischen Schaltungen auch Speicherbausteine implementiert werden können. Letzteres wurde experimentell validiert. Des Weiteren können analoge und neuromorphe Rechenparadigmen sinnvoll eingesetzt werden, um gedruckte Hardware für maschinelles Lernen zu realisieren, um gegenüber konventionellen Methoden die Komplexität von Schaltungsentwürfen erheblich zu minimieren, welches schlussendlich zu einer höheren Produktionsausbeute im Herstellungsprozess führt. Ebenso können neuronale Netzwerkarchitekturen, welche auf Stochastic Computing basieren, zur Reduzierung des Hardwareumfangs gegenüber konventionellen Implementierungen verwendet werden. Letztlich kann geschlussfolgert werden, dass durch den Tintenstrahldruckprozess Schaltungsentwürfe bezüglich Kundenwünschen während der Herstellung individuell angepasst werden können, um die Anwendbarkeit von gedruckter Hardware generell zu erhöhen, da auch hier geringerer Hardwareaufwand im Vergleich zu konventionellen Schaltungsentwürfen erreicht wird. Es wird antizipiert, dass die in dieser Thesis vorgestellten Forschungsergebnisse relevant sind für Informatiker, Elektrotechniker und Materialwissenschaftler, welche aktiv im Bereich der druckbaren Elektronik arbeiten. Die untersuchten Rechenparadigmen und ihr Einfluss auf Verhalten und wichtige Charakteristiken gedruckter Hardware geben Einblicke darüber, wie gedruckte Schaltungen in der Zukunft effizient umgesetzt werden können, um neuartige auf Druckverfahren-basierte Produkte im Elektronikbereich zu ermöglichen

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Author: Andri Renzo
Benini Luca
Cavigelli Lukas
Rossi Davide
Publication venue
Publication date: 01/01/2019
Field of study

Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna