34 research outputs found
Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers
Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power
A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs
In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a non-configurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations
STAR: Sum-Together/Apart Reconfigurable Multipliers for Precision-Scalable ML Workloads
To achieve an optimal balance between accuracy and latency in Deep Neural Networks (DNNs), precision-scalability has become a paramount feature for hardware specialized for Machine Learning (ML) workloads. Recently, many precision-scalable (PS) multipliers and multiply-and-accumulate (MAC) units have been proposed. They are mainly divided in two categories, Sum-Apart (SA) and Sum-Together (ST), and have been always presented as alternative implementations. Instead, in this paper, we introduce for the first time a new class of PS Sum-Together/Apart Reconfigurable multipliers, which we call STAR, designed to support both SA and ST modes with a single reconfigurable architecture. STAR multipliers could be useful in MAC units of CPU or hardware accelerators, for example, enabling them to handle both 2D Convolution (in ST mode) and Depth-wise Convolution (in SA mode) with a unique PS hardware design, thus saving hardware resources. We derive four distinct STAR multiplier architectures, including two derived from the well-known Divide-and-Conquer and Sub-word Parallel SA and ST families, which support 16, 8 and 4-bit precision. We perform an extensive exploration of these architectures in terms of power, performance, and area, across a wide range of clock frequency constraints, from 0.4 to 2.0 GHz, targeting a 28-nm CMOS technology. We identify the Pareto-optimal solutions with the lowest area and power in the low-frequency, mid-frequency, and high-frequency ranges. Our findings allow designers to select the best STAR solution depending on their design target, either low-power and low-area, high performance, or balanced
A Machine-Learning Based Microwave Sensing Approach to Food Contaminant Detection
To detect contaminants accidentally included in packaged foods, food industries use an array of systems ranging from metal detectors to X-ray imagers. Low density plastic or glass contaminants, however, are not easily detected with standard methods. If the dielectric contrast between the packaged food and these contaminants in the microwave spectrum is sensible, Microwave Sensing (MWS) can be used as a contactless detection method, which is particularly useful when the food is already packaged. In this paper we propose using MWS combined with Machine Learning (ML). In particular, we report on experiments we did with packaged cocoa-hazelnut spread and show the accuracy of our approach. We also present an FPGA acceleration that runs the ML processing in real-time so as to keep up with the throughput of a production line
Gender differences and hypercholesterolemia: real-world evidence from the study WECARE (Women Effective CArdiovascular Risk Evaluation)
Introduction:Â The therapeutic control of LDL-cholesterol is essential in cardiovascular prevention, as recommended by the recent guidelines.
Objective:Â To evaluate gender differences in terms of demographic and clinical characteristics, treatment pattern, treatment adherence and healthcare costs in patients on lipid-lowering therapy, stratified by cardiovascular risk in the Italian real clinical practice.
Methods:Â An observational analysis was conducted on the administrative databases of healthcare institutions, covering about 6.1 million health-assisted subjects. After inclusion of all patients on lipid-lowering therapy between January 2017 and June 2020, the population was investigated in the period before the first prescription of a lipid-lowering drug and followed-up for at least 12 months. Clinical and demographic variables were compared after stratification by gender and by cardiovascular risk (very high/high/other risk). The main outcome measures were treatment adherence and direct healthcare costs during follow-up.
Results:Â Of the 684,829 patients with high/very high cardiovascular risk, 337,394 were men and 347,435 women, aged on average 69.3 years and 72.1 years, respectively (p < 0.001). Men were characterised by a worse comorbidity profile. Regardless of cardiovascular risk, female subjects were associated with larger utilisation of low-potency statins and lower adherence (p < 0.001). The annual healthcare costs per patient during follow-up were higher in men than in women (p < 0.001).
Conclusions:Â The results highlighted larger utilisation of low-potency statins, a lower adherence and a milder comorbidity profile in women, the latter feasibly explaining the reduced healthcare costs compared to men
Progetto di un circuito di interfaccia basato su tecnologia NFC per nodi sensori a bassissimo consumo
Uno dei più importanti fattori che hanno contribuito alla crescita esponenziale dell’Internet of Things (IOT) è l’efficienza energetica raggiunta dall’elettronica che ha permesso di realizzare circuiti integrati ultra-low power. A tal riguardo è possibile realizzare nodi sensori a nanocorrenti con funzionalità di datalogger basati su microcontrollori. Per questi dispositivi è stata dimostrata una durata teorica di diversi anni con una semplice batteria a bottone. Questa caratteristica ne permette l’utilizzo in applicazioni nelle quali la sostituzione della batteria risulta difficoltosa o non conveniente.
Lo scopo del seguente elaborato è quello di integrare il progetto di un nodo sensore a nanocorrenti, aggiungendo un’interfaccia di comunicazione con il mondo esterno per il recupero dei dati acquisiti che non impatti sul tempo di vita della batteria risultando autonomo energeticamente.
Nel progetto che verrà discusso in questa tesi, è stata sfruttata la tecnologia Near Field Communication (NFC).
Si parte dallo studio generale della tecnologia NFC e delle antenne NFC. Si discutono infatti le leggi fisiche dell’elettromagnetismo che stanno alla base del trasferimento di energia e di dati, fino ad arrivare alla realizzazione di un prototipo di antenna su PCB (Printed Circuit Board). Successivamente si prendono in esame le caratteristiche dei vari componenti utilizzati e i motivi della loro scelta, per poi passare alla fase di programmazione del microcontrollore. Si conclude con l’analisi delle forme d’onda catturate con oscilloscopio e protocol analyzer per dimostrare il corretto funzionamento del circuito e si misura l’impatto dei componenti aggiunti sui consumi del sistema
A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications
Across different Deep Learning (DL) applications or within the same application but in different phases, bitwidth precision of activations and weights may vary. Moreover, energy and latency of MAC units have to be minimized, especially at the edge. Hence, various precision-scalable MAC units optimized for DL have recently emerged. Our contribution is a new precision-configurable multiplier/dot-product unit based on a modified Radix-4 Booth signed multiplier with Sum-Together (ST) mode. Besides 16-bit full precision multiplications, it can be reconfigured to perform dot products among two 8-bit or four 4-bit sub words of the input operands without requiring an external adder, thus reducing the number of cycles of MAC operations. The results of the synthesis in performance, power and area on a 28-nm technology show that our unit (1) is superior to other state of the art ST multipliers in area (≈35% less) in the clock frequency range between 100 and 1000 MHz and (2) reduces latency up to 4x when used to compute a convolutional layer, at the cost of limited overheads in area (+10%) and power (+13%) compared to a conventional 16-bit Booth multiplier. This unit can play an important role in designing variable-precision MAC units or DL accelerators for edge devices
A Reconfigurable 2D-Convolution Accelerator for DNNs Quantized with Mixed-Precision
Mixed-precision uses in each layer of a Deep Neural Network the minimum bit-width that preserves accuracy. In this context, our new Reconfigurable 2D-Convolution Module (RCM) computes N =1, 2 or 4 Multiply-and-Accumulate operations in parallel with configurable precision from 1 to 16/N bits. Our design-space exploration via high-level synthesis obtains the best points in the latency vs area space, varying the size of the tensor tile handled by our RCM and its parallelism. A comparison with a non-configurable module on a 28-nm technology shows many reconfigurable Pareto points for low bit-width configurations, making our RCM a promising mixed-precision accelerator for inference
High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers
Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators’ performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies