79 research outputs found
Adaptive Latency Insensitive Protocols andElastic Circuits with Early Evaluation: A Comparative Analysis
AbstractLatency Insensitive Protocols (LIP) and Elastic Circuits (EC) solve the same problem of rendering a design tolerant to additional latencies caused by wires or computational elements. They are performance-limited by a firing semantics that enforces coherency through a lazy evaluation rule: Computation is enabled if all inputs to a block are simultaneously available. Adaptive LIP's (ALIP) and EC with early evaluation (ECEE) increase the performance by relaxing the evaluation rule: Computation is enabled as soon as the subset of inputs needed at a given time is available. Their difference in terms of implementation and behavior in selected cases justifies the need for the comparative analysis reported in this paper. Results have been obtained through simple examples, a single representative case-study already used in the context of both LIP's and EC and through extensive simulations over a suite of benchmarks
Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers
Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power
A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs
In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a non-configurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations
HLS-based dataflow hardware architecture for Support Vector Machine in FPGA
Implementing fast and accurate Support Vector Machine (SVM) classifiers in embedded systems with limited compute and memory capacity and in applications with real-time constraints, such as continuous medical monitoring for anomaly detection, can be challenging and calls for low cost, low power and resource efficient hardware accelerators. In this paper, we propose a flexible FPGA-based SVM accelerator highly optimized through a dataflow architecture. Thanks to High Level Synthesis (HLS) and the dataflow method, our design is scalable and can be used for large data dimensions when there is limited on-chip memory. The hardware parallelism is adjustable and can be specified according to the available FPGA resources. The performance of different SVM kernels are evaluated in hardware. In addition, an efficient fixed-point implementation is proposed to improve the speed. We compared our design with recent SVM accelerators and achieved a minimum of 10x speed-up compared to other HLS-based and 4.4x compared to HDL-based designs
Efficient FPGA Implementation of PCA Algorithm for Large Data using High Level Synthesis
Principal Component Analysis (PCA) is a widely used method for dimensionality reduction in different application areas, including microwave imaging where the size of input data is large. Despite its popularity, one of the difficulties in using PCA is its high computational complexity, especially for large dimensional data. In recent years several FPGA implementations have been proposed to accelerate PCA computation. However, most of them use manual RTL design, which requires more time for design and development. In this paper, we propose an FPGA implementation of PCA using High Level Synthesis (HLS), which allows us to explore the design space more efficiently than with hand-coded RTL design. Starting from a PCA algorithm written in C++, we apply various hardware optimization techniques to the same code using Vivado HLS in order to quickly explore the design space. Our experiments show that the performance of the design obtained with the proposed method is superior to the state-of-the-art RTL design in terms of resource utilization, latency and frequency
Exact and heuristic allocation of multi-kernel applications to multi-FPGA platforms
FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.Peer ReviewedPostprint (author's final draft
Simulation-based Machine Learning Training for Brain Anomalies Localization at Microwaves
Machine learning enters the world of medical application and, in this paper, it joins microwave imaging technique for brain stroke classification. One of the main challenges in this application is the need of a large amount of data for the machine learning algorithm training that can be performed via measurements or simulations. In this work, we propose to make the algorithm training via simulations based on a linear integral operator that reduces by three orders of magnitude the data generation time with respect to standard full-wave simulations. This method is used here to train the multilayer perceptron algorithm. The data-set is organized in nine classes, related to the presence, the type and the position of the stroke within the brain. We verified that the algorithm metrics (accuracy, recall and precision) reach values close to 1 for each class
STAR: Sum-Together/Apart Reconfigurable Multipliers for Precision-Scalable ML Workloads
To achieve an optimal balance between accuracy and latency in Deep Neural Networks (DNNs), precision-scalability has become a paramount feature for hardware specialized for Machine Learning (ML) workloads. Recently, many precision-scalable (PS) multipliers and multiply-and-accumulate (MAC) units have been proposed. They are mainly divided in two categories, Sum-Apart (SA) and Sum-Together (ST), and have been always presented as alternative implementations. Instead, in this paper, we introduce for the first time a new class of PS Sum-Together/Apart Reconfigurable multipliers, which we call STAR, designed to support both SA and ST modes with a single reconfigurable architecture. STAR multipliers could be useful in MAC units of CPU or hardware accelerators, for example, enabling them to handle both 2D Convolution (in ST mode) and Depth-wise Convolution (in SA mode) with a unique PS hardware design, thus saving hardware resources. We derive four distinct STAR multiplier architectures, including two derived from the well-known Divide-and-Conquer and Sub-word Parallel SA and ST families, which support 16, 8 and 4-bit precision. We perform an extensive exploration of these architectures in terms of power, performance, and area, across a wide range of clock frequency constraints, from 0.4 to 2.0 GHz, targeting a 28-nm CMOS technology. We identify the Pareto-optimal solutions with the lowest area and power in the low-frequency, mid-frequency, and high-frequency ranges. Our findings allow designers to select the best STAR solution depending on their design target, either low-power and low-area, high performance, or balanced
Model-based data generation for support vector machine stroke classification
This paper presents a new and efficient method to generate a dataset for brain stroke classification. Exploiting the Born approximation, it derives scattering parameters at antennas locations in a 3-D scenario through a linear integral operator. This technique allows to create a large amount of data in a short time, if compared with the full-wave simulations or measurements. Then, the support vector machine is used to create the classifier model, based on training set data with a supervised method and to classify the test set. The dataset is composed by 9 classes, differentiated for presence, typology and position of the stroke. The algorithm is able to classify the test set with a high accuracy
Brain Stroke Classification via Machine Learning Algorithms Trained with a Linearized Scattering Operator
This paper proposes an efficient and fast method to create large datasets for machine learning algorithms applied to brain stroke classification via microwave imaging systems. The proposed method is based on the distorted Born approximation and linearization of the scattering operator, in order to minimize the time to generate the large datasets needed to train the machine learning algorithms. The method is then applied to a microwave imaging system, which consists of twenty-four antennas conformal to the upper part of the head, realized with a 3D anthropomorphic multi-tissue model. Each antenna acts as a transmitter and receiver, and the working frequency is 1 GHz. The data are elaborated with three machine learning algorithms: support vector machine, multilayer perceptron, and k-nearest neighbours, comparing their performance. All classifiers can identify the presence or absence of the stroke, the kind of stroke (haemorrhagic or ischemic), and its position within the brain. The trained algorithms were tested with datasets generated via full-wave simulations of the overall system, considering also slightly modified antennas and limiting the data acquisition to amplitude only. The obtained results are promising for a possible real-time brain stroke classification
- …