36 research outputs found

    Energy-efficient embedded machine learning algorithms for smart sensing systems

    Get PDF
    Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

    Singular value decomposition based pipeline architecture for MIMO communication systems

    Get PDF
    This thesis presents a design, implementation and performance benchmark of custom hardware for computing Singular Value Decomposition (SVD) of the radio communication channel characteristic matrix. Software Defined Radio (SDR) is a concept in which the radio transceiver is implemented by software programs running on a processor. SVD of the channel characteristic matrix is used in pre-coding, equalization and beamforming for Multiple Input Multiple Output (MIMO) and Orthogonal Frequency Division Modulation (OFDM) communication systems (e.g., IEEE 802.11n). Since SVD is computationally intensive, it may require custom hardware to reduce the computing time. The pipeline processor developed in this thesis is suitable for computing the SVD of a sequence of 2 × 2 matrices. A stream of 2×2 matrices is sent to the custom hardware, which returns the corresponding streams of singular values and unitary matrices. The architecture is based on the two sided Jacobi method utilizing Coordinate Rotation Digital Computer (CORDIC) algorithms. A 2×2 SVD prototype was implemented on Field-Programmable Gate Array (FPGA) for SDR applications. The 2×2 SVD prototype design can output the singular values and the corresponding unitary matrices in pipeline while operating at a data rate of 324 MHz on a Virtex 6 (xc6vlx240t-lff1156) FPGA. The prototype design consists of fifty-five CORDIC cores which takes 32 percent of available logic on the FPGA. It achieves the optimal pipeline rate equaled to the maximum hardware clock rate. The depth of the pipeline (latency) is 173 clock-cycles for 16-bit data hardware. The proposed architecture provides performance gains over standard software libraries, such as the ZGESVD function of Linear Algebra PACKage (LAPACK) library, which is based on Golub-Kahan-Reinsch SVD algorithm, when running on standard processors. The ZGESVD function of LAPACK implemented in Intel’s Math Kernel Library (MKL) will achieve a projected data rate of 40 MHz on a 2.50 GHz Intel Quad (Q9300) CPU. The pipeline SVD hardware ban width equals the clock frequency and the data rate can reach 324 MHz on the ML605 board (Virtex 6 xc6vlx240t). The proposed architecture also has the potential to be easily extended to solve 4×4 SVD problems used in pre-coding and equalization schemes. The proposed algorithm and design have better performance for small matrices, even though the general timing complexity is n2 when compared to nlog(n) complexity of Brent-Luk-Van Loan (BLV) systolic array using non-pipeline 2×2 processors. The performance gain of the proposed design is at the cost of increased circuit area.M.S., Computer Engineering -- Drexel University, 201

    HIGH PERFORMANCE, LOW COST SUBSPACE DECOMPOSITION AND POLYNOMIAL ROOTING FOR REAL TIME DIRECTION OF ARRIVAL ESTIMATION: ANALYSIS AND IMPLEMENTATION

    Get PDF
    This thesis develops high performance real-time signal processing modules for direction of arrival (DOA) estimation for localization systems. It proposes highly parallel algorithms for performing subspace decomposition and polynomial rooting, which are otherwise traditionally implemented using sequential algorithms. The proposed algorithms address the emerging need for real-time localization for a wide range of applications. As the antenna array size increases, the complexity of signal processing algorithms increases, making it increasingly difficult to satisfy the real-time constraints. This thesis addresses real-time implementation by proposing parallel algorithms, that maintain considerable improvement over traditional algorithms, especially for systems with larger number of antenna array elements. Singular value decomposition (SVD) and polynomial rooting are two computationally complex steps and act as the bottleneck to achieving real-time performance. The proposed algorithms are suitable for implementation on field programmable gated arrays (FPGAs), single instruction multiple data (SIMD) hardware or application specific integrated chips (ASICs), which offer large number of processing elements that can be exploited for parallel processing. The designs proposed in this thesis are modular, easily expandable and easy to implement. Firstly, this thesis proposes a fast converging SVD algorithm. The proposed method reduces the number of iterations it takes to converge to correct singular values, thus achieving closer to real-time performance. A general algorithm and a modular system design are provided making it easy for designers to replicate and extend the design to larger matrix sizes. Moreover, the method is highly parallel, which can be exploited in various hardware platforms mentioned earlier. A fixed point implementation of proposed SVD algorithm is presented. The FPGA design is pipelined to the maximum extent to increase the maximum achievable frequency of operation. The system was developed with the objective of achieving high throughput. Various modern cores available in FPGAs were used to maximize the performance and details of these modules are presented in detail. Finally, a parallel polynomial rooting technique based on Newton’s method applicable exclusively to root-MUSIC polynomials is proposed. Unique characteristics of root-MUSIC polynomial’s complex dynamics were exploited to derive this polynomial rooting method. The technique exhibits parallelism and converges to the desired root within fixed number of iterations, making this suitable for polynomial rooting of large degree polynomials. We believe this is the first time that complex dynamics of root-MUSIC polynomial were analyzed to propose an algorithm. In all, the thesis addresses two major bottlenecks in a direction of arrival estimation system, by providing simple, high throughput, parallel algorithms

    Study of CORDIC based processing element for digital signal processing algorithms

    Get PDF
    There is a high demand for the efficient implementation of complex arithmetic operations in many Digital Signal Processing (DSP) algorithms. The COordinate Rotation DIgital Computer (CORDIC) algorithm is suitable to be implemented in DSP algorithms since its calculation for complex arithmetic is simple and elegant. Besides, since it avoids using multiplications, adopting the CORDIC algorithm can reduce the complexity. Here, in this project CORDIC based processing element for the construction of digital signal processing algorithms is implemented. This is a flexible device that can be used in the implementation of functions such as Singular Value Decomposition (SVD), Discrete Cosine Transform (DCT) as well as many other important functions. It uses a CORDIC module to perform arithmetic operations and the result is a flexible computational processing element (PE) for digital signal processing algorithms. To implement the CORDIC based architectures for functions like SVD and DCT, it is required to decompose their computations in terms of CORDIC operations. SVD is widely used in digital signal processing applications such as direction estimation, recursive least squares (RLS) filtering and system identification. Two different Jacobi-type methods for SVD parallel computation are usually considered, namely the Kogbetliantz (two-sided rotation) and the Hestenes (one- sided rotation) method. Kogbetliantz’s method has been considered, because it is suitable for mapping onto CORDIC array architecture and highly suitable for parallel computation. Here in its implementation, CORDIC algorithm provides the arithmetic units required in the processing elements as these enable the efficient implementation of plane rotation and phase computation. Many fundamental aspects of linear algebra rely on determining the rank of a matrix, making the SVD an important and widely used technique. DCT is one of the most widely used transform techniques in digital signal processing and it computation involves many multiplications and additions. The DCT based on CORDIC algorithm does not need multipliers. Moreover, it has regularity and simple architecture and it is used to compress a wide variety of images by transferring data into frequency domain. These digital signal-processing algorithms are used in many applications. The purpose of this thesis is to describe a solution in which a conventional CORDIC system is used to implement an SVD and DCT processing elements. The approach presented combines the low circuit complexity with high performance

    Rapid Frequency Estimation

    Get PDF
    Frequency estimation plays an important role in many digital signal processing applications. Many areas have benefited from the discovery of the Fast Fourier Transform (FFT) decades ago and from the relatively recent advances in modern spectral estimation techniques within the last few decades. As processor and programmable logic technologies advance, unconventional methods for rapid frequency estimation in white Gaussian noise should be considered for real time applications. In this thesis, a practical hardware implementation that combines two known frequency estimation techniques is presented, implemented, and characterized. The combined implementation, using the well known FFT and a less well known modern spectral analysis method known as the Direct State Space (DSS) algorithm, is used to demonstrate and promote application of modern spectral methods in various real time applications, including Electronic Counter Measure (ECM) techniques

    Application-specific instruction set processor for SoC implementation of modern signal processing algorithms

    Full text link

    Embedded Machine Learning: Emphasis on Hardware Accelerators and Approximate Computing for Tactile Data Processing

    Get PDF
    Machine Learning (ML) a subset of Artificial Intelligence (AI) is driving the industrial and technological revolution of the present and future. We envision a world with smart devices that are able to mimic human behavior (sense, process, and act) and perform tasks that at one time we thought could only be carried out by humans. The vision is to achieve such a level of intelligence with affordable, power-efficient, and fast hardware platforms. However, embedding machine learning algorithms in many application domains such as the internet of things (IoT), prostheses, robotics, and wearable devices is an ongoing challenge. A challenge that is controlled by the computational complexity of ML algorithms, the performance/availability of hardware platforms, and the application\u2019s budget (power constraint, real-time operation, etc.). In this dissertation, we focus on the design and implementation of efficient ML algorithms to handle the aforementioned challenges. First, we apply Approximate Computing Techniques (ACTs) to reduce the computational complexity of ML algorithms. Then, we design custom Hardware Accelerators to improve the performance of the implementation within a specified budget. Finally, a tactile data processing application is adopted for the validation of the proposed exact and approximate embedded machine learning accelerators. The dissertation starts with the introduction of the various ML algorithms used for tactile data processing. These algorithms are assessed in terms of their computational complexity and the available hardware platforms which could be used for implementation. Afterward, a survey on the existing approximate computing techniques and hardware accelerators design methodologies is presented. Based on the findings of the survey, an approach for applying algorithmic-level ACTs on machine learning algorithms is provided. Then three novel hardware accelerators are proposed: (1) k-Nearest Neighbor (kNN) based on a selection-based sorter, (2) Tensorial Support Vector Machine (TSVM) based on Shallow Neural Networks, and (3) Hybrid Precision Binary Convolution Neural Network (BCNN). The three accelerators offer a real-time classification with monumental reductions in the hardware resources and power consumption compared to existing implementations targeting the same tactile data processing application on FPGA. Moreover, the approximate accelerators maintain a high classification accuracy with a loss of at most 5%

    A custom computing framework for orientation and photogrammetry

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (p. 211-223).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.There is great demand today for real-time computer vision systems, with applications including image enhancement, target detection and surveillance, autonomous navigation, and scene reconstruction. These operations generally require extensive computing power; when multiple conventional processors and custom gate arrays are inappropriate, due to either excessive cost or risk, a class of devices known as Field-Programmable Gate Arrays (FPGAs) can be employed. FPGAs per the flexibility of a programmable solution and nearly the performance of a custom gate array. When implementing a custom algorithm in an FPGA, one must be more efficient than with a gate array technology. By tailoring the algorithms, architectures, and precisions, the gate count of an algorithm may be sufficiently reduced to t into an FPGA. The challenge is to perform this customization of the algorithm, while still maintaining the required performance. The techniques required to perform algorithmic optimization for FPGAs are scattered across many fields; what is currently lacking is a framework for utilizing all these well known and developing techniques. The purpose of this thesis is to develop this framework for orientation and photogrammetry systems.by Paul D. Fiore.Ph.D
    corecore