399 research outputs found

    A Memory-Centric Customizable Domain-Specific FPGA Overlay for Accelerating Machine Learning Applications

    Get PDF
    Low latency inferencing is of paramount importance to a wide range of real time and userfacing Machine Learning (ML) applications. Field Programmable Gate Arrays (FPGAs) offer unique advantages in delivering low latency as well as energy efficient accelertors for low latency inferencing. Unfortunately, creating machine learning accelerators in FPGAs is not easy, requiring the use of vendor specific CAD tools and low level digital and hardware microarchitecture design knowledge that the majority of ML researchers do not possess. The continued refinement of High Level Synthesis (HLS) tools can reduce but not eliminate the need for hardware-specific design knowledge. The designs by these tools can also produce inefficient use of FPGA resources that ultimately limit the performance of the neural network. This research investigated a new FPGA-based software-hardware codesigned overlay architecture that opens the advantages of FPGAs to the broader ML user community. As an overlay, the proposed design allows rapid coding and deployment of different ML network configurations and different data-widths, eliminating the prior barrier of needing to resynthesize each design. This brings important attributes of code portability over different FPGA families. The proposed overlay design is a Single-Instruction-Multiple-Data (SIMD) Processor-In-Memory (PIM) architecture developed as a programmable overlay for FPGAs. In contrast to point designs, it can be programmed to implement different types of machine learning algorithms. The overlay architecture integrates bit-serial Arithmetic Logic Units (ALUs) with distributed Block RAMs (BRAMs). The PIM design increases the size of arithmetic operations and on-chip storage capacity. User-visible inference latencies are reduced by exploiting concurrent accesses to network parameters (weights and biases) and partial results stored throughout the distributed BRAMs. Run-time performance comparisons show that the proposed design achieves a speedup compared to HLS-based or custom-tuned equivalent designs. Notably, the proposed design is programmable, allowing rapid design space exploration without the need to resynthesize when changing ML algorithms on the FPGA

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

    Full text link
    Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27Ă—27\times faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34Ă—1.34\times faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8Ă—2.8\times and 3.2Ă—3.2\times than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems

    Efficient emotion recognition using hyperdimensional computing with combinatorial channel encoding and cellular automata

    Full text link
    In this paper, a hardware-optimized approach to emotion recognition based on the efficient brain-inspired hyperdimensional computing (HDC) paradigm is proposed. Emotion recognition provides valuable information for human-computer interactions, however the large number of input channels (>200) and modalities (>3) involved in emotion recognition are significantly expensive from a memory perspective. To address this, methods for memory reduction and optimization are proposed, including a novel approach that takes advantage of the combinatorial nature of the encoding process, and an elementary cellular automaton. HDC with early sensor fusion is implemented alongside the proposed techniques achieving two-class multi-modal classification accuracies of >76% for valence and >73% for arousal on the multi-modal AMIGOS and DEAP datasets, almost always better than state of the art. The required vector storage is seamlessly reduced by 98% and the frequency of vector requests by at least 1/5. The results demonstrate the potential of efficient hyperdimensional computing for low-power, multi-channeled emotion recognition tasks

    Automated Debugging Methodology for FPGA-based Systems

    Get PDF
    Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort. Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively. This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments. The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure. The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed. The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system. The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference. The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present

    Towards a robust, effective and resource-efficient machine learning technique for IoT security monitoring.

    Get PDF
    Internet of Things (IoT) devices are becoming increasingly popular and an integral part of our everyday lives, making them a lucrative target for attackers. These devices require suitable security mechanisms that enable robust and effective detection of attacks. Machine learning (ML) and its subdivision Deep Learning (DL) methods offer a promise, but they can be computationally expensive in providing better detection for resource-constrained IoT devices. Therefore, this research proposes an optimization method to train ML and DL methods for effective and efficient security monitoring of IoT devices. It first investigates the feasibility of the Light Gradient Boosting Machine (LGBM) for attack detection in IoT environments, proposing an optimization procedure to obtain its effective counterparts. The trained LGBM can successfully discern attacks and regular traffic in various IoT benchmark datasets used in this research. As LGBM is a traditional ML technique, it may be difficult to learn complex network traffic patterns present in IoT datasets. Therefore, we further examine Deep Neural Networks (DNNs), proposing an effective and efficient DNN-based security solution for IoT security monitoring to leverage more resource savings and accurate attack detection. Investigation results are promising, as the proposed optimization method exploits the mini-batch gradient descent with simulated micro-batching in building effective and efficient DNN-based IoT security solutions. Following the success of DNN for effective and efficient attack detection, we further exploit it in the context of adversarial attack resistance. The resulting DNN is more resistant to adversarial samples than its benchmark counterparts and other conventional ML methods. To evaluate the effectiveness of our proposal, we considered on-device learning in federated learning settings, using decentralized edge devices to augment data privacy in resource-constrained environments. To this end, the performance of the method was evaluated against various realistic IoT datasets (e.g. NBaIoT, MNIST) on virtual and realistic testbed set-ups with GB-BXBT-2807 edge-computing-like devices. The experimental results show that the proposed method can reduce memory and time usage by 81% and 22% in the simulated environment of virtual workers compared to its benchmark counterpart. In the realistic testbed scenario, it saves 6% of memory footprints with a reduction of execution time by 15%, while maintaining a better and state-of-the-art accuracy

    New Waves of IoT Technologies Research – Transcending Intelligence and Senses at the Edge to Create Multi Experience Environments

    Get PDF
    The next wave of Internet of Things (IoT) and Industrial Internet of Things (IIoT) brings new technological developments that incorporate radical advances in Artificial Intelligence (AI), edge computing processing, new sensing capabilities, more security protection and autonomous functions accelerating progress towards the ability for IoT systems to self-develop, self-maintain and self-optimise. The emergence of hyper autonomous IoT applications with enhanced sensing, distributed intelligence, edge processing and connectivity, combined with human augmentation, has the potential to power the transformation and optimisation of industrial sectors and to change the innovation landscape. This chapter is reviewing the most recent advances in the next wave of the IoT by looking not only at the technology enabling the IoT but also at the platforms and smart data aspects that will bring intelligence, sustainability, dependability, autonomy, and will support human-centric solutions.acceptedVersio
    • …
    corecore