26 research outputs found

    Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

    Get PDF
    With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers, to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems as proponents for driving biomedical circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21 pages, 10 figures, 5 tables

    The importance of space and time in neuromorphic cognitive agents

    Full text link
    Artificial neural networks and computational neuroscience models have made tremendous progress, allowing computers to achieve impressive results in artificial intelligence (AI) applications, such as image recognition, natural language processing, or autonomous driving. Despite this remarkable progress, biological neural systems consume orders of magnitude less energy than today's artificial neural networks and are much more agile and adaptive. This efficiency and adaptivity gap is partially explained by the computing substrate of biological neural processing systems that is fundamentally different from the way today's computers are built. Biological systems use in-memory computing elements operating in a massively parallel way rather than time-multiplexed computing units that are reused in a sequential fashion. Moreover, activity of biological neurons follows continuous-time dynamics in real, physical time, instead of operating on discrete temporal cycles abstracted away from real-time. Here, we present neuromorphic processing devices that emulate the biological style of processing by using parallel instances of mixed-signal analog/digital circuits that operate in real time. We argue that this approach brings significant advantages in efficiency of computation. We show examples of embodied neuromorphic agents that use such devices to interact with the environment and exhibit autonomous learning

    Simulation and implementation of novel deep learning hardware architectures for resource constrained devices

    Get PDF
    Corey Lammie designed mixed signal memristive-complementary metalโ€“oxideโ€“semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems

    ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ๊ธฐ๊ณ„ํ•™์Šต ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋””๋žจ ๊ธฐ๋ฐ˜ ํ”„๋กœ์„ธ์‹ฑ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋งˆ์ดํฌ๋กœ์•„ํ‚คํ…์ฒ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์•ˆ์ •ํ˜ธ.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2ร— higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7ร— and 3.9ร— speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.์ตœ๊ทผ ๋งŽ์€ ์‹ ๊ฒฝ๋ง ์—ฐ๊ตฌ๋“ค์ด ๊ด€์‹ฌ์„ ๋ฐ›์œผ๋ฉด์„œ, RNN ๋ชจ๋ธ ํ˜น์€ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋“ฑ์žฅํ•˜๊ณ ์žˆ๋‹ค. RNN ๋ชจ๋ธ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ–‰ ์‹œ๊ฐ„ ๋™์•ˆ ๊ฐ๊ฐ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์—ฐ์‚ฐํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ธ GnR ์—ฐ์‚ฐ์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์€ ๋‹ค์Œ ์ด๋“ค์„ ํ•ฉ์น˜๋Š” ๋™์ž‘์„ ํ•œ๋‹ค. RNN ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ํ–‰๋ ฌ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์žฌ์‚ฌ์šฉ์„ฑ์ด ๋‚ฎ๊ณ  ์ด๋“ค์˜ ํฌ๊ธฐ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜์—ฌ ์˜จ์นฉ ์Šคํ† ๋ฆฌ์ง€์— ์ €์žฅ๋  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ ๋ฐ GnR ์—ฐ์‚ฐ์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์€ ์ฃผ ๋ฉ”๋ชจ๋ฆฌ DRAM์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. ๋”ฐ๋ผ์„œ DRAM ๋‚ด์—์„œ ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ๊ด€์‹ฌ์„ ๋Œ๊ณ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋จผ์ € DRAM ๋ฑ…ํฌ ๋‚ด๋ถ€์— MAC ์œ ๋‹›์„ ๋ฐฐ์น˜ํ•˜์—ฌ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์ˆ˜ํ–‰ํ•˜๋Š” MViD๋ผ๋Š” ์ฃผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ๋†’์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํฌ์†Œ ํ–‰๋ ฌ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๊ณ  ์–‘์žํ™”๋ฅผ ํ™œ์šฉํ•œ๋‹ค. DRAM ์žฅ์น˜๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ œํ•œ๋œ ์ „๋ ฅ ๋•Œ๋ฌธ์— DRAM ๋ฑ…ํฌ์˜ ์ผ๋ถ€์—๋งŒ MAC ์žฅ์น˜๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค. ์ „๋ ฅ ์ œํ•œ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉด์„œ ํ”„๋กœ์„ธ์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ๋Šฆ์ถ”๊ฑฐ๋‚˜ ์ผ์‹œ ์ค‘์ง€ํ•˜๋„๋ก MViD๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ MViD๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ๋กœ Deep Speech 2์˜ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋ฉด์„œ 4๊ฐœ์˜ DRAM ๋žญํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ์„ธ์„œ์—์„œ ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ค€ ์‹œ์Šคํ…œ์— ๋น„ํ•ด 7.2๋ฐฐ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ์ถ”์ฒœ ์‹œ์Šคํ…œ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ทผ์ฒ˜ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ์ธ TRiM์„ ์ œ์•ˆํ•œ๋‹ค. DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๊ฐ€ ๊ณ„์ธต์  ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ TRiM์€ DDR4/5 ๋žญํฌ/๋ฑ…ํฌ๊ทธ๋ฃน/๋ฑ…ํฌ ์ˆ˜์ค€์—์„œ DRAM ๋‚ด๋ถ€ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜๋กœ DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค. ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์— ๋ช…๋ น์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค. ๋˜ํ•œ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ์ธก ๊ตฌ์กฐ์— ํ•ซ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋ณต์ œ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. DDR5๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ตœ์ ์˜ TRiM ์„ค๊ณ„๋Š” DRAM ์นฉ์˜ 2.66%์— ํ•ด๋‹นํ•˜๋Š” ํฌ๊ธฐ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ์œผ๋กœ ์ตœ๋Œ€ 7.7๋ฐฐ ๋ฐ 3.9๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ˆ˜์ง‘์˜ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ 55% ๋ฐ 50% ์ค„์ธ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 117๋ฐ•

    Power efficient machine learning-based hardware architectures for biomedical applications

    Get PDF
    The future of critical health diagnosis will involve intelligent and smart devices that are low-cost, wearable, and lightweight, requiring low-power, energy-efficient hardware platforms. Various machine learning models, such as deep learning architectures, have been employed to design intelligent healthcare systems. However, deploying these sophisticated and intelligent devices in real-time embedded systems with limited hardware resources and power budget is complex due to the requirement of high computational power in achieving a high accuracy rate. As a result, this creates a significant gap between the advancement of computing technology and the associated device technologies for healthcare applications. Power-efficient machine learning-based digital hardware design techniques have been introduced in this work for the realization of a compact design solution while maintaining optimal prediction accuracy. Two hardware design approaches, DeepSAC and SABiNN have been proposed and analyzed in this work. DeepSAC is a shift-accumulator-based technique, whereas SABiNN is a 2's complement-based binarized digital hardware technique. Neural network models, such as feedforward, convolutional neural nets, residual networks, and other popular machine learning and deep neural networks, are selected to benchmark the proposed model architecture. Various deep compression learning techniques, such as pruning, n-bit (n = 8,16) integer quantization, and binarization on hyper-parameters, are also employed. These models significantly reduced the power consumption rate by 5x, size by 13x, and improved the model latency. For efficient use of these models, especially in biomedical applications, a sleep apnea (SA) detection device for adults is developed to detect SA events in real-time. The input to the system consists of two physiological sensor data, such as ECG signal from the chest movement and SpO2 measurement from the pulse oximeter to predict the occurrence of SA episodes. In the training phase, actual patient data is used, and the network model is converted into the proposed hardware models to achieve medically backed accuracy. After achieving acceptable results of 88 percent accuracy, all the parameters are extracted for inference on edge. In the inference phase, reconfigurable hardware validated the extracted parameter for model precision and power consumption rate before being translated onto the silicon. This research implements the final model in CMOS platforms using 130 nm and 180 nm commercial CMOS processes.Includes bibliographical references

    VLSI design of configurable low-power coarse-grained array architecture

    Get PDF
    Biomedical signal acquisition from in- or on-body sensors often requires local (on-node) low-level pre-processing before the data are sent to a remote node for aggregation and further processing. Local processing is required for many different operations, which include signal cleanup (noise removal), sensor calibration, event detection and data compression. In this environment, processing is subject to aggressive energy consumption restrictions, while often operating under real-time requirements. These conflicting requirements impose the use of dedicated circuits addressing a very specific task or the use of domain-specific customization to obtain significant gains in power efficiency. However, economic and time-to-market constraints often make the development or use of application-specific platforms very risky.One way to address these challenges is to develop a sensor node with a general-purpose architecture combining a low-power, low-performance general microprocessor or micro-controller with a coarse-grained reconfigurable array (CGRA) acting as an accelerator. A CGRA consists of a fixed number of processing units (e.g., ALUs) whose function and interconnections are determined by some configuration data.The objective of this work is to create an RTL-level description of a low-power CGRA of ALUs and produce a low-power VLSI (standard cell) implementation, that supports power-saving features.The CGRA implementation should use as few resources as possible and fully exploit the intended operation environment. The design will be evaluated with a set of simple signal processing task

    Adaptive extreme edge computing for wearable devices

    Get PDF
    Wearable devices are a fast-growing technology with impact on personal healthcare for both society and economy. Due to the widespread of sensors in pervasive and distributed networks, power consumption, processing speed, and system adaptation are vital in future smart wearable devices. The visioning and forecasting of how to bring computation to the edge in smart sensors have already begun, with an aspiration to provide adaptive extreme edge computing. Here, we provide a holistic view of hardware and theoretical solutions towards smart wearable devices that can provide guidance to research in this pervasive computing era. We propose various solutions for biologically plausible models for continual learning in neuromorphic computing technologies for wearable sensors. To envision this concept, we provide a systematic outline in which prospective low power and low latency scenarios of wearable sensors in neuromorphic platforms are expected. We successively describe vital potential landscapes of neuromorphic processors exploiting complementary metal-oxide semiconductors (CMOS) and emerging memory technologies (e.g. memristive devices). Furthermore, we evaluate the requirements for edge computing within wearable devices in terms of footprint, power consumption, latency, and data size. We additionally investigate the challenges beyond neuromorphic computing hardware, algorithms and devices that could impede enhancement of adaptive edge computing in smart wearable devices

    Intelligent Computing: The Latest Advances, Challenges and Future

    Get PDF
    Computing is a critical driving force in the development of human civilization. In recent years, we have witnessed the emergence of intelligent computing, a new computing paradigm that is reshaping traditional computing and promoting digital revolution in the era of big data, artificial intelligence and internet-of-things with new computing theories, architectures, methods, systems, and applications. Intelligent computing has greatly broadened the scope of computing, extending it from traditional computing on data to increasingly diverse computing paradigms such as perceptual intelligence, cognitive intelligence, autonomous intelligence, and human-computer fusion intelligence. Intelligence and computing have undergone paths of different evolution and development for a long time but have become increasingly intertwined in recent years: intelligent computing is not only intelligence-oriented but also intelligence-driven. Such cross-fertilization has prompted the emergence and rapid advancement of intelligent computing. Intelligent computing is still in its infancy and an abundance of innovations in the theories, systems, and applications of intelligent computing are expected to occur soon. We present the first comprehensive survey of literature on intelligent computing, covering its theory fundamentals, the technological fusion of intelligence and computing, important applications, challenges, and future perspectives. We believe that this survey is highly timely and will provide a comprehensive reference and cast valuable insights into intelligent computing for academic and industrial researchers and practitioners

    Designing energy-efficient computing systems using equalization and machine learning

    Full text link
    As technology scaling slows down in the nanometer CMOS regime and mobile computing becomes more ubiquitous, designing energy-efficient hardware for mobile systems is becoming increasingly critical and challenging. Although various approaches like near-threshold computing (NTC), aggressive voltage scaling with shadow latches, etc. have been proposed to get the most out of limited battery life, there is still no โ€œsilver bulletโ€ to increasing power-performance demands of the mobile systems. Moreover, given that a mobile system could operate in a variety of environmental conditions, like different temperatures, have varying performance requirements, etc., there is a growing need for designing tunable/reconfigurable systems in order to achieve energy-efficient operation. In this work we propose to address the energy- efficiency problem of mobile systems using two different approaches: circuit tunability and distributed adaptive algorithms. Inspired by the communication systems, we developed feedback equalization based digital logic that changes the threshold of its gates based on the input pattern. We showed that feedback equalization in static complementary CMOS logic enabled up to 20% reduction in energy dissipation while maintaining the performance metrics. We also achieved 30% reduction in energy dissipation for pass-transistor digital logic (PTL) with equalization while maintaining performance. In addition, we proposed a mechanism that leverages feedback equalization techniques to achieve near optimal operation of static complementary CMOS logic blocks over the entire voltage range from near threshold supply voltage to nominal supply voltage. Using energy-delay product (EDP) as a metric we analyzed the use of the feedback equalizer as part of various sequential computational blocks. Our analysis shows that for near-threshold voltage operation, when equalization was used, we can improve the operating frequency by up to 30%, while the energy increase was less than 15%, with an overall EDP reduction of โ‰ˆ10%. We also observe an EDP reduction of close to 5% across entire above-threshold voltage range. On the distributed adaptive algorithm front, we explored energy-efficient hardware implementation of machine learning algorithms. We proposed an adaptive classifier that leverages the wide variability in data complexity to enable energy-efficient data classification operations for mobile systems. Our approach takes advantage of varying classification hardness across data to dynamically allocate resources and improve energy efficiency. On average, our adaptive classifier is โ‰ˆ100ร— more energy efficient but has โ‰ˆ1% higher error rate than a complex radial basis function classifier and is โ‰ˆ10ร— less energy efficient but has โ‰ˆ40% lower error rate than a simple linear classifier across a wide range of classification data sets. We also developed a field of groves (FoG) implementation of random forests (RF) that achieves an accuracy comparable to Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) under tight energy budgets. The FoG architecture takes advantage of the fact that in random forests a small portion of the weak classifiers (decision trees) might be sufficient to achieve high statistical performance. By dividing the random forest into smaller forests (Groves), and conditionally executing the rest of the forest, FoG is able to achieve much higher energy efficiency levels for comparable error rates. We also take advantage of the distributed nature of the FoG to achieve high level of parallelism. Our evaluation shows that at maximum achievable accuracies FoG consumes โ‰ˆ1.48ร—, โ‰ˆ24ร—, โ‰ˆ2.5ร—, and โ‰ˆ34.7ร— lower energy per classification compared to conventional RF, SVM-RBF , Multi-Layer Perceptron Network (MLP), and CNN, respectively. FoG is 6.5ร— less energy efficient than SVM-LR, but achieves 18% higher accuracy on average across all considered datasets
    corecore