1,314 research outputs found

    Automated CNN pipeline generation for heterogeneous architectures

    Get PDF
    Heterogeneity is a vital feature in emerging processor chip designing. Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster are common in modern edge devices. One example is Intel\u27s Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor. Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect.\ua0For Instance Intel\u27s XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms. An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices. In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.The proposed approach provides near optimal solution for a throughput maximizing pipeline. We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values. The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms. Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution

    FFT-Based Deep Learning Deployment in Embedded Systems

    Full text link
    Deep learning has delivered its powerfulness in many application domains, especially in image and speech recognition. As the backbone of deep learning, deep neural networks (DNNs) consist of multiple layers of various types with hundreds to thousands of neurons. Embedded platforms are now becoming essential for deep learning deployment due to their portability, versatility, and energy efficiency. The large model size of DNNs, while providing excellent accuracy, also burdens the embedded platforms with intensive computation and storage. Researchers have investigated on reducing DNN model size with negligible accuracy loss. This work proposes a Fast Fourier Transform (FFT)-based DNN training and inference model suitable for embedded platforms with reduced asymptotic complexity of both computation and storage, making our approach distinguished from existing approaches. We develop the training and inference algorithms based on FFT as the computing kernel and deploy the FFT-based inference model on embedded platforms achieving extraordinary processing speed.Comment: Design, Automation, and Test in Europe (DATE) For source code, please contact Mahdi Nazemi at <[email protected]

    Smart Sensor Architectures for Multimedia Sensing in IoMT

    Full text link
    [EN] Today, a wide range of developments and paradigms require the use of embedded systems characterized by restrictions on their computing capacity, consumption, cost, and network connection. The evolution of the Internet of Things (IoT) towards Industrial IoT (IIoT) or the Internet of Multimedia Things (IoMT), its impact within the 4.0 industry, the evolution of cloud computing towards edge or fog computing, also called near-sensor computing, or the increase in the use of embedded vision, are current examples of this trend. One of the most common methods of reducing energy consumption is the use of processor frequency scaling, based on a particular policy. The algorithms to define this policy are intended to obtain good responses to the workloads that occur in smarthphones. There has been no study that allows a correct definition of these algorithms for workloads such as those expected in the above scenarios. This paper presents a method to determine the operating parameters of the dynamic governor algorithm called Interactive, which offers significant improvements in power consumption, without reducing the performance of the application. These improvements depend on the load that the system has to support, so the results are evaluated against three different loads, from higher to lower, showing improvements ranging from 62% to 26%.This work has been supported by the MCyU (Spanish Ministry of Science and Universities) under the project ATLAS (PGC2018-094151-B-I00), which is partially funded by AEI, FEDER and EU.Silvestre-Blanes, J.; Sempere Paya, VM.; Albero Albero, T. (2020). Smart Sensor Architectures for Multimedia Sensing in IoMT. Sensors. 20(5):1-16. https://doi.org/10.3390/s20051400S116205Bangemann, T., Riedl, M., Thron, M., & Diedrich, C. (2016). Integration of Classical Components Into Industrial Cyber–Physical Systems. Proceedings of the IEEE, 104(5), 947-959. doi:10.1109/jproc.2015.2510981Wollschlaeger, M., Sauter, T., & Jasperneite, J. (2017). The Future of Industrial Communication: Automation Networks in the Era of the Internet of Things and Industry 4.0. IEEE Industrial Electronics Magazine, 11(1), 17-27. doi:10.1109/mie.2017.2649104Salehi, M., & Ejlali, A. (2015). A Hardware Platform for Evaluating Low-Energy Multiprocessor Embedded Systems Based on COTS Devices. IEEE Transactions on Industrial Electronics, 62(2), 1262-1269. doi:10.1109/tie.2014.2352215Alvi, S. A., Afzal, B., Shah, G. A., Atzori, L., & Mahmood, W. (2015). Internet of multimedia things: Vision and challenges. Ad Hoc Networks, 33, 87-111. doi:10.1016/j.adhoc.2015.04.006Jridi, M., Chapel, T., Dorez, V., Le Bougeant, G., & Le Botlan, A. (2018). SoC-Based Edge Computing Gateway in the Context of the Internet of Multimedia Things: Experimental Platform. Journal of Low Power Electronics and Applications, 8(1), 1. doi:10.3390/jlpea8010001Memos, V. A., Psannis, K. E., Ishibashi, Y., Kim, B.-G., & Gupta, B. B. (2018). An Efficient Algorithm for Media-based Surveillance System (EAMSuS) in IoT Smart City Framework. Future Generation Computer Systems, 83, 619-628. doi:10.1016/j.future.2017.04.039Chianese, A., Piccialli, F., & Riccio, G. (2015). Designing a Smart Multisensor Framework Based on Beaglebone Black Board. Lecture Notes in Electrical Engineering, 391-397. doi:10.1007/978-3-662-45402-2_60Wang, W., Wang, Q., & Sohraby, K. (2016). Multimedia Sensing as a Service (MSaaS): Exploring Resource Saving Potentials of at Cloud-Edge IoTs and Fogs. IEEE Internet of Things Journal, 1-1. doi:10.1109/jiot.2016.2578722Munir, A., Gordon-Ross, A., & Ranka, S. (2014). Multi-Core Embedded Wireless Sensor Networks: Architecture and Applications. IEEE Transactions on Parallel and Distributed Systems, 25(6), 1553-1562. doi:10.1109/tpds.2013.219Baali, H., Djelouat, H., Amira, A., & Bensaali, F. (2018). Empowering Technology Enabled Care Using IoT and Smart Devices: A Review. IEEE Sensors Journal, 18(5), 1790-1809. doi:10.1109/jsen.2017.2786301Kim, Y. G., Kong, J., & Chung, S. W. (2018). A Survey on Recent OS-Level Energy Management Techniques for Mobile Processing Units. IEEE Transactions on Parallel and Distributed Systems, 29(10), 2388-2401. doi:10.1109/tpds.2018.2822683Chaib Draa, I., Niar, S., Tayeb, J., Grislin, E., & Desertot, M. (2016). Sensing user context and habits for run-time energy optimization. EURASIP Journal on Embedded Systems, 2017(1). doi:10.1186/s13639-016-0036-8Chen, Y.-L., Chang, M.-F., Yu, C.-W., Chen, X.-Z., & Liang, W.-Y. (2018). Learning-Directed Dynamic Voltage and Frequency Scaling Scheme with Adjustable Performance for Single-Core and Multi-Core Embedded and Mobile Systems. Sensors, 18(9), 3068. doi:10.3390/s18093068Tamilselvan, K., & Thangaraj, P. (2020). Pods – A novel intelligent energy efficient and dynamic frequency scalings for multi-core embedded architectures in an IoT environment. Microprocessors and Microsystems, 72, 102907. doi:10.1016/j.micpro.2019.10290

    NASA Center for Intelligent Robotic Systems for Space Exploration

    Get PDF
    NASA's program for the civilian exploration of space is a challenge to scientists and engineers to help maintain and further develop the United States' position of leadership in a focused sphere of space activity. Such an ambitious plan requires the contribution and further development of many scientific and technological fields. One research area essential for the success of these space exploration programs is Intelligent Robotic Systems. These systems represent a class of autonomous and semi-autonomous machines that can perform human-like functions with or without human interaction. They are fundamental for activities too hazardous for humans or too distant or complex for remote telemanipulation. To meet this challenge, Rensselaer Polytechnic Institute (RPI) has established an Engineering Research Center for Intelligent Robotic Systems for Space Exploration (CIRSSE). The Center was created with a five year $5.5 million grant from NASA submitted by a team of the Robotics and Automation Laboratories. The Robotics and Automation Laboratories of RPI are the result of the merger of the Robotics and Automation Laboratory of the Department of Electrical, Computer, and Systems Engineering (ECSE) and the Research Laboratory for Kinematics and Robotic Mechanisms of the Department of Mechanical Engineering, Aeronautical Engineering, and Mechanics (ME,AE,&M), in 1987. This report is an examination of the activities that are centered at CIRSSE

    Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures

    Full text link
    For reasons of both performance and energy efficiency, high-performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of computing devices and is gaining influence in programming next-generation accelerators. To characterize the performance of these devices across a range of applications requires a diverse, portable and configurable benchmark suite, and OpenCL is an attractive programming model for this purpose. We present an extended and enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus placed on the robustness of applications, curation of additional benchmarks with an increased emphasis on correctness of results and choice of problem size. Preliminary results and analysis are reported for eight benchmark codes on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure

    Power, Performance, and Energy Management of Heterogeneous Architectures

    Get PDF
    abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201
    corecore