180 research outputs found

    EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

    Get PDF
    The current trend in High-Performance Computing (HPC) is to extract concurrency from clusters that include heterogeneous resources such as General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Array (FPGAs). Although these heterogeneous systems can provide substantial performance for massively parallel applications, much of the available computing resources are often under-utilized due to inefficient application mapping, load balancing, and tuning. While several performance prediction models exist to efficiently tune applications, they often require significant computing architecture knowledge for reliable prediction. In addition, they do not address multiple levels of design space abstraction and it is often difficult to choose a reliable prediction model for a given design. In this research, we develop a multi-level suite of performance prediction models for heterogeneous systems that primarily targets Synchronous Iterative Algorithms (SIAs). The modeling suite aims to produce accurate and straightforward application runtime prediction prior to the actual large-scale implementation. This suite addresses two levels of system abstraction: 1) low-level where partial knowledge of the application implementation is present along with the system specifications and 2) high-level where the implementation details are minimum and only high-level computing system specifications are given. The performance prediction modeling suite is developed using our proposed Synchronous Iterative GPGPU Execution (SIGE) model for GPGPU clusters, motivated by the RC Amenability Test for Scalable Systems (RATSS) model for FPGA clusters. The low-level abstraction for GPGPU clusters consists of a regression-based performance prediction framework that statistically abstracts system architecture characteristics, enabling performance prediction without detailed architecture knowledge. In this framework, the overall execution time of an application is predicted using regression models developed for host-device computations and network-level communications performed in the algorithm. We have used a family of Spiking Neural Network (SNN) models and an Anisotropic Diffusion Filter (ADF) algorithm as SIA case studies for verification of the regression-based framework and achieved over 90% prediction accuracy compared to the actual implementations for several GPGPU cluster configurations tested. The results establish the adequacy of the low-level abstraction model for advanced, fine-grained performance prediction and design space exploration (DSE). The high-level abstraction consists of the following two primary modeling approaches: qualitative modeling that uses existing subjective-analytical models for computation and communication; and quantitative modeling that predicts computation and communication performance by measuring hardware events associated with objective-analytical models using micro-benchmarks. The performance prediction provided by the high-level abstraction approaches, albeit coarse-grained, delivers useful insight into application performance on the chosen heterogeneous system. A blend of the two high-level modeling approaches, labeled as hybrid modeling, is explored for insightful preliminary performance prediction. The performance prediction models in the multi-level suite are verified and compared for their accuracy and ease-of-use, allowing developers to choose a model that best satisfies their design space abstraction. We also construct a roadmap that guides user from optimal Application-to-Accelerator (A2A) mapping to fine-grained performance prediction, thereby providing a hierarchical approach to optimal application porting on the target heterogeneous system. The end goal of this dissertation research is to offer the HPC community a thorough, non-architecture specific, performance prediction framework in the form of a hierarchical modeling suite that enables them to optimally utilize the heterogeneous resources

    Proceedings, MSVSCC 2014

    Get PDF
    Proceedings of the 8th Annual Modeling, Simulation & Visualization Student Capstone Conference held on April 17, 2014 at VMASC in Suffolk, Virginia

    Deep Model for Improved Operator Function State Assessment

    Get PDF
    A deep learning framework is presented for engagement assessment using EEG signals. Deep learning is a recently developed machine learning technique and has been applied to many applications. In this paper, we proposed a deep learning strategy for operator function state (OFS) assessment. Fifteen pilots participated in a flight simulation from Seattle to Chicago. During the four-hour simulation, EEG signals were recorded for each pilot. We labeled 20- minute data as engaged and disengaged to fine-tune the deep network and utilized the remaining vast amount of unlabeled data to initialize the network. The trained deep network was then used to assess if a pilot was engaged during the four-hour simulation

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportional–derivative (PD) controller. EdgeDRNN has achieved 21 μs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 μs

    Event-Driven Technologies for Reactive Motion Planning: Neuromorphic Stereo Vision and Robot Path Planning and Their Application on Parallel Hardware

    Get PDF
    Die Robotik wird immer mehr zu einem Schlüsselfaktor des technischen Aufschwungs. Trotz beeindruckender Fortschritte in den letzten Jahrzehnten, übertreffen Gehirne von Säugetieren in den Bereichen Sehen und Bewegungsplanung noch immer selbst die leistungsfähigsten Maschinen. Industrieroboter sind sehr schnell und präzise, aber ihre Planungsalgorithmen sind in hochdynamischen Umgebungen, wie sie für die Mensch-Roboter-Kollaboration (MRK) erforderlich sind, nicht leistungsfähig genug. Ohne schnelle und adaptive Bewegungsplanung kann sichere MRK nicht garantiert werden. Neuromorphe Technologien, einschließlich visueller Sensoren und Hardware-Chips, arbeiten asynchron und verarbeiten so raum-zeitliche Informationen sehr effizient. Insbesondere ereignisbasierte visuelle Sensoren sind konventionellen, synchronen Kameras bei vielen Anwendungen bereits überlegen. Daher haben ereignisbasierte Methoden ein großes Potenzial, schnellere und energieeffizientere Algorithmen zur Bewegungssteuerung in der MRK zu ermöglichen. In dieser Arbeit wird ein Ansatz zur flexiblen reaktiven Bewegungssteuerung eines Roboterarms vorgestellt. Dabei wird die Exterozeption durch ereignisbasiertes Stereosehen erreicht und die Pfadplanung ist in einer neuronalen Repräsentation des Konfigurationsraums implementiert. Die Multiview-3D-Rekonstruktion wird durch eine qualitative Analyse in Simulation evaluiert und auf ein Stereo-System ereignisbasierter Kameras übertragen. Zur Evaluierung der reaktiven kollisionsfreien Online-Planung wird ein Demonstrator mit einem industriellen Roboter genutzt. Dieser wird auch für eine vergleichende Studie zu sample-basierten Planern verwendet. Ergänzt wird dies durch einen Benchmark von parallelen Hardwarelösungen wozu als Testszenario Bahnplanung in der Robotik gewählt wurde. Die Ergebnisse zeigen, dass die vorgeschlagenen neuronalen Lösungen einen effektiven Weg zur Realisierung einer Robotersteuerung für dynamische Szenarien darstellen. Diese Arbeit schafft eine Grundlage für neuronale Lösungen bei adaptiven Fertigungsprozesse, auch in Zusammenarbeit mit dem Menschen, ohne Einbußen bei Geschwindigkeit und Sicherheit. Damit ebnet sie den Weg für die Integration von dem Gehirn nachempfundener Hardware und Algorithmen in die Industrierobotik und MRK

    Heterogeneous multicore systems for signal processing

    Get PDF
    This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Aerial Vehicles

    Get PDF
    This book contains 35 chapters written by experts in developing techniques for making aerial vehicles more intelligent, more reliable, more flexible in use, and safer in operation.It will also serve as an inspiration for further improvement of the design and application of aeral vehicles. The advanced techniques and research described here may also be applicable to other high-tech areas such as robotics, avionics, vetronics, and space

    Computer Science & Technology Series : XIX Argentine Congress of Computer Science. Selected papers

    Get PDF
    CACIC’13 was the nineteenth Congress in the CACIC series. It was organized by the Department of Computer Systems at the CAECE University in Mar del Plata. The Congress included 13 Workshops with 165 accepted papers, 5 Conferences, 3 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 5 courses. CACIC 2013 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities. The call for papers attracted a total of 247 submissions. An average of 2.5 review reports were collected for each paper, for a grand total of 676 review reports that involved about 210 different reviewers. A total of 165 full papers, involving 489 authors and 80 Universities, were accepted and 25 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI

    マルチレベル並列化とアプリケーション指向データレイアウトを用いるハードウェアアクセラレータの設計と実装

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 稲葉 雅幸, 東京大学教授 須田 礼仁, 東京大学教授 五十嵐 健夫, 東京大学教授 山西 健司, 東京大学准教授 稲葉 真理, 東京大学講師 中山 英樹University of Tokyo(東京大学
    corecore