Search CORE

550 research outputs found

Intel-oneAPI para Computación Heterogénea

Author: Castaño Roldán Germán
Publication venue
Publication date: 01/01/2021
Field of study

Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021"oneAPI is a cross-industry, open, standards-based unified programming model that delivers a common developer experience across accelerator architectures—for faster application performance, more productivity, and greater innovation." -www.oneapi.com The Intel DPC++ Compatibility Tool is a component of the Intel oneAPI base toolkit. This tool automatically transforms CUDA code into Data Parallel C++ (DPC++) assisting in the migration process. This project consists of an analysis of the DPC++ Compatibility Tool, considering the manual intervention required and the problems encountered while migrating the Rodinia benchmarks. And a comparative study of the performance obtained by the migrated code."oneAPI es un modelo de programación unificado, abierto y basado en estándares, que ofrece una experiencia de desarrollador común en todas las arquitecturas de aceleradores, para un rendimiento de aplicaciones más rápido, más productividad y una mayor innovación." -www.oneapi.com La herramienta de compatibilidad DPC++ de Intel es un componente del oneAPI Base Toolkit. esta herramienta transforma automáticamente código CUDA en Data Parallel C++ (DPC++) ayudando en el proceso de migración. Este proyecto consiste en un análisis de la herramienta de compatibilidad DPC++, considerando la intervención manual requerida y los problemas encontrados al migrar los benchmarks de Rodinia. Y un estudio comparativo del rendimiento obtenido por el código migrado.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

Docta Complutense

엣지 클라우드 환경을 위한 연산 오프로딩 시스템

Author: 정혁진
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 전기·정보공학부,2020. 2. 문수묵.The purpose of my dissertation is to build lightweight edge computing systems which provide seamless offloading services even when users move across multiple edge servers. I focused on two specific application domains: 1) web applications and 2) DNN applications. I propose an edge computing system which offload computations from web-supported devices to edge servers. The proposed system exploits the portability of web apps, i.e., distributed as source code and runnable without installation, when migrating the execution state of web apps. This significantly reduces the complexity of state migration, allowing a web app to migrate within a few seconds. Also, the proposed system supports offloading of webassembly, a standard low-level instruction format for web apps, having achieved up to 8.4x speedup compared to offloading of pure JavaScript codes. I also propose incremental offloading of neural network (IONN), which simultaneously offloads DNN execution while deploying a DNN model, thus reducing the overhead of DNN model deployment. Also, I extended IONN to support large-scale edge server environments by proactively migrating DNN layers to edge servers where mobile users are predicted to visit. Simulation with open-source mobility dataset showed that the proposed system could significantly reduce the overhead of deploying a DNN model.본 논문의 목적은 사용자가 이동하는 동안에도 원활한 연산 오프로딩 서비스를 제공하는 경량 엣지 컴퓨팅 시스템을 구축하는 것입니다. 웹 어플리케이션과 인공신경망 (DNN: Deep Neural Network) 이라는 두 가지 어플리케이션 도메인에서 연구를 진행했습니다. 첫째, 웹 지원 장치에서 엣지 서버로 연산을 오프로드하는 엣지 컴퓨팅 시스템을 제안합니다. 제안된 시스템은 웹 앱의 실행 상태를 마이그레이션 할 때 웹 앱의 높은 이식성(소스 코드로 배포되고 설치하지 않고 실행할 수 있음)을 활용합니다. 이를 통해 상태 마이그레이션의 복잡성이 크게 줄여서 웹 앱이 몇 초 내에 마이그레이션 될 수 있습니다. 또한, 제안된 시스템은 웹 어플리케이션을 위한 표준 저수준 인스트럭션인 웹 어셈블리 오프로드를 지원하여 순수한 JavaScript 코드 오프로드와 비교하여 최대 8.4 배의 속도 향상을 달성했습니다. 둘째, DNN 어플리케이션을 엣지 서버에 배포할 때, DNN 모델을 전송하는 동안 DNN 연산을 오프로드 하여 빠르게 성능향상을 달성할 수 있는 점진적 오프로드 방식을 제안합니다. 또한, 모바일 사용자가 방문 할 것으로 예상되는 엣지 서버로 DNN 레이어를 사전에 마이그레이션하여 콜드 스타트 성능을 향상시키는 방식을 제안 합니다. 오픈 소스 모빌리티 데이터셋을 이용한 시뮬레이션에서, DNN 모델을 배포하면서 발생하는 성능 저하를 제안 하는 방식이 크게 줄일 수 있음을 확인하였습니다.Chapter 1. Introduction 1 1.1 Offloading Web App Computations to Edge Servers 1 1.2 Offloading DNN Computations to Edge Servers 3 Chapter 2. Seamless Offloading of Web App Computations 7 2.1 Motivation: Computation-Intensive Web Apps 7 2.2 Mobile Web Worker System 10 2.2.1 Review of HTML5 Web Worker 10 2.2.2 Mobile Web Worker System 11 2.3 Migrating Web Worker 14 2.3.1 Runtime State of Web Worker 15 2.3.2 Snapshot of Mobile Web Worker 16 2.3.3 End-to-End Migration Process 21 2.4 Evaluation 22 2.4.1 Experimental Environment 22 2.4.2 Migration Performance 24 2.4.3 Application Execution Performance 27 Chapter 3. IONN: Incremental Offloading of Neural Network Computations 30 3.1 Motivation: Overhead of Deploying DNN Model 30 3.2 Background 32 3.2.1 Deep Neural Network 33 3.2.2 Offloading of DNN Computations 33 3.3 IONN For DNN Edge Computing 35 3.4 DNN Partitioning 37 3.4.1 Neural Network (NN) Execution Graph 38 3.4.2 Partitioning Algorithm 40 3.4.3 Handling DNNs with Multiple Paths. 43 3.5 Evaluation 45 3.5.1 Experimental Environment 45 3.5.2 DNN Query Performance 46 3.5.3 Accuracy of Prediction Functions 48 3.5.4 Energy Consumption. 49 Chapter 4. PerDNN: Offloading DNN Computations to Pervasive Edge Servers 51 4.1 Motivation: Cold Start Issue 51 4.2 Proposed Offloading System: PerDNN 52 4.2.1 Edge Server Environment 53 4.2.2 Overall Architecture 54 4.2.3 GPU-aware DNN Partitioning 56 4.2.4 Mobility Prediction 59 4.3 Evaluation 63 4.3.1 Performance Gain of Single Client 64 4.3.2 Large-Scale Simulation 65 Chapter 5. RelatedWorks 73 Chapter 6. Conclusion. 78 Chapter 5. RelatedWorks 73 Chapter 6. Conclusion 78 Bibliography 80Docto

SNU Open Repository and Archive

HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC

Author: Quraishi Masudul Hassan
Ren Fengbo
Riera Michael
Tavakoli Erfan Bank
Publication venue
Publication date: 18/10/2021
Field of study

This paper presents HALO 1.0, an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic accelerator orchestration (HALO) principles. HALO implements a novel compute-centric message passing interface (C^2MPI) specification for enabling the performance-portable execution of a hardware-agnostic host application across heterogeneous accelerators. The experiment results of evaluating eight widely used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs, and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified control flow for host programs to run across all the computing devices with a consistently top performance portability score, which is up to five orders of magnitude higher than the OpenCL-based solution.Comment: 21 page

arXiv.org e-Print Archive

Efficient hardware implementations of bio-inspired networks

Author: Vasanthakumaribabu Anakha
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2020
Field of study

The human brain, with its massive computational capability and power efficiency in small form factor, continues to inspire the ultimate goal of building machines that can perform tasks without being explicitly programmed. In an effort to mimic the natural information processing paradigms observed in the brain, several neural network generations have been proposed over the years. Among the neural networks inspired by biology, second-generation Artificial or Deep Neural Networks (ANNs/DNNs) use memoryless neuron models and have shown unprecedented success surpassing humans in a wide variety of tasks. Unlike ANNs, third-generation Spiking Neural Networks (SNNs) closely mimic biological neurons by operating on discrete and sparse events in time called spikes, which are obtained by the time integration of previous inputs. Implementation of data-intensive neural network models on computers based on the von Neumann architecture is mainly limited by the continuous data transfer between the physically separated memory and processing units. Hence, non-von Neumann architectural solutions are essential for processing these memory-intensive bio-inspired neural networks in an energy-efficient manner. Among the non-von Neumann architectures, implementations employing non-volatile memory (NVM) devices are most promising due to their compact size and low operating power. However, it is non-trivial to integrate these nanoscale devices on conventional computational substrates due to their non-idealities, such as limited dynamic range, finite bit resolution, programming variability, etc. This dissertation demonstrates the architectural and algorithmic optimizations of implementing bio-inspired neural networks using emerging nanoscale devices. The first half of the dissertation focuses on the hardware acceleration of DNN implementations. A 4-layer stochastic DNN in a crossbar architecture with memristive devices at the cross point is analyzed for accelerating DNN training. This network is then used as a baseline to explore the impact of experimental memristive device behavior on network performance. Programming variability is found to have a critical role in determining network performance compared to other non-ideal characteristics of the devices. In addition, noise-resilient inference engines are demonstrated using stochastic memristive DNNs with 100 bits for stochastic encoding during inference and 10 bits for the expensive training. The second half of the dissertation focuses on a novel probabilistic framework for SNNs using the Generalized Linear Model (GLM) neurons for capturing neuronal behavior. This work demonstrates that probabilistic SNNs have comparable perform-ance against equivalent ANNs on two popular benchmarks - handwritten-digit classification and human activity recognition. Considering the potential of SNNs in energy-efficient implementations, a hardware accelerator for inference is proposed, termed as Spintronic Accelerator for Probabilistic SNNs (SpinAPS). The learning algorithm is optimized for a hardware friendly implementation and uses first-to-spike decoding scheme for low latency inference. With binary spintronic synapses and digital CMOS logic neurons for computations, SpinAPS achieves a performance improvement of 4x in terms of GSOPS/W/mm

^2

when compared to a conventional SRAM-based design. Collectively, this work demonstrates the potential of emerging memory technologies in building energy-efficient hardware architectures for deep and spiking neural networks. The design strategies adopted in this work can be extended to other spike and non-spike based systems for building embedded solutions having power/energy constraints

Digital Commons @ New Jersey Institute of Technology (NJIT)