Search CORE

287 research outputs found

February 17th, 2017

Author: Hwu Wen-Mei
Publication venue: Barcelona Supercomputing Center
Publication date: 10/09/2017
Field of study

Since 2006, we have been experiencing two very important developments in computing. One is that tremendous amounts of resources have been invested into innovative applications such as first-principle based models, deep learning and cognitive computing. Many application domains are defying the conventional “it is too expensive” thinking that led to inaccuracies and missed opportunities. The other part is that the industry has been taking a technological path where application performance and power efficiency vary by more than two orders of magnitude depending on their parallelism, heterogeneity, and locality. Today, most of the top supercomputers in the world are heterogeneous parallel computing systems. New standards such as the Heterogeneous Systems Architecture (HSA) are emerging to facilitate software development. Much has been and needs to be learned about of algorithms, languages, compilers and hardware architecture in these movements. What are the applications that continue to drive the technology development? How will we program these systems? How will innovations in memory and storage devices present further opportunities and challenges? What is the impact on long-term software engineering cost on applications? In this talk, I will present some research opportunities and challenges that are brought about by this perfect storm

UPCommons. Portal del coneixement obert de la UPC

Enabling GPU Support for the COMPSs-Mobile Framework

Author: Badia Sala Rosa Maria
Hwu Wen-Mei
Lordan Francesc
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/01/2018
Field of study

Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework – an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud – to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.This work is partially supported by the Joint-Laboratory on Extreme Scale Computing (JLESC), by the European Union through the Horizon 2020 research and innovation programme under contract 687584 (TANGO Project), by the Spanish Goverment (TIN2015-65316-P, BES-2013-067167, EEBB-2016-11272, SEV-2011-00067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

Author: Dakkak Abdul
de Gonzalo Simon Garcia
Hwu Wen-mei
Li Cheng
Xiong Jinjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/11/2018
Field of study

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines: including image recognition, object detection, natural language processing, speech synthesis, and personalized recommendation pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure for both enterprise and consumer applications, has to be able to handle user-defined pipelines of diverse DNN inference workloads while maintaining isolation and latency guarantees, and minimizing resource waste. The current solution for guaranteeing isolation within FaaS is suboptimal -- suffering from "cold start" latency. A major cause of such inefficiency is the need to move large amount of model data within and across servers. We propose TrIMS as a novel solution to address these issues. Our proposed solution consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models and up to 210x speedup for large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201

arXiv.org e-Print Archive

Crossref

A Feature Taxonomy and Survey of Synchronization Primitive Implementations

Author: Glew Andy
Hwu Wen-mei
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/02/1991
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNCR Corporatio

Illinois Digital Environment for Access to Learning and Scholarship Repository

Performance Implications of Synchronization Support for Parallel FORTRAN Programs

Author: Anik Sadun
Hwu Wen-mei W.
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/06/1991
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJoint Services Electronics Program / N00014-90-J-1270National Science Foundation / MIP-8809478National Aeronautics and Space Administration / NASA NAG 1-613NCRAMD 29K Advanced Processor Development Divisio

Illinois Digital Environment for Access to Learning and Scholarship Repository

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Author: Hwu Wen-mei
Mailthody Vikram Sharma
Park Jeongmin Brian
Qureshi Zaid
Publication venue
Publication date: 28/06/2023
Field of study

Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows the CPUs cannot achieve the throughput required to saturate GNN model training throughput, causing gross under-utilization of expensive GPU resources. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, comes in the critical path of execution. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory with a hybrid data placement strategy. By enabling GPU threads to fetch feature vectors directly from storage, GIDS dataloader solves the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS dataloader leverages GPU parallelism to tolerate storage latency and eliminates expensive page-fault overhead. Doing so enables us to design novel optimizations for exploiting locality and increasing effective bandwidth for GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 392X when compared to the current, state-of-the-art DGL dataloader.Comment: Under Submission. Source code: https://github.com/jeongminpark417/GID

arXiv.org e-Print Archive