287 research outputs found
February 17th, 2017
Since 2006, we have been experiencing two very important developments in computing. One is that tremendous amounts of resources have been invested into innovative applications such as first-principle based models, deep learning and cognitive computing. Many application domains are defying the conventional “it is too expensive” thinking that led to inaccuracies and missed opportunities. The other part is that the industry has been taking a technological path where application performance and power efficiency vary by more than two orders of magnitude depending on their parallelism, heterogeneity, and locality. Today, most of the top supercomputers in the world are heterogeneous parallel computing systems. New standards such as the Heterogeneous Systems Architecture (HSA) are emerging to facilitate software development. Much has been and needs to be learned about of algorithms, languages, compilers and hardware architecture in these movements. What are the applications that continue to drive the technology development? How will we program these systems? How will innovations in memory and storage devices present further opportunities and challenges? What is the impact on long-term software engineering cost on applications? In this talk, I will present some research opportunities and challenges that are brought about by this perfect storm
Enabling GPU Support for the COMPSs-Mobile Framework
Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework – an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud – to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.This work is partially supported by the Joint-Laboratory on Extreme Scale Computing (JLESC), by the European Union through the Horizon 2020 research and innovation programme under contract 687584 (TANGO Project), by the Spanish Goverment (TIN2015-65316-P, BES-2013-067167, EEBB-2016-11272, SEV-2011-00067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
A Feature Taxonomy and Survey of Synchronization Primitive Implementations
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNCR Corporatio
Performance Implications of Synchronization Support for Parallel FORTRAN Programs
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJoint Services Electronics Program / N00014-90-J-1270National Science Foundation / MIP-8809478National Aeronautics and Space Administration / NASA NAG 1-613NCRAMD 29K Advanced Processor Development Divisio
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
Graph Neural Networks (GNNs) are emerging as a powerful tool for learning
from graph-structured data and performing sophisticated inference tasks in
various application domains. Although GNNs have been shown to be effective on
modest-sized graphs, training them on large-scale graphs remains a significant
challenge due to lack of efficient data access and data movement methods.
Existing frameworks for training GNNs use CPUs for graph sampling and feature
aggregation, while the training and updating of model weights are executed on
GPUs. However, our in-depth profiling shows the CPUs cannot achieve the
throughput required to saturate GNN model training throughput, causing gross
under-utilization of expensive GPU resources. Furthermore, when the graph and
its embeddings do not fit in the CPU memory, the overhead introduced by the
operating system, say for handling page-faults, comes in the critical path of
execution.
To address these issues, we propose the GPU Initiated Direct Storage Access
(GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs
while efficiently utilizing all hardware resources, such as CPU memory,
storage, and GPU memory with a hybrid data placement strategy. By enabling GPU
threads to fetch feature vectors directly from storage, GIDS dataloader solves
the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS
dataloader leverages GPU parallelism to tolerate storage latency and eliminates
expensive page-fault overhead. Doing so enables us to design novel
optimizations for exploiting locality and increasing effective bandwidth for
GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets
shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by
up to 392X when compared to the current, state-of-the-art DGL dataloader.Comment: Under Submission. Source code:
https://github.com/jeongminpark417/GID
- …