Search CORE

4 research outputs found

Characterizing Network Requirements for GPU API Remoting in AI Applications

Author: Chen Haibo
Chen Rong
Chen Zhuofu
Gu Jinyu
Wang Tianxia
Wei Xingda
Publication venue
Publication date: 24/01/2024
Field of study

GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands

arXiv.org e-Print Archive

Scheduling computations with provably low synchronization overheads

Author: Paulino Hervé
Rito Guilherme
Publication venue
Publication date: 01/01/2019
Field of study

Work Stealing has been a very successful algorithm for scheduling parallel computations, and is known to achieve high performances even for computations exhibiting fine-grained parallelism. We present a variant of \ws\ that provably avoids most synchronization overheads by keeping processors' deques entirely private by default, and only exposing work when requested by thieves. This is the first paper that obtains bounds on the synchronization overheads that are (essentially) independent of the total amount of work, thus corresponding to a great improvement, in both algorithm design and theory, over state-of-the-art \ws\ algorithms. Consider any computation with work

T_{1}

and critical-path length

T_{\infty}

executed by

P

processors using our scheduler. Our analysis shows that the expected execution time is

O\left(\frac{T_{1}}{P} + T_{\infty}\right)

, and the expected synchronization overheads incurred during the execution are at most

O\left(\left(C_{CAS} + C_{MFence}\right)PT_{\infty}\right)

, where

C_{CAS}

and

C_{MFence}

respectively denote the maximum cost of executing a Compare-And-Swap instruction and a Memory Fence instruction

arXiv.org e-Print Archive

E-LIS