4 research outputs found
Characterizing Network Requirements for GPU API Remoting in AI Applications
GPU remoting is a promising technique for supporting AI applications.
Networking plays a key role in enabling remoting. However, for efficient
remoting, the network requirements in terms of latency and bandwidth are
unknown. In this paper, we take a GPU-centric approach to derive the minimum
latency and bandwidth requirements for GPU remoting, while ensuring no (or
little) performance degradation for AI applications. Our study including
theoretical model demonstrates that, with careful remoting design, unmodified
AI applications can run on the remoting setup using commodity networking
hardware without any overhead or even with better performance, with low network
demands
Scheduling computations with provably low synchronization overheads
Work Stealing has been a very successful algorithm for scheduling parallel
computations, and is known to achieve high performances even for computations
exhibiting fine-grained parallelism. We present a variant of \ws\ that provably
avoids most synchronization overheads by keeping processors' deques entirely
private by default, and only exposing work when requested by thieves. This is
the first paper that obtains bounds on the synchronization overheads that are
(essentially) independent of the total amount of work, thus corresponding to a
great improvement, in both algorithm design and theory, over state-of-the-art
\ws\ algorithms. Consider any computation with work and critical-path
length executed by processors using our scheduler. Our
analysis shows that the expected execution time is , and the expected synchronization overheads incurred during
the execution are at most , where and
respectively denote the maximum cost of executing a Compare-And-Swap
instruction and a Memory Fence instruction