74 research outputs found
Massively Parallel Video Networks
We introduce a class of causal video understanding models that aims to
improve efficiency of video processing by maximising throughput, minimising
latency, and reducing the number of clock cycles. Leveraging operation
pipelining and multi-rate clocks, these models perform a minimal amount of
computation (e.g. as few as four convolutional layers) for each frame per
timestep to produce an output. The models are still very deep, with dozens of
such operations being performed but in a pipelined fashion that enables
depth-parallel computation. We illustrate the proposed principles by applying
them to existing image architectures and analyse their behaviour on two video
tasks: action recognition and human keypoint localisation. The results show
that a significant degree of parallelism, and implicitly speedup, can be
achieved with little loss in performance.Comment: Fixed typos in densenet model definition in appendi
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
Communication is a key bottleneck for distributed graph neural network (GNN)
training. This paper proposes GNNPipe, a new approach that scales the
distributed full-graph deep GNN training. Being the first to use layer-level
model parallelism for GNN training, GNNPipe partitions GNN layers among GPUs,
each device performs the computation for a disjoint subset of consecutive GNN
layers on the whole graph. Compared to graph parallelism with each GPU handling
a graph partition, GNNPipe reduces the communication volume by a factor of the
number of GNN layers. GNNPipe overcomes the unique challenges for pipelined
layer-level model parallelism on the whole graph by partitioning it into
dependent chunks, allowing the use of historical vertex embeddings, and
applying specific training techniques to ensure convergence. We also propose a
hybrid approach by combining GNNPipe with graph parallelism to handle large
graphs, achieve better computer resource utilization and ensure model
convergence. We build a general GNN training system supporting all three
parallelism setting. Extensive experiments show that our method reduces the
per-epoch training time by up to 2.45x (on average 1.58x) and reduces the
communication volume and overhead by up to 22.89x and 27.21x (on average 8.69x
and 11.60x), respectively, while achieving a comparable level of model accuracy
and convergence speed compared to graph parallelism
- …