12 research outputs found
Multi-node Acceleration for Large-scale GCNs
Limited by the memory capacity and compute power, singe-node graph
convolutional neural network (GCN) accelerators cannot complete the execution
of GCNs within a reasonable amount of time, due to the explosive size of graphs
nowadays. Thus, large-scale GCNs call for a multi-node acceleration system
(MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we
aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale
graphs. We first identify the communication pattern and challenges of
multi-node acceleration for GCNs on large-scale graphs. We observe that (1)
coarse-grained communication patterns exist in the execution of GCNs in
MultiAccSys, which introduces massive amount of redundant network transmissions
and off-chip memory accesses; (2) overall, the acceleration of GCNs in
MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two
observations, we then propose MultiGCN, the first MultiAccSys for large-scale
GCNs that trades network latency for network bandwidth. Specifically, by
leveraging the network latency tolerance, we first propose a topology-aware
multicast mechanism with a one put per multicast message-passing model to
reduce transmissions and alleviate network bandwidth requirements. Second, we
introduce a scatter-based round execution mechanism which cooperates with the
multicast mechanism and reduces redundant off-chip memory accesses. Compared to
the baseline MultiAccSys, MultiGCN achieves 4~12x speedup using only 28%~68%
energy, while reducing 32% transmissions and 73% off-chip memory accesses on
average. It not only achieves 2.5~8x speedup over the state-of-the-art
multi-GPU solution, but also scales to large-scale graphs as opposed to
single-node GCN accelerators.Comment: To appear in T