On a GPU cluster, the ratio of high computing power to communication
bandwidth makes scaling breadth-first search (BFS) on a scale-free graph
extremely challenging. By separating high and low out-degree vertices, we
present an implementation with scalable computation and a model for scalable
communication for BFS and direction-optimized BFS. Our communication model uses
global reduction for high-degree vertices, and point-to-point transmission for
low-degree vertices. Leveraging the characteristics of degree separation, we
reduce the graph size to one third of the conventional edge list
representation. With several other optimizations, we observe linear weak
scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a
scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access
system.Comment: 12 pages, 13 figures. To appear at IPDPS 201