19 research outputs found
Spatial Mixture-of-Experts
Many data have an underlying dependence on spatial location; it may be
weather on the Earth, a simulation on a mesh, or a registered image. Yet this
feature is rarely taken advantage of, and violates common assumptions made by
many neural network layers, such as translation equivariance. Further, many
works that do incorporate locality fail to capture fine-grained structure. To
address this, we introduce the Spatial Mixture-of-Experts (SMoE) layer, a
sparsely-gated layer that learns spatial structure in the input domain and
routes experts at a fine-grained level to utilize it. We also develop new
techniques to train SMoEs, including a self-supervised routing loss and damping
expert errors. Finally, we show strong results for SMoEs on numerous tasks, and
set new state-of-the-art results for medium-range weather prediction and
post-processing ensemble weather forecasts.Comment: 20 pages, 3 figures; NeurIPS 202
Neural parameter allocation search
https://arxiv.org/pdf/2006.10598.pdfFirst author draf
Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field
Learning Combinatorial Node Labeling Algorithms
We present a graph neural network to learn graph coloring heuristics using
reinforcement learning. Our learned deterministic heuristics give better
solutions than classical degree-based greedy heuristics and only take seconds
to evaluate on graphs with tens of thousands of vertices. As our approach is
based on policy-gradients, it also learns a probabilistic policy as well. These
probabilistic policies outperform all greedy coloring baselines and a machine
learning baseline. Our approach generalizes several previous machine-learning
frameworks, which applied to problems like minimum vertex cover. We also
demonstrate that our approach outperforms two greedy heuristics on minimum
vertex cover
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing
samples across nodes usually yields the best performance, but poses scaling
challenges due to global information dissemination and load imbalance across
uneven sample lengths. State-of-the-art decentralized optimizers mitigate the
problem, but require more iterations to achieve the same accuracy as their
globally-communicating counterparts. We present Wait-Avoiding Group Model
Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global
communication via subgroup weight exchange. The key insight is a combination of
algorithmic changes to the averaging scheme and the use of a group allreduce
operation. We prove the convergence of WAGMA-SGD, and empirically show that it
retains convergence rates similar to Allreduce-SGD. For evaluation, we train
ResNet-50 on ImageNet; Transformer for machine translation; and deep
reinforcement learning for navigation at scale. Compared with state-of-the-art
decentralized SGD variants, WAGMA-SGD significantly improves training
throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves
the fastest time-to-solution (e.g., the highest score using the shortest
training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems
(IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202
Neural Graph Databases
Graph databases (GDBs) enable processing and analysis of unstructured,
complex, rich, and usually vast graph datasets. Despite the large significance
of GDBs in both academia and industry, little effort has been made into
integrating them with the predictive power of graph neural networks (GNNs). In
this work, we show how to seamlessly combine nearly any GNN model with the
computational capabilities of GDBs. For this, we observe that the majority of
these systems are based on, or support, a graph data model called the Labeled
Property Graph (LPG), where vertices and edges can have arbitrarily complex
sets of labels and properties. We then develop LPG2vec, an encoder that
transforms an arbitrary LPG dataset into a representation that can be directly
used with a broad class of GNNs, including convolutional, attentional,
message-passing, and even higher-order or spectral models. In our evaluation,
we show that the rich information represented as LPG labels and properties is
properly preserved by LPG2vec, and it increases the accuracy of predictions
regardless of the targeted learning task or the used GNN model, by up to 34%
compared to graphs with no LPG labels/properties. In general, LPG2vec enables
combining predictive power of the most powerful GNNs with the full scope of
information encoded in the LPG model, paving the way for neural graph
databases, a class of systems where the vast complexity of maintained data will
benefit from modern and future graph machine learning methods