28 research outputs found
Initialization of ReLUs for Dynamical Isometry
Deep learning relies on good initialization schemes and hyperparameter
choices prior to training a neural network. Random weight initializations
induce random network ensembles, which give rise to the trainability, training
speed, and sometimes also generalization ability of an instance. In addition,
such ensembles provide theoretical insights into the space of candidate models
of which one is selected during training. The results obtained so far rely on
mean field approximations that assume infinite layer width and that study
average squared signals. We derive the joint signal output distribution
exactly, without mean field assumptions, for fully-connected networks with
Gaussian weights and biases, and analyze deviations from the mean field
results. For rectified linear units, we further discuss limitations of the
standard initialization scheme, such as its lack of dynamical isometry, and
propose a simple alternative that overcomes these by initial parameter sharing.Comment: NeurIPS 201
International crop trade networks: The impact of shocks and cascades
Analyzing available FAO data from 176 countries over 21 years, we observe an
increase of complexity in the international trade of maize, rice, soy, and
wheat. A larger number of countries play a role as producers or intermediaries,
either for trade or food processing. In consequence, we find that the trade
networks become more prone to failure cascades caused by exogenous shocks. In
our model, countries compensate for demand deficits by imposing export
restrictions. To capture these, we construct higher-order trade dependency
networks for the different crops and years. These networks reveal hidden
dependencies between countries and allow to discuss policy implications
Convolutional and Residual Networks Provably Contain Lottery Tickets
The Lottery Ticket Hypothesis continues to have a profound practical impact on the quest for small scale deep neural networks that solve modern deep learning tasks at competitive performance. These lottery tickets are identified by pruning large randomly initialized neural networks with architectures that are as diverse as their applications. Yet, theoretical insights that attest their existence have been mostly focused on deep fully-connected feed forward networks with ReLU activation functions. We prove that also modern architectures consisting of convolutional and residual layers that can be equipped with almost arbitrary activation functions can contain lottery tickets with high probability
Modeling the formation of R\&D alliances: An agent-based model with empirical validation
We develop an agent-based model to reproduce the size distribution of R\&D
alliances of firms. Agents are uniformly selected to initiate an alliance and
to invite collaboration partners. These decide about acceptance based on an
individual threshold that is compared with the utility expected from joining
the current alliance. The benefit of alliances results from the fitness of the
agents involved. Fitness is obtained from an empirical distribution of agent's
activities. The cost of an alliance reflects its coordination effort. Two free
parameters and scale the costs and the individual threshold. If
initiators receive rejections of invitations, the alliance formation stops
and another initiator is selected. The three free parameters
are calibrated against a large scale data set of about 15,000 firms engaging in
about 15,000 R\&D alliances over 26 years. For the validation of the model we
compare the empirical size distribution with the theoretical one, using
confidence bands, to find a very good agreement. As an asset of our agent-based
model, we provide an analytical solution that allows to reduce the simulation
effort considerably. The analytical solution applies to general forms of the
utility of alliances. Hence, the model can be extended to other cases of
alliance formation. While no information about the initiators of an alliance is
available, our results indicate that mostly firms with high fitness are able to
attract newcomers and to establish larger alliances
Most Activation Functions Can Win the Lottery Without Excessive Depth
The strong lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning, which has inspired interesting practical and theoretical insights into how neural networks can represent functions. For networks with ReLU activation functions, it has been proven that a target network with depth L can be approximated by the subnetwork of a randomly initialized neural network that has double the target’s depth 2L and is wider by a logarithmic factor. We show that a depth L + 1 network is sufficient. This result indicates that we can expect to find lottery tickets at realistic, commonly used depths while only requiring logarithmic overparametrization. Our novel construction approach applies to a large class of
activation functions and is not limited to ReLUs
Cascade Size Distributions: Why They Matter and How to Compute Them Efficiently
Cascade models are central to understanding, predicting, and controlling
epidemic spreading and information propagation. Related optimization, including
influence maximization, model parameter inference, or the development of
vaccination strategies, relies heavily on sampling from a model. This is either
inefficient or inaccurate. As alternative, we present an efficient message
passing algorithm that computes the probability distribution of the cascade
size for the Independent Cascade Model on weighted directed networks and
generalizations. Our approach is exact on trees but can be applied to any
network topology. It approximates locally tree-like networks well, scales to
large networks, and can lead to surprisingly good performance on more dense
networks, as we also exemplify on real world data.Comment: Accepted at AAAI 202
Are GATs Out of Balance?
While the expressive power and computational capabilities of graph neural
networks (GNNs) have been theoretically studied, their optimization and
learning dynamics, in general, remain largely unexplored. Our study undertakes
the Graph Attention Network (GAT), a popular GNN architecture in which a node's
neighborhood aggregation is weighted by parameterized attention coefficients.
We derive a conservation law of GAT gradient flow dynamics, which explains why
a high portion of parameters in GATs with standard initialization struggle to
change during training. This effect is amplified in deeper GATs, which perform
significantly worse than their shallow counterparts. To alleviate this problem,
we devise an initialization scheme that balances the GAT network. Our approach
i) allows more effective propagation of gradients and in turn enables
trainability of deeper networks, and ii) attains a considerable speedup in
training and convergence time in comparison to the standard initialization. Our
main theorem serves as a stepping stone to studying the learning dynamics of
positive homogeneous models with attention mechanisms.Comment: 25 pages. To be published in Advances in Neural Information
Processing Systems (NeurIPS), 202