96 research outputs found
Are GATs Out of Balance?
While the expressive power and computational capabilities of graph neural
networks (GNNs) have been theoretically studied, their optimization and
learning dynamics, in general, remain largely unexplored. Our study undertakes
the Graph Attention Network (GAT), a popular GNN architecture in which a node's
neighborhood aggregation is weighted by parameterized attention coefficients.
We derive a conservation law of GAT gradient flow dynamics, which explains why
a high portion of parameters in GATs with standard initialization struggle to
change during training. This effect is amplified in deeper GATs, which perform
significantly worse than their shallow counterparts. To alleviate this problem,
we devise an initialization scheme that balances the GAT network. Our approach
i) allows more effective propagation of gradients and in turn enables
trainability of deeper networks, and ii) attains a considerable speedup in
training and convergence time in comparison to the standard initialization. Our
main theorem serves as a stepping stone to studying the learning dynamics of
positive homogeneous models with attention mechanisms.Comment: 25 pages. To be published in Advances in Neural Information
Processing Systems (NeurIPS), 202
Identifying overparameterization in Quantum Circuit Born Machines
In machine learning, overparameterization is associated with qualitative
changes in the empirical risk landscape, which can lead to more efficient
training dynamics. For many parameterized models used in statistical learning,
there exists a critical number of parameters, or model size, above which the
model is constructed and trained in the overparameterized regime. There are
many characteristics of overparameterized loss landscapes. The most significant
is the convergence of standard gradient descent to global or local minima of
low loss. In this work, we study the onset of overparameterization transitions
for quantum circuit Born machines, generative models that are trained using
non-adversarial gradient-based methods. We observe that bounds based on
numerical analysis are in general good lower bounds on the overparameterization
transition. However, bounds based on the quantum circuit's algebraic structure
are very loose upper bounds. Our results indicate that fully understanding the
trainability of these models remains an open question.Comment: 11 pages, 16 figure
Towards Strong Pruning for Lottery Tickets with Non-Zero Biases
The strong lottery ticket hypothesis holds the promise that pruning randomly initialized deep neural networks could offer a computationally efficient alternative to deep learning with stochastic gradient descent. Common parameter initialization schemes and existence proofs, however, are focused on networks with zero biases, thus foregoing the potential universal approximation property of pruning. To fill this gap, we extend multiple initialization schemes and existence proofs to non-zero biases, including explicit 'looks-linear' approaches for ReLU activation functions. These do not only enable truly orthogonal parameter initialization but also reduce potential pruning errors. In experiments on standard benchmark data sets, we further highlight the practical benefits of non-zero bias initialization schemes, and present theoretically inspired extensions for state-of-the-art strong lottery ticket pruning
Householder-Absolute Neural Layers For High Variability and Deep Trainability
We propose a new architecture for artificial neural networks called
Householder-absolute neural layers, or Han-layers for short, that use
Householder reflectors as weight matrices and the absolute-value function for
activation. Han-layers, functioning as fully connected layers, are motivated by
recent results on neural-network variability and are designed to increase
activation ratio and reduce the chance of Collapse to Constants. Neural
networks constructed chiefly from Han-layers are called HanNets. By
construction, HanNets enjoy a theoretical guarantee that vanishing or exploding
gradient never occurs. We conduct several proof-of-concept experiments. Some
surprising results obtained on styled test problems suggest that, under certain
conditions, HanNets exhibit an unusual ability to produce nearly perfect
solutions unattainable by fully connected networks. Experiments on regression
datasets show that HanNets can significantly reduce the number of model
parameters while maintaining or improving the level of generalization accuracy.
In addition, by adding a few Han-layers into the pre-classification FC-layer of
a convolutional neural network, we are able to quickly improve a
state-of-the-art result on CIFAR10 dataset. These proof-of-concept results are
sufficient to necessitate further studies on HanNets to understand their
capacities and limits, and to exploit their potentials in real-world
applications
From Tight Gradient Bounds for Parameterized Quantum Circuits to the Absence of Barren Plateaus in QGANs
Barren plateaus are a central bottleneck in the scalability of variational
quantum algorithms (VQAs), and are known to arise in various ways, from circuit
depth and hardware noise to global observables. However, a caveat of most
existing results is the requirement of t-design circuit assumptions that are
typically not satisfied in practice. In this work, we loosen these assumptions
altogether and derive tight upper and lower bounds on gradient concentration,
for a large class of parameterized quantum circuits and arbitrary observables.
By requiring only a couple of design choices that are constructive and easily
verified, our results can readily be leveraged to rule out barren plateaus for
explicit circuits and mixed observables, namely, observables containing a
non-vanishing local term. This insight has direct implications for hybrid
Quantum Generative Adversarial Networks (qGANs), a generative model that can be
reformulated as a VQA with an observable composed of local and global terms. We
prove that designing the discriminator appropriately leads to 1-local weights
that stay constant in the number of qubits, regardless of discriminator depth.
Combined with our first contribution, this implies that qGANs with shallow
generators can be trained at scale without suffering from barren plateaus --
making them a promising candidate for applications in generative quantum
machine learning. We demonstrate this result by training a qGAN to learn a 2D
mixture of Gaussian distributions with up to 16 qubits, and provide numerical
evidence that global contributions to the gradient, while initially
exponentially small, may kick in substantially over the course of training
- …