218 research outputs found
Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities
Recently, we proposed to transform the outputs of each hidden neuron in a
multi-layer perceptron network to have zero output and zero slope on average,
and use separate shortcut connections to model the linear dependencies instead.
We continue the work by firstly introducing a third transformation to normalize
the scale of the outputs of each hidden neuron, and secondly by analyzing the
connections to second order optimization methods. We show that the
transformations make a simple stochastic gradient behave closer to second-order
optimization methods and thus speed up learning. This is shown both in theory
and with experiments. The experiments on the third transformation show that
while it further increases the speed of learning, it can also hurt performance
by converging to a worse local optimum, where both the inputs and outputs of
many hidden neurons are close to zero.Comment: 10 pages, 5 figures, ICLR201
Graph Kernels
We present a unified framework to study graph kernels, special cases of which include the random
walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004;
Mahé et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time
complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3).
We find a spectral decomposition approach even more efficient when computing entire kernel matrices.
For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3)
time per iteration, where d is the size of the label set. By extending the necessary linear algebra to
Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels,
and O(n^4) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n^2)
time per iteration in all cases. Experiments on graphs from bioinformatics and other application
domains show that these techniques can speed up computation of the kernel by an order of magnitude
or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when
specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to
R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment
kernel of Fröhlich et al. (2006) yet provably positive semi-definite
Efficient Exact Inference in Planar Ising Models
We give polynomial-time algorithms for the exact computation of lowest-energy
(ground) states, worst margin violators, log partition functions, and marginal
edge probabilities in certain binary undirected graphical models. Our approach
provides an interesting alternative to the well-known graph cut paradigm in
that it does not impose any submodularity constraints; instead we require
planarity to establish a correspondence with perfect matchings (dimer
coverings) in an expanded dual graph. We implement a unified framework while
delegating complex but well-understood subproblems (planar embedding,
maximum-weight perfect matching) to established algorithms for which efficient
implementations are freely available. Unlike graph cut methods, we can perform
penalized maximum-likelihood as well as maximum-margin parameter estimation in
the associated conditional random fields (CRFs), and employ marginal posterior
probabilities as well as maximum a posteriori (MAP) states for prediction.
Maximum-margin CRF parameter estimation on image denoising and segmentation
problems shows our approach to be efficient and effective. A C++ implementation
is available from http://nic.schraudolph.org/isinf/Comment: Fixed a number of bugs in v1; added 10 pages of additional figures,
explanations, proofs, and experiment
BAMBI: blind accelerated multimodal Bayesian inference
In this paper we present an algorithm for rapid Bayesian analysis that
combines the benefits of nested sampling and artificial neural networks. The
blind accelerated multimodal Bayesian inference (BAMBI) algorithm implements
the MultiNest package for nested sampling as well as the training of an
artificial neural network (NN) to learn the likelihood function. In the case of
computationally expensive likelihoods, this allows the substitution of a much
more rapid approximation in order to increase significantly the speed of the
analysis. We begin by demonstrating, with a few toy examples, the ability of a
NN to learn complicated likelihood surfaces. BAMBI's ability to decrease
running time for Bayesian inference is then demonstrated in the context of
estimating cosmological parameters from Wilkinson Microwave Anisotropy Probe
and other observations. We show that valuable speed increases are achieved in
addition to obtaining NNs trained on the likelihood functions for the different
model and data combinations. These NNs can then be used for an even faster
follow-up analysis using the same likelihood and different priors. This is a
fully general algorithm that can be applied, without any pre-processing, to
other problems with computationally expensive likelihood functions.Comment: 12 pages, 8 tables, 17 figures; accepted by MNRAS; v2 to reflect
minor changes in published versio
A Formalization of The Natural Gradient Method for General Similarity Measures
In optimization, the natural gradient method is well-known for likelihood
maximization. The method uses the Kullback-Leibler divergence, corresponding
infinitesimally to the Fisher-Rao metric, which is pulled back to the parameter
space of a family of probability distributions. This way, gradients with
respect to the parameters respect the Fisher-Rao geometry of the space of
distributions, which might differ vastly from the standard Euclidean geometry
of the parameter space, often leading to faster convergence. However, when
minimizing an arbitrary similarity measure between distributions, it is
generally unclear which metric to use. We provide a general framework that,
given a similarity measure, derives a metric for the natural gradient. We then
discuss connections between the natural gradient method and multiple other
optimization techniques in the literature. Finally, we provide computations of
the formal natural gradient to show overlap with well-known cases and to
compute natural gradients in novel frameworks
Towards Pose-Invariant 2D Face Classification for Surveillance
A key problem for "face in the crowd" recognition from existing surveillance cameras in public spaces (such as mass transit centres) is the issue of pose mismatches between probe and gallery faces. In addition to accuracy, scalability is also important, necessarily limiting the complexity of face classification algorithms. In this paper we evaluate recent approaches to the recognition of faces at relatively large pose angles from a gallery of frontal images and propose novel adaptations as well as modifications. Specifically, we compare and contrast the accuracy, robustness and speed of an Active Appearance Model (AAM) based method (where realistic frontal faces are synthesized from non-frontal probe faces) against bag-of-features methods (which are local feature approaches based on block Discrete Cosine Transforms and Gaussian Mixture Models). We show a novel approach where the AAM based technique is sped up by directly obtaining pose-robust features, allowing the omission of the computationally expensive and artefact producing image synthesis step. Additionally, we adapt a histogram-based bag-of-features technique to face classification and contrast its properties to a previously proposed direct bag-of-features method. We also show that the two bag-of-features approaches can be considerably sped up, without a loss in classification accuracy, via an approximation of the exponential function. Experiments on the FERET and PIE databases suggest that the bag-of-features techniques generally attain better performance, with significantly lower computational loads. The histogram-based bag-of-features technique is capable of achieving an average recognition accuracy of 89% for pose angles of around 25 degrees
- …