23 research outputs found
Learning detectors quickly using structured covariance matrices
Computer vision is increasingly becoming interested in the rapid estimation
of object detectors. Canonical hard negative mining strategies are slow as they
require multiple passes of the large negative training set. Recent work has
demonstrated that if the distribution of negative examples is assumed to be
stationary, then Linear Discriminant Analysis (LDA) can learn comparable
detectors without ever revisiting the negative set. Even with this insight,
however, the time to learn a single object detector can still be on the order
of tens of seconds on a modern desktop computer. This paper proposes to
leverage the resulting structured covariance matrix to obtain detectors with
identical performance in orders of magnitude less time and memory. We elucidate
an important connection to the correlation filter literature, demonstrating
that these can also be trained without ever revisiting the negative set
Staple: Complementary Learners for Real-Time Tracking
Correlation Filter-based trackers have recently achieved excellent
performance, showing great robustness to challenging situations exhibiting
motion blur and illumination changes. However, since the model that they learn
depends strongly on the spatial layout of the tracked object, they are
notoriously sensitive to deformation. Models based on colour statistics have
complementary traits: they cope well with variation in shape, but suffer when
illumination is not consistent throughout a sequence. Moreover, colour
distributions alone can be insufficiently discriminative. In this paper, we
show that a simple tracker combining complementary cues in a ridge regression
framework can operate faster than 80 FPS and outperform not only all entries in
the popular VOT14 competition, but also recent and far more sophisticated
trackers according to multiple benchmarks.Comment: To appear in CVPR 201
On progressive sharpening, flat minima and generalisation
We present a new approach to understanding the relationship between loss
curvature and input-output model behaviour in deep learning. Specifically, we
use existing empirical analyses of the spectrum of deep network loss Hessians
to ground an ansatz tying together the loss Hessian and the input-output
Jacobian of a deep neural network over training samples throughout training. We
then prove a series of theoretical results which quantify the degree to which
the input-output Jacobian of a model approximates its Lipschitz norm over a
data distribution, and deduce a novel generalisation bound in terms of the
empirical Jacobian. We use our ansatz, together with our theoretical results,
to give a new account of the recently observed progressive sharpening
phenomenon, as well as the generalisation properties of flat minima.
Experimental evidence is provided to validate our claims
End-to-end representation learning for Correlation Filter based tracking
The Correlation Filter is an algorithm that trains a linear template to
discriminate between images and their translations. It is well suited to object
tracking because its formulation in the Fourier domain provides a fast
solution, enabling the detector to be re-trained once per frame. Previous works
that use the Correlation Filter, however, have adopted features that were
either manually designed or trained for a different task. This work is the
first to overcome this limitation by interpreting the Correlation Filter
learner, which has a closed-form solution, as a differentiable layer in a deep
neural network. This enables learning deep features that are tightly coupled to
the Correlation Filter. Experiments illustrate that our method has the
important practical benefit of allowing lightweight architectures to achieve
state-of-the-art performance at high framerates.Comment: To appear at CVPR 201
Learning feed-forward one-shot learners
One-shot learning is usually tackled by using generative models or
discriminative embeddings. Discriminative methods based on deep learning, which
are very effective in other learning scenarios, are ill-suited for one-shot
learning as they need large amounts of training data. In this paper, we propose
a method to learn the parameters of a deep model in one shot. We construct the
learner as a second deep network, called a learnet, which predicts the
parameters of a pupil network from a single exemplar. In this manner we obtain
an efficient feed-forward one-shot learner, trained end-to-end by minimizing a
one-shot classification objective in a learning to learn formulation. In order
to make the construction feasible, we propose a number of factorizations of the
parameters of the pupil network. We demonstrate encouraging results by learning
characters from single exemplars in Omniglot, and by tracking visual objects
from a single initial exemplar in the Visual Object Tracking benchmark.Comment: The first three authors contributed equally, and are listed in
alphabetical orde
On skip connections and normalisation layers in deep optimisation
We introduce a general theoretical framework, designed for the study of
gradient optimisation of deep neural networks, that encompasses ubiquitous
architecture choices including batch normalisation, weight normalisation and
skip connections. Our framework determines the curvature and regularity
properties of multilayer loss landscapes in terms of their constituent layers,
thereby elucidating the roles played by normalisation layers and skip
connections in globalising these properties. We then demonstrate the utility of
this framework in two respects. First, we give the only proof of which we are
aware that a class of deep neural networks can be trained using gradient
descent to global optima even when such optima only exist at infinity, as is
the case for the cross-entropy cost. Second, we identify a novel causal
mechanism by which skip connections accelerate training, which we verify
predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.Comment: NeurIPS 202
Devon: Deformable Volume Network for Learning Optical Flow
State-of-the-art neural network models estimate large displacement optical
flow in multi-resolution and use warping to propagate the estimation between
two resolutions. Despite their impressive results, it is known that there are
two problems with the approach. First, the multi-resolution estimation of
optical flow fails in situations where small objects move fast. Second, warping
creates artifacts when occlusion or dis-occlusion happens. In this paper, we
propose a new neural network module, Deformable Cost Volume, which alleviates
the two problems. Based on this module, we designed the Deformable Volume
Network (Devon) which can estimate multi-scale optical flow in a single high
resolution. Experiments show Devon is more suitable in handling small objects
moving fast and achieves comparable results to the state-of-the-art methods in
public benchmarks
Learning feed-forward one-shot learners
Abstract One-shot learning is usually tackled by using generative models or discriminative embeddings. Discriminative methods based on deep learning, which are very effective in other learning scenarios, are ill-suited for one-shot learning as they need large amounts of training data. In this paper, we propose a method to learn the parameters of a deep model in one shot. We construct the learner as a second deep network, called a learnet, which predicts the parameters of a pupil network from a single exemplar. In this manner we obtain an efficient feed-forward one-shot learner, trained end-to-end by minimizing a one-shot classification objective in a learning to learn formulation. In order to make the construction feasible, we propose a number of factorizations of the parameters of the pupil network. We demonstrate encouraging results by learning characters from single exemplars in Omniglot, and by tracking visual objects from a single initial exemplar in the Visual Object Tracking benchmark
Long-Term Visual Object Tracking Benchmark
We propose a new long video dataset (called Track Long and Prosper - TLP) and
benchmark for single object tracking. The dataset consists of 50 HD videos from
real world scenarios, encompassing a duration of over 400 minutes (676K
frames), making it more than 20 folds larger in average duration per sequence
and more than 8 folds larger in terms of total covered duration, as compared to
existing generic datasets for visual tracking. The proposed dataset paves a way
to suitably assess long term tracking performance and train better deep
learning architectures (avoiding/reducing augmentation, which may not reflect
real world behaviour). We benchmark the dataset on 17 state of the art trackers
and rank them according to tracking accuracy and run time speeds. We further
present thorough qualitative and quantitative evaluation highlighting the
importance of long term aspect of tracking. Our most interesting observations
are (a) existing short sequence benchmarks fail to bring out the inherent
differences in tracking algorithms which widen up while tracking on long
sequences and (b) the accuracy of trackers abruptly drops on challenging long
sequences, suggesting the potential need of research efforts in the direction
of long-term tracking.Comment: ACCV 2018 (Oral