1 research outputs found
Unsupervised learning from videos using temporal coherency deep networks
In this work we address the challenging problem of unsupervised learning from
videos. Existing methods utilize the spatio-temporal continuity in contiguous
video frames as regularization for the learning process. Typically, this
temporal coherence of close frames is used as a free form of annotation,
encouraging the learned representations to exhibit small differences between
these frames. But this type of approach fails to capture the dissimilarity
between videos with different content, hence learning less discriminative
features. We here propose two Siamese architectures for Convolutional Neural
Networks, and their corresponding novel loss functions, to learn from unlabeled
videos, which jointly exploit the local temporal coherence between contiguous
frames, and a global discriminative margin used to separate representations of
different videos. An extensive experimental evaluation is presented, where we
validate the proposed models on various tasks. First, we show how the learned
features can be used to discover actions and scenes in video collections.
Second, we show the benefits of such an unsupervised learning from just
unlabeled videos, which can be directly used as a prior for the supervised
recognition tasks of actions and objects in images, where our results further
show that our features can even surpass a traditional and heavily supervised
pre-training plus fine-tunning strategy