148 research outputs found
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
Deep Predictive Models for Collision Risk Assessment in Autonomous Driving
In this paper, we investigate a predictive approach for collision risk
assessment in autonomous and assisted driving. A deep predictive model is
trained to anticipate imminent accidents from traditional video streams. In
particular, the model learns to identify cues in RGB images that are predictive
of hazardous upcoming situations. In contrast to previous work, our approach
incorporates (a) temporal information during decision making, (b) multi-modal
information about the environment, as well as the proprioceptive state and
steering actions of the controlled vehicle, and (c) information about the
uncertainty inherent to the task. To this end, we discuss Deep Predictive
Models and present an implementation using a Bayesian Convolutional LSTM.
Experiments in a simple simulation environment show that the approach can learn
to predict impending accidents with reasonable accuracy, especially when
multiple cameras are used as input sources.Comment: 8 pages, 4 figure
FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification
Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current research mainly focuses on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies that are important for describing complicated dynamics. In this article, we propose a novel deep neural network, termed Fusing Temporal relations and Holistic features for aerial video classification (FuTH-Net), to model not only holistic features but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multiscale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: 1) a holistic representation pathway to learn a general feature of both frame appearances and short-term temporal variations and 2) a temporal relation pathway to capture multiscale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterward, a novel fusion module is proposed to spatiotemporally integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net
Future Person Localization in First-Person Videos
We present a new task that predicts future locations of people observed in
first-person videos. Consider a first-person video stream continuously recorded
by a wearable camera. Given a short clip of a person that is extracted from the
complete stream, we aim to predict that person's location in future frames. To
facilitate this future person localization ability, we make the following three
key observations: a) First-person videos typically involve significant
ego-motion which greatly affects the location of the target person in future
frames; b) Scales of the target person act as a salient cue to estimate a
perspective effect in first-person videos; c) First-person videos often capture
people up-close, making it easier to leverage target poses (e.g., where they
look) for predicting their future locations. We incorporate these three
observations into a prediction framework with a multi-stream
convolution-deconvolution architecture. Experimental results reveal our method
to be effective on our new dataset as well as on a public social interaction
dataset.Comment: Accepted to CVPR 201
- …