13,865 research outputs found
Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video
We explore object discovery and detector adaptation based on unlabeled video
sequences captured from a mobile platform. We propose a fully automatic
approach for object mining from video which builds upon a generic object
tracking approach. By applying this method to three large video datasets from
autonomous driving and mobile robotics scenarios, we demonstrate its robustness
and generality. Based on the object mining results, we propose a novel approach
for unsupervised object discovery by appearance-based clustering. We show that
this approach successfully discovers interesting objects relevant to driving
scenarios. In addition, we perform self-supervised detector adaptation in order
to improve detection performance on the KITTI dataset for existing categories.
Our approach has direct relevance for enabling large-scale object learning for
autonomous driving.Comment: CVPR'18 submissio
Unsupervised Deep Tracking
We propose an unsupervised visual tracking method in this paper. Different
from existing approaches using extensive annotated data for supervised
learning, our CNN model is trained on large-scale unlabeled videos in an
unsupervised manner. Our motivation is that a robust tracker should be
effective in both the forward and backward predictions (i.e., the tracker can
forward localize the target object in successive frames and backtrace to its
initial position in the first frame). We build our framework on a Siamese
correlation filter network, which is trained using unlabeled raw videos.
Meanwhile, we propose a multiple-frame validation method and a cost-sensitive
loss to facilitate unsupervised learning. Without bells and whistles, the
proposed unsupervised tracker achieves the baseline accuracy of fully
supervised trackers, which require complete and accurate labels during
training. Furthermore, unsupervised framework exhibits a potential in
leveraging unlabeled or weakly labeled data to further improve the tracking
accuracy.Comment: to appear in CVPR 201
Inserting Videos into Videos
In this paper, we introduce a new problem of manipulating a given video by
inserting other videos into it. Our main task is, given an object video and a
scene video, to insert the object video at a user-specified location in the
scene video so that the resulting video looks realistic. We aim to handle
different object motions and complex backgrounds without expensive segmentation
annotations. As it is difficult to collect training pairs for this problem, we
synthesize fake training pairs that can provide helpful supervisory signals
when training a neural network with unpaired real data. The proposed network
architecture can take both real and fake pairs as input and perform both
supervised and unsupervised training in an adversarial learning scheme. To
synthesize a realistic video, the network renders each frame based on the
current input and previous frames. Within this framework, we observe that
injecting noise into previous frames while generating the current frame
stabilizes training. We conduct experiments on real-world videos in object
tracking and person re-identification benchmark datasets. Experimental results
demonstrate that the proposed algorithm is able to synthesize long sequences of
realistic videos with a given object video inserted.Comment: CVPR 201
You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance
This paper presents an unsupervised approach towards automatically extracting
video-based guidance on object usage, from egocentric video and wearable gaze
tracking, collected from multiple users while performing tasks. The approach i)
discovers task relevant objects, ii) builds a model for each, iii)
distinguishes different ways in which each discovered object has been used and
iv) discovers the dependencies between object interactions. The work
investigates using appearance, position, motion and attention, and presents
results using each and a combination of relevant features. Moreover, an online
scalable approach is presented and is compared to offline results. The paper
proposes a method for selecting a suitable video guide to be displayed to a
novice user indicating how to use an object, purely triggered by the user's
gaze. The potential assistive mode can also recommend an object to be used next
based on the learnt sequence of object interactions. The approach was tested on
a variety of daily tasks such as initialising a printer, preparing a coffee and
setting up a gym machine
Self-Learning Camera: Autonomous Adaptation of Object Detectors to Unlabeled Video Streams
Learning object detectors requires massive amounts of labeled training
samples from the specific data source of interest. This is impractical when
dealing with many different sources (e.g., in camera networks), or constantly
changing ones such as mobile cameras (e.g., in robotics or driving assistant
systems). In this paper, we address the problem of self-learning detectors in
an autonomous manner, i.e. (i) detectors continuously updating themselves to
efficiently adapt to streaming data sources (contrary to transductive
algorithms), (ii) without any labeled data strongly related to the target data
stream (contrary to self-paced learning), and (iii) without manual intervention
to set and update hyper-parameters. To that end, we propose an unsupervised,
on-line, and self-tuning learning algorithm to optimize a multi-task learning
convex objective. Our method uses confident but laconic oracles (high-precision
but low-recall off-the-shelf generic detectors), and exploits the structure of
the problem to jointly learn on-line an ensemble of instance-level trackers,
from which we derive an adapted category-level object detector. Our approach is
validated on real-world publicly available video object datasets
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep
generative model for videos of moving objects. It can reliably discover and
track objects throughout the sequence of frames, and can also generate future
frames conditioning on the current frame, thereby simulating expected motion of
objects. This is achieved by explicitly encoding object presence, locations and
appearances in the latent variables of the model. SQAIR retains all strengths
of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016),
including learning in an unsupervised manner, and addresses its shortcomings.
We use a moving multi-MNIST dataset to show limitations of AIR in detecting
overlapping or partially occluded objects, and show how SQAIR overcomes them by
leveraging temporal consistency of objects. Finally, we also apply SQAIR to
real-world pedestrian CCTV data, where it learns to reliably detect, track and
generate walking pedestrians with no supervision.Comment: 25 pages, 19 figures, NeurIPS 2018, code:
https://github.com/akosiorek/sqair, video: https://youtu.be/-IUNQgSLE0
Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder
Unsupervised video summarization plays an important role on digesting,
browsing, and searching the ever-growing videos every day, and the underlying
fine-grained semantic and motion information (i.e., objects of interest and
their key motions) in online videos has been barely touched. In this paper, we
investigate a pioneer research direction towards the fine-grained unsupervised
object-level video summarization. It can be distinguished from existing
pipelines in two aspects: extracting key motions of participated objects, and
learning to summarize in an unsupervised and online manner. To achieve this
goal, we propose a novel online motion Auto-Encoder (online motion-AE)
framework that functions on the super-segmented object motion clips.
Comprehensive experiments on a newly-collected surveillance dataset and public
datasets have demonstrated the effectiveness of our proposed method
Unsupervised Incremental Learning of Deep Descriptors From Video Streams
We present a novel unsupervised method for face identity learning from video
sequences. The method exploits the ResNet deep network for face detection and
VGGface fc7 face descriptors together with a smart learning mechanism that
exploits the temporal coherence of visual data in video streams. We present a
novel feature matching solution based on Reverse Nearest Neighbour and a
feature forgetting strategy that supports incremental learning with memory size
control, while time progresses. It is shown that the proposed learning
procedure is asymptotically stable and can be effectively applied to relevant
applications like multiple face tracking
Self-supervised Learning for Video Correspondence Flow
The objective of this paper is self-supervised learning of feature embeddings
that are suitable for matching correspondences along the videos, which we term
correspondence flow. By leveraging the natural spatial-temporal coherence in
videos, we propose to train a ``pointer'' that reconstructs a target frame by
copying pixels from a reference frame.
We make the following contributions: First, we introduce a simple information
bottleneck that forces the model to learn robust features for correspondence
matching, and prevent it from learning trivial solutions, \eg matching based on
low-level colour information. Second, to tackle the challenges from tracker
drifting, due to complex object deformations, illumination changes and
occlusions, we propose to train a recursive model over long temporal windows
with scheduled sampling and cycle consistency. Third, we achieve
state-of-the-art performance on DAVIS 2017 video segmentation and JHMDB
keypoint tracking tasks, outperforming all previous self-supervised learning
approaches by a significant margin. Fourth, in order to shed light on the
potential of self-supervised learning on the task of video correspondence flow,
we probe the upper bound by training on additional data, \ie more diverse
videos, further demonstrating significant improvements on video segmentation.Comment: BMVC2019 (Oral Presentation
Spatiotemporal CNN for Video Object Segmentation
In this paper, we present a unified, end-to-end trainable spatiotemporal CNN
model for VOS, which consists of two branches, i.e., the temporal coherence
branch and the spatial segmentation branch. Specifically, the temporal
coherence branch pretrained in an adversarial fashion from unlabeled video
data, is designed to capture the dynamic appearance and motion cues of video
sequences to guide object segmentation. The spatial segmentation branch focuses
on segmenting objects accurately based on the learned appearance and motion
cues. To obtain accurate segmentation results, we design a coarse-to-fine
process to sequentially apply a designed attention module on multi-scale
feature maps, and concatenate them to produce the final prediction. In this
way, the spatial segmentation branch is enforced to gradually concentrate on
object regions. These two branches are jointly fine-tuned on video segmentation
sequences in an end-to-end manner. Several experiments are carried out on three
challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show
that our method achieves favorable performance against the state-of-the-arts.
Code is available at https://github.com/longyin880815/STCNN.Comment: 10 pages, 3 figures, 6 tables, CVPR 201
- …