1,889 research outputs found
Learning to Detect and Retrieve Objects from Unlabeled Videos
Learning an object detector or retrieval requires a large data set with
manual annotations. Such data sets are expensive and time consuming to create
and therefore difficult to obtain on a large scale. In this work, we propose to
exploit the natural correlation in narrations and the visual presence of
objects in video, to learn an object detector and retrieval without any manual
labeling involved. We pose the problem as weakly supervised learning with noisy
labels, and propose a novel object detection paradigm under these constraints.
We handle the background rejection by using contrastive samples and confront
the high level of label noise with a new clustering score. Our evaluation is
based on a set of 11 manually annotated objects in over 5000 frames. We show
comparison to a weakly-supervised approach as baseline and provide a strongly
labeled upper bound.Comment: ICCV 2019 Workshop on Multi-modal Video Analysis and Moments in Time
Challeng
Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video
We explore object discovery and detector adaptation based on unlabeled video
sequences captured from a mobile platform. We propose a fully automatic
approach for object mining from video which builds upon a generic object
tracking approach. By applying this method to three large video datasets from
autonomous driving and mobile robotics scenarios, we demonstrate its robustness
and generality. Based on the object mining results, we propose a novel approach
for unsupervised object discovery by appearance-based clustering. We show that
this approach successfully discovers interesting objects relevant to driving
scenarios. In addition, we perform self-supervised detector adaptation in order
to improve detection performance on the KITTI dataset for existing categories.
Our approach has direct relevance for enabling large-scale object learning for
autonomous driving.Comment: CVPR'18 submissio
Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision
Supervised machine learning based state-of-the-art computer vision techniques
are in general data hungry. Their data curation poses the challenges of
expensive human labeling, inadequate computing resources and larger experiment
turn around times. Training data subset selection and active learning
techniques have been proposed as possible solutions to these challenges. A
special class of subset selection functions naturally model notions of
diversity, coverage and representation and can be used to eliminate redundancy
thus lending themselves well for training data subset selection. They can also
help improve the efficiency of active learning in further reducing human
labeling efforts by selecting a subset of the examples obtained using the
conventional uncertainty sampling based techniques. In this work, we
empirically demonstrate the effectiveness of two diversity models, namely the
Facility-Location and Dispersion models for training-data subset selection and
reducing labeling effort. We demonstrate this across the board for a variety of
computer vision tasks including Gender Recognition, Face Recognition, Scene
Recognition, Object Detection and Object Recognition. Our results show that
diversity based subset selection done in the right way can increase the
accuracy by upto 5 - 10% over existing baselines, particularly in settings in
which less training data is available. This allows the training of complex
machine learning models like Convolutional Neural Networks with much less
training data and labeling costs while incurring minimal performance loss.Comment: Accepted to WACV 2019. arXiv admin note: substantial text overlap
with arXiv:1805.1119
Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos
Human behavior understanding in videos is a complex, still unsolved problem
and requires to accurately model motion at both the local (pixel-wise dense
prediction) and global (aggregation of motion cues) levels. Current approaches
based on supervised learning require large amounts of annotated data, whose
scarce availability is one of the main limiting factors to the development of
general solutions. Unsupervised learning can instead leverage the vast amount
of videos available on the web and it is a promising solution for overcoming
the existing limitations. In this paper, we propose an adversarial GAN-based
framework that learns video representations and dynamics through a
self-supervision mechanism in order to perform dense and global prediction in
videos. Our approach synthesizes videos by 1) factorizing the process into the
generation of static visual content and motion, 2) learning a suitable
representation of a motion latent space in order to enforce spatio-temporal
coherency of object trajectories, and 3) incorporating motion estimation and
pixel-wise dense prediction into the training procedure. Self-supervision is
enforced by using motion masks produced by the generator, as a co-product of
its generation process, to supervise the discriminator network in performing
dense prediction. Performance evaluation, carried out on standard benchmarks,
shows that our approach is able to learn, in an unsupervised way, both local
and global video dynamics. The learned representations, then, support the
training of video object segmentation methods with sensibly less (about 50%)
annotations, giving performance comparable to the state of the art.
Furthermore, the proposed method achieves promising performance in generating
realistic videos, outperforming state-of-the-art approaches especially on
motion-related metrics
Multimodal Co-Training for Selecting Good Examples from Webly Labeled Video
We tackle the problem of learning concept classifiers from videos on the web
without using manually labeled data. Although metadata attached to videos
(e.g., video titles, descriptions) can be of help collecting training data for
the target concept, the collected data is often very noisy. The main challenge
is therefore how to select good examples from noisy training data. Previous
approaches firstly learn easy examples that are unlikely to be noise and then
gradually learn more complex examples. However, hard examples that are much
different from easy ones are never learned. In this paper, we propose an
approach called multimodal co-training (MMCo) for selecting good examples from
noisy training data. MMCo jointly learns classifiers for multiple modalities
that complement each other to select good examples. Since MMCo selects examples
by consensus of multimodal classifiers, a hard example for one modality can
still be used as a training example by exploiting the power of the other
modalities. The algorithm is very simple and easily implemented but yields
consistent and significant boosts in example selection and classification
performance on the FCVID and YouTube8M benchmarks
Self-Training for Domain Adaptive Scene Text Detection
Though deep learning based scene text detection has achieved great progress,
well-trained detectors suffer from severe performance degradation for different
domains. In general, a tremendous amount of data is indispensable to train the
detector in the target domain. However, data collection and annotation are
expensive and time-consuming. To address this problem, we propose a
self-training framework to automatically mine hard examples with pseudo-labels
from unannotated videos or images. To reduce the noise of hard examples, a
novel text mining module is implemented based on the fusion of detection and
tracking results. Then, an image-to-video generation method is designed for the
tasks that videos are unavailable and only images can be used. Experimental
results on standard benchmarks, including ICDAR2015, MSRA-TD500, ICDAR2017 MLT,
demonstrate the effectiveness of our self-training method. The simple Mask
R-CNN adapted with self-training and fine-tuned on real data can achieve
comparable or even superior results with the state-of-the-art methods
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Improving Object Detection with Selective Self-supervised Self-training
We study how to leverage Web images to augment human-curated object detection
datasets. Our approach is two-pronged. On the one hand, we retrieve Web images
by image-to-image search, which incurs less domain shift from the curated data
than other search methods. The Web images are diverse, supplying a wide variety
of object poses, appearances, their interactions with the context, etc. On the
other hand, we propose a novel learning method motivated by two parallel lines
of work that explore unlabeled data for image classification: self-training and
self-supervised learning. They fail to improve object detectors in their
vanilla forms due to the domain gap between the Web images and curated
datasets. To tackle this challenge, we propose a selective net to rectify the
supervision signals in Web images. It not only identifies positive bounding
boxes but also creates a safe zone for mining hard negative boxes. We report
state-of-the-art results on detecting backpacks and chairs from everyday
scenes, along with other challenging object classes.Comment: Accepted to ECCV 202
Long and Short Memory Balancing in Visual Co-Tracking using Q-Learning
Employing one or more additional classifiers to break the self-learning loop
in tracing-by-detection has gained considerable attention. Most of such
trackers merely utilize the redundancy to address the accumulating label error
in the tracking loop, and suffer from high computational complexity as well as
tracking challenges that may interrupt all classifiers (e.g. temporal
occlusions). We propose the active co-tracking framework, in which the main
classifier of the tracker labels samples of the video sequence, and only
consults auxiliary classifier when it is uncertain. Based on the source of the
uncertainty and the differences of two classifiers (e.g. accuracy, speed,
update frequency, etc.), different policies should be taken to exchange the
information between two classifiers. Here, we introduce a reinforcement
learning approach to find the appropriate policy by considering the state of
the tracker in a specific sequence. The proposed method yields promising
results in comparison to the best tracking-by-detection approaches.Comment: Submitted to ICIP 201
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
- …