20,661 research outputs found
Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos
Human behavior understanding in videos is a complex, still unsolved problem
and requires to accurately model motion at both the local (pixel-wise dense
prediction) and global (aggregation of motion cues) levels. Current approaches
based on supervised learning require large amounts of annotated data, whose
scarce availability is one of the main limiting factors to the development of
general solutions. Unsupervised learning can instead leverage the vast amount
of videos available on the web and it is a promising solution for overcoming
the existing limitations. In this paper, we propose an adversarial GAN-based
framework that learns video representations and dynamics through a
self-supervision mechanism in order to perform dense and global prediction in
videos. Our approach synthesizes videos by 1) factorizing the process into the
generation of static visual content and motion, 2) learning a suitable
representation of a motion latent space in order to enforce spatio-temporal
coherency of object trajectories, and 3) incorporating motion estimation and
pixel-wise dense prediction into the training procedure. Self-supervision is
enforced by using motion masks produced by the generator, as a co-product of
its generation process, to supervise the discriminator network in performing
dense prediction. Performance evaluation, carried out on standard benchmarks,
shows that our approach is able to learn, in an unsupervised way, both local
and global video dynamics. The learned representations, then, support the
training of video object segmentation methods with sensibly less (about 50%)
annotations, giving performance comparable to the state of the art.
Furthermore, the proposed method achieves promising performance in generating
realistic videos, outperforming state-of-the-art approaches especially on
motion-related metrics
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
Learning to Align Multi-Camera Domains using Part-Aware Clustering for Unsupervised Video Person Re-Identification
Most video person re-identification (re-ID) methods are mainly based on
supervised learning, which requires cross-camera ID labeling. Since the cost of
labeling increases dramatically as the number of cameras increases, it is
difficult to apply the re-identification algorithm to a large camera network.
In this paper, we address the scalability issue by presenting deep
representation learning without ID information across multiple cameras.
Technically, we train neural networks to generate both ID-discriminative and
camera-invariant features. To achieve the ID discrimination ability of the
embedding features, we maximize feature distances between different person IDs
within a camera by using a metric learning approach. At the same time,
considering each camera as a different domain, we apply adversarial learning
across multiple camera domains for generating camera-invariant features. We
also propose a part-aware adaptation module, which effectively performs
multi-camera domain invariant feature learning in different spatial regions. We
carry out comprehensive experiments on three public re-ID datasets (i.e.,
PRID-2011, iLIDS-VID, and MARS). Our method outperforms state-of-the-art
methods by a large margin of about 20\% in terms of rank-1 accuracy on the
large-scale MARS dataset
Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from
unlabeled video. The main idea is to use cycle-consistency in time as free
supervisory signal for learning visual representations from scratch. At
training time, our model learns a feature map representation to be useful for
performing cycle-consistent tracking. At test time, we use the acquired
representation to find nearest neighbors across space and time. We demonstrate
the generalizability of the representation -- without finetuning -- across a
range of visual correspondence tasks, including video object segmentation,
keypoint tracking, and optical flow. Our approach outperforms previous
self-supervised methods and performs competitively with strongly supervised
methods.Comment: CVPR 2019 Oral. Project page: http://ajabri.github.io/timecycl
Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks
We present a deep learning method for the interactive video object
segmentation. Our method is built upon two core operations, interaction and
propagation, and each operation is conducted by Convolutional Neural Networks.
The two networks are connected both internally and externally so that the
networks are trained jointly and interact with each other to solve the complex
video object segmentation problem. We propose a new multi-round training scheme
for the interactive video object segmentation so that the networks can learn
how to understand the user's intention and update incorrect estimations during
the training. At the testing time, our method produces high-quality results and
also runs fast enough to work with users interactively. We evaluated the
proposed method quantitatively on the interactive track benchmark at the DAVIS
Challenge 2018. We outperformed other competing methods by a significant margin
in both the speed and the accuracy. We also demonstrated that our method works
well with real user interactions.Comment: CVPR 201
Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding
Retrieval of text information from natural scene images and video frames is a
challenging task due to its inherent problems like complex character shapes,
low resolution, background noise, etc. Available OCR systems often fail to
retrieve such information in scene/video frames. Keyword spotting, an
alternative way to retrieve information, performs efficient text searching in
such scenarios. However, current word spotting techniques in scene/video images
are script-specific and they are mainly developed for Latin script. This paper
presents a novel word spotting framework using dynamic shape coding for text
retrieval in natural scene image and video frames. The framework is designed to
search query keyword from multiple scripts with the help of on-the-fly
script-wise keyword generation for the corresponding script. We have used a
two-stage word spotting approach using Hidden Markov Model (HMM) to detect the
translated keyword in a given text line by identifying the script of the line.
A novel unsupervised dynamic shape coding based scheme has been used to group
similar shape characters to avoid confusion and to improve text alignment.
Next, the hypotheses locations are verified to improve retrieval performance.
To evaluate the proposed system for searching keyword from natural scene image
and video frames, we have considered two popular Indic scripts such as Bangla
(Bengali) and Devanagari along with English. Inspired by the zone-wise
recognition approach in Indic scripts[1], zone-wise text information has been
used to improve the traditional word spotting performance in Indic scripts. For
our experiment, a dataset consisting of images of different scenes and video
frames of English, Bangla and Devanagari scripts were considered. The results
obtained showed the effectiveness of our proposed word spotting approach.Comment: Multimedia Tools and Applications, Springe
A Review of Co-saliency Detection Technique: Fundamentals, Applications, and Challenges
Co-saliency detection is a newly emerging and rapidly growing research area
in computer vision community. As a novel branch of visual saliency, co-saliency
detection refers to the discovery of common and salient foregrounds from two or
more relevant images, and can be widely used in many computer vision tasks. The
existing co-saliency detection algorithms mainly consist of three components:
extracting effective features to represent the image regions, exploring the
informative cues or factors to characterize co-saliency, and designing
effective computational frameworks to formulate co-saliency. Although numerous
methods have been developed, the literature is still lacking a deep review and
evaluation of co-saliency detection techniques. In this paper, we aim at
providing a comprehensive review of the fundamentals, challenges, and
applications of co-saliency detection. Specifically, we provide an overview of
some related computer vision works, review the history of co-saliency
detection, summarize and categorize the major algorithms in this research area,
discuss some open issues in this area, present the potential applications of
co-saliency detection, and finally point out some unsolved challenges and
promising future works. We expect this review to be beneficial to both fresh
and senior researchers in this field, and give insights to researchers in other
related areas regarding the utility of co-saliency detection algorithms.Comment: 28 pages, 12 figures, 3 table
Unseen Object Segmentation in Videos via Transferable Representations
In order to learn object segmentation models in videos, conventional methods
require a large amount of pixel-wise ground truth annotations. However,
collecting such supervised data is time-consuming and labor-intensive. In this
paper, we exploit existing annotations in source images and transfer such
visual information to segment videos with unseen object categories. Without
using any annotations in the target video, we propose a method to jointly mine
useful segments and learn feature representations that better adapt to the
target frames. The entire process is decomposed into two tasks: 1) solving a
submodular function for selecting object-like segments, and 2) learning a CNN
model with a transferable module for adapting seen categories in the source
domain to the unseen target video. We present an iterative update scheme
between two tasks to self-learn the final solution for object segmentation.
Experimental results on numerous benchmark datasets show that the proposed
method performs favorably against the state-of-the-art algorithms.Comment: Accepted in ACCV'18 (oral). Code is available at
https://github.com/wenz116/TransferSe
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
- …