3,139 research outputs found
Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos
Human behavior understanding in videos is a complex, still unsolved problem
and requires to accurately model motion at both the local (pixel-wise dense
prediction) and global (aggregation of motion cues) levels. Current approaches
based on supervised learning require large amounts of annotated data, whose
scarce availability is one of the main limiting factors to the development of
general solutions. Unsupervised learning can instead leverage the vast amount
of videos available on the web and it is a promising solution for overcoming
the existing limitations. In this paper, we propose an adversarial GAN-based
framework that learns video representations and dynamics through a
self-supervision mechanism in order to perform dense and global prediction in
videos. Our approach synthesizes videos by 1) factorizing the process into the
generation of static visual content and motion, 2) learning a suitable
representation of a motion latent space in order to enforce spatio-temporal
coherency of object trajectories, and 3) incorporating motion estimation and
pixel-wise dense prediction into the training procedure. Self-supervision is
enforced by using motion masks produced by the generator, as a co-product of
its generation process, to supervise the discriminator network in performing
dense prediction. Performance evaluation, carried out on standard benchmarks,
shows that our approach is able to learn, in an unsupervised way, both local
and global video dynamics. The learned representations, then, support the
training of video object segmentation methods with sensibly less (about 50%)
annotations, giving performance comparable to the state of the art.
Furthermore, the proposed method achieves promising performance in generating
realistic videos, outperforming state-of-the-art approaches especially on
motion-related metrics
Semi-Supervised Domain Adaptation for Weakly Labeled Semantic Video Object Segmentation
Deep convolutional neural networks (CNNs) have been immensely successful in
many high-level computer vision tasks given large labeled datasets. However,
for video semantic object segmentation, a domain where labels are scarce,
effectively exploiting the representation power of CNN with limited training
data remains a challenge. Simply borrowing the existing pretrained CNN image
recognition model for video segmentation task can severely hurt performance. We
propose a semi-supervised approach to adapting CNN image recognition model
trained from labeled image data to the target domain exploiting both semantic
evidence learned from CNN, and the intrinsic structures of video data. By
explicitly modeling and compensating for the domain shift from the source
domain to the target domain, this proposed approach underpins a robust semantic
object segmentation method against the changes in appearance, shape and
occlusion in natural videos. We present extensive experiments on challenging
datasets that demonstrate the superior performance of our approach compared
with the state-of-the-art methods
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Adversarial Constraint Learning for Structured Prediction
Constraint-based learning reduces the burden of collecting labels by having
users specify general properties of structured outputs, such as constraints
imposed by physical laws. We propose a novel framework for simultaneously
learning these constraints and using them for supervision, bypassing the
difficulty of using domain expertise to manually specify constraints. Learning
requires a black-box simulator of structured outputs, which generates valid
labels, but need not model their corresponding inputs or the input-label
relationship. At training time, we constrain the model to produce outputs that
cannot be distinguished from simulated labels by adversarial training.
Providing our framework with a small number of labeled inputs gives rise to a
new semi-supervised structured prediction model; we evaluate this model on
multiple tasks --- tracking, pose estimation and time series prediction --- and
find that it achieves high accuracy with only a small number of labeled inputs.
In some cases, no labels are required at all.Comment: To appear at IJCAI 201
SfM-Net: Learning of Structure and Motion from Video
We propose SfM-Net, a geometry-aware neural network for motion estimation in
videos that decomposes frame-to-frame pixel motion in terms of scene and object
depth, camera motion and 3D object rotations and translations. Given a sequence
of frames, SfM-Net predicts depth, segmentation, camera and rigid object
motions, converts those into a dense frame-to-frame motion field (optical
flow), differentiably warps frames in time to match pixels and back-propagates.
The model can be trained with various degrees of supervision: 1)
self-supervised by the re-projection photometric error (completely
unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by
depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth
estimates and successfully estimates frame-to-frame camera rotations and
translations. It often successfully segments the moving objects in the scene,
even though such supervision is never provided
Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder
Unsupervised video summarization plays an important role on digesting,
browsing, and searching the ever-growing videos every day, and the underlying
fine-grained semantic and motion information (i.e., objects of interest and
their key motions) in online videos has been barely touched. In this paper, we
investigate a pioneer research direction towards the fine-grained unsupervised
object-level video summarization. It can be distinguished from existing
pipelines in two aspects: extracting key motions of participated objects, and
learning to summarize in an unsupervised and online manner. To achieve this
goal, we propose a novel online motion Auto-Encoder (online motion-AE)
framework that functions on the super-segmented object motion clips.
Comprehensive experiments on a newly-collected surveillance dataset and public
datasets have demonstrated the effectiveness of our proposed method
Recovering Spatiotemporal Correspondence between Deformable Objects by Exploiting Consistent Foreground Motion in Video
Given unstructured videos of deformable objects, we automatically recover
spatiotemporal correspondences to map one object to another (such as animals in
the wild). While traditional methods based on appearance fail in such
challenging conditions, we exploit consistency in object motion between
instances. Our approach discovers pairs of short video intervals where the
object moves in a consistent manner and uses these candidates as seeds for
spatial alignment. We model the spatial correspondence between the point
trajectories on the object in one interval to those in the other using a
time-varying Thin Plate Spline deformation model. On a large dataset of tiger
and horse videos, our method automatically aligns thousands of pairs of frames
to a high accuracy, and outperforms the popular SIFT Flow algorithm.Comment: 9 pages, 14 figures. This article is obsolete. Its contents are now
covered in arXiv:1511.09319, where we discuss a comprehensive system for
behavior discovery and spatial alignment of articulated object classes from
unstructured video (available at https://arxiv.org/abs/1511.09319
Video Object Segmentation with Language Referring Expressions
Most state-of-the-art semi-supervised video object segmentation methods rely
on a pixel-accurate mask of a target object provided for the first frame of a
video. However, obtaining a detailed segmentation mask is expensive and
time-consuming. In this work we explore an alternative way of identifying a
target object, namely by employing language referring expressions. Besides
being a more practical and natural way of pointing out a target object, using
language specifications can help to avoid drift as well as make the system more
robust to complex dynamics and appearance variations. Leveraging recent
advances of language grounding models designed for images, we propose an
approach to extend them to video data, ensuring temporally coherent
predictions. To evaluate our method we augment the popular video object
segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of
target objects. We show that our language-supervised approach performs on par
with the methods which have access to a pixel-level mask of the target object
on DAVIS'16 and is competitive to methods using scribbles on the challenging
DAVIS'17 dataset.Comment: ACCV 2018: 14th Asian Conference on Computer Visio
- …