74,464 research outputs found
Behavior Discovery and Alignment of Articulated Object Classes from Unstructured Video
We propose an automatic system for organizing the content of a collection of
unstructured videos of an articulated object class (e.g. tiger, horse). By
exploiting the recurring motion patterns of the class across videos, our
system: 1) identifies its characteristic behaviors; and 2) recovers
pixel-to-pixel alignments across different instances. Our system can be useful
for organizing video collections for indexing and retrieval. Moreover, it can
be a platform for learning the appearance or behaviors of object classes from
Internet video. Traditional supervised techniques cannot exploit this wealth of
data directly, as they require a large amount of time-consuming manual
annotations.
The behavior discovery stage generates temporal video intervals, each
automatically trimmed to one instance of the discovered behavior, clustered by
type. It relies on our novel motion representation for articulated motion based
on the displacement of ordered pairs of trajectories (PoTs). The alignment
stage aligns hundreds of instances of the class to a great accuracy despite
considerable appearance variations (e.g. an adult tiger and a cub). It uses a
flexible Thin Plate Spline deformation model that can vary through time. We
carefully evaluate each step of our system on a new, fully annotated dataset.
On behavior discovery, we outperform the state-of-the-art Improved DTF
descriptor. On spatial alignment, we outperform the popular SIFT Flow
algorithm.Comment: 19 pages, 19 figure, 3 tables. arXiv admin note: substantial text
overlap with arXiv:1411.788
Video object segmentation and applications in temporal alignment and aspect learning
Modern computer vision has seen recently significant progress in learning visual concepts
from examples. This progress has been fuelled by recent models of visual appearance
as well as recently collected large-scale datasets of manually annotated still
images. Video is a promising alternative, as it inherently contains much richer information
compared to still images. For instance, in video we can observe an object move
which allows us to differentiate it from its surroundings, or we can observe a smooth
transition between different viewpoints of the same object instance. This richness in
information allows us to effectively tackle tasks that would otherwise be very difficult
if we only considered still images, or even adress tasks that are video-specific.
Our first contribution is a computationally efficient technique for video object segmentation.
Our method relies solely on motion in order to rapidly create a rough initial
estimate of the foreground object. This rough initial estimate is then refined through
an energy formulation to be spatio-temporally smooth. The method is able to handle
rapidly moving backgrounds and objects, as well as non-rigid deformations and articulations
without having prior knowledge about the objects appearance, size or location.
In addition to this class-agnostic method, we present a class-specific method that incorporates
additional class-specific appearance cues when the class of the foreground
object is known in advance (e.g. a video of a car).
For our second contribution, we propose a novel model for temporal video alignment
with regard to the viewpoint of the foreground object (i.e., a pair of aligned
frames shows the same object viewpoint) Our work relies on our video object segmentation
technique to automatically localise the foreground objects and extract appearance
measurements solely from them instead of the background. Our model is able
to temporally align realistic videos, where events may occur in a different order, or
occur only in one of the videos. This is in contrast to previous works that typically
assume that the videos show a scripted sequence of events and can simply be aligned
by stretching or compressing one of the videos.
As a final contribution, we once again use our video object segmentation technique
as a basis for automatic visual aspect discovery from videos of an object class. Compared
to previous works, we use a broader definition of an aspect that considers four
factors of variation: viewpoint, articulated pose, occlusions and cropping by the image
border. We pose the aspect discovery task as a clustering problem and provide an
extensive experimental exploration on the benefits of object segmentation for this task
Unsupervised Object Discovery and Tracking in Video Collections
This paper addresses the problem of automatically localizing dominant objects
as spatio-temporal tubes in a noisy collection of videos with minimal or even
no supervision. We formulate the problem as a combination of two complementary
processes: discovery and tracking. The first one establishes correspondences
between prominent regions across videos, and the second one associates
successive similar object regions within the same video. Interestingly, our
algorithm also discovers the implicit topology of frames associated with
instances of the same object class across different videos, a role normally
left to supervisory information in the form of class labels in conventional
image and video understanding methods. Indeed, as demonstrated by our
experiments, our method can handle video collections featuring multiple object
classes, and substantially outperforms the state of the art in colocalization,
even though it tackles a broader problem with much less supervision
Object-Oriented Dynamics Learning through Multi-Level Abstraction
Object-based approaches for learning action-conditioned dynamics has
demonstrated promise for generalization and interpretability. However, existing
approaches suffer from structural limitations and optimization difficulties for
common environments with multiple dynamic objects. In this paper, we present a
novel self-supervised learning framework, called Multi-level Abstraction
Object-oriented Predictor (MAOP), which employs a three-level learning
architecture that enables efficient object-based dynamics learning from raw
visual observations. We also design a spatial-temporal relational reasoning
mechanism for MAOP to support instance-level dynamics learning and handle
partial observability. Our results show that MAOP significantly outperforms
previous methods in terms of sample efficiency and generalization over novel
environments for learning environment models. We also demonstrate that learned
dynamics models enable efficient planning in unseen environments, comparable to
true environment models. In addition, MAOP learns semantically and visually
interpretable disentangled representations.Comment: Accepted to the Thirthy-Fourth AAAI Conference On Artificial
Intelligence (AAAI), 202
- …