5,005 research outputs found
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
Fusion of Head and Full-Body Detectors for Multi-Object Tracking
In order to track all persons in a scene, the tracking-by-detection paradigm
has proven to be a very effective approach. Yet, relying solely on a single
detector is also a major limitation, as useful image information might be
ignored. Consequently, this work demonstrates how to fuse two detectors into a
tracking system. To obtain the trajectories, we propose to formulate tracking
as a weighted graph labeling problem, resulting in a binary quadratic program.
As such problems are NP-hard, the solution can only be approximated. Based on
the Frank-Wolfe algorithm, we present a new solver that is crucial to handle
such difficult problems. Evaluation on pedestrian tracking is provided for
multiple scenarios, showing superior results over single detector tracking and
standard QP-solvers. Finally, our tracker ranks 2nd on the MOT16 benchmark and
1st on the new MOT17 benchmark, outperforming over 90 trackers.Comment: 10 pages, 4 figures; Winner of the MOT17 challenge; CVPRW 201
Convex Global 3D Registration with Lagrangian Duality
The registration of 3D models by a Euclidean transformation is a fundamental task at the core of many application in computer vision. This problem is non-convex due to the presence of rotational constraints, making traditional local optimization methods prone to getting stuck in local minima. This paper addresses finding the globally optimal transformation in various 3D registration problems by a unified formulation that integrates common geometric registration modalities (namely point-to-point, point-to-line and point-to-plane). This formulation renders the optimization problem independent of both the number and nature of the correspondences.
The main novelty of our proposal is the introduction of a strengthened Lagrangian dual relaxation for this problem, which surpasses previous similar approaches [32] in effectiveness.
In fact, even though with no theoretical guarantees, exhaustive empirical evaluation in both synthetic and real experiments always resulted on a tight relaxation that allowed to recover a guaranteed globally optimal solution by exploiting duality theory.
Thus, our approach allows for effectively solving the 3D registration with global optimality guarantees while running at a fraction of the time for the state-of-the-art alternative [34], based on a more computationally intensive Branch and Bound method.Universidad de Málaga. Campus de Excelencia Internacional AndalucĂa Tech
Weakly-Supervised Alignment of Video With Text
Suppose that we are given a set of videos, along with natural language
descriptions in the form of multiple sentences (e.g., manual annotations, movie
scripts, sport summaries etc.), and that these sentences appear in the same
temporal order as their visual counterparts. We propose in this paper a method
for aligning the two modalities, i.e., automatically providing a time stamp for
every sentence. Given vectorial features for both video and text, we propose to
cast this task as a temporal assignment problem, with an implicit linear
mapping between the two feature modalities. We formulate this problem as an
integer quadratic program, and solve its continuous convex relaxation using an
efficient conditional gradient algorithm. Several rounding procedures are
proposed to construct the final integer solution. After demonstrating
significant improvements over the state of the art on the related task of
aligning video with symbolic labels [7], we evaluate our method on a
challenging dataset of videos with associated textual descriptions [36], using
both bag-of-words and continuous representations for text.Comment: ICCV 2015 - IEEE International Conference on Computer Vision, Dec
2015, Santiago, Chil
On Pairwise Costs for Network Flow Multi-Object Tracking
Multi-object tracking has been recently approached with the min-cost network
flow optimization techniques. Such methods simultaneously resolve multiple
object tracks in a video and enable modeling of dependencies among tracks.
Min-cost network flow methods also fit well within the "tracking-by-detection"
paradigm where object trajectories are obtained by connecting per-frame outputs
of an object detector. Object detectors, however, often fail due to occlusions
and clutter in the video. To cope with such situations, we propose to add
pairwise costs to the min-cost network flow framework. While integer solutions
to such a problem become NP-hard, we design a convex relaxation solution with
an efficient rounding heuristic which empirically gives certificates of small
suboptimality. We evaluate two particular types of pairwise costs and
demonstrate improvements over recent tracking methods in real-world video
sequences
Mapping, Localization and Path Planning for Image-based Navigation using Visual Features and Map
Building on progress in feature representations for image retrieval,
image-based localization has seen a surge of research interest. Image-based
localization has the advantage of being inexpensive and efficient, often
avoiding the use of 3D metric maps altogether. That said, the need to maintain
a large number of reference images as an effective support of localization in a
scene, nonetheless calls for them to be organized in a map structure of some
kind.
The problem of localization often arises as part of a navigation process. We
are, therefore, interested in summarizing the reference images as a set of
landmarks, which meet the requirements for image-based navigation. A
contribution of this paper is to formulate such a set of requirements for the
two sub-tasks involved: map construction and self-localization. These
requirements are then exploited for compact map representation and accurate
self-localization, using the framework of a network flow problem. During this
process, we formulate the map construction and self-localization problems as
convex quadratic and second-order cone programs, respectively. We evaluate our
methods on publicly available indoor and outdoor datasets, where they
outperform existing methods significantly.Comment: CVPR 2019, for implementation see https://github.com/janinethom
A Convex Polynomial Force-Motion Model for Planar Sliding: Identification and Application
We propose a polynomial force-motion model for planar sliding. The set of
generalized friction loads is the 1-sublevel set of a polynomial whose gradient
directions correspond to generalized velocities. Additionally, the polynomial
is confined to be convex even-degree homogeneous in order to obey the maximum
work inequality, symmetry, shape invariance in scale, and fast invertibility.
We present a simple and statistically-efficient model identification procedure
using a sum-of-squares convex relaxation. Simulation and robotic experiments
validate the accuracy and efficiency of our approach. We also show practical
applications of our model including stable pushing of objects and free sliding
dynamic simulations.Comment: 2016 IEEE International Conference on Robotics and Automation (ICRA
- …