1,306 research outputs found
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
On Pairwise Costs for Network Flow Multi-Object Tracking
Multi-object tracking has been recently approached with the min-cost network
flow optimization techniques. Such methods simultaneously resolve multiple
object tracks in a video and enable modeling of dependencies among tracks.
Min-cost network flow methods also fit well within the "tracking-by-detection"
paradigm where object trajectories are obtained by connecting per-frame outputs
of an object detector. Object detectors, however, often fail due to occlusions
and clutter in the video. To cope with such situations, we propose to add
pairwise costs to the min-cost network flow framework. While integer solutions
to such a problem become NP-hard, we design a convex relaxation solution with
an efficient rounding heuristic which empirically gives certificates of small
suboptimality. We evaluate two particular types of pairwise costs and
demonstrate improvements over recent tracking methods in real-world video
sequences
Fusion of Head and Full-Body Detectors for Multi-Object Tracking
In order to track all persons in a scene, the tracking-by-detection paradigm
has proven to be a very effective approach. Yet, relying solely on a single
detector is also a major limitation, as useful image information might be
ignored. Consequently, this work demonstrates how to fuse two detectors into a
tracking system. To obtain the trajectories, we propose to formulate tracking
as a weighted graph labeling problem, resulting in a binary quadratic program.
As such problems are NP-hard, the solution can only be approximated. Based on
the Frank-Wolfe algorithm, we present a new solver that is crucial to handle
such difficult problems. Evaluation on pedestrian tracking is provided for
multiple scenarios, showing superior results over single detector tracking and
standard QP-solvers. Finally, our tracker ranks 2nd on the MOT16 benchmark and
1st on the new MOT17 benchmark, outperforming over 90 trackers.Comment: 10 pages, 4 figures; Winner of the MOT17 challenge; CVPRW 201
New Variants of Frank-Wolfe Algorithm for Video Co-localization Problem
The co-localization problem is a model that simultaneously localizes objects
of the same class within a series of images or videos. In
\cite{joulin2014efficient}, authors present new variants of the Frank-Wolfe
algorithm (aka conditional gradient) that increase the efficiency in solving
the image and video co-localization problems. The authors show the efficiency
of their methods with the rate of decrease in a value called the Wolfe gap in
each iteration of the algorithm. In this project, inspired by the conditional
gradient sliding algorithm (CGS) \cite{CGS:Lan}, We propose algorithms for
solving such problems and demonstrate the efficiency of the proposed algorithms
through numerical experiments. The efficiency of these methods with respect to
the Wolfe gap is compared with implementing them on the YouTube-Objects dataset
for videos.Comment: 20 pages, 7 figures, Future Technologies Conference (FTC) 202
Image Co-localization by Mimicking a Good Detector's Confidence Score Distribution
Given a set of images containing objects from the same category, the task of
image co-localization is to identify and localize each instance. This paper
shows that this problem can be solved by a simple but intriguing idea, that is,
a common object detector can be learnt by making its detection confidence
scores distributed like those of a strongly supervised detector. More
specifically, we observe that given a set of object proposals extracted from an
image that contains the object of interest, an accurate strongly supervised
object detector should give high scores to only a small minority of proposals,
and low scores to most of them. Thus, we devise an entropy-based objective
function to enforce the above property when learning the common object
detector. Once the detector is learnt, we resort to a segmentation approach to
refine the localization. We show that despite its simplicity, our approach
outperforms state-of-the-art methods.Comment: Accepted to Proc. European Conf. Computer Vision 201
Weakly-Supervised Alignment of Video With Text
Suppose that we are given a set of videos, along with natural language
descriptions in the form of multiple sentences (e.g., manual annotations, movie
scripts, sport summaries etc.), and that these sentences appear in the same
temporal order as their visual counterparts. We propose in this paper a method
for aligning the two modalities, i.e., automatically providing a time stamp for
every sentence. Given vectorial features for both video and text, we propose to
cast this task as a temporal assignment problem, with an implicit linear
mapping between the two feature modalities. We formulate this problem as an
integer quadratic program, and solve its continuous convex relaxation using an
efficient conditional gradient algorithm. Several rounding procedures are
proposed to construct the final integer solution. After demonstrating
significant improvements over the state of the art on the related task of
aligning video with symbolic labels [7], we evaluate our method on a
challenging dataset of videos with associated textual descriptions [36], using
both bag-of-words and continuous representations for text.Comment: ICCV 2015 - IEEE International Conference on Computer Vision, Dec
2015, Santiago, Chil
On Variants of Sliding and Frank-Wolfe Type Methods and Their Applications in Video Co-localization
In this dissertation, our main focus is to design and analyze first-order methods for computing approximate solutions to convex, smooth optimization problems over certain feasible sets. Specifically, our goal in this dissertation is to explore some variants of sliding and Frank-Wolfe (FW) type algorithms, analyze their convergence complexity, and examine their performance in numerical experiments. We achieve three accomplishments in our research results throughout this dissertation. First, we incorporate a linesearch technique to a well-known projection-free sliding algorithm, namely the conditional gradient sliding (CGS) method. Our proposed algorithm, called the conditional gradient sliding with linesearch (CGSls), does not require the knowledge of Lipschitz constant of the gradient of objective function, which is critical in the numerical implementation of the CGS method. Second, we explore the possibility of designing a bundle level type version of the CGS method, which to the best of our knowledge has not yet appeared in the literature. Our proposed sliding APL (SAPL) method achieves the same complexity to the CGS method. Third, we study numerical algorithms for solving the image co-localization problem. For this problem, we propose new variants of the Frank-Wolfe (FW) method and compare their empirical performance with other existing methods. The dissertation is organized as follows. In the first chapter, we review some projection-based and projection-free algorithms, their variants, and their respective advantages and disadvantages. Several useful definitions, theorems, and lemmas are also introduced in this chapter that will be utilized throughout the dissertation. For completeness, we prove most of the known results listed in this chapter (proof deferred to the appendix). In the second chapter, we incorporate a linesearch technique to the well-known CGS method and propose the CGSls method. We show that the proposed CGSls method converges with similar complexity to the CGS method. We also examine the performance of the proposed algorithm by comparing it to the CGS method and other projection-free algorithms. In the third chapter, we explore the possibility of designing a bundle level type variant of the CGS method. The proposed SAPL method is inspired by previous literature on bundle level type method. Such bundle level type method has not yet appeared in any literature on sliding algorithms. We show that the proposed SAPL method converge with the same order of complexity as the CGS and CGSls methods. In the fourth chapter, we apply the algorithms studied in previous chapters to the well-known video co-localization problem. We also propose new variants of the FW method and compare their empirical performance with other numerical methods
Unsupervised Object Discovery and Tracking in Video Collections
This paper addresses the problem of automatically localizing dominant objects
as spatio-temporal tubes in a noisy collection of videos with minimal or even
no supervision. We formulate the problem as a combination of two complementary
processes: discovery and tracking. The first one establishes correspondences
between prominent regions across videos, and the second one associates
successive similar object regions within the same video. Interestingly, our
algorithm also discovers the implicit topology of frames associated with
instances of the same object class across different videos, a role normally
left to supervisory information in the form of class labels in conventional
image and video understanding methods. Indeed, as demonstrated by our
experiments, our method can handle video collections featuring multiple object
classes, and substantially outperforms the state of the art in colocalization,
even though it tackles a broader problem with much less supervision
- âŠ