199 research outputs found
Hamiltonian Streamline Guided Feature Extraction with Applications to Face Detection
We propose a new feature extraction method based on two dynamical systems
induced by intensity landscape: the negative gradient system and the
Hamiltonian system. We build features based on the Hamiltonian streamlines.
These features contain nice global topological information about the intensity
landscape, and can be used for object detection. We show that for training
images of same size, our feature space is much smaller than that generated by
Haar-like features. The training time is extremely short, and detection speed
and accuracy is similar to Haar-like feature based classifiers
LIBSVX: A Supervoxel Library and Benchmark for Early Video Processing
Supervoxel segmentation has strong potential to be incorporated into early
video analysis as superpixel segmentation has in image analysis. However, there
are many plausible supervoxel methods and little understanding as to when and
where each is most appropriate. Indeed, we are not aware of a single
comparative study on supervoxel segmentation. To that end, we study seven
supervoxel algorithms, including both off-line and streaming methods, in the
context of what we consider to be a good supervoxel: namely, spatiotemporal
uniformity, object/region boundary detection, region compression and parsimony.
For the evaluation we propose a comprehensive suite of seven quality metrics to
measure these desirable supervoxel characteristics. In addition, we evaluate
the methods in a supervoxel classification task as a proxy for subsequent
high-level uses of the supervoxels in video analysis. We use six existing
benchmark video datasets with a variety of content-types and dense human
annotations. Our findings have led us to conclusive evidence that the
hierarchical graph-based (GBH), segmentation by weighted aggregation (SWA) and
temporal superpixels (TSP) methods are the top-performers among the seven
methods. They all perform well in terms of segmentation accuracy, but vary in
regard to the other desiderata: GBH captures object boundaries best; SWA has
the best potential for region compression; and TSP achieves the best
undersegmentation error.Comment: In Review at International Journal of Computer Visio
Actor-Action Semantic Segmentation with Grouping Process Models
Actor-action semantic segmentation made an important step toward advanced
video understanding problems: what action is happening; who is performing the
action; and where is the action in space-time. Current models for this problem
are local, based on layered CRFs, and are unable to capture long-ranging
interaction of video parts. We propose a new model that combines these local
labeling CRFs with a hierarchical supervoxel decomposition. The supervoxels
provide cues for possible groupings of nodes, at various scales, in the CRFs to
encourage adaptive, high-order groups for more effective labeling. Our model is
dynamic and continuously exchanges information during inference: the local CRFs
influence what supervoxels in the hierarchy are active, and these active nodes
influence the connectivity in the CRF; we hence call it a grouping process
model. The experimental results on a recent large-scale video dataset show a
large margin of 60% relative improvement over the state of the art, which
demonstrates the effectiveness of the dynamic, bidirectional flow between
labeling and grouping.Comment: Technical repor
Video Object Segmentation using Supervoxel-Based Gerrymandering
Pixels operate locally. Superpixels have some potential to collect
information across many pixels; supervoxels have more potential by implicitly
operating across time. In this paper, we explore this well established notion
thoroughly analyzing how supervoxels can be used in place of and in conjunction
with other means of aggregating information across space-time. Focusing on the
problem of strictly unsupervised video object segmentation, we devise a method
called supervoxel gerrymandering that links masks of foregroundness and
backgroundness via local and non-local consensus measures. We pose and answer a
series of critical questions about the ability of supervoxels to adequately
sway local voting; the questions regard type and scale of supervoxels as well
as local versus non-local consensus, and the questions are posed in a general
way so as to impact the broader knowledge of the use of supervoxels in video
understanding. We work with the DAVIS dataset and find that our analysis yields
an unsupervised method that outperforms all other known unsupervised methods
and even many supervised ones
Efficient Hierarchical Markov Random Fields for Object Detection on a Mobile Robot
Object detection and classification using video is necessary for intelligent
planning and navigation on a mobile robot. However, current methods can be too
slow or not sufficient for distinguishing multiple classes. Techniques that
rely on binary (foreground/background) labels incorrectly identify areas with
multiple overlapping objects as single segment. We propose two Hierarchical
Markov Random Field models in efforts to distinguish connected objects using
tiered, binary label sets. Near-realtime performance has been achieved using
efficient optimization methods which runs up to 11 frames per second on a dual
core 2.2 Ghz processor. Evaluation of both models is done using footage taken
from a robot obstacle course at the 2010 Intelligent Ground Vehicle
Competition.Comment: 7 page
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Semi-supervised video object segmentation has made significant progress on
real and challenging videos in recent years. The current paradigm for
segmentation methods and benchmark datasets is to segment objects in video
provided a single annotation in the first frame. However, we find that
segmentation performance across the entire video varies dramatically when
selecting an alternative frame for annotation. This paper address the problem
of learning to suggest the single best frame across the video for user
annotation-this is, in fact, never the first frame of video. We achieve this by
introducing BubbleNets, a novel deep sorting network that learns to select
frames using a performance-based loss function that enables the conversion of
expansive amounts of training examples from already existing datasets. Using
BubbleNets, we are able to achieve an 11% relative improvement in segmentation
performance on the DAVIS benchmark without any changes to the underlying method
of segmentation.Comment: CVPR 201
Adviser Networks: Learning What Question to Ask for Human-In-The-Loop Viewpoint Estimation
Humans have an unparalleled visual intelligence and can overcome visual
ambiguities that machines currently cannot. Recent works have shown that
incorporating guidance from humans during inference for monocular
viewpoint-estimation can help overcome difficult cases in which the
computer-alone would have otherwise failed. These hybrid intelligence
approaches are hence gaining traction. However, deciding what question to ask
the human at inference time remains an unknown for these problems.
We address this question by formulating it as an Adviser Problem: can we
learn a mapping from the input to a specific question to ask the human to
maximize the expected positive impact to the overall task? We formulate a
solution to the adviser problem for viewpoint estimation using a deep network
where the question asks for the location of a keypoint in the input image. We
show that by using the Adviser Network's recommendations, the model and the
human outperforms the previous hybrid-intelligence state-of-the-art by 3.7%,
and the computer-only state-of-the-art by 5.28% absolute.Comment: 15 pages, 3 figures. Updated Acknowledgmen
Tukey-Inspired Video Object Segmentation
We investigate the problem of strictly unsupervised video object
segmentation, i.e., the separation of a primary object from background in video
without a user-provided object mask or any training on an annotated dataset. We
find foreground objects in low-level vision data using a John Tukey-inspired
measure of "outlierness". This Tukey-inspired measure also estimates the
reliability of each data source as video characteristics change (e.g., a camera
starts moving). The proposed method achieves state-of-the-art results for
strictly unsupervised video object segmentation on the challenging DAVIS
dataset. Finally, we use a variant of the Tukey-inspired measure to combine the
output of multiple segmentation methods, including those using supervision
during training, runtime, or both. This collectively more robust method of
segmentation improves the Jaccard measure of its constituent methods by as much
as 28%
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
We study weakly-supervised video object grounding: given a video segment and
a corresponding descriptive sentence, the goal is to localize objects that are
mentioned from the sentence in the video. During training, no object bounding
boxes are available, but the set of possible objects to be grounded is known
beforehand. Existing approaches in the image domain use Multiple Instance
Learning (MIL) to ground objects by enforcing matches between visual and
semantic features. A naive extension of this approach to the video domain is to
treat the entire segment as a bag of spatial object proposals. However, an
object existing sparsely across multiple frames might not be detected
completely since successfully spotting it from one single frame would trigger a
satisfactory match. To this end, we propagate the weak supervisory signal from
the segment level to frames that likely contain the target object. For frames
that are unlikely to contain the target objects, we use an alternative penalty
loss. We also leverage the interactions among objects as a textual guide for
the grounding. We evaluate our model on the newly-collected benchmark
YouCook2-BoundingBox and show improvements over competitive baselines.Comment: 16 pages including Appendi
Robot-Supervised Learning for Object Segmentation
To be effective in unstructured and changing environments, robots must learn
to recognize new objects. Deep learning has enabled rapid progress for object
detection and segmentation in computer vision; however, this progress comes at
the price of human annotators labeling many training examples. This paper
addresses the problem of extending learning-based segmentation methods to
robotics applications where annotated training data is not available. Our
method enables pixelwise segmentation of grasped objects. We factor the problem
of segmenting the object from the background into two sub-problems: (1)
segmenting the robot manipulator and object from the background and (2)
segmenting the object from the manipulator. We propose a kinematics-based
foreground segmentation technique to solve (1). To solve (2), we train a
self-recognition network that segments the robot manipulator. We train this
network without human supervision, leveraging our foreground segmentation
technique from (1) to label a training set of images containing the robot
manipulator without a grasped object. We demonstrate experimentally that our
method outperforms state-of-the-art adaptable in-hand object segmentation. We
also show that a training set composed of automatically labelled images of
grasped objects improves segmentation performance on a test set of images of
the same objects in the environment
- …