2,588 research outputs found
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
Open-Vocabulary Object Detection via Scene Graph Discovery
In recent years, open-vocabulary (OV) object detection has attracted
increasing research attention. Unlike traditional detection, which only
recognizes fixed-category objects, OV detection aims to detect objects in an
open category set. Previous works often leverage vision-language (VL) training
data (e.g., referring grounding data) to recognize OV objects. However, they
only use pairs of nouns and individual objects in VL data, while these data
usually contain much more information, such as scene graphs, which are also
crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based
Discovery Network (SGDN) that exploits scene graph cues for OV detection.
Firstly, a scene-graph-based decoder (SGDecoder) including sparse
scene-graph-guided attention (SSGA) is presented. It captures scene graphs and
leverages them to discover OV objects. Secondly, we propose scene-graph-based
prediction (SGPred), where we build a scene-graph-based offset regression
(SGOR) mechanism to enable mutual enhancement between scene graph extraction
and object localization. Thirdly, we design a cross-modal learning mechanism in
SGPred. It takes scene graphs as bridges to improve the consistency between
cross-modal embeddings for OV object classification. Experiments on COCO and
LVIS demonstrate the effectiveness of our approach. Moreover, we show the
ability of our model for OV scene graph detection, while previous OV scene
graph generation methods cannot tackle this task
PlaNet - Photo Geolocation with Convolutional Neural Networks
Is it possible to build a system to determine the location where a photo was
taken using just its pixels? In general, the problem seems exceptionally
difficult: it is trivial to construct situations where no location can be
inferred. Yet images often contain informative cues such as landmarks, weather
patterns, vegetation, road markings, and architectural details, which in
combination may allow one to determine an approximate location and occasionally
an exact location. Websites such as GeoGuessr and View from your Window suggest
that humans are relatively good at integrating these cues to geolocate images,
especially en-masse. In computer vision, the photo geolocation problem is
usually approached using image retrieval methods. In contrast, we pose the
problem as one of classification by subdividing the surface of the earth into
thousands of multi-scale geographic cells, and train a deep network using
millions of geotagged images. While previous approaches only recognize
landmarks or perform approximate matching using global image descriptors, our
model is able to use and integrate multiple visible cues. We show that the
resulting model, called PlaNet, outperforms previous approaches and even
attains superhuman levels of accuracy in some cases. Moreover, we extend our
model to photo albums by combining it with a long short-term memory (LSTM)
architecture. By learning to exploit temporal coherence to geolocate uncertain
photos, we demonstrate that this model achieves a 50% performance improvement
over the single-image model
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
The emergence of CLIP has opened the way for open-world image perception. The
zero-shot classification capabilities of the model are impressive but are
harder to use for dense tasks such as image segmentation. Several methods have
proposed different modifications and learning schemes to produce dense output.
Instead, we propose in this work an open-vocabulary semantic segmentation
method, dubbed CLIP-DIY, which does not require any additional training or
annotations, but instead leverages existing unsupervised object localization
approaches. In particular, CLIP-DIY is a multi-scale approach that directly
exploits CLIP classification abilities on patches of different sizes and
aggregates the decision in a single map. We further guide the segmentation
using foreground/background scores obtained using unsupervised object
localization methods. With our method, we obtain state-of-the-art zero-shot
semantic segmentation results on PASCAL VOC and perform on par with the best
methods on COCO
Building an enhanced vocabulary of the robot environment with a ceiling pointing camera
Mobile robots are of great help for automatic monitoring tasks in different environments. One of the first tasks that needs to be addressed when creating these kinds of robotic systems is modeling the robot environment. This work proposes a pipeline to build an enhanced visual model of a robot environment indoors. Vision based recognition approaches frequently use quantized feature spaces, commonly known as Bag of Words (BoW) or vocabulary representations. A drawback using standard BoW approaches is that semantic information is not considered as a criteria to create the visual words. To solve this challenging task, this paper studies how to leverage the standard vocabulary construction process to obtain a more meaningful visual vocabulary of the robot work environment using image sequences. We take advantage of spatio-temporal constraints and prior knowledge about the position of the camera. The key contribution of our work is the definition of a new pipeline to create a model of the environment. This pipeline incorporates (1) tracking information to the process of vocabulary construction and (2) geometric cues to the appearance descriptors. Motivated by long term robotic applications, such as the aforementioned monitoring tasks, we focus on a configuration where the robot camera points to the ceiling, which captures more stable regions of the environment. The experimental validation shows how our vocabulary models the environment in more detail than standard vocabulary approaches, without loss of recognition performance. We show different robotic tasks that could benefit of the use of our visual vocabulary approach, such as place recognition or object discovery. For this validation, we use our publicly available data-set
Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey
The recent enthusiasm for open-world vision systems show the high interest of
the community to perform perception tasks outside of the closed-vocabulary
benchmark setups which have been so popular until now. Being able to discover
objects in images/videos without knowing in advance what objects populate the
dataset is an exciting prospect. But how to find objects without knowing
anything about them? Recent works show that it is possible to perform
class-agnostic unsupervised object localization by exploiting self-supervised
pre-trained features. We propose here a survey of unsupervised object
localization methods that discover objects in images without requiring any
manual annotation in the era of self-supervised ViTs. We gather links of
discussed methods in the repository
https://github.com/valeoai/Awesome-Unsupervised-Object-Localization
- …