39,302 research outputs found
TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos
Automatically describing videos has ever been fascinating. In this work, we
attempt to describe videos from a specific domain - broadcast videos of lawn
tennis matches. Given a video shot from a tennis match, we intend to generate a
textual commentary similar to what a human expert would write on a sports
website. Unlike many recent works that focus on generating short captions, we
are interested in generating semantically richer descriptions. This demands a
detailed low-level analysis of the video content, specially the actions and
interactions among subjects. We address this by limiting our domain to the game
of lawn tennis. Rich descriptions are generated by leveraging a large corpus of
human created descriptions harvested from Internet. We evaluate our method on a
newly created tennis video data set. Extensive analysis demonstrate that our
approach addresses both semantic correctness as well as readability aspects
involved in the task.Comment: BMVC 201
Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction
The human language is one of the most natural interfaces for humans to
interact with robots. This paper presents a robot system that retrieves
everyday objects with unconstrained natural language descriptions. A core issue
for the system is semantic and spatial grounding, which is to infer objects and
their spatial relationships from images and natural language expressions. We
introduce a two-stage neural-network grounding pipeline that maps natural
language referring expressions directly to objects in the images. The first
stage uses visual descriptions in the referring expressions to generate a
candidate set of relevant objects. The second stage examines all pairwise
relationships between the candidates and predicts the most likely referred
object according to the spatial descriptions in the referring expressions. A
key feature of our system is that by leveraging a large dataset of images
labeled with text descriptions, it allows unrestricted object types and natural
language referring expressions. Preliminary results indicate that our system
outperforms a near state-of-the-art object comprehension system on standard
benchmark datasets. We also present a robot system that follows voice commands
to pick and place previously unseen objects.Comment: 8 pages, 4 figures, Accepted at RSS 2017 Workshop on Spatial-Semantic
Representations in Robotic
Spatio-temporal Person Retrieval via Natural Language Queries
In this paper, we address the problem of spatio-temporal person retrieval
from multiple videos using a natural language query, in which we output a tube
(i.e., a sequence of bounding boxes) which encloses the person described by the
query. For this problem, we introduce a novel dataset consisting of videos
containing people annotated with bounding boxes for each second and with five
natural language descriptions. To retrieve the tube of the person described by
a given natural language query, we design a model that combines methods for
spatio-temporal human detection and multimodal retrieval. We conduct
comprehensive experiments to compare a variety of tube and text representations
and multimodal retrieval methods, and present a strong baseline in this task as
well as demonstrate the efficacy of our tube representation and multimodal
feature embedding technique. Finally, we demonstrate the versatility of our
model by applying it to two other important tasks.Comment: Accepted to ICCV201
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
We introduce the dense captioning task, which requires a computer vision
system to both localize and describe salient regions in images in natural
language. The dense captioning task generalizes object detection when the
descriptions consist of a single word, and Image Captioning when one predicted
region covers the full image. To address the localization and description task
jointly we propose a Fully Convolutional Localization Network (FCLN)
architecture that processes an image with a single, efficient forward pass,
requires no external regions proposals, and can be trained end-to-end with a
single round of optimization. The architecture is composed of a Convolutional
Network, a novel dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We evaluate our network on
the Visual Genome dataset, which comprises 94,000 images and 4,100,000
region-grounded captions. We observe both speed and accuracy improvements over
baselines based on current state of the art approaches in both generation and
retrieval settings
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image
captioning requires to recognize the important objects, their attributes and
their relationships in an image. It also needs to generate syntactically and
semantically correct sentences. Deep learning-based techniques are capable of
handling the complexities and challenges of image captioning. In this survey
paper, we aim to present a comprehensive review of existing deep learning-based
image captioning techniques. We discuss the foundation of the techniques to
analyze their performances, strengths and limitations. We also discuss the
datasets and the evaluation metrics popularly used in deep learning based
automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys
(October 2018
Dialog-based Interactive Image Retrieval
Existing methods for interactive image retrieval have demonstrated the merit
of integrating user feedback, improving retrieval results. However, most
current systems rely on restricted forms of user feedback, such as binary
relevance responses, or feedback based on a fixed set of relative attributes,
which limits their impact. In this paper, we introduce a new approach to
interactive image search that enables users to provide feedback via natural
language, allowing for more natural and effective interaction. We formulate the
task of dialog-based interactive image retrieval as a reinforcement learning
problem, and reward the dialog system for improving the rank of the target
image during each dialog turn. To mitigate the cumbersome and costly process of
collecting human-machine conversations as the dialog system learns, we train
our system with a user simulator, which is itself trained to describe the
differences between target and candidate images. The efficacy of our approach
is demonstrated in a footwear retrieval application. Experiments on both
simulated and real-world data show that 1) our proposed learning framework
achieves better accuracy than other supervised and reinforcement learning
baselines and 2) user feedback based on natural language rather than
pre-specified attributes leads to more effective retrieval results, and a more
natural and expressive communication interface.Comment: accepted at NeurIPS 201
Learning to Disambiguate by Asking Discriminative Questions
The ability to ask questions is a powerful tool to gather information in
order to learn about the world and resolve ambiguities. In this paper, we
explore a novel problem of generating discriminative questions to help
disambiguate visual instances. Our work can be seen as a complement and new
extension to the rich research studies on image captioning and question
answering. We introduce the first large-scale dataset with over 10,000
carefully annotated images-question tuples to facilitate benchmarking. In
particular, each tuple consists of a pair of images and 4.6 discriminative
questions (as positive samples) and 5.9 non-discriminative questions (as
negative samples) on average. In addition, we present an effective method for
visual discriminative question generation. The method can be trained in a
weakly supervised manner without discriminative images-question tuples but just
existing visual question answering datasets. Promising results are shown
against representative baselines through quantitative evaluations and user
studies.Comment: 14 pages, 12 figures, ICCV201
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
In this paper, we propose a multimodal search engine that combines visual and
textual cues to retrieve items from a multimedia database aesthetically similar
to the query. The goal of our engine is to enable intuitive retrieval of
fashion merchandise such as clothes or furniture. Existing search engines treat
textual input only as an additional source of information about the query image
and do not correspond to the real-life scenario where the user looks for 'the
same shirt but of denim'. Our novel method, dubbed DeepStyle, mitigates those
shortcomings by using a joint neural network architecture to model contextual
dependencies between features of different modalities. We prove the robustness
of this approach on two different challenging datasets of fashion items and
furniture where our DeepStyle engine outperforms baseline methods by 18-21% on
the tested datasets. Our search engine is commercially deployed and available
through a Web-based application.Comment: Copyright held by IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
A Procedural Texture Generation Framework Based on Semantic Descriptions
Procedural textures are normally generated from mathematical models with
parameters carefully selected by experienced users. However, for naive users,
the intuitive way to obtain a desired texture is to provide semantic
descriptions such as "regular," "lacelike," and "repetitive" and then a
procedural model with proper parameters will be automatically suggested to
generate the corresponding textures. By contrast, it is less practical for
users to learn mathematical models and tune parameters based on multiple
examinations of large numbers of generated textures. In this study, we propose
a novel framework that generates procedural textures according to user-defined
semantic descriptions, and we establish a mapping between procedural models and
semantic texture descriptions. First, based on a vocabulary of semantic
attributes collected from psychophysical experiments, a multi-label learning
method is employed to annotate a large number of textures with semantic
attributes to form a semantic procedural texture dataset. Then, we derive a low
dimensional semantic space in which the semantic descriptions can be separated
from one other. Finally, given a set of semantic descriptions, the diverse
properties of the samples in the semantic space can lead the framework to find
an appropriate generation model that uses appropriate parameters to produce a
desired texture. The experimental results show that the proposed framework is
effective and that the generated textures closely correlate with the input
semantic descriptions.Comment: 9 pages, 10 figure
Attention-based Natural Language Person Retrieval
Following the recent progress in image classification and captioning using
deep learning, we develop a novel natural language person retrieval system
based on an attention mechanism. More specifically, given the description of a
person, the goal is to localize the person in an image. To this end, we first
construct a benchmark dataset for natural language person retrieval. To do so,
we generate bounding boxes for persons in a public image dataset from the
segmentation masks, which are then annotated with descriptions and attributes
using the Amazon Mechanical Turk. We then adopt a region proposal network in
Faster R-CNN as a candidate region generator. The cropped images based on the
region proposals as well as the whole images with attention weights are fed
into Convolutional Neural Networks for visual feature extraction, while the
natural language expression and attributes are input to Bidirectional Long
Short- Term Memory (BLSTM) models for text feature extraction. The visual and
text features are integrated to score region proposals, and the one with the
highest score is retrieved as the output of our system. The experimental
results show significant improvement over the state-of-the-art method for
generic object retrieval and this line of research promises to benefit search
in surveillance video footage.Comment: CVPR 2017 Workshop (vision meets cognition
- …