1,556 research outputs found
Prediction and Description of Near-Future Activities in Video
Most of the existing works on human activity analysis focus on recognition or
early recognition of the activity labels from complete or partial observations.
Similarly, existing video captioning approaches focus on the observed events in
videos. Predicting the labels and the captions of future activities where no
frames of the predicted activities have been observed is a challenging problem,
with important applications that require anticipatory response. In this work,
we propose a system that can infer the labels and the captions of a sequence of
future activities. Our proposed network for label prediction of a future
activity sequence is similar to a hybrid Siamese network with three branches
where the first branch takes visual features from the objects present in the
scene, the second branch takes observed activity features and the third branch
captures the last observed activity features. The predicted labels and the
observed scene context are then mapped to meaningful captions using a
sequence-to-sequence learning-based method. Experiments on three challenging
activity analysis datasets and a video description dataset demonstrate that
both our label prediction framework and captioning framework outperform the
state-of-the-arts.Comment: 14 pages, 4 figures, 14 table
TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos
Automatically describing videos has ever been fascinating. In this work, we
attempt to describe videos from a specific domain - broadcast videos of lawn
tennis matches. Given a video shot from a tennis match, we intend to generate a
textual commentary similar to what a human expert would write on a sports
website. Unlike many recent works that focus on generating short captions, we
are interested in generating semantically richer descriptions. This demands a
detailed low-level analysis of the video content, specially the actions and
interactions among subjects. We address this by limiting our domain to the game
of lawn tennis. Rich descriptions are generated by leveraging a large corpus of
human created descriptions harvested from Internet. We evaluate our method on a
newly created tennis video data set. Extensive analysis demonstrate that our
approach addresses both semantic correctness as well as readability aspects
involved in the task.Comment: BMVC 201
SoDeep: a Sorting Deep net to learn ranking loss surrogates
Several tasks in machine learning are evaluated using non-differentiable
metrics such as mean average precision or Spearman correlation. However, their
non-differentiability prevents from using them as objective functions in a
learning framework. Surrogate and relaxation methods exist but tend to be
specific to a given metric.
In the present work, we introduce a new method to learn approximations of
such non-differentiable objective functions. Our approach is based on a deep
architecture that approximates the sorting of arbitrary sets of scores. It is
trained virtually for free using synthetic data. This sorting deep (SoDeep) net
can then be combined in a plug-and-play manner with existing deep
architectures. We demonstrate the interest of our approach in three different
tasks that require ranking: Cross-modal text-image retrieval, multi-label image
classification and visual memorability ranking. Our approach yields very
competitive results on these three tasks, which validates the merit and the
flexibility of SoDeep as a proxy for sorting operation in ranking-based losses.Comment: Accepted to CVPR 201
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Learning semantic sentence representations from visually grounded language without lexical knowledge
Current approaches to learning semantic representations of sentences often
use prior word-level knowledge. The current study aims to leverage visual
information in order to capture sentence level semantics without the need for
word embeddings. We use a multimodal sentence encoder trained on a corpus of
images with matching text captions to produce visually grounded sentence
embeddings. Deep Neural Networks are trained to map the two modalities to a
common embedding space such that for an image the corresponding caption can be
retrieved and vice versa. We show that our model achieves results comparable to
the current state-of-the-art on two popular image-caption retrieval benchmark
data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the
resulting sentence embeddings using the data from the Semantic Textual
Similarity benchmark task and show that the multimodal embeddings correlate
well with human semantic similarity judgements. The system achieves
state-of-the-art results on several of these benchmarks, which shows that a
system trained solely on multimodal data, without assuming any word
representations, is able to capture sentence level semantics. Importantly, this
result shows that we do not need prior knowledge of lexical level semantics in
order to model sentence level semantics. These findings demonstrate the
importance of visual information in semantics
Semantic speech retrieval with a visually grounded model of untranscribed speech
There is growing interest in models that can learn from unlabelled speech
paired with visual context. This setting is relevant for low-resource speech
processing, robotics, and human language acquisition research. Here we study
how a visually grounded speech model, trained on images of scenes paired with
spoken captions, captures aspects of semantics. We use an external image tagger
to generate soft text labels from images, which serve as targets for a neural
model that maps untranscribed speech to (semantic) keyword labels. We introduce
a newly collected data set of human semantic relevance judgements and an
associated task, semantic speech retrieval, where the goal is to search for
spoken utterances that are semantically relevant to a given text query. Without
seeing any text, the model trained on parallel speech and images achieves a
precision of almost 60% on its top ten semantic retrievals. Compared to a
supervised model trained on transcriptions, our model matches human judgements
better by some measures, especially in retrieving non-verbatim semantic
matches. We perform an extensive analysis of the model and its resulting
representations.Comment: 10 pages, 3 figures, 5 tables; accepted to the IEEE/ACM Transactions
on Audio, Speech and Language Processin
mAnI: Movie Amalgamation using Neural Imitation
Cross-modal data retrieval has been the basis of various creative tasks
performed by Artificial Intelligence (AI). One such highly challenging task for
AI is to convert a book into its corresponding movie, which most of the
creative film makers do as of today. In this research, we take the first step
towards it by visualizing the content of a book using its corresponding movie
visuals. Given a set of sentences from a book or even a fan-fiction written in
the same universe, we employ deep learning models to visualize the input by
stitching together relevant frames from the movie. We studied and compared
three different types of setting to match the book with the movie content: (i)
Dialog model: using only the dialog from the movie, (ii) Visual model: using
only the visual content from the movie, and (iii) Hybrid model: using the
dialog and the visual content from the movie. Experiments on the publicly
available MovieBook dataset shows the effectiveness of the proposed models.Comment: Accepted in ML4Creativity workshop in KDD 2017. Preprin
Attention-based Natural Language Person Retrieval
Following the recent progress in image classification and captioning using
deep learning, we develop a novel natural language person retrieval system
based on an attention mechanism. More specifically, given the description of a
person, the goal is to localize the person in an image. To this end, we first
construct a benchmark dataset for natural language person retrieval. To do so,
we generate bounding boxes for persons in a public image dataset from the
segmentation masks, which are then annotated with descriptions and attributes
using the Amazon Mechanical Turk. We then adopt a region proposal network in
Faster R-CNN as a candidate region generator. The cropped images based on the
region proposals as well as the whole images with attention weights are fed
into Convolutional Neural Networks for visual feature extraction, while the
natural language expression and attributes are input to Bidirectional Long
Short- Term Memory (BLSTM) models for text feature extraction. The visual and
text features are integrated to score region proposals, and the one with the
highest score is retrieved as the output of our system. The experimental
results show significant improvement over the state-of-the-art method for
generic object retrieval and this line of research promises to benefit search
in surveillance video footage.Comment: CVPR 2017 Workshop (vision meets cognition
Learning to discover and localize visual objects with open vocabulary
To alleviate the cost of obtaining accurate bounding boxes for training
today's state-of-the-art object detection models, recent weakly supervised
detection work has proposed techniques to learn from image-level labels.
However, requiring discrete image-level labels is both restrictive and
suboptimal. Real-world "supervision" usually consists of more unstructured
text, such as captions. In this work we learn association maps between images
and captions. We then use a novel objectness criterion to rank the resulting
candidate boxes, such that high-ranking boxes have strong gradients along all
edges. Thus, we can detect objects beyond a fixed object category vocabulary,
if those objects are frequent and distinctive enough. We show that our
objectness criterion improves the proposed bounding boxes in relation to prior
weakly supervised detection methods. Further, we show encouraging results on
object detection from image-level captions only
Visual Relationship Detection using Scene Graphs: A Survey
Understanding a scene by decoding the visual relationships depicted in an
image has been a long studied problem. While the recent advances in deep
learning and the usage of deep neural networks have achieved near human
accuracy on many tasks, there still exists a pretty big gap between human and
machine level performance when it comes to various visual relationship
detection tasks. Developing on earlier tasks like object recognition,
segmentation and captioning which focused on a relatively coarser image
understanding, newer tasks have been introduced recently to deal with a finer
level of image understanding. A Scene Graph is one such technique to better
represent a scene and the various relationships present in it. With its wide
number of applications in various tasks like Visual Question Answering,
Semantic Image Retrieval, Image Generation, among many others, it has proved to
be a useful tool for deeper and better visual relationship understanding. In
this paper, we present a detailed survey on the various techniques for scene
graph generation, their efficacy to represent visual relationships and how it
has been used to solve various downstream tasks. We also attempt to analyze the
various future directions in which the field might advance in the future. Being
one of the first papers to give a detailed survey on this topic, we also hope
to give a succinct introduction to scene graphs, and guide practitioners while
developing approaches for their applications
- …