2,907 research outputs found
Clue: Cross-modal Coherence Modeling for Caption Generation
We use coherence relations inspired by computational models of discourse to
study the information needs and goals of image captioning. Using an annotation
protocol specifically devised for capturing image--caption coherence relations,
we annotate 10,000 instances from publicly-available image--caption pairs. We
introduce a new task for learning inferences in imagery and text, coherence
relation prediction, and show that these coherence annotations can be exploited
to learn relation classifiers as an intermediary step, and also train
coherence-aware, controllable image captioning models. The results show a
dramatic improvement in the consistency and quality of the generated captions
with respect to information needs specified via coherence relations.Comment: Accepted as a long paper to ACL 202
Generating Video Descriptions with Topic Guidance
Generating video descriptions in natural language (a.k.a. video captioning)
is a more challenging task than image captioning as the videos are
intrinsically more complicated than images in two aspects. First, videos cover
a broader range of topics, such as news, music, sports and so on. Second,
multiple topics could coexist in the same video. In this paper, we propose a
novel caption model, topic-guided model (TGM), to generate topic-oriented
descriptions for videos in the wild via exploiting topic information. In
addition to predefined topics, i.e., category tags crawled from the web, we
also mine topics in a data-driven way based on training captions by an
unsupervised topic mining model. We show that data-driven topics reflect a
better topic schema than the predefined topics. As for testing video topic
prediction, we treat the topic mining model as teacher to train the student,
the topic prediction model, by utilizing the full multi-modalities in the video
especially the speech modality. We propose a series of caption models to
exploit topic guidance, including implicitly using the topics as input features
to generate words related to the topic and explicitly modifying the weights in
the decoder with topics to function as an ensemble of topic-aware language
decoders. Our comprehensive experimental results on the current largest video
caption dataset MSR-VTT prove the effectiveness of our topic-guided model,
which significantly surpasses the winning performance in the 2016 MSR video to
language challenge.Comment: Appeared at ICMR 201
Neural Architecture Search using Deep Neural Networks and Monte Carlo Tree Search
Neural Architecture Search (NAS) has shown great success in automating the
design of neural networks, but the prohibitive amount of computations behind
current NAS methods requires further investigations in improving the sample
efficiency and the network evaluation cost to get better results in a shorter
time. In this paper, we present a novel scalable Monte Carlo Tree Search (MCTS)
based NAS agent, named AlphaX, to tackle these two aspects. AlphaX improves the
search efficiency by adaptively balancing the exploration and exploitation at
the state level, and by a Meta-Deep Neural Network (DNN) to predict network
accuracies for biasing the search toward a promising region. To amortize the
network evaluation cost, AlphaX accelerates MCTS rollouts with a distributed
design and reduces the number of epochs in evaluating a network by transfer
learning, which is guided with the tree structure in MCTS. In 12 GPU days and
1000 samples, AlphaX found an architecture that reaches 97.84\% top-1 accuracy
on CIFAR-10, and 75.5\% top-1 accuracy on ImageNet, exceeding SOTA NAS methods
in both the accuracy and sampling efficiency. Particularly, we also evaluate
AlphaX on NASBench-101, a large scale NAS dataset; AlphaX is 3x and 2.8x more
sample efficient than Random Search and Regularized Evolution in finding the
global optimum. Finally, we show the searched architecture improves a variety
of vision applications from Neural Style Transfer, to Image Captioning and
Object Detection.Comment: To appear in the Thirty-Fourth AAAI conference on Artificial
Intelligence (AAAI-2020
Evaluation of Automatic Video Captioning Using Direct Assessment
We present Direct Assessment, a method for manually assessing the quality of
automatically-generated captions for video. Evaluating the accuracy of video
captions is particularly difficult because for any given video clip there is no
definitive ground truth or correct answer against which to measure. Automatic
metrics for comparing automatic video captions against a manual caption such as
BLEU and METEOR, drawn from techniques used in evaluating machine translation,
were used in the TRECVid video captioning task in 2016 but these are shown to
have weaknesses. The work presented here brings human assessment into the
evaluation by crowdsourcing how well a caption describes a video. We
automatically degrade the quality of some sample captions which are assessed
manually and from this we are able to rate the quality of the human assessors,
a factor we take into account in the evaluation. Using data from the TRECVid
video-to-text task in 2016, we show how our direct assessment method is
replicable and robust and should scale to where there many caption-generation
techniques to be evaluated.Comment: 26 pages, 8 figure
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Recognising objects according to a pre-defined fixed set of class labels has
been well studied in the Computer Vision. There are a great many practical
applications where the subjects that may be of interest are not known
beforehand, or so easily delineated, however. In many of these cases natural
language dialog is a natural way to specify the subject of interest, and the
task achieving this capability (a.k.a, Referring Expression Comprehension) has
recently attracted attention. To this end we propose a unified framework, the
ParalleL AttentioN (PLAN) network, to discover the object in an image that is
being referred to in variable length natural expression descriptions, from
short phrases query to long multi-round dialogs. The PLAN network has two
attention mechanisms that relate parts of the expressions to both the global
visual content and also directly to object candidates. Furthermore, the
attention mechanisms are recurrent, making the referring process visualizable
and explainable. The attended information from these dual sources are combined
to reason about the referred object. These two attention mechanisms can be
trained in parallel and we find the combined system outperforms the
state-of-art on several benchmarked datasets with different length language
input, such as RefCOCO, RefCOCO+ and GuessWhat?!.Comment: 11 page
- …