1,032 research outputs found
Comparing Extant Story Classifiers: Results & New Directions
Having access to a large set of stories is a necessary first step for robust and wide-ranging computational narrative modeling; happily, language data - including stories - are increasingly available in electronic form. Unhappily, the process of automatically separating stories from other forms of written discourse is not straightforward, and has resulted in a data collection bottleneck. Therefore researchers have sought to develop reliable, robust automatic algorithms for identifying story text mixed with other non-story text. In this paper we report on the reimplementation and experimental comparison of the two approaches to this task: Gordon\u27s unigram classifier, and Corman\u27s semantic triplet classifier. We cross-analyze their performance on both Gordon\u27s and Corman\u27s corpora, and discuss similarities, differences, and gaps in the performance of these classifiers, and point the way forward to improving their approaches
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
In order to convey the most content in their limited space, advertisements
embed references to outside knowledge via symbolism. For example, a motorcycle
stands for adventure (a positive property the ad wants associated with the
product being sold), and a gun stands for danger (a negative property to
dissuade viewers from undesirable behaviors). We show how to use symbolic
references to better understand the meaning of an ad. We further show how
anchoring ad understanding in general-purpose object recognition and image
captioning improves results. We formulate the ad understanding task as matching
the ad image to human-generated statements that describe the action that the ad
prompts, and the rationale it provides for taking this action. Our proposed
method outperforms the state of the art on this task, and on an alternative
formulation of question-answering on ads. We show additional applications of
our learned representations for matching ads to slogans, and clustering ads
according to their topic, without extra training.Comment: To appear, Proceedings of the European Conference on Computer Vision
(ECCV
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
We introduce a model for bidirectional retrieval of images and sentences
through a multi-modal embedding of visual and natural language data. Unlike
previous models that directly map images or sentences into a common embedding
space, our model works on a finer level and embeds fragments of images
(objects) and fragments of sentences (typed dependency tree relations) into a
common space. In addition to a ranking objective seen in previous work, this
allows us to add a new fragment alignment objective that learns to directly
associate these fragments across modalities. Extensive experimental evaluation
shows that reasoning on both the global level of images and sentences and the
finer level of their respective fragments significantly improves performance on
image-sentence retrieval tasks. Additionally, our model provides interpretable
predictions since the inferred inter-modal fragment alignment is explicit
Summarizing First-Person Videos from Third Persons' Points of Views
Video highlight or summarization is among interesting topics in computer
vision, which benefits a variety of applications like viewing, searching, or
storage. However, most existing studies rely on training data of third-person
videos, which cannot easily generalize to highlight the first-person ones. With
the goal of deriving an effective model to summarize first-person videos, we
propose a novel deep neural network architecture for describing and
discriminating vital spatiotemporal information across videos with different
points of view. Our proposed model is realized in a semi-supervised setting, in
which fully annotated third-person videos, unlabeled first-person videos, and a
small number of annotated first-person ones are presented during training. In
our experiments, qualitative and quantitative evaluations on both benchmarks
and our collected first-person video datasets are presented.Comment: 16+10 pages, ECCV 201
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets
Visual question answering (Visual QA) has attracted a lot of attention
lately, seen essentially as a form of (visual) Turing test that artificial
intelligence should strive to achieve. In this paper, we study a crucial
component of this task: how can we design good datasets for the task? We focus
on the design of multiple-choice based datasets where the learner has to select
the right answer from a set of candidate ones including the target (\ie the
correct one) and the decoys (\ie the incorrect ones). Through careful analysis
of the results attained by state-of-the-art learning models and human
annotators on existing datasets, we show that the design of the decoy answers
has a significant impact on how and what the learning models learn from the
datasets. In particular, the resulting learner can ignore the visual
information, the question, or both while still doing well on the task. Inspired
by this, we propose automatic procedures to remedy such design deficiencies. We
apply the procedures to re-construct decoy answers for two popular Visual QA
datasets as well as to create a new Visual QA dataset from the Visual Genome
project, resulting in the largest dataset for this task. Extensive empirical
studies show that the design deficiencies have been alleviated in the remedied
datasets and the performance on them is likely a more faithful indicator of the
difference among learning models. The datasets are released and publicly
available via http://www.teds.usc.edu/website_vqa/.Comment: Accepted for Oral Presentation at NAACL-HLT 201
- …