183 research outputs found
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
Auto-Encoding Scene Graphs for Image Captioning
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more
human-like captions. Intuitively, we humans use the inductive bias to compose
collocations and contextual inference in discourse. For example, when we see
the relation `person on bike', it is natural to replace `on' with `ride' and
infer `person riding bike on a road' even the `road' is not evident. Therefore,
exploiting such bias as a language prior is expected to help the conventional
encoder-decoder models less likely overfit to the dataset bias and focus on
reasoning. Specifically, we use the scene graph --- a directed graph
() where an object node is connected by adjective nodes and
relationship nodes --- to represent the complex structural layout of both image
() and sentence (). In the textual domain, we use
SGAE to learn a dictionary () that helps to reconstruct sentences
in the pipeline, where encodes the desired language prior;
in the vision-language domain, we use the shared to guide the
encoder-decoder in the pipeline. Thanks to the scene graph
representation and shared dictionary, the inductive bias is transferred across
domains in principle. We validate the effectiveness of SGAE on the challenging
MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves
a new state-of-the-art CIDEr-D on the Karpathy split, and a competitive
CIDEr-D (c40) on the official server even compared to other ensemble
models
Attentive Tensor Product Learning
This paper proposes a new architecture - Attentive Tensor Product Learning
(ATPL) - to represent grammatical structures in deep learning models. ATPL is a
new architecture to bridge this gap by exploiting Tensor Product
Representations (TPR), a structured neural-symbolic model developed in
cognitive science, aiming to integrate deep learning with explicit language
structures and rules. The key ideas of ATPL are: 1) unsupervised learning of
role-unbinding vectors of words via TPR-based deep neural network; 2) employing
attention modules to compute TPR; and 3) integration of TPR with typical deep
learning architectures including Long Short-Term Memory (LSTM) and Feedforward
Neural Network (FFNN). The novelty of our approach lies in its ability to
extract the grammatical structure of a sentence by using role-unbinding
vectors, which are obtained in an unsupervised manner. This ATPL approach is
applied to 1) image captioning, 2) part of speech (POS) tagging, and 3)
constituency parsing of a sentence. Experimental results demonstrate the
effectiveness of the proposed approach
Structure-Aware Generation Network for Recipe Generation from Images
Sharing food has become very popular with the development of social media.
For many real-world applications, people are keen to know the underlying
recipes of a food item. In this paper, we are interested in automatically
generating cooking instructions for food. We investigate an open research task
of generating cooking instructions based on only food images and ingredients,
which is similar to the image captioning task. However, compared with image
captioning datasets, the target recipes are long-length paragraphs and do not
have annotations on structure information. To address the above limitations, we
propose a novel framework of Structure-aware Generation Network (SGN) to tackle
the food recipe generation task. Our approach brings together several novel
ideas in a systematic framework: (1) exploiting an unsupervised learning
approach to obtain the sentence-level tree structure labels before training;
(2) generating trees of target recipes from images with the supervision of tree
structure labels learned from (1); and (3) integrating the inferred tree
structures with the recipe generation procedure. Our proposed model can produce
high-quality and coherent recipes, and achieve the state-of-the-art performance
on the benchmark Recipe1M dataset.Comment: Published at ECCV 202
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition
Recently, there has been a lot of interest in automatically generating
descriptions for an image. Most existing language-model based approaches for
this task learn to generate an image description word by word in its original
word order. However, for humans, it is more natural to locate the objects and
their relationships first, and then elaborate on each object, describing
notable attributes. We present a coarse-to-fine method that decomposes the
original image description into a skeleton sentence and its attributes, and
generates the skeleton sentence and attribute phrases separately. By this
decomposition, our method can generate more accurate and novel descriptions
than the previous state-of-the-art. Experimental results on the MS-COCO and a
larger scale Stock3M datasets show that our algorithm yields consistent
improvements across different evaluation metrics, especially on the SPICE
metric, which has much higher correlation with human ratings than the
conventional metrics. Furthermore, our algorithm can generate descriptions with
varied length, benefiting from the separate control of the skeleton and
attributes. This enables image description generation that better accommodates
user preferences.Comment: Accepted by CVPR 201
Knowledge and Reasoning for Image Understanding
abstract: Image Understanding is a long-established discipline in computer vision, which encompasses a body of advanced image processing techniques, that are used to locate (âwhereâ), characterize and recognize (âwhatâ) objects, regions, and their attributes in the image. However, the notion of âunderstandingâ (and the goal of artificial intelligent machines) goes beyond factual recall of the recognized components and includes reasoning and thinking beyond what can be seen (or perceived). Understanding is often evaluated by asking questions of increasing difficulty. Thus, the expected functionalities of an intelligent Image Understanding system can be expressed in terms of the functionalities that are required to answer questions about an image. Answering questions about images require primarily three components: Image Understanding, question (natural language) understanding, and reasoning based on knowledge. Any question, asking beyond what can be directly seen, requires modeling of commonsense (or background/ontological/factual) knowledge and reasoning.
Knowledge and reasoning have seen scarce use in image understanding applications. In this thesis, we demonstrate the utilities of incorporating background knowledge and using explicit reasoning in image understanding applications. We first present a comprehensive survey of the previous work that utilized background knowledge and reasoning in understanding images. This survey outlines the limited use of commonsense knowledge in high-level applications. We then present a set of vision and reasoning-based methods to solve several applications and show that these approaches benefit in terms of accuracy and interpretability from the explicit use of knowledge and reasoning. We propose novel knowledge representations of image, knowledge acquisition methods, and a new implementation of an efficient probabilistic logical reasoning engine that can utilize publicly available commonsense knowledge to solve applications such as visual question answering, image puzzles. Additionally, we identify the need for new datasets that explicitly require external commonsense knowledge to solve. We propose the new task of Image Riddles, which requires a combination of vision, and reasoning based on ontological knowledge; and we collect a sufficiently large dataset to serve as an ideal testbed for vision and reasoning research. Lastly, we propose end-to-end deep architectures that can combine vision, knowledge and reasoning modules together and achieve large performance boosts over state-of-the-art methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Unsupervised structure induction and multimodal grounding
Structured representations build upon symbolic abstraction (e.g., words in natural language and visual concepts in natural images), offer a principled way of encoding our perceptions about the physical world, and enable the human-like generalization of machine learning systems. The predominant paradigm for learning structured representations of the observed data has been supervised learning, but it is limited in several respects. First, supervised learning is challenging given the scarcity of labeled data. Second, conventional approaches to structured prediction have been relying on a single modality (e.g., either images or text), ignoring the learning cues that may have been specified in and can be readily obtained from other modalities of data. In this thesis, we investigate unsupervised approaches to structure induction in a multimodal setting.
Unsupervised learning is inherently difficult in general, let alone inducing complex and discrete structures from data without direct supervision. By considering the multimodal setting, we leverage the alignments between different data modalities (e.g., text, audio, and images) to facilitate the learning of structure-induction models, e.g., knowing that the individual words in ``a white pigeon'' always appear with the same visual object, a language parser is likely to treat them as a whole (i.e., phrase). The multimodal learning setting is practically viable because multimodal alignments are generally abundant. For example, they can be found in online posts such as news and tweets that usually contain images and associated text, and in (YouTube) videos, where audio, scripts, and scenes are synchronized and grounded in each other.
We develop structure-induction models, which are capable of exploiting bimodal image-text alignments, for two modalities: (1) for natural language, we consider unsupervised syntactic parsing with phrase-structure grammars and regularize the parser by using visual image groundings; and (2) for visual images, we induce scene graph representations by mapping arguments and predicates in the text to their visual counterparts (i.e., visual objects and relations among them) in an unsupervised manner. While useful, crossmodal alignments are not always abundantly available on the web, e.g., the alignments between non-speech audio and text. We tackle the challenge by sharing the visual modality between image-text alignment and image-audio alignment; images function as a pivot and connect audio and text. The contributions of this thesis span from model development to data collection. We demonstrated the feasibility of applying multimodal learning techniques to unsupervised structure induction and multimodal alignment collection. Our work opens up new avenues for multimodal and unsupervised structured representation learning
- âŠ