4,383 research outputs found
Word-Region Alignment-Guided Multimodal Neural Machine Translation
We propose word-region alignment-guided multimodal neural machine translation (MNMT), a novel model for MNMT that links the semantic correlation between textual and visual modalities using word-region alignment (WRA). Existing studies on MNMT have mainly focused on the effect of integrating visual and textual modalities. However, they do not leverage the semantic relevance between the two modalities. We advance the semantic correlation between textual and visual modalities in MNMT by incorporating WRA as a bridge. This proposal has been implemented on two mainstream architectures of neural machine translation (NMT): the recurrent neural network (RNN) and the transformer. Experiments on two public benchmarks, English--German and English--French translation tasks using the Multi30k dataset and English--Japanese translation tasks using the Flickr30kEnt-JP dataset prove that our model has a significant improvement with respect to the competitive baselines across different evaluation metrics and outperforms most of the existing MNMT models. For example, 1.0 BLEU scores are improved for the English-German task and 1.1 BLEU scores are improved for the English-French task on the Multi30k test2016 set; and 0.7 BLEU scores are improved for the English-Japanese task on the Flickr30kEnt-JP test set. Further analysis demonstrates that our model can achieve better translation performance by integrating WRA, leading to better visual information use
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
Consensus Graph Representation Learning for Better Grounded Image Captioning
The contemporary visual captioning models frequently hallucinate objects that
are not actually in a scene, due to the visual misclassification or
over-reliance on priors that resulting in the semantic inconsistency between
the visual information and the target lexical words. The most common way is to
encourage the captioning model to dynamically link generated object words or
phrases to appropriate regions of the image, i.e., the grounded image
captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects)
that has not solved the key issue of object hallucination, i.e., the semantic
inconsistency. In this paper, we take a novel perspective on the issue above -
exploiting the semantic coherency between the visual and language modalities.
Specifically, we propose the Consensus Rraph Representation Learning framework
(CGRL) for GIC that incorporates a consensus representation into the grounded
captioning pipeline. The consensus is learned by aligning the visual graph
(e.g., scene graph) to the language graph that consider both the nodes and
edges in a graph. With the aligned consensus, the captioning model can capture
both the correct linguistic characteristics and visual relevance, and then
grounding appropriate image regions further. We validate the effectiveness of
our model, with a significant decline in object hallucination (-9% CHAIRi) on
the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several
automatic metrics and human evaluation, the results indicate that the proposed
approach can simultaneously improve the performance of image captioning (+2.9
Cider) and grounding (+2.3 F1LOC).Comment: 9 pages, 5 figures, AAAI 202
Language with Vision: a Study on Grounded Word and Sentence Embeddings
Grounding language in vision is an active field of research seeking to
construct cognitively plausible word and sentence representations by
incorporating perceptual knowledge from vision into text-based representations.
Despite many attempts at language grounding, achieving an optimal equilibrium
between textual representations of the language and our embodied experiences
remains an open field. Some common concerns are the following. Is visual
grounding advantageous for abstract words, or is its effectiveness restricted
to concrete words? What is the optimal way of bridging the gap between text and
vision? To what extent is perceptual knowledge from images advantageous for
acquiring high-quality embeddings? Leveraging the current advances in machine
learning and natural language processing, the present study addresses these
questions by proposing a simple yet very effective computational grounding
model for pre-trained word embeddings. Our model effectively balances the
interplay between language and vision by aligning textual embeddings with
visual information while simultaneously preserving the distributional
statistics that characterize word usage in text corpora. By applying a learned
alignment, we are able to indirectly ground unseen words including abstract
words. A series of evaluations on a range of behavioural datasets shows that
visual grounding is beneficial not only for concrete words but also for
abstract words, lending support to the indirect theory of abstract concepts.
Moreover, our approach offers advantages for contextualized embeddings, such as
those generated by BERT, but only when trained on corpora of modest,
cognitively plausible sizes. Code and grounded embeddings for English are
available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2
- …