199 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Computational explorations of semantic cognition
Motivated by the widespread use of distributional models of semantics within the cognitive science community, we follow a computational modelling approach in order to better understand and expand the applicability of such models, as well as to test potential ways in which they can be improved and extended. We review evidence in favour of the assumption that distributional models capture important aspects of semantic cognition. We look at the models’ ability to account for behavioural data and fMRI patterns of brain activity, and investigate the structure of model-based, semantic networks. We test whether introducing affective information, obtained from a neural network model designed to predict emojis from co-occurring text, can improve the performance of linguistic and linguistic-visual models of semantics, in accounting for similarity/relatedness ratings. We find that adding visual and affective representations improves performance, especially for concrete and abstract words, respectively. We describe a processing model based on distributional semantics, in which activation spreads throughout a semantic network, as dictated by the patterns of semantic similarity between words. We show that the activation profile of the network, measured at various time points, can account for response time and accuracies in lexical and semantic decision tasks, as well as for concreteness/imageability and similarity/relatedness ratings. We evaluate the differences between concrete and abstract words, in terms of the structure of the semantic networks derived from distributional models of semantics. We examine how the structure is related to a number of factors that have been argued to differ between concrete and abstract words, namely imageability, age of acquisition, hedonic valence, contextual diversity, and semantic diversity. We use distributional models to explore factors that might be responsible for the poor linguistic performance of children suffering from Developmental Language Disorder. Based on the assumption that certain model parameters can be given a psychological interpretation, we start from “healthy” models, and generate “lesioned” models, by manipulating the parameters. This allows us to determine the importance of each factor, and their effects with respect to learning concrete vs abstract words
Towards hypergraph cognitive networks as feature-rich models of knowledge
Semantic networks provide a useful tool to understand how related concepts
are retrieved from memory. However, most current network approaches use
pairwise links to represent memory recall patterns. Pairwise connections
neglect higher-order associations, i.e. relationships between more than two
concepts at a time. These higher-order interactions might covariate with (and
thus contain information about) how similar concepts are along psycholinguistic
dimensions like arousal, valence, familiarity, gender and others. We overcome
these limits by introducing feature-rich cognitive hypergraphs as quantitative
models of human memory where: (i) concepts recalled together can all engage in
hyperlinks involving also more than two concepts at once (cognitive hypergraph
aspect), and (ii) each concept is endowed with a vector of psycholinguistic
features (feature-rich aspect). We build hypergraphs from word association data
and use evaluation methods from machine learning features to predict concept
concreteness. Since concepts with similar concreteness tend to cluster together
in human memory, we expect to be able to leverage this structure. Using word
association data from the Small World of Words dataset, we compared a pairwise
network and a hypergraph with N=3586 concepts/nodes. Interpretable artificial
intelligence models trained on (1) psycholinguistic features only, (2)
pairwise-based feature aggregations, and on (3) hypergraph-based aggregations
show significant differences between pairwise and hypergraph links.
Specifically, our results show that higher-order and feature-rich hypergraph
models contain richer information than pairwise networks leading to improved
prediction of word concreteness. The relation with previous studies about
conceptual clustering and compartmentalisation in associative knowledge and
human memory are discussed
Modeling Visual Rhetoric and Semantics in Multimedia
Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects.
Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems
PIQA: Reasoning about Physical Commonsense in Natural Language
To apply eyeshadow without a brush, should I use a cotton swab or a
toothpick? Questions requiring this kind of physical commonsense pose a
challenge to today's natural language understanding systems. While recent
pretrained models (such as BERT) have made progress on question answering over
more abstract domains - such as news articles and encyclopedia entries, where
text is plentiful - in more physical domains, text is inherently limited due to
reporting bias. Can AI systems learn to reliably answer physical common-sense
questions without experiencing the physical world? In this paper, we introduce
the task of physical commonsense reasoning and a corresponding benchmark
dataset Physical Interaction: Question Answering or PIQA. Though humans find
the dataset easy (95% accuracy), large pretrained models struggle (77%). We
provide analysis about the dimensions of knowledge that existing models lack,
which offers significant opportunities for future research.Comment: AAAI 202
- …