9 research outputs found
Learning Multimodal Word Representation via Dynamic Fusion Methods
Multimodal models have been proven to outperform text-based models on
learning semantic word representations. Almost all previous multimodal models
typically treat the representations from different modalities equally. However,
it is obvious that information from different modalities contributes
differently to the meaning of words. This motivates us to build a multimodal
model that can dynamically fuse the semantic representations from different
modalities according to different types of words. To that end, we propose three
novel dynamic fusion methods to assign importance weights to each modality, in
which weights are learned under the weak supervision of word association pairs.
The extensive experiments have demonstrated that the proposed methods
outperform strong unimodal baselines and state-of-the-art multimodal models.Comment: To be appear in AAAI-1
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Recommended from our members
From Words to Behaviour via Semantic Networks
The contents and structure of semantic networks have
been the focus of much recent research, with major
advances in the development of distributional models. In
parallel, connectionist modeling has extended our
knowledge of the processes engaged in semantic
activation. However, these two lines of investigation have
rarely brought together. Here, starting from a standard
textual model of semantics, we allow activation to spread
throughout its associated semantic network, as dictated by
the patterns of semantic similarity between words. We
find that the activation profile of the network, measured
at various time points, can successfully account for
response times in the lexical decision task, as well as for
subjective concreteness and imageability ratings
Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities
Multimodal sentiment analysis is a core research area that studies speaker
sentiment expressed from the language, visual, and acoustic modalities. The
central challenge in multimodal learning involves inferring joint
representations that can process and relate information from these modalities.
However, existing work learns joint representations by requiring all modalities
as input and as a result, the learned representations may be sensitive to noisy
or missing modalities at test time. With the recent success of sequence to
sequence (Seq2Seq) models in machine translation, there is an opportunity to
explore new ways of learning joint representations that may not require all
input modalities at test time. In this paper, we propose a method to learn
robust joint representations by translating between modalities. Our method is
based on the key insight that translation from a source to a target modality
provides a method of learning joint representations using only the source
modality as input. We augment modality translations with a cycle consistency
loss to ensure that our joint representations retain maximal information from
all modalities. Once our translation model is trained with paired multimodal
data, we only need data from the source modality at test time for final
sentiment prediction. This ensures that our model remains robust from
perturbations or missing information in the other modalities. We train our
model with a coupled translation-prediction objective and it achieves new
state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI,
ICT-MMMO, and YouTube. Additional experiments show that our model learns
increasingly discriminative joint representations with more input modalities
while maintaining robustness to missing or perturbed modalities.Comment: AAAI 2019, code available at https://github.com/hainow/MCT
Apprentissage multimodal de représentation de mots à l'aide de contexte visuel
International audienceReprésenter la sémantique d'un mot est un défi ma-jeur pour pouvoir traiter automatiquement le langage. Jusqu'à présent, une grande partie des méthodes déter-minent le sens d'un mot via ses contextes dans un corpus de texte. Plus récemment, certains auteurs se sont intéressés à l'apparence visuelle d'un objet pour amé-liorer la représentation sémantique du mot correspon-dant. Cependant, ces travaux ignorent l'environnement et le contexte visuel dans lequel l'objet apparaît. Dans cet article, nous proposons d'apprendre la représenta-tion des mots en bénéficiant de la complémentarité des modalités texte et image par la prise en compte simul-tanée des contextes textuels et visuels des mots. Nous explorons plusieurs choix de modélisation de contexte visuel, et présentons une méthode jointe qui intègre le contexte visuel dans un modèle skip-gram multimodal. Enfin, l'apport de ces représentations dans des tâches d'analyse sémantiques est évaluée sur plusieurs jeux de données. Cet article est une traduction de [ZPSG18]
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Computational explorations of semantic cognition
Motivated by the widespread use of distributional models of semantics within the cognitive science community, we follow a computational modelling approach in order to better understand and expand the applicability of such models, as well as to test potential ways in which they can be improved and extended. We review evidence in favour of the assumption that distributional models capture important aspects of semantic cognition. We look at the models’ ability to account for behavioural data and fMRI patterns of brain activity, and investigate the structure of model-based, semantic networks. We test whether introducing affective information, obtained from a neural network model designed to predict emojis from co-occurring text, can improve the performance of linguistic and linguistic-visual models of semantics, in accounting for similarity/relatedness ratings. We find that adding visual and affective representations improves performance, especially for concrete and abstract words, respectively. We describe a processing model based on distributional semantics, in which activation spreads throughout a semantic network, as dictated by the patterns of semantic similarity between words. We show that the activation profile of the network, measured at various time points, can account for response time and accuracies in lexical and semantic decision tasks, as well as for concreteness/imageability and similarity/relatedness ratings. We evaluate the differences between concrete and abstract words, in terms of the structure of the semantic networks derived from distributional models of semantics. We examine how the structure is related to a number of factors that have been argued to differ between concrete and abstract words, namely imageability, age of acquisition, hedonic valence, contextual diversity, and semantic diversity. We use distributional models to explore factors that might be responsible for the poor linguistic performance of children suffering from Developmental Language Disorder. Based on the assumption that certain model parameters can be given a psychological interpretation, we start from “healthy” models, and generate “lesioned” models, by manipulating the parameters. This allows us to determine the importance of each factor, and their effects with respect to learning concrete vs abstract words
Relatedness and Compatibility: Semantic Dimensions of the Concept of Privacy in Chinese and English Corpora
This dissertation is a study of how privacy as an ethical concept exists in two languages: Mandarin Chinese and American English. The assumption for this dissertation is that different languages will have their own distinctive expressions and understandings when it comes to privacy. Specifically, I have proposed a cross-genre and cross-language study to include two genres of language corpora for each of the languages: social media posts and news articles. In addition, the language corpora span from 2010 to 2019, which supported an observation of how privacy-related languages may have changed and evolved over the years. I took a mixed-methods approach, by using two computational methods: semantic network analysis (SNA) and structural topic modeling (STM) for processing the natural language corpora. When it comes to labeling and interpreting the results of topic modeling, I relied on external coders for labeling and my own in-depth reading of the topic words as well as original documents to make sense of the meaning of these topics. Last but not least, based on the interpretations of topics, I proposed four semantic dimensions and used these four dimensions to come back to code all the topics to have an overall depiction of the topics across these two languages and two genres. The four semantic dimensions, though were found present in both languages, have revealed unequal presence in the two languages. Specifically, the institution dimension has much more presence in the English language; and in the Chinese language, it is the individual dimension that is frequently seen across topics in both genres. Apart from topics, this different emphasis on these two semantic dimensions (institution and individual) is also reflected through the semantic network analysis of nodes where the nodes with leading centrality scores over the years in these two languages differ. After considering the limitation of the data in this study, I conclude by arguing that overall, it is more cautious and appropriate to understand the incompatibilities by saying the two languages differ by their emphasis on different dimensions. This study is one of the first empirically-grounded intercultural explorations of the concept of privacy. It not only provides an examination of the concept as it is understood at the current time of writing but also reveals that natural language is promising to operationalize intercultural privacy research and comparative privacy research.Doctor of Philosoph