958 research outputs found
Multi-modal Machine Learning for Vehicle Rating Predictions Using Image, Text, and Parametric Data
Accurate vehicle rating prediction can facilitate designing and configuring
good vehicles. This prediction allows vehicle designers and manufacturers to
optimize and improve their designs in a timely manner, enhance their product
performance, and effectively attract consumers. However, most of the existing
data-driven methods rely on data from a single mode, e.g., text, image, or
parametric data, which results in a limited and incomplete exploration of the
available information. These methods lack comprehensive analyses and
exploration of data from multiple modes, which probably leads to inaccurate
conclusions and hinders progress in this field. To overcome this limitation, we
propose a multi-modal learning model for more comprehensive and accurate
vehicle rating predictions. Specifically, the model simultaneously learns
features from the parametric specifications, text descriptions, and images of
vehicles to predict five vehicle rating scores, including the total score,
critics score, performance score, safety score, and interior score. We compare
the multi-modal learning model to the corresponding unimodal models and find
that the multi-modal model's explanatory power is 4% - 12% higher than that of
the unimodal models. On this basis, we conduct sensitivity analyses using SHAP
to interpret our model and provide design and optimization directions to
designers and manufacturers. Our study underscores the importance of the
data-driven multi-modal learning approach for vehicle design, evaluation, and
optimization. We have made the code publicly available at
http://decode.mit.edu/projects/vehicleratings/.Comment: The paper submitted to IDETC/CIE2023, the International Design
Engineering Technical Conferences & Computers and Information in Engineering
Conference, has been accepte
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201
The neural basis of audiovisual integration
Our perception is continuous and unified. Yet, sensory information reaches our brains through different senses and needs to be processed in order to create that unified percept. Interactions between sensory modalities occur already at primary cortical levels. The purpose of such interactions and what kind of information they transmit is still largely unknown. The current thesis aimed to reveal the interactions between auditory pitch and visual size in polar coordinates, two modality specific stimulus features that have robust topographic representations in the human brain. In Chapter 1, I present the background of cross-modal interactions in early sensory cortices and of the pitch-size relationship. In Chapter 2, we explored the pitch-size relationship in a speeded classification task and, in Chapter 3, at the level of functional Magnetic Resonance Imaging activation patterns. In Chapter 4, we investigated the effects of actively learning a specific pitch-size mapping during one session on the speeded classification task. In Chapter 5, we extended learning over multiple sessions and examined learning effects with behavioral and neural measures. Finally, in Chapter 6, I summarize the findings of the thesis, its contributions to the literature, and outline directions for future research
BanglaAbuseMeme: A Dataset for Bengali Abusive Meme Classification
The dramatic increase in the use of social media platforms for information
sharing has also fueled a steep growth in online abuse. A simple yet effective
way of abusing individuals or communities is by creating memes, which often
integrate an image with a short piece of text layered on top of it. Such
harmful elements are in rampant use and are a threat to online safety. Hence it
is necessary to develop efficient models to detect and flag abusive memes. The
problem becomes more challenging in a low-resource setting (e.g., Bengali
memes, i.e., images with Bengali text embedded on it) because of the absence of
benchmark datasets on which AI models could be trained. In this paper we bridge
this gap by building a Bengali meme dataset. To setup an effective benchmark we
implement several baseline models for classifying abusive memes using this
dataset. We observe that multimodal models that use both textual and visual
information outperform unimodal models. Our best-performing model achieves a
macro F1 score of 70.51. Finally, we perform a qualitative error analysis of
the misclassified memes of the best-performing text-based, image-based and
multimodal models.Comment: EMNLP 2023 (main conference
MUTEX: Learning Unified Policies from Multimodal Task Specifications
Humans use different modalities, such as speech, text, images, videos, etc.,
to communicate their intent and goals with teammates. For robots to become
better assistants, we aim to endow them with the ability to follow instructions
and understand tasks specified by their human partners. Most robotic policy
learning methods have focused on one single modality of task specification
while ignoring the rich cross-modal information. We present MUTEX, a unified
approach to policy learning from multimodal task specifications. It trains a
transformer-based architecture to facilitate cross-modal reasoning, combining
masked modeling and cross-modal matching objectives in a two-stage training
procedure. After training, MUTEX can follow a task specification in any of the
six learned modalities (video demonstrations, goal images, text goal
descriptions, text instructions, speech goal descriptions, and speech
instructions) or a combination of them. We systematically evaluate the benefits
of MUTEX in a newly designed dataset with 100 tasks in simulation and 50 tasks
in the real world, annotated with multiple instances of task specifications in
different modalities, and observe improved performance over methods trained
specifically for any single modality. More information at
https://ut-austin-rpl.github.io/MUTEX/Comment: Accepted at 7th Conference on Robot Learning (CoRL 2023), Atlanta,
US
Say That Again: The role of multimodal redundancy in communication and context
With several modes of expression, such as facial expressions, body language, and speech working together to convey meaning, social communication is rich in redundancy. While typically relegated to signal preservation, this study investigates the role of cross-modal redundancies in establishing performance context, focusing on unaided, solo performances. Drawing on information theory, I operationalize redundancy as predictability and use an array of machine learning models to featurize speakers\u27 facial expressions, body poses, movement speeds, acoustic features, and spoken language from 24 TEDTalks and 16 episodes of Comedy Central Stand-Up Presents. This analysis demonstrates that it is possible to distinguish between these performance types based on cross-modal predictions, while also highlighting the significant amount of prediction supported by the signals’ synchrony across modalities. Further research is needed to unravel the complexities of redundancy\u27s place in social communication, paving the way for more effective and engaging communication strategies
- …