958 research outputs found

    Multi-modal Machine Learning for Vehicle Rating Predictions Using Image, Text, and Parametric Data

    Full text link
    Accurate vehicle rating prediction can facilitate designing and configuring good vehicles. This prediction allows vehicle designers and manufacturers to optimize and improve their designs in a timely manner, enhance their product performance, and effectively attract consumers. However, most of the existing data-driven methods rely on data from a single mode, e.g., text, image, or parametric data, which results in a limited and incomplete exploration of the available information. These methods lack comprehensive analyses and exploration of data from multiple modes, which probably leads to inaccurate conclusions and hinders progress in this field. To overcome this limitation, we propose a multi-modal learning model for more comprehensive and accurate vehicle rating predictions. Specifically, the model simultaneously learns features from the parametric specifications, text descriptions, and images of vehicles to predict five vehicle rating scores, including the total score, critics score, performance score, safety score, and interior score. We compare the multi-modal learning model to the corresponding unimodal models and find that the multi-modal model's explanatory power is 4% - 12% higher than that of the unimodal models. On this basis, we conduct sensitivity analyses using SHAP to interpret our model and provide design and optimization directions to designers and manufacturers. Our study underscores the importance of the data-driven multi-modal learning approach for vehicle design, evaluation, and optimization. We have made the code publicly available at http://decode.mit.edu/projects/vehicleratings/.Comment: The paper submitted to IDETC/CIE2023, the International Design Engineering Technical Conferences & Computers and Information in Engineering Conference, has been accepte

    Multimodal Language Analysis with Recurrent Multistage Fusion

    Full text link
    Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Cross-modal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays state-of-the-art performance in modeling human multimodal language across three public datasets relating to multimodal sentiment analysis, emotion recognition, and speaker traits recognition. We provide visualizations to show that each stage of fusion focuses on a different subset of multimodal signals, learning increasingly discriminative multimodal representations.Comment: EMNLP 201

    The neural basis of audiovisual integration

    Get PDF
    Our perception is continuous and unified. Yet, sensory information reaches our brains through different senses and needs to be processed in order to create that unified percept. Interactions between sensory modalities occur already at primary cortical levels. The purpose of such interactions and what kind of information they transmit is still largely unknown. The current thesis aimed to reveal the interactions between auditory pitch and visual size in polar coordinates, two modality specific stimulus features that have robust topographic representations in the human brain. In Chapter 1, I present the background of cross-modal interactions in early sensory cortices and of the pitch-size relationship. In Chapter 2, we explored the pitch-size relationship in a speeded classification task and, in Chapter 3, at the level of functional Magnetic Resonance Imaging activation patterns. In Chapter 4, we investigated the effects of actively learning a specific pitch-size mapping during one session on the speeded classification task. In Chapter 5, we extended learning over multiple sessions and examined learning effects with behavioral and neural measures. Finally, in Chapter 6, I summarize the findings of the thesis, its contributions to the literature, and outline directions for future research

    BanglaAbuseMeme: A Dataset for Bengali Abusive Meme Classification

    Full text link
    The dramatic increase in the use of social media platforms for information sharing has also fueled a steep growth in online abuse. A simple yet effective way of abusing individuals or communities is by creating memes, which often integrate an image with a short piece of text layered on top of it. Such harmful elements are in rampant use and are a threat to online safety. Hence it is necessary to develop efficient models to detect and flag abusive memes. The problem becomes more challenging in a low-resource setting (e.g., Bengali memes, i.e., images with Bengali text embedded on it) because of the absence of benchmark datasets on which AI models could be trained. In this paper we bridge this gap by building a Bengali meme dataset. To setup an effective benchmark we implement several baseline models for classifying abusive memes using this dataset. We observe that multimodal models that use both textual and visual information outperform unimodal models. Our best-performing model achieves a macro F1 score of 70.51. Finally, we perform a qualitative error analysis of the misclassified memes of the best-performing text-based, image-based and multimodal models.Comment: EMNLP 2023 (main conference

    MUTEX: Learning Unified Policies from Multimodal Task Specifications

    Full text link
    Humans use different modalities, such as speech, text, images, videos, etc., to communicate their intent and goals with teammates. For robots to become better assistants, we aim to endow them with the ability to follow instructions and understand tasks specified by their human partners. Most robotic policy learning methods have focused on one single modality of task specification while ignoring the rich cross-modal information. We present MUTEX, a unified approach to policy learning from multimodal task specifications. It trains a transformer-based architecture to facilitate cross-modal reasoning, combining masked modeling and cross-modal matching objectives in a two-stage training procedure. After training, MUTEX can follow a task specification in any of the six learned modalities (video demonstrations, goal images, text goal descriptions, text instructions, speech goal descriptions, and speech instructions) or a combination of them. We systematically evaluate the benefits of MUTEX in a newly designed dataset with 100 tasks in simulation and 50 tasks in the real world, annotated with multiple instances of task specifications in different modalities, and observe improved performance over methods trained specifically for any single modality. More information at https://ut-austin-rpl.github.io/MUTEX/Comment: Accepted at 7th Conference on Robot Learning (CoRL 2023), Atlanta, US

    Say That Again: The role of multimodal redundancy in communication and context

    Get PDF
    With several modes of expression, such as facial expressions, body language, and speech working together to convey meaning, social communication is rich in redundancy. While typically relegated to signal preservation, this study investigates the role of cross-modal redundancies in establishing performance context, focusing on unaided, solo performances. Drawing on information theory, I operationalize redundancy as predictability and use an array of machine learning models to featurize speakers\u27 facial expressions, body poses, movement speeds, acoustic features, and spoken language from 24 TEDTalks and 16 episodes of Comedy Central Stand-Up Presents. This analysis demonstrates that it is possible to distinguish between these performance types based on cross-modal predictions, while also highlighting the significant amount of prediction supported by the signals’ synchrony across modalities. Further research is needed to unravel the complexities of redundancy\u27s place in social communication, paving the way for more effective and engaging communication strategies
    • …
    corecore