Search CORE

2,360 research outputs found

Hierachical Delta-Attention Method for Multimodal Fusion

Author: Panchal Kunjal
Publication venue
Publication date: 21/11/2020
Field of study

In vision and linguistics; the main input modalities are facial expressions, speech patterns, and the words uttered. The issue with analysis of any one mode of expression (Visual, Verbal or Vocal) is that lot of contextual information can get lost. This asks researchers to inspect multiple modalities to get a thorough understanding of the cross-modal dependencies and temporal context of the situation to analyze the expression. This work attempts at preserving the long-range dependencies within and across different modalities, which would be bottle-necked by the use of recurrent networks and adds the concept of delta-attention to focus on local differences per modality to capture the idiosyncrasy of different people. We explore a cross-attention fusion technique to get the global view of the emotion expressed through these delta-self-attended modalities, in order to fuse all the local nuances and global context together. The addition of attention is new to the multi-modal fusion field and currently being scrutinized for on what stage the attention mechanism should be used, this work achieves competitive accuracy for overall and per-class classification which is close to the current state-of-the-art with almost half number of parameters

arXiv.org e-Print Archive

Multi-modal Machine Learning in Engineering Design: A Review and Future Directions

Author: Ahmed Faez
Song Binyang
Zhou Rui
Publication venue
Publication date: 28/07/2023
Field of study

In the rapidly advancing field of multi-modal machine learning (MMML), the convergence of multiple data modalities has the potential to reshape various applications. This paper presents a comprehensive overview of the current state, advancements, and challenges of MMML within the sphere of engineering design. The review begins with a deep dive into five fundamental concepts of MMML:multi-modal information representation, fusion, alignment, translation, and co-learning. Following this, we explore the cutting-edge applications of MMML, placing a particular emphasis on tasks pertinent to engineering design, such as cross-modal synthesis, multi-modal prediction, and cross-modal information retrieval. Through this comprehensive overview, we highlight the inherent challenges in adopting MMML in engineering design, and proffer potential directions for future research. To spur on the continued evolution of MMML in engineering design, we advocate for concentrated efforts to construct extensive multi-modal design datasets, develop effective data-driven MMML techniques tailored to design applications, and enhance the scalability and interpretability of MMML models. MMML models, as the next generation of intelligent design tools, hold a promising future to impact how products are designed

arXiv.org e-Print Archive

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Author: Karakanta Alina
Negri Matteo
Papi Sara
Turchi Marco
Publication venue
Publication date: 01/01/2022
Field of study

Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

Silo NLP's Participation at WAT2022

Author: Granroth-Wilding Mark
Grönroos Stig-Arne
Koistinen Mika
Panda Subhadarshi
Parida Shantipriya
Publication venue: COLING
Publication date: 02/08/2022
Field of study

This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Energy of visual and verbal modalities in language education

Author: Stec Maria
Publication venue: 'Cognitive-crcs'
Publication date: 01/01/2019
Field of study

Pictures, like words are omnipresent in our lives. Each form of communication carries visible (clear) and invisible (hidden) messages. This paper will describe the research conducted during a workshop about visual and verbal input in language education. The workshop was addressed to students and teachers who participate in children’s language education. A focus was on the sociocultural context of learning and visual literacy as essential skills for reading multimodal texts and transferring information in the 21st century. There were two questions stated: What is the role of verbal and visual modalities in language education? What is the image-text relationship in transferring information? The qualitative, sociocultural and MDA approaches were applied to raise participant’s awareness of image-text intermodality. The idea was also to practise selection and evaluation of ELT materials. The paper hopes to increase the role of visual methodology and multimodal perspective in language education

Crossref

Repozytorium Uniwersytetu Śląskiego RE-BUŚ