2,360 research outputs found
Hierachical Delta-Attention Method for Multimodal Fusion
In vision and linguistics; the main input modalities are facial expressions,
speech patterns, and the words uttered. The issue with analysis of any one mode
of expression (Visual, Verbal or Vocal) is that lot of contextual information
can get lost. This asks researchers to inspect multiple modalities to get a
thorough understanding of the cross-modal dependencies and temporal context of
the situation to analyze the expression. This work attempts at preserving the
long-range dependencies within and across different modalities, which would be
bottle-necked by the use of recurrent networks and adds the concept of
delta-attention to focus on local differences per modality to capture the
idiosyncrasy of different people. We explore a cross-attention fusion technique
to get the global view of the emotion expressed through these
delta-self-attended modalities, in order to fuse all the local nuances and
global context together. The addition of attention is new to the multi-modal
fusion field and currently being scrutinized for on what stage the attention
mechanism should be used, this work achieves competitive accuracy for overall
and per-class classification which is close to the current state-of-the-art
with almost half number of parameters
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
In the rapidly advancing field of multi-modal machine learning (MMML), the
convergence of multiple data modalities has the potential to reshape various
applications. This paper presents a comprehensive overview of the current
state, advancements, and challenges of MMML within the sphere of engineering
design. The review begins with a deep dive into five fundamental concepts of
MMML:multi-modal information representation, fusion, alignment, translation,
and co-learning. Following this, we explore the cutting-edge applications of
MMML, placing a particular emphasis on tasks pertinent to engineering design,
such as cross-modal synthesis, multi-modal prediction, and cross-modal
information retrieval. Through this comprehensive overview, we highlight the
inherent challenges in adopting MMML in engineering design, and proffer
potential directions for future research. To spur on the continued evolution of
MMML in engineering design, we advocate for concentrated efforts to construct
extensive multi-modal design datasets, develop effective data-driven MMML
techniques tailored to design applications, and enhance the scalability and
interpretability of MMML models. MMML models, as the next generation of
intelligent design tools, hold a promising future to impact how products are
designed
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
Speech translation for subtitling (SubST) is the task of automatically
translating speech data into well-formed subtitles by inserting subtitle breaks
compliant to specific displaying guidelines. Similar to speech translation
(ST), model training requires parallel data comprising audio inputs paired with
their textual translations. In SubST, however, the text has to be also
annotated with subtitle breaks. So far, this requirement has represented a
bottleneck for system development, as confirmed by the dearth of publicly
available SubST corpora. To fill this gap, we propose a method to convert
existing ST corpora into SubST resources without human intervention. We build a
segmenter model that automatically segments texts into proper subtitles by
exploiting audio and text in a multimodal fashion, achieving high segmentation
quality in zero-shot conditions. Comparative experiments with SubST systems
respectively trained on manual and automatic segmentations result in similar
performance, showing the effectiveness of our approach.Comment: Accepted to AACL 202
Silo NLP's Participation at WAT2022
This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).Peer reviewe
Energy of visual and verbal modalities in language education
Pictures, like words are omnipresent in our lives. Each form of communication carries visible (clear) and
invisible (hidden) messages. This paper will describe the research conducted during a workshop about
visual and verbal input in language education. The workshop was addressed to students and teachers who
participate in children’s language education. A focus was on the sociocultural context of learning and visual
literacy as essential skills for reading multimodal texts and transferring information in the 21st century.
There were two questions stated: What is the role of verbal and visual modalities in language education?
What is the image-text relationship in transferring information? The qualitative, sociocultural and MDA
approaches were applied to raise participant’s awareness of image-text intermodality. The idea was also to
practise selection and evaluation of ELT materials. The paper hopes to increase the role of visual
methodology and multimodal perspective in language education
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach
Silo NLP's Participation at WAT2022
This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).Peer reviewe
- …