983 research outputs found
New ideas and trends in deep multimodal content understanding: a review
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.Computer Systems, Imagery and Medi
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
In the rapidly advancing field of multi-modal machine learning (MMML), the
convergence of multiple data modalities has the potential to reshape various
applications. This paper presents a comprehensive overview of the current
state, advancements, and challenges of MMML within the sphere of engineering
design. The review begins with a deep dive into five fundamental concepts of
MMML:multi-modal information representation, fusion, alignment, translation,
and co-learning. Following this, we explore the cutting-edge applications of
MMML, placing a particular emphasis on tasks pertinent to engineering design,
such as cross-modal synthesis, multi-modal prediction, and cross-modal
information retrieval. Through this comprehensive overview, we highlight the
inherent challenges in adopting MMML in engineering design, and proffer
potential directions for future research. To spur on the continued evolution of
MMML in engineering design, we advocate for concentrated efforts to construct
extensive multi-modal design datasets, develop effective data-driven MMML
techniques tailored to design applications, and enhance the scalability and
interpretability of MMML models. MMML models, as the next generation of
intelligent design tools, hold a promising future to impact how products are
designed
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Existing visual question answering methods tend to capture the cross-modal
spurious correlations and fail to discover the true causal mechanism that
facilitates reasoning truthfully based on the dominant visual evidence and the
question intention. Additionally, the existing methods usually ignore the
cross-modal event-level understanding that requires to jointly model event
temporality, causality, and dynamics. In this work, we focus on event-level
visual question answering from a new perspective, i.e., cross-modal causal
relational reasoning, by introducing causal intervention methods to discover
the true causal structures for visual and linguistic modalities. Specifically,
we propose a novel event-level visual question answering framework named
Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust
causality-aware visual-linguistic question answering. To discover cross-modal
causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR)
module is proposed to collaboratively disentangle the visual and linguistic
spurious correlations via front-door and back-door causal interventions. To
model the fine-grained interactions between linguistic semantics and
spatial-temporal representations, we build a Spatial-Temporal Transformer (STT)
that creates multi-modal co-occurrence interactions between visual and
linguistic content. To adaptively fuse the causality-ware visual and linguistic
features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that
leverages the hierarchical linguistic semantic relations as the guidance to
learn the global semantic-aware visual-linguistic representations adaptively.
Extensive experiments on four event-level datasets demonstrate the superiority
of our CMCIR in discovering visual-linguistic causal structures and achieving
robust event-level visual question answering.Comment: 17 pages, 9 figures. This work has been submitted to the IEEE for
possible publication. Copyright may be transferred without notice, after
which this version may no longer be accessible. The datasets, code and models
are available at https://github.com/YangLiu9208/CMCI
Multimodal sentiment analysis in real-life videos
This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target.
The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far.
This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level.
The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated.
A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above.
The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos
- …