1,245 research outputs found
Semi-supervised Deep Generative Modelling of Incomplete Multi-Modality Emotional Data
There are threefold challenges in emotion recognition. First, it is difficult
to recognize human's emotional states only considering a single modality.
Second, it is expensive to manually annotate the emotional data. Third,
emotional data often suffers from missing modalities due to unforeseeable
sensor malfunction or configuration issues. In this paper, we address all these
problems under a novel multi-view deep generative framework. Specifically, we
propose to model the statistical relationships of multi-modality emotional data
using multiple modality-specific generative networks with a shared latent
space. By imposing a Gaussian mixture assumption on the posterior approximation
of the shared latent variables, our framework can learn the joint deep
representation from multiple modalities and evaluate the importance of each
modality simultaneously. To solve the labeled-data-scarcity problem, we extend
our multi-view model to semi-supervised learning scenario by casting the
semi-supervised classification problem as a specialized missing data imputation
task. To address the missing-modality problem, we further extend our
semi-supervised multi-view model to deal with incomplete data, where a missing
view is treated as a latent variable and integrated out during inference. This
way, the proposed overall framework can utilize all available (both labeled and
unlabeled, as well as both complete and incomplete) data to improve its
generalization ability. The experiments conducted on two real multi-modal
emotion datasets demonstrated the superiority of our framework.Comment: arXiv admin note: text overlap with arXiv:1704.07548, 2018 ACM
Multimedia Conference (MM'18
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Automatic emotion recognition is an active research topic with wide range of
applications. Due to the high manual annotation cost and inevitable label
ambiguity, the development of emotion recognition dataset is limited in both
scale and quality. Therefore, one of the key challenges is how to build
effective models with limited data resource. Previous works have explored
different approaches to tackle this challenge including data enhancement,
transfer learning, and semi-supervised learning etc. However, the weakness of
these existing approaches includes such as training instability, large
performance loss during transfer, or marginal improvement.
In this work, we propose a novel semi-supervised multi-modal emotion
recognition model based on cross-modality distribution matching, which
leverages abundant unlabeled data to enhance the model training under the
assumption that the inner emotional status is consistent at the utterance level
across modalities.
We conduct extensive experiments to evaluate the proposed model on two
benchmark datasets, IEMOCAP and MELD. The experiment results prove that the
proposed semi-supervised learning model can effectively utilize unlabeled data
and combine multi-modalities to boost the emotion recognition performance,
which outperforms other state-of-the-art approaches under the same condition.
The proposed model also achieves competitive capacity compared with existing
approaches which take advantage of additional auxiliary information such as
speaker and interaction context.Comment: 10 pages, 5 figures, to be published on ACM Multimedia 202
A Concise yet Effective model for Non-Aligned Incomplete Multi-view and Missing Multi-label Learning
In reality, learning from multi-view multi-label data inevitably confronts
three challenges: missing labels, incomplete views, and non-aligned views.
Existing methods mainly concern the first two and commonly need multiple
assumptions to attack them, making even state-of-the-arts involve at least two
explicit hyper-parameters such that model selection is quite difficult. More
roughly, they will fail in handling the third challenge, let alone addressing
the three jointly. In this paper, we aim at meeting these under the least
assumption by building a concise yet effective model with just one
hyper-parameter. To ease insufficiency of available labels, we exploit not only
the consensus of multiple views but also the global and local structures hidden
among multiple labels. Specifically, we introduce an indicator matrix to tackle
the first two challenges in a regression form while aligning the same
individual labels and all labels of different views in a common label space to
battle the third challenge. In aligning, we characterize the global and local
structures of multiple labels to be high-rank and low-rank, respectively.
Subsequently, an efficient algorithm with linear time complexity in the number
of samples is established. Finally, even without view-alignment, our method
substantially outperforms state-of-the-arts with view-alignment on five real
datasets.Comment: 15 pages, 7 figure
Discriminative Multimodal Learning via Conditional Priors in Generative Models
Deep generative models with latent variables have been used lately to learn
joint representations and generative processes from multi-modal data. These two
learning mechanisms can, however, conflict with each other and representations
can fail to embed information on the data modalities. This research studies the
realistic scenario in which all modalities and class labels are available for
model training, but where some modalities and labels required for downstream
tasks are missing. We show, in this scenario, that the variational lower bound
limits mutual information between joint representations and missing modalities.
We, to counteract these problems, introduce a novel conditional multi-modal
discriminative model that uses an informative prior distribution and optimizes
a likelihood-free objective function that maximizes mutual information between
joint representations and missing modalities. Extensive experimentation
demonstrates the benefits of our proposed model, empirical results show that
our model achieves state-of-the-art results in representative problems such as
downstream classification, acoustic inversion, and image and annotation
generation
Inconsistent Matters: A Knowledge-guided Dual-consistency Network for Multi-modal Rumor Detection
Rumor spreaders are increasingly utilizing multimedia content to attract the
attention and trust of news consumers. Though quite a few rumor detection
models have exploited the multi-modal data, they seldom consider the
inconsistent semantics between images and texts, and rarely spot the
inconsistency among the post contents and background knowledge. In addition,
they commonly assume the completeness of multiple modalities and thus are
incapable of handling handle missing modalities in real-life scenarios.
Motivated by the intuition that rumors in social media are more likely to have
inconsistent semantics, a novel Knowledge-guided Dual-consistency Network is
proposed to detect rumors with multimedia contents. It uses two consistency
detection subnetworks to capture the inconsistency at the cross-modal level and
the content-knowledge level simultaneously. It also enables robust multi-modal
representation learning under different missing visual modality conditions,
using a special token to discriminate between posts with visual modality and
posts without visual modality. Extensive experiments on three public real-world
multimedia datasets demonstrate that our framework can outperform the
state-of-the-art baselines under both complete and incomplete modality
conditions. Our codes are available at https://github.com/MengzSun/KDCN
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation
In medical vision, different imaging modalities provide complementary
information. However, in practice, not all modalities may be available during
inference or even training. Previous approaches, e.g., knowledge distillation
or image synthesis, often assume the availability of full modalities for all
patients during training; this is unrealistic and impractical due to the
variability in data collection across sites. We propose a novel approach to
learn enhanced modality-agnostic representations by employing a meta-learning
strategy in training, even when only limited full modality samples are
available. Meta-learning enhances partial modality representations to full
modality representations by meta-training on partial modality data and
meta-testing on limited full modality samples. Additionally, we co-supervise
this feature enrichment by introducing an auxiliary adversarial learning
branch. More specifically, a missing modality detector is used as a
discriminator to mimic the full modality setting. Our segmentation framework
significantly outperforms state-of-the-art brain tumor segmentation techniques
in missing modality scenarios.Comment: Accepted in ICCV 202
- …