63 research outputs found
Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
Real world multimedia data is often composed of multiple modalities such as
an image or a video with associated text (e.g. captions, user comments, etc.)
and metadata. Such multimodal data packages are prone to manipulations, where a
subset of these modalities can be altered to misrepresent or repurpose data
packages, with possible malicious intent. It is, therefore, important to
develop methods to assess or verify the integrity of these multimedia packages.
Using computer vision and natural language processing methods to directly
compare the image (or video) and the associated caption to verify the integrity
of a media package is only possible for a limited set of objects and scenes. In
this paper, we present a novel deep learning-based approach for assessing the
semantic integrity of multimedia packages containing images and captions, using
a reference set of multimedia packages. We construct a joint embedding of
images and captions with deep multimodal representation learning on the
reference dataset in a framework that also provides image-caption consistency
scores (ICCSs). The integrity of query media packages is assessed as the
inlierness of the query ICCSs with respect to the reference dataset. We present
the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media
packages from Flickr, which we make available to the research community. We use
both the newly created dataset as well as Flickr30K and MS COCO datasets to
quantitatively evaluate our proposed approach. The reference dataset does not
contain unmanipulated versions of tampered query packages. Our method is able
to achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO,
respectively, for detecting semantically incoherent media packages.Comment: *Ayush Jaiswal and Ekraam Sabir contributed equally to the work in
this pape
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Automatic emotion recognition (ER) has recently gained lot of interest due to
its potential in many real-world applications. In this context, multimodal
approaches have been shown to improve performance (over unimodal approaches) by
combining diverse and complementary sources of information, providing some
robustness to noisy and missing modalities. In this paper, we focus on
dimensional ER based on the fusion of facial and vocal modalities extracted
from videos, where complementary audio-visual (A-V) relationships are explored
to predict an individual's emotional states in valence-arousal space. Most
state-of-the-art fusion techniques rely on recurrent networks or conventional
attention mechanisms that do not effectively leverage the complementary nature
of A-V modalities. To address this problem, we introduce a joint
cross-attentional model for A-V fusion that extracts the salient features
across A-V modalities, that allows to effectively leverage the inter-modal
relationships, while retaining the intra-modal relationships. In particular, it
computes the cross-attention weights based on correlation between the joint
feature representation and that of the individual modalities. By deploying the
joint A-V feature representation into the cross-attention module, it helps to
simultaneously leverage both the intra and inter modal relationships, thereby
significantly improving the performance of the system over the vanilla
cross-attention module. The effectiveness of our proposed approach is validated
experimentally on challenging videos from the RECOLA and AffWild2 datasets.
Results indicate that our joint cross-attentional A-V fusion model provides a
cost-effective solution that can outperform state-of-the-art approaches, even
when the modalities are noisy or absent.Comment: arXiv admin note: substantial text overlap with arXiv:2203.14779,
arXiv:2111.0522
Learning Cross-Modality Representations from Multi-Modal Images
Machine learning algorithms can have difficulties adapting to data from different sources, for example from different imaging modalities. We present and analyze three techniques for unsupervised cross-modality feature learning, using a shared autoencoder-like convolutional network that learns a common representation from multi-modal data. We investigate a form of feature normalization, a learning objective that minimizes crossmodality differences, and modality dropout, in which the network is trained with varying subsets of modalities. We measure the same-modality and cross-modality classification accuracies and explore whether the models learn modality-specific or shared features. This paper presents experiments on two public datasets, with knee images from two MRI modalities, provided by the Osteoarthritis Initiative, and brain tumor segmentation on four MRI modalities from the BRATS challenge. All three approaches improved the cross-modality classification accuracy, with modality dropout and per-feature normalization giving the largest improvement. We observed that the networks tend to learn a combination of cross-modality and modality-specific features. Overall, a combination of all three methods produced the most cross-modality features and the highest cross-modality classification accuracy, while maintaining most of the same-modality accuracy
Learning to Make Predictions on Graphs with Autoencoders
We examine two fundamental tasks associated with graph representation
learning: link prediction and semi-supervised node classification. We present a
novel autoencoder architecture capable of learning a joint representation of
both local graph structure and available node features for the multi-task
learning of link prediction and node classification. Our autoencoder
architecture is efficiently trained end-to-end in a single learning stage to
simultaneously perform link prediction and node classification, whereas
previous related methods require multiple training steps that are difficult to
optimize. We provide a comprehensive empirical evaluation of our models on nine
benchmark graph-structured datasets and demonstrate significant improvement
over related methods for graph representation learning. Reference code and data
are available at https://github.com/vuptran/graph-representation-learningComment: Published as a conference paper at IEEE DSAA 201
Computer audition for emotional wellbeing
This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has.
To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout
The effect of concussion history on relevancy-based gating of visual and tactile stimuli
At the cortical level, incoming sensory inputs are subject to interruption as they are transmitted to modality-specific cortical areas, in order to prevent excessive processing of task-irrelevant sensory information and aid in planning appropriate responses. This interruption of stimulus transmission is known as sensory gating, and it is an important component of attentional orienting: effective gating allows attention to be oriented only to stimuli which are task-relevant. Concussion is a condition in which attentional orienting processes appear to be disrupted, but the details of this disruption and its underlying mechanisms are unknown. The experiments contained within this thesis address questions related to the effects of concussion on sensory processing. This thesis aimed to characterize the electrophysiological and behavioural correlates of attentional orienting and sensory gating on a sensory selection task; to build an understanding of the cortical mechanisms involved in sensory selection under conditions of varying task-relevance and in the presence of distractor stimuli; and to address the clinical hypothesis that a history of concussion injury leads to problems gating irrelevant sensory information out of the processing stream. The results provide insight into the process of sensory gating based on task-relevance, the top-down and bottom-up factors that modulate attentional orienting, the cortical networks involved, and the disruption to these processes that can occur with concussion
- …