26 research outputs found
Preserving Specificity in Federated Graph Learning for fMRI-based Neurological Disorder Identification
Resting-state functional magnetic resonance imaging (rs-fMRI) offers a
non-invasive approach to examining abnormal brain connectivity associated with
brain disorders. Graph neural network (GNN) gains popularity in fMRI
representation learning and brain disorder analysis with powerful graph
representation capabilities. Training a general GNN often necessitates a
large-scale dataset from multiple imaging centers/sites, but centralizing
multi-site data generally faces inherent challenges related to data privacy,
security, and storage burden. Federated Learning (FL) enables collaborative
model training without centralized multi-site fMRI data. Unfortunately,
previous FL approaches for fMRI analysis often ignore site-specificity,
including demographic factors such as age, gender, and education level. To this
end, we propose a specificity-aware federated graph learning (SFGL) framework
for rs-fMRI analysis and automated brain disorder identification, with a server
and multiple clients/sites for federated model aggregation and prediction. At
each client, our model consists of a shared and a personalized branch, where
parameters of the shared branch are sent to the server while those of the
personalized branch remain local. This can facilitate knowledge sharing among
sites and also helps preserve site specificity. In the shared branch, we employ
a spatio-temporal attention graph isomorphism network to learn dynamic fMRI
representations. In the personalized branch, we integrate vectorized
demographic information (i.e., age, gender, and education years) and functional
connectivity networks to preserve site-specific characteristics.
Representations generated by the two branches are then fused for
classification. Experimental results on two fMRI datasets with a total of 1,218
subjects suggest that SFGL outperforms several state-of-the-art approaches
Face-to-Face Contrastive Learning for Social Intelligence Question-Answering
Creating artificial social intelligence - algorithms that can understand the
nuances of multi-person interactions - is an exciting and emerging challenge in
processing facial expressions and gestures from multimodal videos. Recent
multimodal methods have set the state of the art on many tasks, but have
difficulty modeling the complex face-to-face conversational dynamics across
speaking turns in social interaction, particularly in a self-supervised setup.
In this paper, we propose Face-to-Face Contrastive Learning (F2F-CL), a graph
neural network designed to model social interactions using factorization nodes
to contextualize the multimodal face-to-face interaction along the boundaries
of the speaking turn. With the F2F-CL model, we propose to perform contrastive
learning between the factorization nodes of different speaking turns within the
same video. We experimentally evaluated the challenging Social-IQ dataset and
show state-of-the-art results
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models
Pre-trained large language models have recently achieved ground-breaking
performance in a wide variety of language understanding tasks. However, the
same model can not be applied to multimodal behavior understanding tasks (e.g.,
video sentiment/humor detection) unless non-verbal features (e.g., acoustic and
visual) can be integrated with language. Jointly modeling multiple modalities
significantly increases the model complexity, and makes the training process
data-hungry. While an enormous amount of text data is available via the web,
collecting large-scale multimodal behavioral video datasets is extremely
expensive, both in terms of time and money. In this paper, we investigate
whether large language models alone can successfully incorporate non-verbal
information when they are presented in textual form. We present a way to
convert the acoustic and visual information into corresponding textual
descriptions and concatenate them with the spoken text. We feed this augmented
input to a pre-trained BERT model and fine-tune it on three downstream
multimodal tasks: sentiment, humor, and sarcasm detection. Our approach,
TextMI, significantly reduces model complexity, adds interpretability to the
model's decision, and can be applied for a diverse set of tasks while achieving
superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment
analysis and multimodal humor detection) performance. We propose TextMI as a
general, competitive baseline for multimodal behavioral analysis tasks,
particularly in a low-resource setting
Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition
Multimodal emotion recognition aims to recognize emotions for each utterance
of multiple modalities, which has received increasing attention for its
application in human-machine interaction. Current graph-based methods fail to
simultaneously depict global contextual features and local diverse uni-modal
features in a dialogue. Furthermore, with the number of graph layers
increasing, they easily fall into over-smoothing. In this paper, we propose a
method for joint modality fusion and graph contrastive learning for multimodal
emotion recognition (Joyful), where multimodality fusion, contrastive learning,
and emotion recognition are jointly optimized. Specifically, we first design a
new multimodal fusion mechanism that can provide deep interaction and fusion
between the global contextual and uni-modal specific features. Then, we
introduce a graph contrastive learning framework with inter-view and intra-view
contrastive losses to learn more distinguishable representations for samples
with different sentiments. Extensive experiments on three benchmark datasets
indicate that Joyful achieved state-of-the-art (SOTA) performance compared to
all baselines
NetGPT: A Native-AI Network Architecture Beyond Provisioning Personalized Generative Services
Large language models (LLMs) have triggered tremendous success to empower
daily life by generative information, and the personalization of LLMs could
further contribute to their applications due to better alignment with human
intents. Towards personalized generative services, a collaborative cloud-edge
methodology sounds promising, as it facilitates the effective orchestration of
heterogeneous distributed communication and computing resources. In this
article, after discussing the pros and cons of several candidate cloud-edge
collaboration techniques, we put forward NetGPT to capably deploy appropriate
LLMs at the edge and the cloud in accordance with their computing capacity. In
addition, edge LLMs could efficiently leverage location-based information for
personalized prompt completion, thus benefiting the interaction with cloud
LLMs. After deploying representative open-source LLMs (e.g., GPT-2-base and
LLaMA model) at the edge and the cloud, we present the feasibility of NetGPT on
the basis of low-rank adaptation-based light-weight fine-tuning. Subsequently,
we highlight substantial essential changes required for a native artificial
intelligence (AI) network architecture towards NetGPT, with special emphasis on
deeper integration of communications and computing resources and careful
calibration of logical AI workflow. Furthermore, we demonstrate several
by-product benefits of NetGPT, given edge LLM's astonishing capability to
predict trends and infer intents, which possibly leads to a unified solution
for intelligent network management \& orchestration. In a nutshell, we argue
that NetGPT is a promising native-AI network architecture beyond provisioning
personalized generative services
Adaptive Graph Spatial-Temporal Transformer Network for Traffic Flow Forecasting
Traffic flow forecasting on graphs has real-world applications in many
fields, such as transportation system and computer networks. Traffic
forecasting can be highly challenging due to complex spatial-temporal
correlations and non-linear traffic patterns. Existing works mostly model such
spatial-temporal dependencies by considering spatial correlations and temporal
correlations separately and fail to model the direct spatial-temporal
correlations. Inspired by the recent success of transformers in the graph
domain, in this paper, we propose to directly model the cross-spatial-temporal
correlations on the spatial-temporal graph using local multi-head
self-attentions. To reduce the time complexity, we set the attention receptive
field to the spatially neighboring nodes, and we also introduce an adaptive
graph to capture the hidden spatial-temporal dependencies. Based on these
attention mechanisms, we propose a novel Adaptive Graph Spatial-Temporal
Transformer Network (ASTTN), which stacks multiple spatial-temporal attention
layers to apply self-attention on the input graph, followed by linear layers
for predictions. Experimental results on public traffic network datasets,
METR-LA PEMS-BAY, PeMSD4, and PeMSD7, demonstrate the superior performance of
our model