414 research outputs found
Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
As a combination of visual and audio signals, video is inherently
multi-modal. However, existing video generation methods are primarily intended
for the synthesis of visual frames, whereas audio signals in realistic videos
are disregarded. In this work, we concentrate on a rarely investigated problem
of text guided sounding video generation and propose the Sounding Video
Generator (SVG), a unified framework for generating realistic videos along with
audio signals. Specifically, we present the SVG-VQGAN to transform visual
frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a
novel hybrid contrastive learning method to model inter-modal and intra-modal
consistency and improve the quantized representations. A cross-modal attention
module is employed to extract associated features of visual frames and audio
signals for contrastive learning. Then, a Transformer-based decoder is used to
model associations between texts, visual frames, and audio signals at token
level for auto-regressive sounding video generation. AudioSetCap, a human
annotated text-video-audio paired dataset, is produced for training SVG.
Experimental results demonstrate the superiority of our method when compared
with existing textto-video generation methods as well as audio generation
methods on Kinetics and VAS datasets
Deep Lifelong Cross-modal Hashing
Hashing methods have made significant progress in cross-modal retrieval tasks
with fast query speed and low storage cost. Among them, deep learning-based
hashing achieves better performance on large-scale data due to its excellent
extraction and representation ability for nonlinear heterogeneous features.
However, there are still two main challenges in catastrophic forgetting when
data with new categories arrive continuously, and time-consuming for
non-continuous hashing retrieval to retrain for updating. To this end, we, in
this paper, propose a novel deep lifelong cross-modal hashing to achieve
lifelong hashing retrieval instead of re-training hash function repeatedly when
new data arrive. Specifically, we design lifelong learning strategy to update
hash functions by directly training the incremental data instead of retraining
new hash functions using all the accumulated data, which significantly reduce
training time. Then, we propose lifelong hashing loss to enable original hash
codes participate in lifelong learning but remain invariant, and further
preserve the similarity and dis-similarity among original and incremental hash
codes to maintain performance. Additionally, considering distribution
heterogeneity when new data arriving continuously, we introduce multi-label
semantic similarity to supervise hash learning, and it has been proven that the
similarity improves performance with detailed analysis. Experimental results on
benchmark datasets show that the proposed methods achieves comparative
performance comparing with recent state-of-the-art cross-modal hashing methods,
and it yields substantial average increments over 20\% in retrieval accuracy
and almost reduces over 80\% training time when new data arrives continuously
2023-2024 Catalog
The 2023-2024 Governors State University Undergraduate and Graduate Catalog is a comprehensive listing of current information regarding:Degree RequirementsCourse OfferingsUndergraduate and Graduate Rules and Regulation
Seamless Multimodal Biometrics for Continuous Personalised Wellbeing Monitoring
Artificially intelligent perception is increasingly present in the lives of
every one of us. Vehicles are no exception, (...) In the near future, pattern
recognition will have an even stronger role in vehicles, as self-driving cars
will require automated ways to understand what is happening around (and within)
them and act accordingly. (...) This doctoral work focused on advancing
in-vehicle sensing through the research of novel computer vision and pattern
recognition methodologies for both biometrics and wellbeing monitoring. The
main focus has been on electrocardiogram (ECG) biometrics, a trait well-known
for its potential for seamless driver monitoring. Major efforts were devoted to
achieving improved performance in identification and identity verification in
off-the-person scenarios, well-known for increased noise and variability. Here,
end-to-end deep learning ECG biometric solutions were proposed and important
topics were addressed such as cross-database and long-term performance,
waveform relevance through explainability, and interlead conversion. Face
biometrics, a natural complement to the ECG in seamless unconstrained
scenarios, was also studied in this work. The open challenges of masked face
recognition and interpretability in biometrics were tackled in an effort to
evolve towards algorithms that are more transparent, trustworthy, and robust to
significant occlusions. Within the topic of wellbeing monitoring, improved
solutions to multimodal emotion recognition in groups of people and
activity/violence recognition in in-vehicle scenarios were proposed. At last,
we also proposed a novel way to learn template security within end-to-end
models, dismissing additional separate encryption processes, and a
self-supervised learning approach tailored to sequential data, in order to
ensure data security and optimal performance. (...)Comment: Doctoral thesis presented and approved on the 21st of December 2022
to the University of Port
Geometric Learning on Graph Structured Data
Graphs provide a ubiquitous and universal data structure that can be applied in many domains such as social networks, biology, chemistry, physics, and computer science. In this thesis we focus on two fundamental paradigms in graph learning: representation learning and similarity learning over graph-structured data. Graph representation learning aims to learn embeddings for nodes by integrating topological and feature information of a graph. Graph similarity learning brings into play with similarity functions that allow to compute similarity between pairs of graphs in a vector space. We address several challenging issues in these two paradigms, designing powerful, yet efficient and theoretical guaranteed machine learning models that can leverage rich topological structural properties of real-world graphs.
This thesis is structured into two parts. In the first part of the thesis, we will present how to develop powerful Graph Neural Networks (GNNs) for graph representation learning from three different perspectives: (1) spatial GNNs, (2) spectral GNNs, and (3) diffusion GNNs. We will discuss the model architecture, representational power, and convergence properties of these GNN models. Specifically, we first study how to develop expressive, yet efficient and simple message-passing aggregation schemes that can go beyond the Weisfeiler-Leman test (1-WL). We propose a generalized message-passing framework by incorporating graph structural properties into an aggregation scheme. Then, we introduce a new local isomorphism hierarchy on neighborhood subgraphs. We further develop a novel neural model, namely GraphSNN, and theoretically prove that this model is more expressive than the 1-WL test. After that, we study how to build an effective and efficient graph convolution model with spectral graph filters. In this study, we propose a spectral GNN model, called DFNets, which incorporates a novel spectral graph filter, namely feedback-looped filters. As a result, this model can provide better localization on neighborhood while achieving fast convergence and linear memory requirements. Finally, we study how to capture the rich topological information of a graph using graph diffusion. We propose a novel GNN architecture with dynamic PageRank, based on a learnable transition matrix. We explore two variants of this GNN architecture: forward-euler solution and invariable feature solution, and theoretically prove that our forward-euler GNN architecture is guaranteed with the convergence to a stationary distribution.
In the second part of this thesis, we will introduce a new optimal transport distance metric on graphs in a regularized learning framework for graph kernels. This optimal transport distance metric can preserve both local and global structures between graphs during the transport, in addition to preserving features and their local variations. Furthermore, we propose two strongly convex regularization terms to theoretically guarantee the convergence and numerical stability in finding an optimal assignment between graphs. One regularization term is used to regularize a Wasserstein distance between graphs in the same ground space. This helps to preserve the local clustering structure on graphs by relaxing the optimal transport problem to be a cluster-to-cluster assignment between locally connected vertices. The other regularization term is used to regularize a Gromov-Wasserstein distance between graphs across different ground spaces based on degree-entropy KL divergence. This helps to improve the matching robustness of an optimal alignment to preserve the global connectivity structure of graphs. We have evaluated our optimal transport-based graph kernel using different benchmark tasks. The experimental results show that our models considerably outperform all the state-of-the-art methods in all benchmark tasks
MIR-GAN : Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
PreprintPublisher PD
Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited
Recommendation models that utilize unique identities (IDs) to represent
distinct users and items have been state-of-the-art (SOTA) and dominated the
recommender systems (RS) literature for over a decade. Meanwhile, the
pre-trained modality encoders, such as BERT and ViT, have become increasingly
powerful in modeling the raw modality features of an item, such as text and
images. Given this, a natural question arises: can a purely modality-based
recommendation model (MoRec) outperforms or matches a pure ID-based model
(IDRec) by replacing the itemID embedding with a SOTA modality encoder? In
fact, this question was answered ten years ago when IDRec beats MoRec by a
strong margin in both recommendation accuracy and efficiency. We aim to revisit
this `old' question and systematically study MoRec from several aspects.
Specifically, we study several sub-questions: (i) which recommendation
paradigm, MoRec or IDRec, performs better in practical scenarios, especially in
the general setting and warm item scenarios where IDRec has a strong advantage?
does this hold for items with different modality features? (ii) can the latest
technical advances from other communities (i.e., natural language processing
and computer vision) translate into accuracy improvement for MoRec? (iii) how
to effectively utilize item modality representation, can we use it directly or
do we have to adjust it with new data? (iv) are there some key challenges for
MoRec to be solved in practical applications? To answer them, we conduct
rigorous experiments for item recommendations with two popular modalities,
i.e., text and vision. We provide the first empirical evidence that MoRec is
already comparable to its IDRec counterpart with an expensive end-to-end
training method, even for warm item recommendation. Our results potentially
imply that the dominance of IDRec in the RS field may be greatly challenged in
the future
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
The goal of sequential recommendation (SR) is to predict a user's potential
interested items based on her/his historical interaction sequences. Most
existing sequential recommenders are developed based on ID features, which,
despite their widespread use, often underperform with sparse IDs and struggle
with the cold-start problem. Besides, inconsistent ID mappings hinder the
model's transferability, isolating similar recommendation domains that could
have been co-optimized. This paper aims to address these issues by exploring
the potential of multi-modal information in learning robust and generalizable
sequence representations. We propose MISSRec, a multi-modal pre-training and
transfer learning framework for SR. On the user side, we design a
Transformer-based encoder-decoder model, where the contextual encoder learns to
capture the sequence-level multi-modal synergy while a novel interest-aware
decoder is developed to grasp item-modality-interest relations for better
sequence representation. On the candidate item side, we adopt a dynamic fusion
module to produce user-adaptive item representation, providing more precise
matching between users and items. We pre-train the model with contrastive
learning objectives and fine-tune it in an efficient manner. Extensive
experiments demonstrate the effectiveness and flexibility of MISSRec, promising
an practical solution for real-world recommendation scenarios.Comment: Accepted to ACM MM 202
BLE-based Indoor Localization and Contact Tracing Approaches
Internet of Things (IoT) has penetrated different aspects of modern life with smart sensors being prevalent within our surrounding indoor environments. Furthermore, dependence on IoT-based Contact Tracing (CT) models has significantly increased mainly due to the COVID-19 pandemic. There is, therefore, an urgent quest to develop/design efficient, autonomous, trustworthy, and secure indoor CT solutions leveraging accurate indoor localization/tracking approaches. In this context, the first objective of this Ph.D. thesis is to enhance accuracy of Bluetooth Low Energy (BLE)-based indoor localization. BLE-based localization is typically performed based on the Received Signal Strength Indicator (RSSI). Extreme fluctuations of the RSSI occurring due to different factors such as multi-path effects and noise, however, prevent the BLE technology to be a reliable solution with acceptable accuracy for dynamic tracking/localization in indoor environments. In this regard, first, an IoT dataset is constructed based on multiple thoroughly separated indoor environments to incorporate the effects of various interferences faced in different spaces. The constructed dataset is then used to develop a Reinforcement Learning (RL)-based information fusion strategy to form a multiple-model implementation consisting of RSSI, Pedestrian dead reckoning (PDR), and Angle-of-Arrival (AoA)-based models. In the second part of the thesis, the focus is devoted to application of multi-agent Deep Neural Networks (DNN) models for indoor tracking. DNN-based approaches are, however, prone to overfitting and high sensitivity to parameter selection, which results in sample inefficiency. Moreover, data labelling is a time-consuming and costly procedure. To address these issues, we leverage Successor Representations (SR)-based techniques, which can learn the expected discounted future state occupancy, and the immediate reward of each state. A Deep Multi-Agent Successor Representation framework is proposed that can adapt quickly to the changes in a multi-agent environment faster than the Model-Free (MF) RL methods and with a lower computational cost compared to Model-Based (MB) RL algorithms. In the third part of the thesis, the developed indoor localization techniques are utilized to design a novel indoor CT solution, referred to as the Trustworthy Blockchain-enabled system for Indoor Contact Tracing (TB-ICT) framework. The TB-ICT is a fully distributed and innovative blockchain platform exploiting the proposed dynamic Proof of Work (dPoW) approach coupled with a Randomized Hash Window (W-Hash) and dynamic Proof of Credit (dPoC) mechanisms
A Survey of Multimodal Information Fusion for Smart Healthcare: Mapping the Journey from Data to Wisdom
Multimodal medical data fusion has emerged as a transformative approach in
smart healthcare, enabling a comprehensive understanding of patient health and
personalized treatment plans. In this paper, a journey from data to information
to knowledge to wisdom (DIKW) is explored through multimodal fusion for smart
healthcare. We present a comprehensive review of multimodal medical data fusion
focused on the integration of various data modalities. The review explores
different approaches such as feature selection, rule-based systems, machine
learning, deep learning, and natural language processing, for fusing and
analyzing multimodal data. This paper also highlights the challenges associated
with multimodal fusion in healthcare. By synthesizing the reviewed frameworks
and theories, it proposes a generic framework for multimodal medical data
fusion that aligns with the DIKW model. Moreover, it discusses future
directions related to the four pillars of healthcare: Predictive, Preventive,
Personalized, and Participatory approaches. The components of the comprehensive
survey presented in this paper form the foundation for more successful
implementation of multimodal fusion in smart healthcare. Our findings can guide
researchers and practitioners in leveraging the power of multimodal fusion with
the state-of-the-art approaches to revolutionize healthcare and improve patient
outcomes.Comment: This work has been submitted to the ELSEVIER for possible
publication. Copyright may be transferred without notice, after which this
version may no longer be accessibl
- …