68 research outputs found
VIP5: Towards Multimodal Foundation Models for Recommendation
Computer Vision (CV), Natural Language Processing (NLP), and Recommender
Systems (RecSys) are three prominent AI applications that have traditionally
developed independently, resulting in disparate modeling and engineering
methodologies. This has impeded the ability for these fields to directly
benefit from each other's advancements. With the recent development of
foundation models, large language models have emerged as a potential
general-purpose interface for unifying different modalities and problem
formulations. In light of this, we propose the development of a multimodal
foundation model (MFM) considering visual, textual, and personalization
modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5),
to unify various modalities and recommendation tasks. This will enable the
processing of multiple modalities in a shared architecture for improved
recommendations. To achieve this, we introduce multimodal personalized prompts
to accommodate multiple modalities under a shared format. Additionally, we
propose a parameter-efficient training method for foundation models, which
involves freezing the P5 backbone and fine-tuning lightweight adapters,
resulting in improved recommendation performance and increased efficiency in
terms of training time and memory usage. Code and data of VIP5 are available at
https://github.com/jeykigung/VIP5.Comment: Accepted by EMNLP 202
Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes
Accurate tissue segmentation of thick-slice fetal brain magnetic resonance
(MR) scans is crucial for both reconstruction of isotropic brain MR volumes and
the quantification of fetal brain development. However, this task is
challenging due to the use of thick-slice scans in clinically-acquired fetal
brain data. To address this issue, we propose to leverage high-quality
isotropic fetal brain MR volumes (and also their corresponding annotations) as
guidance for segmentation of thick-slice scans. Due to existence of significant
domain gap between high-quality isotropic volume (i.e., source data) and
thick-slice scans (i.e., target data), we employ a domain adaptation technique
to achieve the associated knowledge transfer (from high-quality
volumes to thick-slice scans). Specifically, we first register the
available high-quality isotropic fetal brain MR volumes across different
gestational weeks to construct longitudinally-complete source data. To capture
domain-invariant information, we then perform Fourier decomposition to extract
image content and style codes. Finally, we propose a novel Cycle-Consistent
Domain Adaptation Network (C2DA-Net) to efficiently transfer the knowledge
learned from high-quality isotropic volumes for accurate tissue segmentation of
thick-slice scans. Our C2DA-Net can fully utilize a small set of annotated
isotropic volumes to guide tissue segmentation on unannotated thick-slice
scans. Extensive experiments on a large-scale dataset of 372 clinically
acquired thick-slice MR scans demonstrate that our C2DA-Net achieves much
better performance than cutting-edge methods quantitatively and qualitatively.Comment: 10 pages, 9 figures, 5 tables, Fetal MRI, Brain tissue segmentation,
Unsupervised domain adaptation, Cycle-consistenc
The effects of co-colonising ectomycorrhizal fungi on mycorrhizal colonisation and sporocarp formation in Laccaria japonica colonising seedlings of Pinus densiflora
Forest trees are colonised by different species of ectomycorrhizal (ECM) fungi that interact competitively or mutualistically with one another. Most ECM fungi can produce sporocarps. To date, the effects of co-colonising fungal species on sporocarp formation in ECM fungi remain unknown. In this study, we examined host plant growth, mycorrhizal colonisation, and sporocarp formation when roots of Pinus densiflora are colonised by Laccaria japonica and three other ECM fungal species (Cenococcum geophilum, Pisolithus sp., and Suillus luteus). Sporocarp numbers were recorded throughout the experimental period. The biomass, photosynthetic rate, and mycorrhizal colonisation rate of the seedlings were also measured at 45days, 62days, and 1year after seedlings were transplanted. Results indicated that C. geophilum and S. luteus may negatively impact mycorrhizal colonisation and sporocarp formation in L. japonica. Sporocarp formation in L. japonica was positively correlated with conspecific mycorrhizal colonisation but negatively correlated with the biomass of seedlings of P. densiflora. The co-occurring ECM fungi largely competed with L. japonica, resulting in various effects on mycorrhizal colonisation and sporocarp formation in L. japonica. A variety of mechanisms may be involved in the competitive interactions among the different ECM fungal species, including abilities to more rapidly colonise root tips, acquire soil nutrients, or produce antibiotics. These mechanisms need to be confirmed in further studies.Peer reviewe
Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Despite the success of fully-supervised human skeleton sequence modeling,
utilizing self-supervised pre-training for skeleton sequence representation
learning has been an active field because acquiring task-specific skeleton
annotations at large scales is difficult. Recent studies focus on learning
video-level temporal and discriminative information using contrastive learning,
but overlook the hierarchical spatial-temporal nature of human skeletons.
Different from such superficial supervision at the video level, we propose a
self-supervised hierarchical pre-training scheme incorporated into a
hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to
explicitly capture spatial, short-term, and long-term temporal dependencies at
frame, clip, and video levels, respectively. To evaluate the proposed
self-supervised pre-training scheme with Hi-TRS, we conduct extensive
experiments covering three skeleton-based downstream tasks including action
recognition, action detection, and motion prediction. Under both supervised and
semi-supervised evaluation protocols, our method achieves the state-of-the-art
performance. Additionally, we demonstrate that the prior knowledge learned by
our model in the pre-training stage has strong transfer capability for
different downstream tasks.Comment: Accepted to ECCV 202
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.Comment: Accepted at AAAI 202
- …