68 research outputs found

    VIP5: Towards Multimodal Foundation Models for Recommendation

    Full text link
    Computer Vision (CV), Natural Language Processing (NLP), and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other's advancements. With the recent development of foundation models, large language models have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model (MFM) considering visual, textual, and personalization modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5), to unify various modalities and recommendation tasks. This will enable the processing of multiple modalities in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage. Code and data of VIP5 are available at https://github.com/jeykigung/VIP5.Comment: Accepted by EMNLP 202

    Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes

    Full text link
    Accurate tissue segmentation of thick-slice fetal brain magnetic resonance (MR) scans is crucial for both reconstruction of isotropic brain MR volumes and the quantification of fetal brain development. However, this task is challenging due to the use of thick-slice scans in clinically-acquired fetal brain data. To address this issue, we propose to leverage high-quality isotropic fetal brain MR volumes (and also their corresponding annotations) as guidance for segmentation of thick-slice scans. Due to existence of significant domain gap between high-quality isotropic volume (i.e., source data) and thick-slice scans (i.e., target data), we employ a domain adaptation technique to achieve the associated knowledge transfer (from high-quality volumes to thick-slice scans). Specifically, we first register the available high-quality isotropic fetal brain MR volumes across different gestational weeks to construct longitudinally-complete source data. To capture domain-invariant information, we then perform Fourier decomposition to extract image content and style codes. Finally, we propose a novel Cycle-Consistent Domain Adaptation Network (C2DA-Net) to efficiently transfer the knowledge learned from high-quality isotropic volumes for accurate tissue segmentation of thick-slice scans. Our C2DA-Net can fully utilize a small set of annotated isotropic volumes to guide tissue segmentation on unannotated thick-slice scans. Extensive experiments on a large-scale dataset of 372 clinically acquired thick-slice MR scans demonstrate that our C2DA-Net achieves much better performance than cutting-edge methods quantitatively and qualitatively.Comment: 10 pages, 9 figures, 5 tables, Fetal MRI, Brain tissue segmentation, Unsupervised domain adaptation, Cycle-consistenc

    The effects of co-colonising ectomycorrhizal fungi on mycorrhizal colonisation and sporocarp formation in Laccaria japonica colonising seedlings of Pinus densiflora

    Get PDF
    Forest trees are colonised by different species of ectomycorrhizal (ECM) fungi that interact competitively or mutualistically with one another. Most ECM fungi can produce sporocarps. To date, the effects of co-colonising fungal species on sporocarp formation in ECM fungi remain unknown. In this study, we examined host plant growth, mycorrhizal colonisation, and sporocarp formation when roots of Pinus densiflora are colonised by Laccaria japonica and three other ECM fungal species (Cenococcum geophilum, Pisolithus sp., and Suillus luteus). Sporocarp numbers were recorded throughout the experimental period. The biomass, photosynthetic rate, and mycorrhizal colonisation rate of the seedlings were also measured at 45days, 62days, and 1year after seedlings were transplanted. Results indicated that C. geophilum and S. luteus may negatively impact mycorrhizal colonisation and sporocarp formation in L. japonica. Sporocarp formation in L. japonica was positively correlated with conspecific mycorrhizal colonisation but negatively correlated with the biomass of seedlings of P. densiflora. The co-occurring ECM fungi largely competed with L. japonica, resulting in various effects on mycorrhizal colonisation and sporocarp formation in L. japonica. A variety of mechanisms may be involved in the competitive interactions among the different ECM fungal species, including abilities to more rapidly colonise root tips, acquire soil nutrients, or produce antibiotics. These mechanisms need to be confirmed in further studies.Peer reviewe

    Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

    Full text link
    Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.Comment: Accepted to ECCV 202

    Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

    Full text link
    Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.Comment: Accepted at AAAI 202
    corecore