1,548 research outputs found

    Studying the impact of the Full-Network embedding on multimodal pipelines

    Get PDF
    The current state of the art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.This work is partially supported by the Joint Study Agreement no. W156463 under the IBM/BSC Deep Learning Center agreement, by the Spanish Government through Programa Severo Ochoa (SEV-2015- 0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051), and by the Core Research for Evolutional Science and Technology (CREST) program of Japan Science and Technology Agency (JST).Peer ReviewedPostprint (author's final draft

    Sequential Multi-Dimensional Self-Supervised Learning for Clinical Time Series

    Full text link
    Self-supervised learning (SSL) for clinical time series data has received significant attention in recent literature, since these data are highly rich and provide important information about a patient's physiological state. However, most existing SSL methods for clinical time series are limited in that they are designed for unimodal time series, such as a sequence of structured features (e.g., lab values and vitals signs) or an individual high-dimensional physiological signal (e.g., an electrocardiogram). These existing methods cannot be readily extended to model time series that exhibit multimodality, with structured features and high-dimensional data being recorded at each timestep in the sequence. In this work, we address this gap and propose a new SSL method -- Sequential Multi-Dimensional SSL -- where a SSL loss is applied both at the level of the entire sequence and at the level of the individual high-dimensional data points in the sequence in order to better capture information at both scales. Our strategy is agnostic to the specific form of loss function used at each level -- it can be contrastive, as in SimCLR, or non-contrastive, as in VICReg. We evaluate our method on two real-world clinical datasets, where the time series contains sequences of (1) high-frequency electrocardiograms and (2) structured data from lab values and vitals signs. Our experimental results indicate that pre-training with our method and then fine-tuning on downstream tasks improves performance over baselines on both datasets, and in several settings, can lead to improvements across different self-supervised loss functions.Comment: ICML 202

    MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

    Full text link
    In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding

    Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19

    Full text link
    The current pandemic has highlighted the need for methodologies that can quickly and reliably prioritize clinically approved compounds for their potential effectiveness for SARS-CoV-2 infections. In the past decade, network medicine has developed and validated multiple predictive algorithms for drug repurposing, exploiting the sub-cellular network-based relationship between a drug's targets and disease genes. Here, we deployed algorithms relying on artificial intelligence, network diffusion, and network proximity, tasking each of them to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. To test the predictions, we used as ground truth 918 drugs that had been experimentally screened in VeroE6 cells, and the list of drugs under clinical trial, that capture the medical community's assessment of drugs with potential COVID-19 efficacy. We find that while most algorithms offer predictive power for these ground truth data, no single method offers consistently reliable outcomes across all datasets and metrics. This prompted us to develop a multimodal approach that fuses the predictions of all algorithms, showing that a consensus among the different predictive methods consistently exceeds the performance of the best individual pipelines. We find that 76 of the 77 drugs that successfully reduced viral infection do not bind the proteins targeted by SARS-CoV-2, indicating that these drugs rely on network-based actions that cannot be identified using docking-based strategies. These advances offer a methodological pathway to identify repurposable drugs for future pathogens and neglected diseases underserved by the costs and extended timeline of de novo drug development

    Correlated Multimodal Imaging in Life Sciences:Expanding the Biomedical Horizon

    Get PDF
    International audienceThe frontiers of bioimaging are currently being pushed toward the integration and correlation of several modalities to tackle biomedical research questions holistically and across multiple scales. Correlated Multimodal Imaging (CMI) gathers information about exactly the same specimen with two or more complementary modalities that-in combination-create a composite and complementary view of the sample (including insights into structure, function, dynamics and molecular composition). CMI allows to describe biomedical processes within their overall spatio-temporal context and gain a mechanistic understanding of cells, tissues, diseases or organisms by untangling their molecular mechanisms within their native environment. The two best-established CMI implementations for small animals and model organisms are hardware-fused platforms in preclinical imaging (Hybrid Imaging) and Correlated Light and Electron Microscopy (CLEM) in biological imaging. Although the merits of Preclinical Hybrid Imaging (PHI) and CLEM are well-established, both approaches would benefit from standardization of protocols, ontologies and data handling, and the development of optimized and advanced implementations. Specifically, CMI pipelines that aim at bridging preclinical and biological imaging beyond CLEM and PHI are rare but bear great potential to substantially advance both bioimaging and biomedical research. CMI faces three mai

    Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain

    Full text link
    Previously, a machine speech chain, which is based on sequence-to-sequence deep learning, was proposed to mimic speech perception and production behavior. Such chains separately processed listening and speaking by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) and simultaneously enabled them to teach each other in semi-supervised learning when they received unpaired data. Unfortunately, this speech chain study is limited to speech and textual modalities. In fact, natural communication is actually multimodal and involves both auditory and visual sensory systems. Although the said speech chain reduces the requirement of having a full amount of paired data, in this case we still need a large amount of unpaired data. In this research, we take a further step and construct a multimodal chain and design a closely knit chain architecture that combines ASR, TTS, image captioning, and image production models into a single framework. The framework allows the training of each component without requiring a large number of parallel multimodal data. Our experimental results also show that an ASR can be further trained without speech and text data and cross-modal data augmentation remains possible through our proposed chain, which improves the ASR performance.Comment: Accepted in IEEE ASRU 201

    MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

    Full text link
    Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.Comment: Accepted as a full paper at NeurIPS 2023 in New Orleans, US

    An Eye for AI: A Multimodal Bottleneck Transformer Approach for Predicting Individual Eye Movements : Towards Foundation Models for Human Factors & Neuroscience

    Get PDF
    Human perception has been a subject of study for centuries. Various eye tracking methods in many study designs have shed light on individual differences in perception and visual navigation. However, accurately identifying individuals based on gaze behaviour remains a challenge. Artificial intelligence (AI) based methods have led to large successes in domains such as vision and language; they are also making their introduction in human factors & neuroscience (HFN). Leveraging AI for HFN requires quantities of data several orders of magnitude larger than the field is used to organising; there exists a clear discrepancy in the standardisation of data publication. In this work, we work towards foundation models (FM) for HFN by highlighting important data insights from AI. A multimodal bottleneck transformer is proposed, a model architecture that can effectively and efficiently represent and work with the varying modalities encountered in HFN. Results indicate that classification of individuals and prediction of gaze is possible, given more training data
    • …
    corecore