1,548 research outputs found
Studying the impact of the Full-Network embedding on multimodal pipelines
The current state of the art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.This work is partially supported by the Joint Study Agreement no. W156463 under the IBM/BSC Deep Learning Center agreement, by the Spanish Government through Programa Severo Ochoa (SEV-2015-
0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051), and by the Core Research for Evolutional Science and
Technology (CREST) program of Japan Science and Technology Agency (JST).Peer ReviewedPostprint (author's final draft
Sequential Multi-Dimensional Self-Supervised Learning for Clinical Time Series
Self-supervised learning (SSL) for clinical time series data has received
significant attention in recent literature, since these data are highly rich
and provide important information about a patient's physiological state.
However, most existing SSL methods for clinical time series are limited in that
they are designed for unimodal time series, such as a sequence of structured
features (e.g., lab values and vitals signs) or an individual high-dimensional
physiological signal (e.g., an electrocardiogram). These existing methods
cannot be readily extended to model time series that exhibit multimodality,
with structured features and high-dimensional data being recorded at each
timestep in the sequence. In this work, we address this gap and propose a new
SSL method -- Sequential Multi-Dimensional SSL -- where a SSL loss is applied
both at the level of the entire sequence and at the level of the individual
high-dimensional data points in the sequence in order to better capture
information at both scales. Our strategy is agnostic to the specific form of
loss function used at each level -- it can be contrastive, as in SimCLR, or
non-contrastive, as in VICReg. We evaluate our method on two real-world
clinical datasets, where the time series contains sequences of (1)
high-frequency electrocardiograms and (2) structured data from lab values and
vitals signs. Our experimental results indicate that pre-training with our
method and then fine-tuning on downstream tasks improves performance over
baselines on both datasets, and in several settings, can lead to improvements
across different self-supervised loss functions.Comment: ICML 202
MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
In this paper, we introduce the MLM (Multiple Languages and Modalities)
dataset - a new resource to train and evaluate multitask systems on samples in
multiple modalities and three languages. The generation process and inclusion
of semantic data provide a resource that further tests the ability for
multitask systems to learn relationships between entities. The dataset is
designed for researchers and developers who build applications that perform
multiple tasks on data encountered on the web and in digital archives. A second
version of MLM provides a geo-representative subset of the data with weighted
samples for countries of the European Union. We demonstrate the value of the
resource in developing novel applications in the digital humanities with a
motivating use case and specify a benchmark set of tasks to retrieve modalities
and locate entities in the dataset. Evaluation of baseline multitask and single
task systems on the full and geo-representative versions of MLM demonstrate the
challenges of generalising on diverse data. In addition to the digital
humanities, we expect the resource to contribute to research in multimodal
representation learning, location estimation, and scene understanding
Network Medicine Framework for Identifying Drug Repurposing Opportunities for COVID-19
The current pandemic has highlighted the need for methodologies that can
quickly and reliably prioritize clinically approved compounds for their
potential effectiveness for SARS-CoV-2 infections. In the past decade, network
medicine has developed and validated multiple predictive algorithms for drug
repurposing, exploiting the sub-cellular network-based relationship between a
drug's targets and disease genes. Here, we deployed algorithms relying on
artificial intelligence, network diffusion, and network proximity, tasking each
of them to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. To
test the predictions, we used as ground truth 918 drugs that had been
experimentally screened in VeroE6 cells, and the list of drugs under clinical
trial, that capture the medical community's assessment of drugs with potential
COVID-19 efficacy. We find that while most algorithms offer predictive power
for these ground truth data, no single method offers consistently reliable
outcomes across all datasets and metrics. This prompted us to develop a
multimodal approach that fuses the predictions of all algorithms, showing that
a consensus among the different predictive methods consistently exceeds the
performance of the best individual pipelines. We find that 76 of the 77 drugs
that successfully reduced viral infection do not bind the proteins targeted by
SARS-CoV-2, indicating that these drugs rely on network-based actions that
cannot be identified using docking-based strategies. These advances offer a
methodological pathway to identify repurposable drugs for future pathogens and
neglected diseases underserved by the costs and extended timeline of de novo
drug development
Correlated Multimodal Imaging in Life Sciences:Expanding the Biomedical Horizon
International audienceThe frontiers of bioimaging are currently being pushed toward the integration and correlation of several modalities to tackle biomedical research questions holistically and across multiple scales. Correlated Multimodal Imaging (CMI) gathers information about exactly the same specimen with two or more complementary modalities that-in combination-create a composite and complementary view of the sample (including insights into structure, function, dynamics and molecular composition). CMI allows to describe biomedical processes within their overall spatio-temporal context and gain a mechanistic understanding of cells, tissues, diseases or organisms by untangling their molecular mechanisms within their native environment. The two best-established CMI implementations for small animals and model organisms are hardware-fused platforms in preclinical imaging (Hybrid Imaging) and Correlated Light and Electron Microscopy (CLEM) in biological imaging. Although the merits of Preclinical Hybrid Imaging (PHI) and CLEM are well-established, both approaches would benefit from standardization of protocols, ontologies and data handling, and the development of optimized and advanced implementations. Specifically, CMI pipelines that aim at bridging preclinical and biological imaging beyond CLEM and PHI are rare but bear great potential to substantially advance both bioimaging and biomedical research. CMI faces three mai
Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain
Previously, a machine speech chain, which is based on sequence-to-sequence
deep learning, was proposed to mimic speech perception and production behavior.
Such chains separately processed listening and speaking by automatic speech
recognition (ASR) and text-to-speech synthesis (TTS) and simultaneously enabled
them to teach each other in semi-supervised learning when they received
unpaired data. Unfortunately, this speech chain study is limited to speech and
textual modalities. In fact, natural communication is actually multimodal and
involves both auditory and visual sensory systems. Although the said speech
chain reduces the requirement of having a full amount of paired data, in this
case we still need a large amount of unpaired data. In this research, we take a
further step and construct a multimodal chain and design a closely knit chain
architecture that combines ASR, TTS, image captioning, and image production
models into a single framework. The framework allows the training of each
component without requiring a large number of parallel multimodal data. Our
experimental results also show that an ASR can be further trained without
speech and text data and cross-modal data augmentation remains possible through
our proposed chain, which improves the ASR performance.Comment: Accepted in IEEE ASRU 201
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
Predicting multiple real-world tasks in a single model often requires a
particularly diverse feature space. Multimodal (MM) models aim to extract the
synergistic predictive potential of multiple data types to create a shared
feature space with aligned semantic meaning across inputs of drastically
varying sizes (i.e. images, text, sound). Most current MM architectures fuse
these representations in parallel, which not only limits their interpretability
but also creates a dependency on modality availability. We present MultiModN, a
multimodal, modular network that fuses latent representations in a sequence of
any number, combination, or type of modality while providing granular real-time
predictive feedback on any number or combination of predictive tasks.
MultiModN's composable pipeline is interpretable-by-design, as well as innately
multi-task and robust to the fundamental issue of biased missingness. We
perform four experiments on several benchmark MM datasets across 10 real-world
tasks (predicting medical diagnoses, academic performance, and weather), and
show that MultiModN's sequential MM fusion does not compromise performance
compared with a baseline of parallel fusion. By simulating the challenging bias
of missing not-at-random (MNAR), this work shows that, contrary to MultiModN,
parallel fusion baselines erroneously learn MNAR and suffer catastrophic
failure when faced with different patterns of MNAR at inference. To the best of
our knowledge, this is the first inherently MNAR-resistant approach to MM
modeling. In conclusion, MultiModN provides granular insights, robustness, and
flexibility without compromising performance.Comment: Accepted as a full paper at NeurIPS 2023 in New Orleans, US
An Eye for AI: A Multimodal Bottleneck Transformer Approach for Predicting Individual Eye Movements : Towards Foundation Models for Human Factors & Neuroscience
Human perception has been a subject of study for centuries. Various eye tracking methods in many study designs have shed light on individual differences in perception and visual navigation. However, accurately identifying individuals based on gaze behaviour remains a challenge. Artificial intelligence (AI) based methods have led to large successes in domains such as vision and language; they are also making their introduction in human factors & neuroscience (HFN). Leveraging AI for HFN requires quantities of data several orders of magnitude larger than the field is used to organising; there exists a clear discrepancy in the standardisation of data publication. In this work, we work towards foundation models (FM) for HFN by highlighting important data insights from AI. A multimodal bottleneck transformer is proposed, a model architecture that can effectively and efficiently represent and work with the varying modalities encountered in HFN. Results indicate that classification of individuals and prediction of gaze is possible, given more training data
- …