306 research outputs found
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Vehicle Detection of Multi-source Remote Sensing Data Using Active Fine-tuning Network
Vehicle detection in remote sensing images has attracted increasing interest
in recent years. However, its detection ability is limited due to lack of
well-annotated samples, especially in densely crowded scenes. Furthermore,
since a list of remotely sensed data sources is available, efficient
exploitation of useful information from multi-source data for better vehicle
detection is challenging. To solve the above issues, a multi-source active
fine-tuning vehicle detection (Ms-AFt) framework is proposed, which integrates
transfer learning, segmentation, and active classification into a unified
framework for auto-labeling and detection. The proposed Ms-AFt employs a
fine-tuning network to firstly generate a vehicle training set from an
unlabeled dataset. To cope with the diversity of vehicle categories, a
multi-source based segmentation branch is then designed to construct additional
candidate object sets. The separation of high quality vehicles is realized by a
designed attentive classifications network. Finally, all three branches are
combined to achieve vehicle detection. Extensive experimental results conducted
on two open ISPRS benchmark datasets, namely the Vaihingen village and Potsdam
city datasets, demonstrate the superiority and effectiveness of the proposed
Ms-AFt for vehicle detection. In addition, the generalization ability of Ms-AFt
in dense remote sensing scenes is further verified on stereo aerial imagery of
a large camping site
Deep learning in remote sensing: a review
Standing at the paradigm shift towards data-intensive science, machine
learning techniques are becoming increasingly important. In particular, as a
major breakthrough in the field, deep learning has proven as an extremely
powerful tool in many fields. Shall we embrace deep learning as the key to all?
Or, should we resist a 'black-box' solution? There are controversial opinions
in the remote sensing community. In this article, we analyze the challenges of
using deep learning for remote sensing data analysis, review the recent
advances, and provide resources to make deep learning in remote sensing
ridiculously simple to start with. More importantly, we advocate remote sensing
scientists to bring their expertise into deep learning, and use it as an
implicit general model to tackle unprecedented large-scale influential
challenges, such as climate change and urbanization.Comment: Accepted for publication IEEE Geoscience and Remote Sensing Magazin
MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning
Learning multimodal representations involves integrating information from
multiple heterogeneous sources of data. In order to accelerate progress towards
understudied modalities and tasks while ensuring real-world robustness, we
release MultiZoo, a public toolkit consisting of standardized implementations
of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark
spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
Together, these provide an automated end-to-end machine learning pipeline that
simplifies and standardizes data loading, experimental setup, and model
evaluation. To enable holistic evaluation, we offer a comprehensive methodology
to assess (1) generalization, (2) time and space complexity, and (3) modality
robustness. MultiBench paves the way towards a better understanding of the
capabilities and limitations of multimodal models, while ensuring ease of use,
accessibility, and reproducibility. Our toolkits are publicly available, will
be regularly updated, and welcome inputs from the community.Comment: JMLR Open Source Software 2023, Code available at
https://github.com/pliang279/MultiBenc
BInGo: Bayesian Intrinsic Groupwise Registration via Explicit Hierarchical Disentanglement
Multimodal groupwise registration aligns internal structures in a group of
medical images. Current approaches to this problem involve developing
similarity measures over the joint intensity profile of all images, which may
be computationally prohibitive for large image groups and unstable under
various conditions. To tackle these issues, we propose BInGo, a general
unsupervised hierarchical Bayesian framework based on deep learning, to learn
intrinsic structural representations to measure the similarity of multimodal
images. Particularly, a variational auto-encoder with a novel posterior is
proposed, which facilitates the disentanglement learning of structural
representations and spatial transformations, and characterizes the imaging
process from the common structure with shape transition and appearance
variation. Notably, BInGo is scalable to learn from small groups, whereas being
tested for large-scale groupwise registration, thus significantly reducing
computational costs. We compared BInGo with five iterative or deep learning
methods on three public intrasubject and intersubject datasets, i.e. BraTS,
MS-CMR of the heart, and Learn2Reg abdomen MR-CT, and demonstrated its superior
accuracy and computational efficiency, even for very large group sizes (e.g.,
over 1300 2D images from MS-CMR in each group)
Sensor fusion in driving assistance systems
Mención Internacional en el título de doctorLa vida diaria en los países desarrollados y en vías de desarrollo depende en
gran medida del transporte urbano y en carretera. Esta actividad supone un
coste importante para sus usuarios activos y pasivos en términos de polución
y accidentes, muy habitualmente debidos al factor humano. Los nuevos desarrollos
en seguridad y asistencia a la conducción, llamados Advanced Driving
Assistance Systems (ADAS), buscan mejorar la seguridad en el transporte, y
a medio plazo, llegar a la conducción autónoma.
Los ADAS, al igual que la conducción humana, están basados en sensores
que proporcionan información acerca del entorno, y la fiabilidad de los sensores
es crucial para las aplicaciones ADAS al igual que las capacidades
sensoriales lo son para la conducción humana. Una de las formas de aumentar
la fiabilidad de los sensores es el uso de la Fusión Sensorial, desarrollando
nuevas estrategias para el modelado del entorno de conducción gracias al uso
de diversos sensores, y obteniendo una información mejorada a partid de los
datos disponibles.
La presente tesis pretende ofrecer una solución novedosa para la detección
y clasificación de obstáculos en aplicaciones de automoción, usando fusión
vii
sensorial con dos sensores ampliamente disponibles en el mercado: la cámara
de espectro visible y el escáner láser. Cámaras y láseres son sensores
comúnmente usados en la literatura científica, cada vez más accesibles y listos
para ser empleados en aplicaciones reales. La solución propuesta permite la
detección y clasificación de algunos de los obstáculos comúnmente presentes
en la vía, como son ciclistas y peatones.
En esta tesis se han explorado novedosos enfoques para la detección y clasificación,
desde la clasificación empleando clusters de nubes de puntos obtenidas
desde el escáner láser, hasta las técnicas de domain adaptation para la creación
de bases de datos de imágenes sintéticas, pasando por la extracción inteligente
de clusters y la detección y eliminación del suelo en nubes de puntos.Life in developed and developing countries is highly dependent on road and
urban motor transport. This activity involves a high cost for its active and passive
users in terms of pollution and accidents, which are largely attributable to
the human factor. New developments in safety and driving assistance, called
Advanced Driving Assistance Systems (ADAS), are intended to improve
security in transportation, and, in the mid-term, lead to autonomous driving.
ADAS, like the human driving, are based on sensors, which provide information
about the environment, and sensors’ reliability is crucial for ADAS
applications in the same way the sensing abilities are crucial for human driving.
One of the ways to improve reliability for sensors is the use of Sensor
Fusion, developing novel strategies for environment modeling with the help of
several sensors and obtaining an enhanced information from the combination
of the available data.
The present thesis is intended to offer a novel solution for obstacle detection
and classification in automotive applications using sensor fusion with two
highly available sensors in the market: visible spectrum camera and laser
scanner. Cameras and lasers are commonly used sensors in the scientific
literature, increasingly affordable and ready to be deployed in real world
applications. The solution proposed provides obstacle detection and classification
for some obstacles commonly present in the road, such as pedestrians and bicycles.
Novel approaches for detection and classification have been explored in this
thesis, from point cloud clustering classification for laser scanner, to domain
adaptation techniques for synthetic dataset creation, and including intelligent
clustering extraction and ground detection and removal from point clouds.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Cristina Olaverri Monreal.- Secretario: Arturo de la Escalera Hueso.- Vocal: José Eugenio Naranjo Hernánde
-Metric: An N-Dimensional Information-Theoretic Framework for Groupwise Registration and Deep Combined Computing
This paper presents a generic probabilistic framework for estimating the
statistical dependency and finding the anatomical correspondences among an
arbitrary number of medical images. The method builds on a novel formulation of
the -dimensional joint intensity distribution by representing the common
anatomy as latent variables and estimating the appearance model with
nonparametric estimators. Through connection to maximum likelihood and the
expectation-maximization algorithm, an information\hyp{}theoretic metric called
-metric and a co-registration algorithm named -CoReg
are induced, allowing groupwise registration of the observed images with
computational complexity of . Moreover, the method naturally
extends for a weakly-supervised scenario where anatomical labels of certain
images are provided. This leads to a combined\hyp{}computing framework
implemented with deep learning, which performs registration and segmentation
simultaneously and collaboratively in an end-to-end fashion. Extensive
experiments were conducted to demonstrate the versatility and applicability of
our model, including multimodal groupwise registration, motion correction for
dynamic contrast enhanced magnetic resonance images, and deep combined
computing for multimodal medical images. Results show the superiority of our
method in various applications in terms of both accuracy and efficiency,
highlighting the advantage of the proposed representation of the imaging
process
- …