37,196 research outputs found
Fast and Efficient Zero-Learning Image Fusion
We propose a real-time image fusion method using pre-trained neural networks.
Our method generates a single image containing features from multiple sources.
We first decompose images into a base layer representing large scale intensity
variations, and a detail layer containing small scale changes. We use visual
saliency to fuse the base layers, and deep feature maps extracted from a
pre-trained neural network to fuse the detail layers. We conduct ablation
studies to analyze our method's parameters such as decomposition filters,
weight construction methods, and network depth and architecture. Then, we
validate its effectiveness and speed on thermal, medical, and multi-focus
fusion. We also apply it to multiple image inputs such as multi-exposure
sequences. The experimental results demonstrate that our technique achieves
state-of-the-art performance in visual quality, objective assessment, and
runtime efficiency.Comment: 13 pages, 10 figure
The bilateral solver for quality estimation based multi-focus image fusion
In this work, a fast Bilateral Solver for Quality Estimation Based
multi-focus Image Fusion method (BS-QEBIF) is proposed. The all-in-focus image
is generated by pixel-wise summing up the multi-focus source images with their
focus-levels maps as weights. Since the visual quality of an image patch is
highly correlated with its focus level, the focus-level maps are preliminarily
obtained based on visual quality scores, as pre-estimations. However, the
pre-estimations are not ideal. Thus the fast bilateral solver is then adopted
to smooth the pre-estimations, and edges in the multi-focus source images can
be preserved simultaneously. The edge-preserving smoothed results are utilized
as final focus-level maps. Moreover, this work provides a confidence-map
solution for the unstable fusion in the focus-level-changed boundary regions.
Experiments were conducted on pairs of source images. The proposed
BS-QEBIF outperforms the other fusion methods objectively and
subjectively. The all-in-focus image produced by the proposed method can well
maintain the details in the multi-focus source images and does not suffer from
any residual errors. Experimental results show that BS-QEBIF can handle the
focus-level-changed boundary regions without any blocking, ringing and blurring
artifacts
Multisource and Multitemporal Data Fusion in Remote Sensing
The sharp and recent increase in the availability of data captured by
different sensors combined with their considerably heterogeneous natures poses
a serious challenge for the effective and efficient processing of remotely
sensed data. Such an increase in remote sensing and ancillary datasets,
however, opens up the possibility of utilizing multimodal datasets in a joint
manner to further improve the performance of the processing approaches with
respect to the application at hand. Multisource data fusion has, therefore,
received enormous attention from researchers worldwide for a wide variety of
applications. Moreover, thanks to the revisit capability of several spaceborne
sensors, the integration of the temporal information with the spatial and/or
spectral/backscattering information of the remotely sensed data is possible and
helps to move from a representation of 2D/3D data to 4D data structures, where
the time variable adds new information as well as challenges for the
information extraction algorithms. There are a huge number of research works
dedicated to multisource and multitemporal data fusion, but the methods for the
fusion of different modalities have expanded in different paths according to
each research community. This paper brings together the advances of multisource
and multitemporal data fusion approaches with respect to different research
communities and provides a thorough and discipline-specific starting point for
researchers at different levels (i.e., students, researchers, and senior
researchers) willing to conduct novel investigations on this challenging topic
by supplying sufficient detail and references
Learning Multi-Modal Word Representation Grounded in Visual Context
Representing the semantics of words is a long-standing problem for the
natural language processing community. Most methods compute word semantics
given their textual context in large corpora. More recently, researchers
attempted to integrate perceptual and visual features. Most of these works
consider the visual appearance of objects to enhance word representations but
they ignore the visual environment and context in which objects appear. We
propose to unify text-based techniques with vision-based techniques by
simultaneously leveraging textual and visual context to learn multimodal word
embeddings. We explore various choices for what can serve as a visual context
and present an end-to-end method to integrate visual context elements in a
multimodal skip-gram model. We provide experiments and extensive analysis of
the obtained results
Multimodal Classification of Events in Social Media
A large amount of social media hosted on platforms like Flickr and Instagram
is related to social events. The task of social event classification refers to
the distinction of event and non-event-related content as well as the
classification of event types (e.g. sports events, concerts, etc.). In this
paper, we provide an extensive study of textual, visual, as well as multimodal
representations for social event classification. We investigate strengths and
weaknesses of the modalities and study synergy effects between the modalities.
Experimental results obtained with our multimodal representation outperform
state-of-the-art methods and provide a new baseline for future research.Comment: Preprint of accepted manuscript for the Elsevier Image and Vision
Computing Journal (IMAVIS). The paper will be published by IMAVIS under DOI
10.1016/j.imavis.2015.12.00
Latent Variable Algorithms for Multimodal Learning and Sensor Fusion
Multimodal learning has been lacking principled ways of combining information
from different modalities and learning a low-dimensional manifold of meaningful
representations. We study multimodal learning and sensor fusion from a latent
variable perspective. We first present a regularized recurrent attention filter
for sensor fusion. This algorithm can dynamically combine information from
different types of sensors in a sequential decision making task. Each sensor is
bonded with a modular neural network to maximize utility of its own
information. A gating modular neural network dynamically generates a set of
mixing weights for outputs from sensor networks by balancing utility of all
sensors' information. We design a co-learning mechanism to encourage
co-adaption and independent learning of each sensor at the same time, and
propose a regularization based co-learning method. In the second part, we focus
on recovering the manifold of latent representation. We propose a co-learning
approach using probabilistic graphical model which imposes a structural prior
on the generative model: multimodal variational RNN (MVRNN) model, and derive a
variational lower bound for its objective functions. In the third part, we
extend the siamese structure to sensor fusion for robust acoustic event
detection. We perform experiments to investigate the latent representations
that are extracted; works will be done in the following months. Our experiments
show that the recurrent attention filter can dynamically combine different
sensor inputs according to the information carried in the inputs. We consider
MVRNN can identify latent representations that are useful for many downstream
tasks such as speech synthesis, activity recognition, and control and planning.
Both algorithms are general frameworks which can be applied to other tasks
where different types of sensors are jointly used for decision making
Linked Component Analysis from Matrices to High Order Tensors: Applications to Biomedical Data
With the increasing availability of various sensor technologies, we now have
access to large amounts of multi-block (also called multi-set,
multi-relational, or multi-view) data that need to be jointly analyzed to
explore their latent connections. Various component analysis methods have
played an increasingly important role for the analysis of such coupled data. In
this paper, we first provide a brief review of existing matrix-based (two-way)
component analysis methods for the joint analysis of such data with a focus on
biomedical applications. Then, we discuss their important extensions and
generalization to multi-block multiway (tensor) data. We show how constrained
multi-block tensor decomposition methods are able to extract similar or
statistically dependent common features that are shared by all blocks, by
incorporating the multiway nature of data. Special emphasis is given to the
flexible common and individual feature analysis of multi-block data with the
aim to simultaneously extract common and individual latent components with
desired properties and types of diversity. Illustrative examples are given to
demonstrate their effectiveness for biomedical data analysis.Comment: 20 pages, 11 figures, Proceedings of the IEEE, 201
Scale-Invariant Structure Saliency Selection for Fast Image Fusion
In this paper, we present a fast yet effective method for pixel-level
scale-invariant image fusion in spatial domain based on the scale-space theory.
Specifically, we propose a scale-invariant structure saliency selection scheme
based on the difference-of-Gaussian (DoG) pyramid of images to build the
weights or activity map. Due to the scale-invariant structure saliency
selection, our method can keep both details of small size objects and the
integrity information of large size objects in images. In addition, our method
is very efficient since there are no complex operation involved and easy to be
implemented and therefore can be used for fast high resolution images fusion.
Experimental results demonstrate the proposed method yields competitive or even
better results comparing to state-of-the-art image fusion methods both in terms
of visual quality and objective evaluation metrics. Furthermore, the proposed
method is very fast and can be used to fuse the high resolution images in
real-time. Code is available at https://github.com/yiqingmy/Fusion
EEG-based Intention Recognition from Spatio-Temporal Representations via Cascade and Parallel Convolutional Recurrent Neural Networks
Brain-Computer Interface (BCI) is a system empowering humans to communicate
with or control the outside world with exclusively brain intentions.
Electroencephalography (EEG) based BCIs are promising solutions due to their
convenient and portable instruments. Motor imagery EEG (MI-EEG) is a kind of
most widely focused EEG signals, which reveals a subjects movement intentions
without actual actions. Despite the extensive research of MI-EEG in recent
years, it is still challenging to interpret EEG signals effectively due to the
massive noises in EEG signals (e.g., low signal noise ratio and incomplete EEG
signals), and difficulties in capturing the inconspicuous relationships between
EEG signals and certain brain activities. Most existing works either only
consider EEG as chain-like sequences neglecting complex dependencies between
adjacent signals or performing simple temporal averaging over EEG sequences. In
this paper, we introduce both cascade and parallel convolutional recurrent
neural network models for precisely identifying human intended movements by
effectively learning compositional spatio-temporal representations of raw EEG
streams. The proposed models grasp the spatial correlations between physically
neighboring EEG signals by converting the chain like EEG sequences into a 2D
mesh like hierarchy. An LSTM based recurrent network is able to extract the
subtle temporal dependencies of EEG data streams. Extensive experiments on a
large-scale MI-EEG dataset (108 subjects, 3,145,160 EEG records) have
demonstrated that both models achieve high accuracy near 98.3% and outperform a
set of baseline methods and most recent deep learning based EEG recognition
models, yielding a significant accuracy increase of 18% in the cross-subject
validation scenario.Comment: 8 pages, 3 figure
Discriminative Representation Combinations for Accurate Face Spoofing Detection
Three discriminative representations for face presentation attack detection
are introduced in this paper. Firstly we design a descriptor called spatial
pyramid coding micro-texture (SPMT) feature to characterize local appearance
information. Secondly we utilize the SSD, which is a deep learning framework
for detection, to excavate context cues and conduct end-to-end face
presentation attack detection. Finally we design a descriptor called template
face matched binocular depth (TFBD) feature to characterize stereo structures
of real and fake faces. For accurate presentation attack detection, we also
design two kinds of representation combinations. Firstly, we propose a
decision-level cascade strategy to combine SPMT with SSD. Secondly, we use a
simple score fusion strategy to combine face structure cues (TFBD) with local
micro-texture features (SPMT). To demonstrate the effectiveness of our design,
we evaluate the representation combination of SPMT and SSD on three public
datasets, which outperforms all other state-of-the-art methods. In addition, we
evaluate the representation combination of SPMT and TFBD on our dataset and
excellent performance is also achieved.Comment: To be published in Pattern Recognitio
- …