51,460 research outputs found
DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck
Deep reinforcement learning (DRL) agents are often sensitive to visual
changes that were unseen in their training environments. To address this
problem, we leverage the sequential nature of RL to learn robust
representations that encode only task-relevant information from observations
based on the unsupervised multi-view setting. Specifically, we introduce an
auxiliary objective based on the multi-view in-formation bottleneck (MIB)
principle which quantifies the amount of task-irrelevant information and
encourages learning representations that are both predictive of the future and
less sensitive to task-irrelevant distractions. This enables us to train
high-performance policies that are robust to visual distractions and can
generalize to unseen environments. We demonstrate that our approach can achieve
SOTA performance on diverse visual control tasks on the DeepMind Control Suite,
even when the background is replaced with natural videos. In addition, we show
that our approach outperforms well-established baselines for generalization to
unseen environments on the Procgen benchmark. Our code is open-sourced and
available at https://github.com/JmfanBU/DRIBO.Comment: 27 page
Efficient and Effective Deep Multi-view Subspace Clustering
Recent multi-view subspace clustering achieves impressive results utilizing
deep networks, where the self-expressive correlation is typically modeled by a
fully connected (FC) layer. However, they still suffer from two limitations. i)
The parameter scale of the FC layer is quadratic to sample numbers, resulting
in high time and memory costs that significantly degrade their feasibility in
large-scale datasets. ii) It is under-explored to extract a unified
representation that simultaneously satisfies minimal sufficiency and
discriminability. To this end, we propose a novel deep framework, termed
Efficient and Effective deep Multi-View Subspace Clustering (EMVSC).
Instead of a parameterized FC layer, we design a Relation-Metric Net that
decouples network parameter scale from sample numbers for greater computational
efficiency. Most importantly, the proposed method devises a multi-type
auto-encoder to explicitly decouple consistent, complementary, and superfluous
information from every view, which is supervised by a soft clustering
assignment similarity constraint. Following information bottleneck theory and
the maximal coding rate reduction principle, a sufficient yet minimal unified
representation can be obtained, as well as pursuing intra-cluster aggregation
and inter-cluster separability within it. Extensive experiments show that
EMVSC yields comparable results to existing methods and achieves
state-of-the-art performance in various types of multi-view datasets
Deep Variational Multivariate Information Bottleneck -- A Framework for Variational Losses
Variational dimensionality reduction methods are known for their high
accuracy, generative abilities, and robustness. We introduce a framework to
unify many existing variational methods and design new ones. The framework is
based on an interpretation of the multivariate information bottleneck, in which
an encoder graph, specifying what information to compress, is traded-off
against a decoder graph, specifying a generative model. Using this framework,
we rederive existing dimensionality reduction methods including the deep
variational information bottleneck and variational auto-encoders. The framework
naturally introduces a trade-off parameter extending the deep variational CCA
(DVCCA) family of algorithms to beta-DVCCA. We derive a new method, the deep
variational symmetric informational bottleneck (DVSIB), which simultaneously
compresses two variables to preserve information between their compressed
representations. We implement these algorithms and evaluate their ability to
produce shared low dimensional latent spaces on Noisy MNIST dataset. We show
that algorithms that are better matched to the structure of the data (in our
case, beta-DVCCA and DVSIB) produce better latent spaces as measured by
classification accuracy, dimensionality of the latent variables, and sample
efficiency. We believe that this framework can be used to unify other
multi-view representation learning algorithms and to derive and implement novel
problem-specific loss functions
InfiNet: Fully Convolutional Networks for Infant Brain MRI Segmentation
We present a novel, parameter-efficient and practical fully convolutional
neural network architecture, termed InfiNet, aimed at voxel-wise semantic
segmentation of infant brain MRI images at iso-intense stage, which can be
easily extended for other segmentation tasks involving multi-modalities.
InfiNet consists of double encoder arms for T1 and T2 input scans that feed
into a joint-decoder arm that terminates in the classification layer. The
novelty of InfiNet lies in the manner in which the decoder upsamples lower
resolution input feature map(s) from multiple encoder arms. Specifically, the
pooled indices computed in the max-pooling layers of each of the encoder blocks
are related to the corresponding decoder block to perform non-linear
learning-free upsampling. The sparse maps are concatenated with intermediate
encoder representations (skip connections) and convolved with trainable filters
to produce dense feature maps. InfiNet is trained end-to-end to optimize for
the Generalized Dice Loss, which is well-suited for high class imbalance.
InfiNet achieves the whole-volume segmentation in under 50 seconds and we
demonstrate competitive performance against multiple state-of-the art deep
architectures and their multi-modal variants.Comment: 4 pages, 3 figures, conference, IEEE ISBI, 201
End-to-End Audiovisual Fusion with LSTMs
Several end-to-end deep learning approaches have been recently presented
which simultaneously extract visual features from the input images and perform
visual speech classification. However, research on jointly extracting audio and
visual features and performing classification is very limited. In this work, we
present an end-to-end audiovisual model based on Bidirectional Long Short-Term
Memory (BLSTM) networks. To the best of our knowledge, this is the first
audiovisual fusion model which simultaneously learns to extract features
directly from the pixels and spectrograms and perform classification of speech
and nonlinguistic vocalisations. The model consists of multiple identical
streams, one for each modality, which extract features directly from mouth
regions and spectrograms. The temporal dynamics in each stream/modality are
modeled by a BLSTM and the fusion of multiple streams/modalities takes place
via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4
nonlingusitic vocalisations over audio-only classification is reported on the
AVIC database. At the same time, the proposed end-to-end audiovisual fusion
system improves the state-of-the-art performance on the AVIC database leading
to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual
speech recognition experiments on the OuluVS2 database using different views of
the mouth, frontal to profile. The proposed audiovisual system significantly
outperforms the audio-only model for all views when the acoustic noise is high.Comment: Accepted to AVSP 2017. arXiv admin note: substantial text overlap
with arXiv:1709.00443 and text overlap with arXiv:1701.0584
- …