566 research outputs found
Synthetic IR image refinement using adversarial learning with bidirectional mappings
© 2019 IEEE. Collecting a large dataset of real infrared (IR) images is expensive, time-consuming, and even unavailable in some specific scenarios. With recent progress in machine learning, it has become more feasible to replace real IR images with qualified synthetic IR images in learning-based IR systems. However, this alternative may fail to achieve the desired performance, due to the gap between real and synthetic IR images. Inspired by adversarial learning for image-to-image translation, we propose the Synthetic IR Refinement Generative Adversarial Network (SIR-GAN) to narrow this gap. By learning the bidirectional mappings between two unpaired domains, the realism of the simulated IR images generated from the IR Simulator are significantly improved, where the source domain contains a large number of simulated IR images, where the target domain contains a limited quantity of real IR images. Specifically, driven by the idea of transferring infrared characteristic and protect target semantic information simultaneously, we propose a SIR refinement loss to consider an infrared loss and a structure loss further to the adversarial loss and the consistency loss. To further reduce the gap, stabilize training, and avoid artefacts, we modify the proposed algorithm by developing a training strategy, adding the U-net in the generators, using the dilated convolution in the discriminators and invoking the N-Adam acts as the optimizer. Qualitative, quantitative, and ablation study experiments demonstrate the superiority of the proposed approach compared with the state-of-the-art techniques in terms of realism and fidelity. In addition, our refined IR images are evaluated in the context of a feasibility study, where the accuracy of the trained classifier is significantly improved by adding our refined data into a small real-data training set
Multimodal Unsupervised Image-to-Image Translation
Unsupervised image-to-image translation is an important and challenging
problem in computer vision. Given an image in the source domain, the goal is to
learn the conditional distribution of corresponding images in the target
domain, without seeing any pairs of corresponding images. While this
conditional distribution is inherently multimodal, existing approaches make an
overly simplified assumption, modeling it as a deterministic one-to-one
mapping. As a result, they fail to generate diverse outputs from a given source
domain image. To address this limitation, we propose a Multimodal Unsupervised
Image-to-image Translation (MUNIT) framework. We assume that the image
representation can be decomposed into a content code that is domain-invariant,
and a style code that captures domain-specific properties. To translate an
image to another domain, we recombine its content code with a random style code
sampled from the style space of the target domain. We analyze the proposed
framework and establish several theoretical results. Extensive experiments with
comparisons to the state-of-the-art approaches further demonstrates the
advantage of the proposed framework. Moreover, our framework allows users to
control the style of translation outputs by providing an example style image.
Code and pretrained models are available at https://github.com/nvlabs/MUNITComment: Accepted by ECCV 201
When Autonomous Systems Meet Accuracy and Transferability through AI: A Survey
With widespread applications of artificial intelligence (AI), the
capabilities of the perception, understanding, decision-making and control for
autonomous systems have improved significantly in the past years. When
autonomous systems consider the performance of accuracy and transferability,
several AI methods, like adversarial learning, reinforcement learning (RL) and
meta-learning, show their powerful performance. Here, we review the
learning-based approaches in autonomous systems from the perspectives of
accuracy and transferability. Accuracy means that a well-trained model shows
good results during the testing phase, in which the testing set shares a same
task or a data distribution with the training set. Transferability means that
when a well-trained model is transferred to other testing domains, the accuracy
is still good. Firstly, we introduce some basic concepts of transfer learning
and then present some preliminaries of adversarial learning, RL and
meta-learning. Secondly, we focus on reviewing the accuracy or transferability
or both of them to show the advantages of adversarial learning, like generative
adversarial networks (GANs), in typical computer vision tasks in autonomous
systems, including image style transfer, image superresolution, image
deblurring/dehazing/rain removal, semantic segmentation, depth estimation,
pedestrian detection and person re-identification (re-ID). Then, we further
review the performance of RL and meta-learning from the aspects of accuracy or
transferability or both of them in autonomous systems, involving pedestrian
tracking, robot navigation and robotic manipulation. Finally, we discuss
several challenges and future topics for using adversarial learning, RL and
meta-learning in autonomous systems
Semantic speech retrieval with a visually grounded model of untranscribed speech
There is growing interest in models that can learn from unlabelled speech
paired with visual context. This setting is relevant for low-resource speech
processing, robotics, and human language acquisition research. Here we study
how a visually grounded speech model, trained on images of scenes paired with
spoken captions, captures aspects of semantics. We use an external image tagger
to generate soft text labels from images, which serve as targets for a neural
model that maps untranscribed speech to (semantic) keyword labels. We introduce
a newly collected data set of human semantic relevance judgements and an
associated task, semantic speech retrieval, where the goal is to search for
spoken utterances that are semantically relevant to a given text query. Without
seeing any text, the model trained on parallel speech and images achieves a
precision of almost 60% on its top ten semantic retrievals. Compared to a
supervised model trained on transcriptions, our model matches human judgements
better by some measures, especially in retrieving non-verbatim semantic
matches. We perform an extensive analysis of the model and its resulting
representations.Comment: 10 pages, 3 figures, 5 tables; accepted to the IEEE/ACM Transactions
on Audio, Speech and Language Processin
On the Origin of Deep Learning
This paper is a review of the evolutionary history of deep learning models.
It covers from the genesis of neural networks when associationism modeling of
the brain is studied, to the models that dominate the last decade of research
in deep learning like convolutional neural networks, deep belief networks, and
recurrent neural networks. In addition to a review of these models, this paper
primarily focuses on the precedents of the models above, examining how the
initial ideas are assembled to construct the early models and how these
preliminary models are developed into their current forms. Many of these
evolutionary paths last more than half a century and have a diversity of
directions. For example, CNN is built on prior knowledge of biological vision
system; DBN is evolved from a trade-off of modeling power and computation
complexity of graphical models and many nowadays models are neural counterparts
of ancient linear models. This paper reviews these evolutionary paths and
offers a concise thought flow of how these models are developed, and aims to
provide a thorough background for deep learning. More importantly, along with
the path, this paper summarizes the gist behind these milestones and proposes
many directions to guide the future research of deep learning.Comment: 70 pages, 200 reference
Mask-ShadowGAN: Learning to Remove Shadows from Unpaired Data
This paper presents a new method for shadow removal using unpaired data,
enabling us to avoid tedious annotations and obtain more diverse training
samples. However, directly employing adversarial learning and cycle-consistency
constraints is insufficient to learn the underlying relationship between the
shadow and shadow-free domains, since the mapping between shadow and
shadow-free images is not simply one-to-one. To address the problem, we
formulate Mask-ShadowGAN, a new deep framework that automatically learns to
produce a shadow mask from the input shadow image and then takes the mask to
guide the shadow generation via re-formulated cycle-consistency constraints.
Particularly, the framework simultaneously learns to produce shadow masks and
learns to remove shadows, to maximize the overall performance. Also, we
prepared an unpaired dataset for shadow removal and demonstrated the
effectiveness of Mask-ShadowGAN on various experiments, even it was trained on
unpaired data.Comment: Accepted to ICCV 201
Breaking Down the Barriers To Operator Workload Estimation: Advancing Algorithmic Handling of Temporal Non-Stationarity and Cross-Participant Differences for EEG Analysis Using Deep Learning
This research focuses on two barriers to using EEG data for workload assessment: day-to-day variability, and cross- participant applicability. Several signal processing techniques and deep learning approaches are evaluated in multi-task environments. These methods account for temporal, spatial, and frequential data dependencies. Variance of frequency- domain power distributions for cross-day workload classification is statistically significant. Skewness and kurtosis are not significant in an environment absent workload transitions, but are salient with transitions present. LSTMs improve day- to-day feature stationarity, decreasing error by 59% compared to previous best results. A multi-path convolutional recurrent model using bi-directional, residual recurrent layers significantly increases predictive accuracy and decreases cross-participant variance. Deep learning regression approaches are applied to a multi-task environment with workload transitions. Accounting for temporal dependence significantly reduces error and increases correlation compared to baselines. Visualization techniques for LSTM feature saliency are developed to understand EEG analysis model biases
Generative Adversarial Transformers
We introduce the GANformer, a novel and efficient type of transformer, and
explore it for the task of visual generative modeling. The network employs a
bipartite structure that enables long-range interactions across the image,
while maintaining computation of linear efficiency, that can readily scale to
high-resolution synthesis. It iteratively propagates information from a set of
latent variables to the evolving visual features and vice versa, to support the
refinement of each in light of the other and encourage the emergence of
compositional representations of objects and scenes. In contrast to the classic
transformer architecture, it utilizes multiplicative integration that allows
flexible region-based modulation, and can thus be seen as a generalization of
the successful StyleGAN network. We demonstrate the model's strength and
robustness through a careful evaluation over a range of datasets, from
simulated multi-object environments to rich real-world indoor and outdoor
scenes, showing it achieves state-of-the-art results in terms of image quality
and diversity, while enjoying fast learning and better data-efficiency. Further
qualitative and quantitative experiments offer us an insight into the model's
inner workings, revealing improved interpretability and stronger
disentanglement, and illustrating the benefits and efficacy of our approach. An
implementation of the model is available at
https://github.com/dorarad/gansformer.Comment: Published as a conference paper at ICML 202
Deep-ESN: A Multiple Projection-encoding Hierarchical Reservoir Computing Framework
As an efficient recurrent neural network (RNN) model, reservoir computing
(RC) models, such as Echo State Networks, have attracted widespread attention
in the last decade. However, while they have had great success with time series
data [1], [2], many time series have a multiscale structure, which a
single-hidden-layer RC model may have difficulty capturing. In this paper, we
propose a novel hierarchical reservoir computing framework we call Deep Echo
State Networks (Deep-ESNs). The most distinctive feature of a Deep-ESN is its
ability to deal with time series through hierarchical projections.
Specifically, when an input time series is projected into the high-dimensional
echo-state space of a reservoir, a subsequent encoding layer (e.g., a PCA,
autoencoder, or a random projection) can project the echo-state representations
into a lower-dimensional space. These low-dimensional representations can then
be processed by another ESN. By using projection layers and encoding layers
alternately in the hierarchical framework, a Deep-ESN can not only attenuate
the effects of the collinearity problem in ESNs, but also fully take advantage
of the temporal kernel property of ESNs to explore multiscale dynamics of time
series. To fuse the multiscale representations obtained by each reservoir, we
add connections from each encoding layer to the last output layer. Theoretical
analyses prove that stability of a Deep-ESN is guaranteed by the echo state
property (ESP), and the time complexity is equivalent to a conventional ESN.
Experimental results on some artificial and real world time series demonstrate
that Deep-ESNs can capture multiscale dynamics, and outperform both standard
ESNs and previous hierarchical ESN-based models
Learning to see across Domains and Modalities
Deep learning has raised hopes and expectations as a general solution for
many applications; indeed it has proven effective, but it also showed a strong
dependence on large quantities of data. Luckily, it has been shown that, even
when data is scarce, a successful model can be trained by reusing prior
knowledge. Thus, developing techniques for transfer learning, in its broadest
definition, is a crucial element towards the deployment of effective and
accurate intelligent systems. This thesis will focus on a family of transfer
learning methods applied to the task of visual object recognition, specifically
image classification. Transfer learning is a general term, and specific
settings have been given specific names: when the learner has only access to
unlabeled data from the a target domain and labeled data from a different
domain (the source), the problem is known as that of "unsupervised domain
adaptation" (DA). The first part of this work will focus on three methods for
this setting: one of these methods deals with features, one with images while
the third one uses both. The second part will focus on the real life issues of
robotic perception, specifically RGB-D recognition. Robotic platforms are
usually not limited to color perception; very often they also carry a Depth
camera. Unfortunately, the depth modality is rarely used for visual recognition
due to the lack of pretrained models from which to transfer and little data to
train one on from scratch. Two methods for dealing with this scenario will be
presented: one using synthetic data and the other exploiting cross-modality
transfer learning
- …