993 research outputs found
Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries
Associating image regions with text queries has been recently explored as a
new way to bridge visual and linguistic representations. A few pioneering
approaches have been proposed based on recurrent neural language models trained
generatively (e.g., generating captions), but achieving somewhat limited
localization accuracy. To better address natural-language-based visual entity
localization, we propose a discriminative approach. We formulate a
discriminative bimodal neural network (DBNet), which can be trained by a
classifier with extensive use of negative samples. Our training objective
encourages better localization on single images, incorporates text phrases in a
broad range, and properly pairs image regions with text phrases into positive
and negative examples. Experiments on the Visual Genome dataset demonstrate the
proposed DBNet significantly outperforms previous state-of-the-art methods both
for localization on single images and for detection on multiple images. We we
also establish an evaluation protocol for natural-language visual detection.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
201
Learning Deep Disentangled Embeddings with the F-Statistic Loss
Deep-embedding methods aim to discover representations of a domain that make
explicit the domain's class structure and thereby support few-shot learning.
Disentangling methods aim to make explicit compositional or factorial
structure. We combine these two active but independent lines of research and
propose a new paradigm suitable for both goals. We propose and evaluate a novel
loss function based on the statistic, which describes the separation of two
or more distributions. By ensuring that distinct classes are well separated on
a subset of embedding dimensions, we obtain embeddings that are useful for
few-shot learning. By not requiring separation on all dimensions, we encourage
the discovery of disentangled representations. Our embedding method matches or
beats state-of-the-art, as evaluated by performance on recall@ and few-shot
learning tasks. Our method also obtains performance superior to a variety of
alternatives on disentangling, as evaluated by two key properties of a
disentangled representation: modularity and explicitness. The goal of our work
is to obtain more interpretable, manipulable, and generalizable deep
representations of concepts and categories
The MBPEP: a deep ensemble pruning algorithm providing high quality uncertainty prediction
Machine learning algorithms have been effectively applied into various real
world tasks. However, it is difficult to provide high-quality machine learning
solutions to accommodate an unknown distribution of input datasets; this
difficulty is called the uncertainty prediction problems. In this paper, a
margin-based Pareto deep ensemble pruning (MBPEP) model is proposed. It
achieves the high-quality uncertainty estimation with a small value of the
prediction interval width (MPIW) and a high confidence of prediction interval
coverage probability (PICP) by using deep ensemble networks. In addition to
these networks, unique loss functions are proposed, and these functions make
the sub-learners available for standard gradient descent learning. Furthermore,
the margin criterion fine-tuning-based Pareto pruning method is introduced to
optimize the ensembles. Several experiments including predicting uncertainties
of classification and regression are conducted to analyze the performance of
MBPEP. The experimental results show that MBPEP achieves a small interval width
and a low learning error with an optimal number of ensembles. For the
real-world problems, MBPEP performs well on input datasets with unknown
distributions datasets incomings and improves learning performance on a multi
task problem when compared to that of each single model.Comment: 20 pages, 7 figure
Speech Dereverberation with Context-aware Recurrent Neural Networks
In this paper, we propose a model to perform speech dereverberation by
estimating its spectral magnitude from the reverberant counterpart. Our models
are capable of extracting features that take into account both short and
long-term dependencies in the signal through a convolutional encoder (which
extracts features from a short, bounded context of frames) and a recurrent
neural network for extracting long-term information. Our model outperforms a
recently proposed model that uses different context information depending on
the reverberation time, without requiring any sort of additional input,
yielding improvements of up to 0.4 on PESQ, 0.3 on STOI, and 1.0 on POLQA
relative to reverberant speech. We also show our model is able to generalize to
real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening
tests show the proposed method outperforming benchmark models in reduction of
perceived reverberation.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Leveraging Machine Learning and Big Data for Smart Buildings: A Comprehensive Survey
Future buildings will offer new convenience, comfort, and efficiency
possibilities to their residents. Changes will occur to the way people live as
technology involves into people's lives and information processing is fully
integrated into their daily living activities and objects. The future
expectation of smart buildings includes making the residents' experience as
easy and comfortable as possible. The massive streaming data generated and
captured by smart building appliances and devices contains valuable information
that needs to be mined to facilitate timely actions and better decision making.
Machine learning and big data analytics will undoubtedly play a critical role
to enable the delivery of such smart services. In this paper, we survey the
area of smart building with a special focus on the role of techniques from
machine learning and big data analytics. This survey also reviews the current
trends and challenges faced in the development of smart building services
AtlasNet: A Papier-M\^ach\'e Approach to Learning 3D Surface Generation
We introduce a method for learning to generate the surface of 3D shapes. Our
approach represents a 3D shape as a collection of parametric surface elements
and, in contrast to methods generating voxel grids or point clouds, naturally
infers a surface representation of the shape. Beyond its novelty, our new shape
generation framework, AtlasNet, comes with significant advantages, such as
improved precision and generalization capabilities, and the possibility to
generate a shape of arbitrary resolution without memory issues. We demonstrate
these benefits and compare to strong baselines on the ShapeNet benchmark for
two applications: (i) auto-encoding shapes, and (ii) single-view reconstruction
from a still image. We also provide results showing its potential for other
applications, such as morphing, parametrization, super-resolution, matching,
and co-segmentation
Scale Aware Adaptation for Land-Cover Classification in Remote Sensing Imagery
Land-cover classification using remote sensing imagery is an important Earth
observation task. Recently, land cover classification has benefited from the
development of fully connected neural networks for semantic segmentation. The
benchmark datasets available for training deep segmentation models in remote
sensing imagery tend to be small, however, often consisting of only a handful
of images from a single location with a single scale. This limits the models'
ability to generalize to other datasets. Domain adaptation has been proposed to
improve the models' generalization but we find these approaches are not
effective for dealing with the scale variation commonly found between remote
sensing image collections. We therefore propose a scale aware adversarial
learning framework to perform joint cross-location and cross-scale land-cover
classification. The framework has a dual discriminator architecture with a
standard feature discriminator as well as a novel scale discriminator. We also
introduce a scale attention module which produces scale-enhanced features.
Experimental results show that the proposed framework outperforms
state-of-the-art domain adaptation methods by a large margin.Comment: The open-sourced codes are available on Github:
https://github.com/xdeng7/scale-aware_d
ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation
Optical character recognition (OCR) systems performance have improved
significantly in the deep learning era. This is especially true for handwritten
text recognition (HTR), where each author has a unique style, unlike printed
text, where the variation is smaller by design. That said, deep learning based
HTR is limited, as in every other task, by the number of training examples.
Gathering data is a challenging and costly task, and even more so, the labeling
task that follows, of which we focus here. One possible approach to reduce the
burden of data annotation is semi-supervised learning. Semi supervised methods
use, in addition to labeled data, some unlabeled samples to improve
performance, compared to fully supervised ones. Consequently, such methods may
adapt to unseen images during test time.
We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten
text images that are versatile both in style and lexicon. ScrabbleGAN relies on
a novel generative model which can generate images of words with an arbitrary
length. We show how to operate our approach in a semi-supervised manner,
enjoying the aforementioned benefits such as performance boost over state of
the art supervised HTR. Furthermore, our generator can manipulate the resulting
text style. This allows us to change, for instance, whether the text is
cursive, or how thin is the pen stroke.Comment: in CVPR 202
Real-Time User-Guided Image Colorization with Learned Deep Priors
We propose a deep learning approach for user-guided image colorization. The
system directly maps a grayscale image, along with sparse, local user "hints"
to an output colorization with a Convolutional Neural Network (CNN). Rather
than using hand-defined rules, the network propagates user edits by fusing
low-level cues along with high-level semantic information, learned from
large-scale data. We train on a million images, with simulated user inputs. To
guide the user towards efficient input selection, the system recommends likely
colors based on the input image and current user inputs. The colorization is
performed in a single feed-forward pass, enabling real-time use. Even with
randomly simulated user inputs, we show that the proposed system helps novice
users quickly create realistic colorizations, and offers large improvements in
colorization quality with just a minute of use. In addition, we demonstrate
that the framework can incorporate other user "hints" to the desired
colorization, showing an application to color histogram transfer. Our code and
models are available at https://richzhang.github.io/ideepcolor.Comment: Accepted to SIGGRAPH 2017. Project page:
https://richzhang.github.io/ideepcolo
Deep Active Learning for Joint Classification & Segmentation with Weak Annotator
CNN visualization and interpretation methods, like class-activation maps
(CAMs), are typically used to highlight the image regions linked to class
predictions. These models allow to simultaneously classify images and extract
class-dependent saliency maps, without the need for costly pixel-level
annotations. However, they typically yield segmentations with high
false-positive rates and, therefore, coarse visualisations, more so when
processing challenging images, as encountered in histology. To mitigate this
issue, we propose an active learning (AL) framework, which progressively
integrates pixel-level annotations during training. Given training data with
global image-level labels, our deep weakly-supervised learning model jointly
performs supervised image-level classification and active learning for
segmentation, integrating pixel annotations by an oracle. Unlike standard AL
methods that focus on sample selection, we also leverage large numbers of
unlabeled images via pseudo-segmentations (i.e., self-learning at the pixel
level), and integrate them with the oracle-annotated samples during training.
We report extensive experiments over two challenging benchmarks --
high-resolution medical images (histology GlaS data for colon cancer) and
natural images (CUB-200-2011 for bird species). Our results indicate that, by
simply using random sample selection, the proposed approach can significantly
outperform state-of the-art CAMs and AL methods, with an identical
oracle-supervision budget. Our code is publicly available.Comment: 20 pages, 12 figures, WACV 202
- …