100,580 research outputs found
Semi-supervised and Population Based Training for Voice Commands Recognition
We present a rapid design methodology that combines automated hyper-parameter
tuning with semi-supervised training to build highly accurate and robust models
for voice commands classification. Proposed approach allows quick evaluation of
network architectures to fit performance and power constraints of available
hardware, while ensuring good hyper-parameter choices for each network in
real-world scenarios. Leveraging the vast amount of unlabeled data with a
student/teacher based semi-supervised method, classification accuracy is
improved from 84% to 94% in the validation set. For model optimization, we
explore the hyper-parameter space through population based training and obtain
an optimized model in the same time frame as it takes to train a single model
High-Fidelity Image Generation With Fewer Labels
Deep generative models are becoming a cornerstone of modern machine learning.
Recent work on conditional generative adversarial networks has shown that
learning complex, high-dimensional distributions over natural images is within
reach. While the latest models are able to generate high-fidelity, diverse
natural images at high resolution, they rely on a vast quantity of labeled
data. In this work we demonstrate how one can benefit from recent work on self-
and semi-supervised learning to outperform the state of the art on both
unsupervised ImageNet synthesis, as well as in the conditional setting. In
particular, the proposed approach is able to match the sample quality (as
measured by FID) of the current state-of-the-art conditional model BigGAN on
ImageNet using only 10% of the labels and outperform it using 20% of the
labels.Comment: Mario Lucic, Michael Tschannen, and Marvin Ritter contributed equally
to this work. ICML 2019 camera-ready version. Code available at
https://github.com/google/compare_ga
Recurrent Topic-Transition GAN for Visual Paragraph Generation
A natural image usually conveys rich semantic content and can be viewed from
different angles. Existing image description methods are largely restricted by
small sets of biased visual paragraph annotations, and fail to cover rich
underlying semantics. In this paper, we investigate a semi-supervised paragraph
generative framework that is able to synthesize diverse and semantically
coherent paragraph descriptions by reasoning over local semantic regions and
exploiting linguistic knowledge. The proposed Recurrent Topic-Transition
Generative Adversarial Network (RTT-GAN) builds an adversarial framework
between a structured paragraph generator and multi-level paragraph
discriminators. The paragraph generator generates sentences recurrently by
incorporating region-based visual and language attention mechanisms at each
step. The quality of generated paragraph sentences is assessed by multi-level
adversarial discriminators from two aspects, namely, plausibility at sentence
level and topic-transition coherence at paragraph level. The joint adversarial
training of RTT-GAN drives the model to generate realistic paragraphs with
smooth logical transition between sentence topics. Extensive quantitative
experiments on image and video paragraph datasets demonstrate the effectiveness
of our RTT-GAN in both supervised and semi-supervised settings. Qualitative
results on telling diverse stories for an image also verify the
interpretability of RTT-GAN.Comment: 10 pages, 6 figure
Iterative Pseudo-Labeling for Speech Recognition
Pseudo-labeling has recently shown promise in end-to-end automatic speech
recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised
algorithm which efficiently performs multiple iterations of pseudo-labeling on
unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an
existing model at each iteration using both labeled data and a subset of
unlabeled data. We study the main components of IPL: decoding with a language
model and data augmentation. We then demonstrate the effectiveness of IPL by
achieving state-of-the-art word-error rate on the Librispeech test sets in both
standard and low-resource setting. We also study the effect of language models
trained on different corpora to show IPL can effectively utilize additional
text. Finally, we release a new large in-domain text corpus which does not
overlap with the Librispeech training transcriptions to foster research in
low-resource, semi-supervised ASRComment: INTERSPEECH 202
Learning to Generate with Memory
Memory units have been widely used to enrich the capabilities of deep
networks on capturing long-term dependencies in reasoning and prediction tasks,
but little investigation exists on deep generative models (DGMs) which are good
at inferring high-level invariant representations from unlabeled data. This
paper presents a deep generative model with a possibly large external memory
and an attention mechanism to capture the local detail information that is
often lost in the bottom-up abstraction process in representation learning. By
adopting a smooth attention model, the whole network is trained end-to-end by
optimizing a variational bound of data likelihood via auto-encoding variational
Bayesian methods, where an asymmetric recognition network is learnt jointly to
infer high-level invariant representations. The asymmetric architecture can
reduce the competition between bottom-up invariant feature extraction and
top-down generation of instance details. Our experiments on several datasets
demonstrate that memory can significantly boost the performance of DGMs and
even achieve state-of-the-art results on various tasks, including density
estimation, image generation, and missing value imputation
DASNet: Reducing Pixel-level Annotations for Instance and Semantic Segmentation
Pixel-level annotation demands expensive human efforts and limits the
performance of deep networks that usually benefits from more such training
data. In this work we aim to achieve high quality instance and semantic
segmentation results over a small set of pixel-level mask annotations and a
large set of box annotations. The basic idea is exploring detection models to
simplify the pixel-level supervised learning task and thus reduce the required
amount of mask annotations. Our architecture, named DASNet, consists of three
modules: detection, attention, and segmentation. The detection module detects
all classes of objects, the attention module generates multi-scale
class-specific features, and the segmentation module recovers the binary masks.
Our method demonstrates substantially improved performance compared to existing
semi-supervised approaches on PASCAL VOC 2012 dataset
Learning Pixel-wise Labeling from the Internet without Human Interaction
Deep learning stands at the forefront in many computer vision tasks. However,
deep neural networks are usually data-hungry and require a huge amount of
well-annotated training samples. Collecting sufficient annotated data is very
expensive in many applications, especially for pixel-level prediction tasks
such as semantic segmentation. To solve this fundamental issue, we consider a
new challenging vision task, Internetly supervised semantic segmentation, which
only uses Internet data with noisy image-level supervision of corresponding
query keywords for segmentation model training. We address this task by
proposing the following solution. A class-specific attention model unifying
multiscale forward and backward convolutional features is proposed to provide
initial segmentation "ground truth". The model trained with such noisy
annotations is then improved by an online fine-tuning procedure. It achieves
state-of-the-art performance under the weakly-supervised setting on PASCAL
VOC2012 dataset. The proposed framework also paves a new way towards learning
from the Internet without human interaction and could serve as a strong
baseline therein. Code and data will be released upon the paper acceptance
Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection
Sound event detection is a challenging task, especially for scenes with
multiple simultaneous events. While event classification methods tend to be
fairly accurate, event localization presents additional challenges, especially
when large amounts of labeled data are not available. Task4 of the 2018 DCASE
challenge presents an event detection task that requires accuracy in both
segmentation and recognition of events while providing only weakly labeled
training data. Supervised methods can produce accurate event labels but are
limited in event segmentation when training data lacks event timestamps. On the
other hand, unsupervised methods that model the acoustic properties of the
audio can produce accurate event boundaries but are not guided by the
characteristics of event classes and sound categories. We present a hybrid
approach that combines an acoustic-driven event boundary detection and a
supervised label inference using a deep neural network. This framework
leverages benefits of both unsupervised and supervised methodologies and takes
advantage of large amounts of unlabeled data, making it ideal for large-scale
weakly labeled event detection. Compared to a baseline system, the proposed
approach delivers a 15% absolute improvement in F-score, demonstrating the
benefits of the hybrid bottom-up, top-down approach.Comment: Submitted to ICASSP 201
Training Deep Neural Networks on Noisy Labels with Bootstrapping
Current state-of-the-art deep learning systems for visual object recognition
and detection use purely supervised training with regularization such as
dropout to avoid overfitting. The performance depends critically on the amount
of labeled examples, and in current practice the labels are assumed to be
unambiguous and accurate. However, this assumption often does not hold; e.g. in
recognition, class labels may be missing; in detection, objects in the image
may not be localized; and in general, the labeling may be subjective. In this
work we propose a generic way to handle noisy and incomplete labeling by
augmenting the prediction objective with a notion of consistency. We consider a
prediction consistent if the same prediction is made given similar percepts,
where the notion of similarity is between deep network features computed from
the input data. In experiments we demonstrate that our approach yields
substantial robustness to label noise on several datasets. On MNIST handwritten
digits, we show that our model is robust to label corruption. On the Toronto
Face Database, we show that our model handles well the case of subjective
labels in emotion recognition, achieving state-of-the- art results, and can
also benefit from unlabeled face images with no modification to our method. On
the ILSVRC2014 detection challenge data, we show that our approach extends to
very deep networks, high resolution images and structured outputs, and results
in improved scalable detection
In-Order Transition-based Constituent Parsing
Both bottom-up and top-down strategies have been used for neural
transition-based constituent parsing. The parsing strategies differ in terms of
the order in which they recognize productions in the derivation tree, where
bottom-up strategies and top-down strategies take post-order and pre-order
traversal over trees, respectively. Bottom-up parsers benefit from rich
features from readily built partial parses, but lack lookahead guidance in the
parsing process; top-down parsers benefit from non-local guidance for local
decisions, but rely on a strong encoder over the input to predict a constituent
hierarchy before its construction.To mitigate both issues, we propose a novel
parsing system based on in-order traversal over syntactic trees, designing a
set of transition actions to find a compromise between bottom-up constituent
information and top-down lookahead information. Based on stack-LSTM, our
psycholinguistically motivated constituent parsing system achieves 91.8 F1 on
WSJ benchmark. Furthermore, the system achieves 93.6 F1 with supervised
reranking and 94.2 F1 with semi-supervised reranking, which are the best
results on the WSJ benchmark.Comment: Accepted by TAC
- …