62,522 research outputs found
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
We propose a novel approach to semi-supervised automatic speech recognition
(ASR). We first exploit a large amount of unlabeled audio data via
representation learning, where we reconstruct a temporal slice of filterbank
features from past and future context frames. The resulting deep contextualized
acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end
ASR system using a smaller amount of labeled audio data. In our experiments, we
show that systems trained on DeCoAR consistently outperform ones trained on
conventional filterbank features, giving 42% and 19% relative improvement over
the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our
approach can drastically reduce the amount of labeled data required;
unsupervised training on LibriSpeech then supervision with 100 hours of labeled
data achieves performance on par with training on all 960 hours directly.
Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids
We introduce an unsupervised feature learning approach that embeds 3D shape
information into a single-view image representation. The main idea is a
self-supervised training objective that, given only a single 2D image, requires
all unseen views of the object to be predictable from learned features. We
implement this idea as an encoder-decoder convolutional neural network. The
network maps an input image of an unknown category and unknown viewpoint to a
latent space, from which a deconvolutional decoder can best "lift" the image to
its complete viewgrid showing the object from all viewing angles. Our
class-agnostic training procedure encourages the representation to capture
fundamental shape primitives and semantic regularities in a data-driven
manner---without manual semantic labels. Our results on two widely-used shape
datasets show 1) our approach successfully learns to perform "mental rotation"
even for objects unseen during training, and 2) the learned latent space is a
powerful representation for object recognition, outperforming several existing
unsupervised feature learning methods.Comment: To appear at ECCV 201
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
- …