70 research outputs found
Unsupervised Feature Learning for Audio Analysis
Identifying acoustic events from a continuously streaming audio source is of
interest for many applications including environmental monitoring for basic
research. In this scenario neither different event classes are known nor what
distinguishes one class from another. Therefore, an unsupervised feature
learning method for exploration of audio data is presented in this paper. It
incorporates the two following novel contributions: First, an audio frame
predictor based on a Convolutional LSTM autoencoder is demonstrated, which is
used for unsupervised feature extraction. Second, a training method for
autoencoders is presented, which leads to distinct features by amplifying event
similarities. In comparison to standard approaches, the features extracted from
the audio frame predictor trained with the novel approach show 13 % better
results when used with a classifier and 36 % better results when used for
clustering.Comment: Presented at the 5th International Conference on Learning
Representations (ICLR) 2017, Workshop Track, Toulon, Franc
On The Inductive Bias of Words in Acoustics-to-Word Models
Acoustics-to-word models are end-to-end speech recognizers that use words as
targets without relying on pronunciation dictionaries or graphemes. These
models are notoriously difficult to train due to the lack of linguistic
knowledge. It is also unclear how the amount of training data impacts the
optimization and generalization of such models. In this work, we study the
optimization and generalization of acoustics-to-word models under different
amounts of training data. In addition, we study three types of inductive bias,
leveraging a pronunciation dictionary, word boundary annotations, and
constraints on word durations. We find that constraining word durations leads
to the most improvement. Finally, we analyze the word embedding space learned
by the model, and find that the space has a structure dominated by the
pronunciation of words. This suggests that the contexts of words, instead of
their phonetic structure, should be the future focus of inductive bias in
acoustics-to-word models
Acoustic-to-Word Models with Conversational Context Information
Conversational context information, higher-level knowledge that spans across
sentences, can help to recognize a long conversation. However, existing speech
recognition models are typically built at a sentence level, and thus it may not
capture important conversational context information. The recent progress in
end-to-end speech recognition enables integrating context with other available
information (e.g., acoustic, linguistic resources) and directly recognizing
words from speech. In this work, we present a direct acoustic-to-word,
end-to-end speech recognition model capable of utilizing the conversational
context to better process long conversations. We evaluate our proposed approach
on the Switchboard conversational speech corpus and show that our system
outperforms a standard end-to-end speech recognition system.Comment: NAACL 2019. arXiv admin note: text overlap with arXiv:1808.0217
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
We investigate the effectiveness of generative adversarial networks (GANs)
for speech enhancement, in the context of improving noise robustness of
automatic speech recognition (ASR) systems. Prior work demonstrates that GANs
can effectively suppress additive noise in raw waveform speech signals,
improving perceptual quality metrics; however this technique was not justified
in the context of ASR. In this work, we conduct a detailed study to measure the
effectiveness of GANs in enhancing speech contaminated by both additive and
reverberant noise. Motivated by recent advances in image processing, we propose
operating GANs on log-Mel filterbank spectra instead of waveforms, which
requires less computation and is more robust to reverberant noise. While GAN
enhancement improves the performance of a clean-trained ASR system on noisy
speech, it falls short of the performance achieved by conventional multi-style
training (MTR). By appending the GAN-enhanced features to the noisy inputs and
retraining, we achieve a 7% WER improvement relative to the MTR system.Comment: Published as a conference paper at ICASSP 201
Multiresolution and Multimodal Speech Recognition with Transformers
This paper presents an audio visual automatic speech recognition (AV-ASR)
system using a Transformer-based architecture. We particularly focus on the
scene context provided by the visual information, to ground the ASR. We extract
representations for audio features in the encoder layers of the transformer and
fuse video features using an additional crossmodal multihead attention layer.
Additionally, we incorporate a multitask training criterion for multiresolution
ASR, where we train the model to generate both character and subword level
transcriptions.
Experimental results on the How2 dataset, indicate that multiresolution
training can speed up convergence by around 50% and relatively improves word
error rate (WER) performance by upto 18% over subword prediction models.
Further, incorporating visual information improves performance with relative
gains upto 3.76% over audio only models.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based
architectures.Comment: Accepted for ACL 202
Multi-encoder multi-resolution framework for end-to-end speech recognition
Attention-based methods and Connectionist Temporal Classification (CTC)
network have been promising research directions for end-to-end Automatic Speech
Recognition (ASR). The joint CTC/Attention model has achieved great success by
utilizing both architectures during multi-task training and joint decoding. In
this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework
based on the joint CTC/Attention model. Two heterogeneous encoders with
different architectures, temporal resolutions and separate CTC networks work in
parallel to extract complimentary acoustic information. A hierarchical
attention mechanism is then used to combine the encoder-level information. To
demonstrate the effectiveness of the proposed model, experiments are conducted
on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate
(WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6%
WER in the WSJ eval92 test set, which is the best WER reported for an
end-to-end system on this benchmark
Promising Accurate Prefix Boosting for sequence-to-sequence ASR
In this paper, we present promising accurate prefix boosting (PAPB), a
discriminative training technique for attention based sequence-to-sequence
(seq2seq) ASR. PAPB is devised to unify the training and testing scheme in an
effective manner. The training procedure involves maximizing the score of each
partial correct sequence obtained during beam search compared to other
hypotheses. The training objective also includes minimization of token
(character) error rate. PAPB shows its efficacy by achieving 10.8\% and 3.8\%
WER with and without RNNLM respectively on Wall Street Journal dataset
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
We present a recurrent encoder-decoder deep neural network architecture that
directly translates speech in one language into text in another. The model does
not explicitly transcribe the speech into text in the source language, nor does
it require supervision from the ground truth source language transcription
during training. We apply a slightly modified sequence-to-sequence with
attention architecture that has previously been used for speech recognition and
show that it can be repurposed for this more complex task, illustrating the
power of attention-based models. A single model trained end-to-end obtains
state-of-the-art performance on the Fisher Callhome Spanish-English speech
translation task, outperforming a cascade of independently trained
sequence-to-sequence speech recognition and machine translation models by 1.8
BLEU points on the Fisher test set. In addition, we find that making use of the
training data in both languages by multi-task training sequence-to-sequence
speech translation and recognition models with a shared encoder network can
improve performance by a further 1.4 BLEU points.Comment: 5 pages, 1 figure. Interspeech 201
Future Semantic Segmentation with Convolutional LSTM
We consider the problem of predicting semantic segmentation of future frames
in a video. Given several observed frames in a video, our goal is to predict
the semantic segmentation map of future frames that are not yet observed. A
reliable solution to this problem is useful in many applications that require
real-time decision making, such as autonomous driving. We propose a novel model
that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal
information of observed frames for future prediction. We also extend our model
to use bidirectional ConvLSTM to capture temporal information in both
directions. Our proposed approach outperforms other state-of-the-art methods on
the benchmark dataset.Comment: Accepted to BMVC 201
ESPnet: End-to-End Speech Processing Toolkit
This paper introduces a new open source platform for end-to-end speech
processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech
recognition (ASR), and adopts widely-used dynamic neural network toolkits,
Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the
Kaldi ASR toolkit style for data processing, feature extraction/format, and
recipes to provide a complete setup for speech recognition and other speech
processing experiments. This paper explains a major architecture of this
software platform, several important functionalities, which differentiate
ESPnet from other open source ASR toolkits, and experimental results with major
ASR benchmarks
- …