3,869 research outputs found
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
We develop streaming keyword spotting systems using a recurrent neural
network transducer (RNN-T) model: an all-neural, end-to-end trained,
sequence-to-sequence model which jointly learns acoustic and language model
components. Our models are trained to predict either phonemes or graphemes as
subword units, thus allowing us to detect arbitrary keyword phrases, without
any out-of-vocabulary words. In order to adapt the models to the requirements
of keyword spotting, we propose a novel technique which biases the RNN-T system
towards a specific keyword of interest.
Our systems are compared against a strong sequence-trained, connectionist
temporal classification (CTC) based "keyword-filler" baseline, which is
augmented with a separate phoneme language model. Overall, our RNN-T system
with the proposed biasing technique significantly improves performance over the
baseline system.Comment: To appear in Proceedings of IEEE ASRU 201
Transparent pronunciation scoring using articulatorily weighted phoneme edit distance
For researching effects of gamification in foreign language learning for
children in the "Say It Again, Kid!" project we developed a feedback paradigm
that can drive gameplay in pronunciation learning games. We describe our
scoring system based on the difference between a reference phone sequence and
the output of a multilingual CTC phoneme recogniser. We present a white-box
scoring model of mapped weighted Levenshtein edit distance between reference
and error with error weights for articulatory differences computed from a
training set of scored utterances. The system can produce a human-readable list
of each detected mispronunciation's contribution to the utterance score. We
compare our scoring method to established black box methods.Comment: Submitted to Interspeech 201
Large-Scale Visual Speech Recognition
This work presents a scalable solution to open-vocabulary visual speech
recognition. To achieve this, we constructed the largest existing visual speech
recognition dataset, consisting of pairs of text and video clips of faces
speaking (3,886 hours of video). In tandem, we designed and trained an
integrated lipreading system, consisting of a video processing pipeline that
maps raw video to stable videos of lips and sequences of phonemes, a scalable
deep neural network that maps the lip videos to sequences of phoneme
distributions, and a production-level speech decoder that outputs sequences of
words. The proposed system achieves a word error rate (WER) of 40.9% as
measured on a held-out set. In comparison, professional lipreaders achieve
either 86.4% or 92.9% WER on the same dataset when having access to additional
types of contextual information. Our approach significantly improves on other
lipreading approaches, including variants of LipNet and of Watch, Attend, and
Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively
The challenges of SVM optimization using Adaboost on a phoneme recognition problem
The use of digital technology is growing at a very fast pace which led to the
emergence of systems based on the cognitive infocommunications. The expansion
of this sector impose the use of combining methods in order to ensure the
robustness in cognitive systems
On The Inductive Bias of Words in Acoustics-to-Word Models
Acoustics-to-word models are end-to-end speech recognizers that use words as
targets without relying on pronunciation dictionaries or graphemes. These
models are notoriously difficult to train due to the lack of linguistic
knowledge. It is also unclear how the amount of training data impacts the
optimization and generalization of such models. In this work, we study the
optimization and generalization of acoustics-to-word models under different
amounts of training data. In addition, we study three types of inductive bias,
leveraging a pronunciation dictionary, word boundary annotations, and
constraints on word durations. We find that constraining word durations leads
to the most improvement. Finally, we analyze the word embedding space learned
by the model, and find that the space has a structure dominated by the
pronunciation of words. This suggests that the contexts of words, instead of
their phonetic structure, should be the future focus of inductive bias in
acoustics-to-word models
Spatial Concept Acquisition for a Mobile Robot that Integrates Self-Localization and Unsupervised Word Discovery from Spoken Sentences
In this paper, we propose a novel unsupervised learning method for the
lexical acquisition of words related to places visited by robots, from human
continuous speech signals. We address the problem of learning novel words by a
robot that has no prior knowledge of these words except for a primitive
acoustic model. Further, we propose a method that allows a robot to effectively
use the learned words and their meanings for self-localization tasks. The
proposed method is nonparametric Bayesian spatial concept acquisition method
(SpCoA) that integrates the generative model for self-localization and the
unsupervised word segmentation in uttered sentences via latent variables
related to the spatial concept. We implemented the proposed method SpCoA on
SIGVerse, which is a simulation environment, and TurtleBot2, which is a mobile
robot in a real environment. Further, we conducted experiments for evaluating
the performance of SpCoA. The experimental results showed that SpCoA enabled
the robot to acquire the names of places from speech sentences. They also
revealed that the robot could effectively utilize the acquired spatial concepts
and reduce the uncertainty in self-localization.Comment: This paper was accepted in the IEEE Transactions on Cognitive and
Developmental Systems. (04-May-2016
Unsupervised speech representation learning using WaveNet autoencoders
We consider the task of unsupervised extraction of meaningful latent
representations of speech by applying autoencoding neural networks to speech
waveforms. The goal is to learn a representation able to capture high level
semantic content from the signal, e.g.\ phoneme identities, while being
invariant to confounding low level details in the signal such as the underlying
pitch contour or background noise. Since the learned representation is tuned to
contain only phonetic content, we resort to using a high capacity WaveNet
decoder to infer information discarded by the encoder from previous samples.
Moreover, the behavior of autoencoder models depends on the kind of constraint
that is applied to the latent representation. We compare three variants: a
simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder
(VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of
learned representations in terms of speaker independence, the ability to
predict phonetic content, and the ability to accurately reconstruct individual
spectrogram frames. Moreover, for discrete encodings extracted using the
VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a
regularization scheme that forces the representations to focus on the phonetic
content of the utterance and report performance comparable with the top entries
in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.Comment: Accepted to IEEE TASLP, final version available at
http://dx.doi.org/10.1109/TASLP.2019.293886
Building DNN Acoustic Models for Large Vocabulary Speech Recognition
Deep neural networks (DNNs) are now a central component of nearly all
state-of-the-art speech recognition systems. Building neural network acoustic
models requires several design decisions including network architecture, size,
and training loss function. This paper offers an empirical investigation on
which aspects of DNN acoustic model design are most important for speech
recognition system performance. We report DNN classifier performance and final
speech recognizer word error rates, and compare DNNs using several metrics to
quantify factors influencing differences in task performance. Our first set of
experiments use the standard Switchboard benchmark corpus, which contains
approximately 300 hours of conversational telephone speech. We compare standard
DNNs to convolutional networks, and present the first experiments using
locally-connected, untied neural networks for acoustic modeling. We
additionally build systems on a corpus of 2,100 hours of training data by
combining the Switchboard and Fisher corpora. This larger corpus allows us to
more thoroughly examine performance of large DNN models -- with up to ten times
more parameters than those typically used in speech recognition systems. Our
results suggest that a relatively simple DNN architecture and optimization
technique produces strong results. These findings, along with previous work,
help establish a set of best practices for building DNN hybrid speech
recognition systems with maximum likelihood training. Our experiments in DNN
optimization additionally serve as a case study for training DNNs with
discriminative loss functions for speech tasks, as well as DNN classifiers more
generally
Speech Recognition Front End Without Information Loss
Speech representation and modelling in high-dimensional spaces of acoustic
waveforms, or a linear transformation thereof, is investigated with the aim of
improving the robustness of automatic speech recognition to additive noise. The
motivation behind this approach is twofold: (i) the information in acoustic
waveforms that is usually removed in the process of extracting low-dimensional
features might aid robust recognition by virtue of structured redundancy
analogous to channel coding, (ii) linear feature domains allow for exact noise
adaptation, as opposed to representations that involve non-linear processing
which makes noise adaptation challenging. Thus, we develop a generative
framework for phoneme modelling in high-dimensional linear feature domains, and
use it in phoneme classification and recognition tasks. Results show that
classification and recognition in this framework perform better than analogous
PLP and MFCC classifiers below 18 dB SNR. A combination of the high-dimensional
and MFCC features at the likelihood level performs uniformly better than either
of the individual representations across all noise levels
Speech Recognition by Machine, A Review
This paper presents a brief survey on Automatic Speech Recognition and
discusses the major themes and advances made in the past 60 years of research,
so as to provide a technological perspective and an appreciation of the
fundamental progress that has been accomplished in this important area of
speech communication. After years of research and development the accuracy of
automatic speech recognition remains one of the important research challenges
(e.g., variations of the context, speakers, and environment).The design of
Speech Recognition system requires careful attentions to the following issues:
Definition of various types of speech classes, speech representation, feature
extraction techniques, speech classifiers, database and performance evaluation.
The problems that are existing in ASR and the various techniques to solve these
problems constructed by various research workers have been presented in a
chronological order. Hence authors hope that this work shall be a contribution
in the area of speech recognition. The objective of this review paper is to
summarize and compare some of the well known methods used in various stages of
speech recognition system and identify research topic and applications which
are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS December 2009, ISSN 1947 5500,
http://sites.google.com/site/ijcsis
- …