156 research outputs found
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Whole Word Phonetic Displays for Speech Articulation Training
The main objective of this dissertation is to investigate and develop speech recognition technologies for speech training for people with hearing impairments. During the course of this work, a computer aided speech training system for articulation speech training was also designed and implemented. The speech training system places emphasis on displays to improve children\u27s pronunciation of isolated Consonant-Vowel-Consonant (CVC) words, with displays at both the phonetic level and whole word level. This dissertation presents two hybrid methods for combining Hidden Markov Models (HMMs) and Neural Networks (NNs) for speech recognition. The first method uses NN outputs as posterior probability estimators for HMMs. The second method uses NNs to transform the original speech features to normalized features with reduced correlation. Based on experimental testing, both of the hybrid methods give higher accuracy than standard HMM methods. The second method, using the NN to create normalized features, outperforms the first method in terms of accuracy. Several graphical displays were developed to provide real time visual feedback to users, to help them to improve and correct their pronunciations
Joint learning of phonetic units and word pronunciations for ASR
Abstract The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative -requiring no language-specific knowledge -to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures
Modelling Speech Dynamics with Trajectory-HMMs
Institute for Communicating and Collaborative SystemsThe conditional independence assumption imposed by the hidden Markov models
(HMMs) makes it difficult to model temporal correlation patterns in human speech.
Traditionally, this limitation is circumvented by appending the first and second-order
regression coefficients to the observation feature vectors. Although this leads to improved
performance in recognition tasks, we argue that a straightforward use of dynamic
features in HMMs will result in an inferior model, due to the incorrect handling
of dynamic constraints. In this thesis I will show that an HMM can be transformed
into a Trajectory-HMM capable of generating smoothed output mean trajectories, by
performing a per-utterance normalisation. The resulting model can be trained by either
maximisingmodel log-likelihood or minimisingmean generation errors on the training
data. To combat the exponential growth of paths in searching, the idea of delayed path
merging is proposed and a new time-synchronous decoding algorithm built on the concept
of token-passing is designed for use in the recognition task. The Trajectory-HMM
brings a new way of sharing knowledge between speech recognition and synthesis
components, by tackling both problems in a coherent statistical framework. I evaluated
the Trajectory-HMM on two different speech tasks using the speaker-dependent
MOCHA-TIMIT database. First as a generative model to recover articulatory features
from speech signal, where the Trajectory-HMM was used in a complementary way
to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory
framework. Experiments indicate that the jointly trained acoustic-articulatory models
are more accurate (having a lower Root Mean Square error) than the separately trained
ones, and that Trajectory-HMM training results in greater accuracy compared with
conventional Baum-Welch parameter updating. In addition, the Root Mean Square
(RMS) training objective proves to be consistently better than the Maximum Likelihood
objective. However, experiment of the phone recognition task shows that the
MLE trained Trajectory-HMM, while retaining attractive properties of being a proper
generative model, tends to favour over-smoothed trajectories among competing hypothesises,
and does not perform better than a conventional HMM. We use this to
build an argument that models giving a better fit on training data may suffer a reduction
of discrimination by being too faithful to the training data. Finally, experiments
on using triphone models show that increasing modelling detail is an effective way to
leverage modelling performance with little added complexity in training
Recommended from our members
Speech Enabled Avatar from a Single Photograph
This paper presents a complete framework for creating speech-enabled 2D and 3D avatars from a single image of a person. Our approach uses a generic facial motion model which represents deformations of the prototype face during speech. We have developed an HMM-based facial animation algorithm which takes into account both lexical stress and coarticulation. This algorithm produces realistic animations of the prototype facial surface from either text or speech. The generic facial motion model is transformed to a novel face geometry using a set of corresponding points between the generic mesh and the novel face. In the case of a 2D avatar, a single photograph of the person is used as input. We manually select a small number of features on the photograph and these are used to deform the prototype surface. The deformed surface is then used to animate the photograph. In the case of a 3D avatar, we use a single stereo image of the person as input. The sparse geometry of the face is computed from this image and used to warp the prototype surface to obtain the complete 3D surface of the person's face. This surface is etched into a glass cube using sub-surface laser engraving (SSLE) technology. Synthesized facial animation videos are then projected onto the etched glass cube. Even though the etched surface is static, the projection of facial animation onto it results in a compelling experience for the viewer. We show several examples of 2D and 3D avatars that are driven by text and speech inputs
Hidden Markov model-based speech enhancement
This work proposes a method of model-based speech enhancement that uses a network of
HMMs to first decode noisy speech and to then synthesise a set of features that enables
a speech production model to reconstruct clean speech. The motivation is to remove the
distortion and residual and musical noises that are associated with conventional filteringbased
methods of speech enhancement.
STRAIGHT forms the speech production model for speech reconstruction and requires
a time-frequency spectral surface, aperiodicity and a fundamental frequency contour.
The technique of HMM-based synthesis is used to create the estimate of the timefrequency
surface, and aperiodicity after the model and state sequence is obtained from
HMM decoding of the input noisy speech. Fundamental frequency were found to be best
estimated using the PEFAC method rather than synthesis from the HMMs.
For the robust HMM decoding in noisy conditions it is necessary for the HMMs
to model noisy speech and consequently noise adaptation is investigated to achieve this
and its resulting effect on the reconstructed speech measured. Even with such noise
adaptation to match the HMMs to the noisy conditions, decoding errors arise, both
in terms of incorrect decoding and time alignment errors. Confidence measures are
developed to identify such errors and then compensation methods developed to conceal
these errors in the enhanced speech signal.
Speech quality and intelligibility analysis is first applied in terms of PESQ and NCM
showing the superiority of the proposed method against conventional methods at low
SNRs. Three way subjective MOS listening test then discovers the performance of the
proposed method overwhelmingly surpass the conventional methods over all noise conditions
and then a subjective word recognition test shows an advantage of the proposed
method over speech intelligibility to the conventional methods at low SNRs
Confusion modelling for lip-reading
Lip-reading is mostly used as a means of communication by people with hearing di�fficulties. Recent work has explored the automation of this process, with the aim
of building a speech recognition system entirely driven by lip movements. However, this work has so far produced poor results because of factors such as high variability
of speaker features, diffi�culties in mapping from visual features to speech sounds, and high co-articulation of visual features.
The motivation for the work in this thesis is inspired by previous work in dysarthric speech recognition [Morales, 2009]. Dysathric speakers have poor control over their
articulators, often leading to a reduced phonemic repertoire. The premise of this thesis is that recognition of the visual speech signal is a similar problem to recog-
nition of dysarthric speech, in that some information about the speech signal has been lost in both cases, and this brings about a systematic pattern of errors in the
decoded output.
This work attempts to exploit the systematic nature of these errors by modelling them in the framework of a weighted finite-state transducer cascade. Results
indicate that the technique can achieve slightly lower error rates than the conventional approach. In addition, it explores some interesting more general questions for
automated lip-reading
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Most speech and language technologies are trained with massive amounts of
speech and text information. However, most of the world languages do not have
such resources or stable orthography. Systems constructed under these almost
zero resource conditions are not only promising for speech technology but also
for computational language documentation. The goal of computational language
documentation is to help field linguists to (semi-)automatically analyze and
annotate audio recordings of endangered and unwritten languages. Example tasks
are automatic phoneme discovery or lexicon discovery from the speech signal.
This paper presents a speech corpus collected during a realistic language
documentation process. It is made up of 5k speech utterances in Mboshi (Bantu
C25) aligned to French text translations. Speech transcriptions are also made
available: they correspond to a non-standard graphemic form close to the
language phonology. We present how the data was collected, cleaned and
processed and we illustrate its use through a zero-resource task: spoken term
discovery. The dataset is made available to the community for reproducible
computational language documentation experiments and their evaluation.Comment: accepted to LREC 201
- …