24 research outputs found
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Optical music recognition of the singer using formant frequency estimation of vocal fold vibration and lip motion with interpolated GMM classifiers
The main work of this paper is to identify the musical genres of the singer by performing the optical detection of lip motion. Recently, optical music recognition has attracted much attention. Optical music recognition in this study is a type of automatic techniques in information engineering, which can be used to determine the musical style of the singer. This paper proposes a method for optical music recognition where acoustic formant analysis of both vocal fold vibration and lip motion are employed with interpolated Gaussian mixture model (GMM) estimation to perform musical genre classification of the singer. The developed approach for such classification application is called GMM-Formant. Since humming and voiced speech sounds cause periodic vibrations of the vocal folds and then the corresponding motion of the lip, the proposed GMM-Formant firstly operates to acquire the required formant information. Formant information is important acoustic feature data for recognition classification. The proposed GMM-Formant method then uses linear interpolation for combining GMM likelihood estimates and formant evaluation results appropriately. GMM-Formant will effectively adjust the estimated formant feature evaluation outcomes by referring to certain degree of the likelihood score derived from GMM calculations. The superiority and effectiveness of presented GMM-Formant are demonstrated by a series of experiments on musical genre classification of the singer
Articulatory-WaveNet: Deep Autoregressive Model for Acoustic-to-Articulatory Inversion
Acoustic-to-Articulatory Inversion, the estimation of articulatory kinematics from speech, is an important problem which has received significant attention in recent years. Estimated articulatory movements from such models can be used for many applications, including speech synthesis, automatic speech recognition, and facial kinematics for talking-head animation devices. Knowledge about the position of the articulators can also be extremely useful in speech therapy systems and Computer-Aided Language Learning (CALL) and Computer-Aided Pronunciation Training (CAPT) systems for second language learners. Acoustic-to-Articulatory Inversion is a challenging problem due to the complexity of articulation patterns and significant inter-speaker differences. This is even more challenging when applied to non-native speakers without any kinematic training data. This dissertation attempts to address these problems through the development of up-graded architectures for Articulatory Inversion. The proposed Articulatory-WaveNet architecture is based on a dilated causal convolutional layer structure that improves the Acoustic-to-Articulatory Inversion estimated results for both speaker-dependent and speaker-independent scenarios. The system has been evaluated on the ElectroMagnetic Articulography corpus of Mandarin Accented English (EMA-MAE) corpus, consisting of 39 speakers including both native English speakers and Mandarin accented English speakers. Results show that Articulatory-WaveNet improves the performance of the speaker-dependent and speaker-independent Acoustic-to-Articulatory Inversion systems significantly compared to the previously reported results
Background-tracking acoustic features for genre identification of broadcast shows
This paper presents a novel method for extracting acoustic features that characterise the background environment in audio recordings. These features are based on the output of an alignment that fits multiple parallel background-based Constrained Maximum Likelihood Linear Regression transformations asynchronously to the input audio signal. With this setup, the resulting features can track changes in the audio background like appearance and disappearance of music, applause or laughter, independently of the speakers in the foreground of the audio. The ability to provide this type of acoustic description in audiovisual data has many potential applications, including automatic classification of broadcast archives or improving automatic transcription and subtitling. In this paper, the performance of these features in a genre identification task in a set of 332 BBC shows is explored. The proposed background-tracking features outperform short-term Perceptual Linear Prediction features in this task using Gaussian Mixture Model classifiers (62% vs 72% accuracy). The use of more complex classifiers, Hidden Markov Models and Support Vector Machines, increases the performance of the system with the novel background-tracking features to 79% and 81% in accuracy respectively
Acoustic model selection for recognition of regional accented speech
Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices
Automatic Speech Recognition for ageing voices
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking
rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accuracies
is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies
were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven incorrect.
Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in fundamental
frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly associated
with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical
acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then extended
to construct unsupervised hierarchical models which also outperform the baseline
models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults