96 research outputs found
Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations
This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%
Recommended from our members
Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation
Expressive synthesis from text is a challenging
problem. There are two issues. First, read text is often highly
expressive to convey the emotion and scenario in the text. Second,
since the expressive training speech is not always available for
different speakers, it is necessary to develop methods to share the
expressive information over speakers. This paper investigates the
approach of using very expressive, highly diverse audiobook data
from multiple speakers to build an expressive speech synthesis
system. Both of two problems are addressed by considering a
factorized framework where speaker and emotion are modelled
in separate sub-spaces of a cluster adaptive training (CAT)
parametric speech synthesis system. The sub-spaces for the
expressive state of a speaker and the characteristics of the speaker
are jointly trained using a set of audiobooks. In this work, the
expressive speech synthesis system works in two distinct modes.
In the first mode, the expressive information is given by audio
data and the adaptation method is used to extract the expressive
information in the audio data. In the second mode, the input of
the synthesis system is plain text and a full expressive synthesis
system is examined where the expressive state is predicted from
the text. In both modes, the expressive information is shared
and transplanted over different speakers. Experimental results
show that in both modes, the expressive speech synthesis method
proposed in this work significantly improves the expressiveness
of the synthetic speech for different speakers. Finally, this paper
also examines whether it is possible to predict the expressive
states from text for multiple speakers using a single model, or
whether the prediction process needs to be speaker specific.This is the accepted manuscript. The final version is available from IEEE at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6995936&filter%3DAND%28p_IS_Number%3A7055953%29
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
Online source separation in reverberant environments exploiting known speaker locations
This thesis concerns blind source separation techniques using second order statistics and higher order statistics for reverberant environments. A focus of the thesis is algorithmic simplicity with a view to the algorithms being implemented in their online forms. The main challenge of blind source separation applications is to handle reverberant acoustic environments; a further complication is changes in the acoustic environment such as when human speakers physically move.
A novel time-domain method which utilises a pair of finite impulse response filters is proposed. The method of principle angles is defined which exploits a singular value decomposition for their design. The pair of filters are implemented within a generalised sidelobe canceller structure, thus the method can be considered as a beamforming method which cancels one source. An adaptive filtering stage is then employed to recover the remaining source, by exploiting the output of the beamforming stage as a noise reference.
A common approach to blind source separation is to use methods that use higher order statistics such as independent component analysis. When dealing with realistic convolutive audio and speech mixtures, processing in the frequency domain at each frequency bin is required. As a result this introduces the permutation problem, inherent in independent component analysis, across the frequency bins. Independent vector analysis directly addresses this issue by modeling the dependencies between frequency bins, namely making use of a source vector prior. An alternative source prior for real-time (online) natural gradient independent vector analysis is proposed. A Student's t probability density function is known to be more suited for speech sources, due to its heavier tails, and is incorporated into a real-time version of natural gradient independent vector analysis. The final algorithm is realised as a real-time embedded application on a floating point Texas Instruments digital signal processor platform.
Moving sources, along with reverberant environments, cause significant problems in realistic source separation systems as mixing filters become time variant. A method which employs the pair of cancellation filters, is proposed to cancel one source coupled with an online natural gradient independent vector analysis technique to improve average separation performance in the context of step-wise moving sources. This addresses `dips' in performance when sources move. Results show the average convergence time of the performance parameters is improved.
Online methods introduced in thesis are tested using impulse responses measured in reverberant environments, demonstrating their robustness and are shown to perform better than established methods in a variety of situations
Computation of the one-dimensional unwrapped phase
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 101-102). "Cepstrum bibliography" (p. 67-100).In this thesis, the computation of the unwrapped phase of the discrete-time Fourier transform (DTFT) of a one-dimensional finite-length signal is explored. The phase of the DTFT is not unique, and may contain integer multiple of 27r discontinuities. The unwrapped phase is the instance of the phase function chosen to ensure continuity. This thesis presents existing algorithms for computing the unwrapped phase, discussing their weaknesses and strengths. Then two composite algorithms are proposed that use the existing ones, combining their strengths while avoiding their weaknesses. The core of the proposed methods is based on recent advances in polynomial factoring. The proposed methods are implemented and compared to the existing ones.by Zahi Nadim Karam.S.M
Ambisonics
This open access book provides a concise explanation of the fundamentals and background of the surround sound recording and playback technology Ambisonics. It equips readers with the psychoacoustical, signal processing, acoustical, and mathematical knowledge needed to understand the inner workings of modern processing utilities, special equipment for recording, manipulation, and reproduction in the higher-order Ambisonic format. The book comes with various practical examples based on free software tools and open scientific data for reproducible research. The book’s introductory section offers a perspective on Ambisonics spanning from the origins of coincident recordings in the 1930s to the Ambisonic concepts of the 1970s, as well as classical ways of applying Ambisonics in first-order coincident sound scene recording and reproduction that have been practiced since the 1980s. As, from time to time, the underlying mathematics become quite involved, but should be comprehensive without sacrificing readability, the book includes an extensive mathematical appendix. The book offers readers a deeper understanding of Ambisonic technologies, and will especially benefit scientists, audio-system and audio-recording engineers. In the advanced sections of the book, fundamentals and modern techniques as higher-order Ambisonic decoding, 3D audio effects, and higher-order recording are explained. Those techniques are shown to be suitable to supply audience areas ranging from studio-sized to hundreds of listeners, or headphone-based playback, regardless whether it is live, interactive, or studio-produced 3D audio material
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience
Discriminative and adaptive training for robust speech recognition and understanding
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D
- …