74 research outputs found
Asynchronous factorisation of speaker and background with feature transforms in speech recognition
This paper presents a novel approach to separate the effects of speaker and background conditions by application of featuretransform based adaptation for Automatic Speech Recognition (ASR). So far factorisation has been shown to yield improvements in the case of utterance-synchronous environments. In this paper we show successful separation of conditions asynchronous with speech, such as background music. Our work takes account of the asynchronous nature of the background, by estimation of condition-specific Constrained Maximum Likelihood Linear Regression (CMLLR) transforms. In addition, speaker adaptation is performed, allowing to factorise speaker and background effects. Equally, background transforms are used asynchronously in the decoding process, using a modified Hidden Markov Model (HMM) topology which applies the optimal transform for each frame. Experimental results are presented on the WSJCAM0 corpus of British English speech, modified to contain controlled sections of background music. This addition of music degrades the baseline Word Error Rate (WER) from 10.1% to 26.4%. While synchronous factorisation with CMLLR transforms provides 28% relative improvement in WER over the baseline, our asynchronous approach increases this reduction to 33%
Robustness issues in a data-driven spoken language understanding system
Robustness is a key requirement in spoken language understanding (SLU) systems. Human speech is often ungrammatical and ill-formed, and there will frequently be a mismatch between training and test data. This paper discusses robustness and adaptation issues in a statistically-based SLU system which is entirely data-driven. To test robustness, the system has been tested on data from the Air Travel Information Service (ATIS) domain which has been artificially corrupted with varying levels of additive noise. Although the speech recognition performance degraded steadily, the system did not fail catastrophically. Indeed, the rate at which the end-to-end performance of the complete system degraded was significantly slower than that of the actual recognition component. In a second set of experiments, the ability to rapidly adapt the core understanding component of the system to a different application within the same broad domain has been tested. Using only a small amount of training data, experiments have shown that a semantic parser based on the Hidden Vector State (HVS) model originally trained on the ATIS corpus can be straightforwardly adapted to the somewhat different DARPA Communicator task using standard adaptation algorithms. The paper concludes by suggesting that the results presented provide initial support to the claim that an SLU system which is statistically-based and trained entirely from data is intrinsically robust and can be readily adapted to new applications
Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations
This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2β3%
Using contextual information in Joint Factor Eigenspace MLLR for speech recognition in diverse scenarios
This paper presents a new approach for rapid adaptation in the presence of highly diverse scenarios that takes advantage of information describing the input signals. We introduce a new method for joint factorisation of the background and the speaker in an eigenspace MLLR framework: Joint Factor Eigenspace MLLR (JFEMLLR). We further propose to use contextual information describing the speaker and background, such as tags or more complex metadata, to provide an immediate estimation of the best MLLR transformation for the utterance. This provides instant adaptation, since it does not require any transcription from a previous decoding stage. Evaluation in a highly diverse Automatic Speech Recognition (ASR) task, a modified version of WSJCAM0, yields an improvement of 26.9% over the baseline, which is an extra 1.2% reduction over two-pass MLLR adaptation
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
μ μ΄ κ°λ₯ν μμ± ν©μ±μ μν κ²μ΄νΈ μ¬κ· μ΄ν μ κ³Ό λ€λ³μ μ 보 μ΅μν
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021.8. μ²μ±μ€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor.
First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems.
We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques.
We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.μμ±μ μ¬λμ΄ μμΌλ‘ λ€λ₯Έ μΌμ νλ©΄μλ, λ©λ¦¬ λ¨μ΄μ§ μλμ νμ©ν μ μλ κ°μ₯ μ μ©ν μΈν°νμ΄μ€ μ€ νλμ΄λ€. λλΆλΆμ μ¬λμ΄ μνμμ λ°μ νκ² μ νλ λͺ¨λ°μΌ κΈ°κΈ°, κ°μ , μλμ°¨ λ±μμ μμ± μΈν°νμ΄μ€λ₯Ό νμ©νκ² λλ©΄μ, κΈ°κ³μ μ¬λ κ°μ μμ± μΈν°νμ΄μ€μ λν μ°κ΅¬κ° λ λ‘ μ¦κ°νκ³ μλ€. λ³Έ λ
Όλ¬Έμ κΈ°κ³κ° μμ±μ λ§λλ κ³Όμ μΈ μμ± ν©μ±μ λ€λ£¬λ€. λ₯ λ¬λ κΈ°μ μ΄ μ μ©λλ©΄μ ν©μ±λ μμ±μ νμ§μ μ¬λμ μμ±κ³Ό μ μ¬ν΄μ‘μ§λ§, μμ°μ€λ¬μ΄ μ€νμΌμ μ μ΄λ μμ§λ λμ μ μΈ κ³Όμ μ΄λ€. λ³Έ λ
Όλ¬Έμμλ λ€μν μ΄μ¨κ³Ό κ°μ μ ννν μ μλ μμ±μ ν©μ±νκΈ° μν κΈ°λ²λ€μ μ μνλ©°, μ€νμΌμ μμλ³λ‘ μ μ΄νμ¬ μμ½κ² μνλ μ€νμΌμ μμ±μ ν©μ±ν μ μλλ‘ νλ κΈ°λ²μ μ μνλ€.
λ¨Όμ μμ± ν©μ±μ μν΄ μ μλ κΈ°μ‘΄ μ€νμΌ μ μ΄ κΈ°λ²λ€μ μκ°νλ€. νμ, κ°μ , λ§ν¬λ, μμ΄ λ±μ μ μ΄νλ©΄μλ μμ°μ€λ¬μ΄ λ°νλ₯Ό ν©μ±νκ³ μ ν΅κ³μ νλΌλ―Έν° μμ± ν©μ± μμ€ν
μ μν΄ μ μλ κΈ°λ²λ€κ³Ό, λ₯λ¬λ κΈ°λ° μμ± ν©μ± μμ€ν
μ μν΄ μ μλ κΈ°λ²μ μκ°νλ€.
λ€μμΌλ‘ λ μνμ€(sequence) κ°μ κ΄κ³λ₯Ό νμ΅νμ¬, μ
λ ₯ μνμ€μ λ°λΌ μΆλ ₯ μνμ€λ₯Ό μμ±νλ μ΄ν
μ
(attention) κΈ°λ²μ μ μ΄ κ°λ₯ν μ¬κ·μ±μ μΆκ°ν κ²μ΄νΈ μ¬κ· μ΄ν
μ
(Gated Recurrent Attention) λ₯Ό μ μνλ€. κ²μ΄νΈ μ¬κ· μ΄ν
μ
μ μΌμ ν μ
λ ₯μ λν΄ μΆλ ₯ μμΉμ λ°λΌ λ¬λΌμ§λ λ€μν μΆλ ₯μ λ κ°μ κ²μ΄νΈλ₯Ό ν΅ν΄ μ μ΄ν μ μμ΄ λ€μν μ€νμΌμ νμ΅νλλ° μ ν©νλ€. κ²μ΄νΈ μ¬κ· μ΄ν
μ
μ νμ΅ λ°μ΄ν°μ μμλ μ€νμΌμ νμ΅νκ³ μμ±νλλ° μμ΄ κΈ°μ‘΄ κΈ°λ²μ λΉν΄ μμ°μ€λ¬μμ΄λ μ€νμΌ μ μ¬λ λ©΄μμ λμ μ±λ₯μ 보μ΄λ κ²μ μ€νμ ν΅ν΄ νμΈν μ μμλ€.
λ€μμΌλ‘ μΈ κ° μ΄μμ μ€νμΌ μμλ€μ μνΈμμ‘΄μ±μ μ κ±°ν μ μλ κΈ°λ²μ μ μνλ€. μ¬λ¬κ°μ μ μ΄ μμλ€(factors)μ λ³μκ° μνΈμμ‘΄μ± μν νλ€μ ν©μΌλ‘ λνλ΄κ³ , μ΄λ₯Ό μ΅μννμ¬ μμ‘΄μ±μ μ κ±°ν μ μμμ 보μΈλ€. μ΄ μν μΆμ μΉλ νμ΅ μ΄κΈ°μ μλ ΄νμ¬ 0μ κ°κΉκ² μ μ§λκΈ° λλ¬Έμ, μμ€ν¨μλ₯Ό λν¨μΌλ‘μ¨ μκΈ°λ μ±λ₯ μ νκ° κ±°μ μλ€. μ μνλ κΈ°λ²μ λ€μΈμ΄, λ€νμ, μ€νμΌ λ°μ΄ν°λ² μ΄μ€λ‘ μμ±ν©μ±κΈ°λ₯Ό νμ΅νλλ° νμ©λλ€. 15λͺ
μ μμ± μ λ¬Έκ°λ€μ μ£Όκ΄μ μΈ λ£κΈ° νκ°λ₯Ό ν΅ν΄ μ μνλ κΈ°λ²μ΄ ν©μ±κΈ°μ μ€νμΌ μ μ΄κ°λ₯μ±μ λμΌ λΏλ§ μλλΌ ν©μ±μμ νμ§κΉμ§ λμΌ μ μμμ 보μΈλ€.1 Introduction 1
1.1 Evolution of Speech Synthesis Technology 1
1.2 Attention-based Speech Synthesis Systems 2
1.2.1 Tacotron 2
1.2.2 Deep Convolutional TTS 3
1.3 Non-autoregressive Speech Synthesis Systems 6
1.3.1 Glow-TTS 6
1.3.2 SpeedySpeech 8
1.4 Outline of the thesis 8
2 Style Modeling Techniques for Speech Synthesis 13
2.1 Introduction 13
2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14
2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15
2.4 Summary 17
3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19
3.1 Introduction 19
3.2 Related Works 20
3.2.1 Gated recurrent unit 20
3.2.2 Location-sensitive attention 22
3.3 Gated Recurrent Attention 24
3.4 Experiments and results 28
3.4.1 Tacotron2 with global style tokens 28
3.4.2 Decaying guided attention 29
3.4.3 Datasets and feature processing 30
3.4.4 Evaluation methods 32
3.4.5 Evaluation results 33
3.5 Guided attention and decaying guided attention 34
3.6 Summary 35
4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41
4.1 Introduction 41
4.2 Related Works 44
4.2.1 Disentanglement Studies for Speech Synthesis 44
4.2.2 Total Correlation and Mutual Information 45
4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46
4.3 Proposed method 46
4.4 Experiments and Results 47
4.4.1 Quality and Naturalness of Speech 51
4.4.2 Speaker and style similarity 52
4.5 Summary 53
5 Conclusions 55
Bibliography 57
μ΄ λ‘ 67
κ°μ¬μ κΈ 69λ°
- β¦