74 research outputs found

    Asynchronous factorisation of speaker and background with feature transforms in speech recognition

    Get PDF
    This paper presents a novel approach to separate the effects of speaker and background conditions by application of featuretransform based adaptation for Automatic Speech Recognition (ASR). So far factorisation has been shown to yield improvements in the case of utterance-synchronous environments. In this paper we show successful separation of conditions asynchronous with speech, such as background music. Our work takes account of the asynchronous nature of the background, by estimation of condition-specific Constrained Maximum Likelihood Linear Regression (CMLLR) transforms. In addition, speaker adaptation is performed, allowing to factorise speaker and background effects. Equally, background transforms are used asynchronously in the decoding process, using a modified Hidden Markov Model (HMM) topology which applies the optimal transform for each frame. Experimental results are presented on the WSJCAM0 corpus of British English speech, modified to contain controlled sections of background music. This addition of music degrades the baseline Word Error Rate (WER) from 10.1% to 26.4%. While synchronous factorisation with CMLLR transforms provides 28% relative improvement in WER over the baseline, our asynchronous approach increases this reduction to 33%

    Robustness issues in a data-driven spoken language understanding system

    Get PDF
    Robustness is a key requirement in spoken language understanding (SLU) systems. Human speech is often ungrammatical and ill-formed, and there will frequently be a mismatch between training and test data. This paper discusses robustness and adaptation issues in a statistically-based SLU system which is entirely data-driven. To test robustness, the system has been tested on data from the Air Travel Information Service (ATIS) domain which has been artificially corrupted with varying levels of additive noise. Although the speech recognition performance degraded steadily, the system did not fail catastrophically. Indeed, the rate at which the end-to-end performance of the complete system degraded was significantly slower than that of the actual recognition component. In a second set of experiments, the ability to rapidly adapt the core understanding component of the system to a different application within the same broad domain has been tested. Using only a small amount of training data, experiments have shown that a semantic parser based on the Hidden Vector State (HVS) model originally trained on the ATIS corpus can be straightforwardly adapted to the somewhat different DARPA Communicator task using standard adaptation algorithms. The paper concludes by suggesting that the results presented provide initial support to the claim that an SLU system which is statistically-based and trained entirely from data is intrinsically robust and can be readily adapted to new applications

    Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations

    Get PDF
    This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%

    Using contextual information in Joint Factor Eigenspace MLLR for speech recognition in diverse scenarios

    Get PDF
    This paper presents a new approach for rapid adaptation in the presence of highly diverse scenarios that takes advantage of information describing the input signals. We introduce a new method for joint factorisation of the background and the speaker in an eigenspace MLLR framework: Joint Factor Eigenspace MLLR (JFEMLLR). We further propose to use contextual information describing the speaker and background, such as tags or more complex metadata, to provide an immediate estimation of the best MLLR transformation for the utterance. This provides instant adaptation, since it does not require any transcription from a previous decoding stage. Evaluation in a highly diverse Automatic Speech Recognition (ASR) task, a modified version of WSJCAM0, yields an improvement of 26.9% over the baseline, which is an extra 1.2% reduction over two-pass MLLR adaptation

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    μ œμ–΄ κ°€λŠ₯ν•œ μŒμ„± 합성을 μœ„ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜κ³Ό λ‹€λ³€μˆ˜ 정보 μ΅œμ†Œν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021.8. μ²œμ„±μ€€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.μŒμ„±μ€ μ‚¬λžŒμ΄ μ†μœΌλ‘œ λ‹€λ₯Έ 일을 ν•˜λ©΄μ„œλ„, 멀리 떨어진 μƒλŒ€μ™€ ν™œμš©ν•  수 μžˆλŠ” κ°€μž₯ μœ μš©ν•œ μΈν„°νŽ˜μ΄μŠ€ 쀑 ν•˜λ‚˜μ΄λ‹€. λŒ€λΆ€λΆ„μ˜ μ‚¬λžŒμ΄ μƒν™œμ—μ„œ λ°€μ ‘ν•˜κ²Œ μ ‘ν•˜λŠ” λͺ¨λ°”일 κΈ°κΈ°, κ°€μ „, μžλ™μ°¨ λ“±μ—μ„œ μŒμ„± μΈν„°νŽ˜μ΄μŠ€λ₯Ό ν™œμš©ν•˜κ²Œ λ˜λ©΄μ„œ, 기계와 μ‚¬λžŒ κ°„μ˜ μŒμ„± μΈν„°νŽ˜μ΄μŠ€μ— λŒ€ν•œ 연ꡬ가 λ‚ λ‘œ μ¦κ°€ν•˜κ³  μžˆλ‹€. λ³Έ 논문은 기계가 μŒμ„±μ„ λ§Œλ“œλŠ” 과정인 μŒμ„± 합성을 닀룬닀. λ”₯ λŸ¬λ‹ 기술이 μ μš©λ˜λ©΄μ„œ ν•©μ„±λœ μŒμ„±μ˜ ν’ˆμ§ˆμ€ μ‚¬λžŒμ˜ μŒμ„±κ³Ό μœ μ‚¬ν•΄μ‘Œμ§€λ§Œ, μžμ—°μŠ€λŸ¬μš΄ μŠ€νƒ€μΌμ˜ μ œμ–΄λŠ” 아직도 도전적인 κ³Όμ œμ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ‹€μ–‘ν•œ 운율과 감정을 ν‘œν˜„ν•  수 μžˆλŠ” μŒμ„±μ„ ν•©μ„±ν•˜κΈ° μœ„ν•œ 기법듀을 μ œμ•ˆν•˜λ©°, μŠ€νƒ€μΌμ„ μš”μ†Œλ³„λ‘œ μ œμ–΄ν•˜μ—¬ μ†μ‰½κ²Œ μ›ν•˜λŠ” μŠ€νƒ€μΌμ˜ μŒμ„±μ„ ν•©μ„±ν•  수 μžˆλ„λ‘ ν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. λ¨Όμ € μŒμ„± 합성을 μœ„ν•΄ μ œμ•ˆλœ κΈ°μ‘΄ μŠ€νƒ€μΌ μ œμ–΄ 기법듀을 μ†Œκ°œν•œλ‹€. ν™”μž, 감정, λ§νˆ¬λ‚˜, 음운 등을 μ œμ–΄ν•˜λ©΄μ„œλ„ μžμ—°μŠ€λŸ¬μš΄ λ°œν™”λ₯Ό ν•©μ„±ν•˜κ³ μž 톡계적 νŒŒλΌλ―Έν„° μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법듀과, λ”₯λŸ¬λ‹ 기반 μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법을 μ†Œκ°œν•œλ‹€. λ‹€μŒμœΌλ‘œ 두 μ‹œν€€μŠ€(sequence) κ°„μ˜ 관계λ₯Ό ν•™μŠ΅ν•˜μ—¬, μž…λ ₯ μ‹œν€€μŠ€μ— 따라 좜λ ₯ μ‹œν€€μŠ€λ₯Ό μƒμ„±ν•˜λŠ” μ–΄ν…μ…˜(attention) 기법에 μ œμ–΄ κ°€λŠ₯ν•œ μž¬κ·€μ„±μ„ μΆ”κ°€ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜(Gated Recurrent Attention) λ₯Ό μ œμ•ˆν•œλ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ μΌμ •ν•œ μž…λ ₯에 λŒ€ν•΄ 좜λ ₯ μœ„μΉ˜μ— 따라 λ‹¬λΌμ§€λŠ” λ‹€μ–‘ν•œ 좜λ ₯을 두 개의 게이트λ₯Ό 톡해 μ œμ–΄ν•  수 μžˆμ–΄ λ‹€μ–‘ν•œ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜λŠ”λ° μ ν•©ν•˜λ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ ν•™μŠ΅ 데이터에 μ—†μ—ˆλ˜ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜κ³  μƒμ„±ν•˜λŠ”λ° μžˆμ–΄ κΈ°μ‘΄ 기법에 λΉ„ν•΄ μžμ—°μŠ€λŸ¬μ›€μ΄λ‚˜ μŠ€νƒ€μΌ μœ μ‚¬λ„ λ©΄μ—μ„œ 높은 μ„±λŠ₯을 λ³΄μ΄λŠ” 것을 μ‹€ν—˜μ„ 톡해 확인할 수 μžˆμ—ˆλ‹€. λ‹€μŒμœΌλ‘œ μ„Έ 개 μ΄μƒμ˜ μŠ€νƒ€μΌ μš”μ†Œλ“€μ˜ μƒν˜Έμ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆλŠ” 기법을 μ œμ•ˆν•œλ‹€. μ—¬λŸ¬κ°œμ˜ μ œμ–΄ μš”μ†Œλ“€(factors)을 λ³€μˆ˜κ°„ μƒν˜Έμ˜μ‘΄μ„± μƒν•œ ν•­λ“€μ˜ ν•©μœΌλ‘œ λ‚˜νƒ€λ‚΄κ³ , 이λ₯Ό μ΅œμ†Œν™”ν•˜μ—¬ μ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆμŒμ„ 보인닀. 이 μƒν•œ μΆ”μ •μΉ˜λŠ” ν•™μŠ΅ μ΄ˆκΈ°μ— μˆ˜λ ΄ν•˜μ—¬ 0에 κ°€κΉκ²Œ μœ μ§€λ˜κΈ° λ•Œλ¬Έμ—, μ†μ‹€ν•¨μˆ˜λ₯Ό λ”ν•¨μœΌλ‘œμ¨ μƒκΈ°λŠ” μ„±λŠ₯ μ €ν•˜κ°€ 거의 μ—†λ‹€. μ œμ•ˆν•˜λŠ” 기법은 λ‹€μ–Έμ–΄, λ‹€ν™”μž, μŠ€νƒ€μΌ λ°μ΄ν„°λ² μ΄μŠ€λ‘œ μŒμ„±ν•©μ„±κΈ°λ₯Ό ν•™μŠ΅ν•˜λŠ”λ° ν™œμš©λœλ‹€. 15λͺ…μ˜ μŒμ„± μ „λ¬Έκ°€λ“€μ˜ 주관적인 λ“£κΈ° 평가λ₯Ό 톡해 μ œμ•ˆν•˜λŠ” 기법이 ν•©μ„±κΈ°μ˜ μŠ€νƒ€μΌ μ œμ–΄κ°€λŠ₯성을 높일 뿐만 μ•„λ‹ˆλΌ ν•©μ„±μŒμ˜ ν’ˆμ§ˆκΉŒμ§€ 높일 수 μžˆμŒμ„ 보인닀.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 초 둝 67 κ°μ‚¬μ˜ κΈ€ 69λ°•
    • …
    corecore