2,766 research outputs found
Recommended from our members
Unsupervised intralingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction
Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper firstly presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Secondly, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Thirdly, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation
Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models
Hidden Markov model (HMM) -based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to firstly estimate the transcription of the adaptation data. By defining a mapping between HMM-based synthesis models and ASR-style models, this paper introduces an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for supplementary acoustic models. Further, this enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data
Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction
This paper demonstrates how unsupervised cross-lingual adaptation of HMM-based speech synthesis models may be performed without explicit knowledge of the adaptation data
language. A two-pass decision tree construction technique is deployed for this purpose. Using parallel translated datasets, cross-lingual and intralingual adaptation are compared in a controlled manner. Listener evaluations reveal that the
proposed method delivers performance approaching that of unsupervised intralingual adaptation
Speech Emotion Recognition Using Multi-hop Attention Mechanism
In this paper, we are interested in exploiting textual and acoustic data of
an utterance for the speech emotion classification task. The baseline approach
models the information from audio and text independently using two deep neural
networks (DNNs). The outputs from both the DNNs are then fused for
classification. As opposed to using knowledge from both the modalities
separately, we propose a framework to exploit acoustic information in tandem
with lexical data. The proposed framework uses two bi-directional long
short-term memory (BLSTM) for obtaining hidden representations of the
utterance. Furthermore, we propose an attention mechanism, referred to as the
multi-hop, which is trained to automatically infer the correlation between the
modalities. The multi-hop attention first computes the relevant segments of the
textual data corresponding to the audio signal. The relevant textual data is
then applied to attend parts of the audio signal. To evaluate the performance
of the proposed system, experiments are performed in the IEMOCAP dataset.
Experimental results show that the proposed technique outperforms the
state-of-the-art system by 6.5% relative improvement in terms of weighted
accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral
presentation
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by
far the largest corpus which is suitable for conducting the speech recognition
research and building speech recognition systems for Mandarin. The recording
procedure, including audio capturing devices and environments are presented in
details. The preparation of the related resources, including transcriptions and
lexicon are described. The corpus is released with a Kaldi recipe. Experimental
results implies that the quality of audio recordings and transcriptions are
promising.Comment: Oriental COCOSDA 201
- …