4 research outputs found

    オートエンコーダを利用した任意話者の声質変換手法の提案

    Get PDF
     声質変換は,入力音声を目的話者の声質に変換する技術である.声質変換手法として,従来はGaussian Mixture Model(GMM)を用いた手法がよく用いられていたが,近年のDeep Learning に関する技術の台頭により,Deep Neural Network(DNN)を用いた声質手法が注目されている.しかし,GMM やDNN を用いた手法の多くは一対一の声質変換手法を提案しており,任意話者の入力に対応した研究は少なく,従来の任意話者の声質変換手法は,一対一声質変換と比べ変換精度が劣ってしまうという問題がある.また,従来のDNN を用いた声質変換手法では,一対一変換および多対一変換において複雑なネットワークを用いるため,多くの訓練データが必要となり,かつ変換に要する時間が長くなるという問題がある. 本研究では,これらの問題を解決するため,オートエンコーダおよびスパースオートエンコーダを用いた声質変換手法を提案する.提案手法では,オートエンコーダで次元圧縮した高次特徴量を目的話者の高次特徴量へDNN で変換し,目的話者のオートエンコーダを用いて音響特徴量に復元する.評価実験では,提案手法と従来手法を比較し,オートエンコーダを用いた手法は従来手法よりも若干高い精度でスペクトル変換を行い,変換時間を短縮することができた.スパースオートエンコーダを用いた手法では,オートエンコーダを用いた提案手法と比べ,スペクトル変換精度の向上および変換した音声の自然性を改善し,任意話者の声質変換精度を向上させることができた.電気通信大学201

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
    corecore