Search CORE

39 research outputs found

Sub-Band Knowledge Distillation Framework for Speech Enhancement

Author: Gao Guanglai
Hao Xiang
Li Xiaofei
Liu Yun
Su Xiangdong
Wen Shixue
Publication venue: 'International Speech Communication Association'
Publication date: 29/10/2020
Field of study

In single-channel speech enhancement, methods based on full-band spectral features have been widely studied. However, only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. Specifically, we divide the full frequency band into multiple sub-bands and pre-train an elite-level sub-band enhancement model (teacher model) for each sub-band. These teacher models are dedicated to processing their own sub-bands. Next, under the teacher models' guidance, we train a general sub-band enhancement model (student model) that works for all sub-bands. Without increasing the number of model parameters and computational complexity, the student model's performance is further improved. To evaluate our proposed method, we conducted a large number of experiments on an open-source data set. The final experimental results show that the guidance from the elite-level teacher models dramatically improves the student model's performance, which exceeds the full-band model by employing fewer parameters.Comment: Published in Interspeech 202

arXiv.org e-Print Archive

Crossref

SNR-Based Teachers-Student Technique for Speech Enhancement

Author: Gao Guanglai
Hao Xiang
Su Xiangdong
Wang Zhiyu
Xu Huali
Zhang Qiang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/10/2020
Field of study

It is very challenging for speech enhancement methods to achieves robust performance under both high signal-to-noise ratio (SNR) and low SNR simultaneously. In this paper, we propose a method that integrates an SNR-based teachers-student technique and time-domain U-Net to deal with this problem. Specifically, this method consists of multiple teacher models and a student model. We first train the teacher models under multiple small-range SNRs that do not coincide with each other so that they can perform speech enhancement well within the specific SNR range. Then, we choose different teacher models to supervise the training of the student model according to the SNR of the training data. Eventually, the student model can perform speech enhancement under both high SNR and low SNR. To evaluate the proposed method, we constructed a dataset with an SNR ranging from -20dB to 20dB based on the public dataset. We experimentally analyzed the effectiveness of the SNR-based teachers-student technique and compared the proposed method with several state-of-the-art methods.Comment: Published in 2020 IEEE International Conference on Multimedia and Expo (ICME 2020

arXiv.org e-Print Archive

Crossref

USING THE TWO-LEVEL MORPHOLOGY ON MODERN MONGOLIAN LINGUISTICS

Author: B Nergui
D Uuganbaatar
Gao Guanglai
I Byambasuren
Publication venue: Mongolian Academy of Sciences
Publication date: 01/08/2017
Field of study

This study compiles primarily the word structure of Modern Mongolian language and further more focused on the possibilities of description of Mongolian language in PC KIMMO, a two level processing method of morphological parsing. The rules file and lexicon presented in the paper describe the morphology of Mongolian words. A lexicon containing the root words of contemporary Mongolian is used in the testing. As a result the two-level morphology is determined as completely possible to be used for Mongolian linguistics. In addition PC-KIMMO description of traditional Mongolian script is considered as being possible

Directory of Open Access Journals

Mongolia Journals Online

Controllable Accented Text-to-Speech Synthesis

Author: Gao Guanglai
Li Haizhou
Liu Rui
Sisman Berrak
Publication venue
Publication date: 22/09/2022
Field of study

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.Comment: To be submitted for possible journal publicatio

arXiv.org e-Print Archive

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Author: Gao Guanglai
Hu Yifan
Li Haizhou
Liu Rui
Publication venue
Publication date: 27/10/2022
Field of study

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The experimental results show that the proposed method outperforms all baselines and generates more expressive speech that is contextually appropriate. We release the source code at: https://github.com/walker-hyf/FCTalker.Comment: 5 pages, 4 figures, 1 table. Submitted to ICASSP 2023. We release the source code at: https://github.com/walker-hyf/FCTalke

arXiv.org e-Print Archive

Accurate emotion strength assessment for seen and unseen speech based on data-driven deep learning

Author: Gao Guanglai
Li Haizhou
Liu Rui
Schuller Björn
Sisman Berrak
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2022
Field of study

Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet

OPUS Augsburg

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Author: Bao Feilong
Gao Guanglai
Hu Yifan
Liu Rui
Yin Pengkai
Publication venue
Publication date: 22/09/2022
Field of study

This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges. To demonstrate the reliability of our dataset, we built a powerful non-autoregressive baseline system based on FastSpeech2 model and HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score (MOS) and real time factor (RTF) metrics. Evaluation results show that the powerful baseline system trained on our dataset achieves MOS above 4 and RTF about

3.30\times10^{-1}

, which makes it applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available \footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.Comment: Accepted at the 2022 International Conference on Asian Language Processing (IALP2022

arXiv.org e-Print Archive