Search CORE

154 research outputs found

A Fully Time-domain Neural Model for Subband-based Speech Synthesizer

Author: Kim Geonmin
Kim Tae-Ho
Lee Soo-Young
Rabiee Azam
Publication venue
Publication date: 01/07/2019
Field of study

This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher-forcing, the subband synthesizer outperforms the fullband model significantly in terms of both subjective and objective measures. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the fully time-domain neural model for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.Comment: 5 pages, 3 figur

arXiv.org e-Print Archive

Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

Author: Bargum Anders R.
Erkut Cumhur
Serafin Stefania
Publication venue
Publication date: 14/11/2023
Field of study

Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions

arXiv.org e-Print Archive

딥러닝을 활용한 스타일 적응형 음성 합성 기법

Author: 이준엽
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 김남수.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.딥러닝 기반의 음성 합성 기술은 지난 몇 년간 횔발하게 개발되고 있다. 딥러닝의 다양한 기법을 사용하여 음성 합성 품질은 비약적으로 발전했지만, 아직 딥러닝 기반의 음성 합성에는 여러 문제가 존재한다. 딥러닝 기반의 통계적 파라미터 기법의 경우 음향 모델의 deterministic한 모델을 활용하여 모델링 능력의 한계가 있으며, 종단형 모델의 경우 스타일을 표현하는 능력과 강인한 어텐션(attention)에 대한 이슈가 끊임없이 재기되고 있다. 본 논문에서는 이러한 기존의 딥러닝 기반 음성 합성 시스템의 단점을 해결할 새로운 대안을 제안한다. 첫 번째 접근법으로서, 뉴럴 통계적 파라미터 방식의 음향 모델링을 고도화하기 위한 adversarially trained variational recurrent neural network (AdVRNN) 기법을 제안한다. AdVRNN 기법은 VRNN을 음성 합성에 적용하여 음성의 변화를 stochastic 하고 자세하게 모델링할 수 있도록 하였다. 또한, 적대적 학습적(adversarial learning) 기법을 활용하여 oversmoothing 문제를 최소화 시키도록 하였다. 이러한 제안된 알고리즘은 기존의 순환 신경망 기반의 음향 모델과 비교하여 성능이 향상됨을 확인하였다. 두 번째 접근법으로서, 스타일 적응형 종단형 음성 합성 기법을 위한 상호 정보량 기반의 새로운 학습 기법을 제안한다. 기존의 global style token(GST) 기반의 스타일 음성 합성 기법의 경우, 비지도 학습을 사용하므로 원하는 목표 스타일이 있어도 이를 중점적으로 학습시키기 어려웠다. 이를 해결하기 위해 GST의 출력과 목표 스타일 임베딩 벡터의 상호 정보량을 최대화 하도록 학습 시키는 기법을 제안하였다. 상호 정보량을 종단형 모델의 손실함수에 적용하기 위해서 mutual information neural estimator(MINE) 기법을 도입하였고 다화자 모델을 통해 기존의 GST 기법에 비해 목표 스타일을 보다 중점적으로 학습시킬 수 있음을 확인하였다. 세번째 접근법으로서, 강인한 종단형 음성 합성의 어텐션인 memory attention을 제안한다. Long-short term memory(LSTM)의 gating 기술은 sequence를 모델링하는데 높은 성능을 보여왔다. 이러한 기술을 어텐션에 적용하여 다양한 스타일을 가진 음성에서도 어텐션의 끊김, 반복 등을 최소화할 수 있는 기법을 제안한다. 단일 화자와 감정 음성 합성 기법을 토대로 memory attention의 성능을 확인하였으며 기존 기법 대비 보다 안정적인 어텐션 곡선을 얻을 수 있음을 확인하였다. 마지막 접근법으로서, selective multi-attention (SMA)을 활용한 스타일 적응형 종단형 음성 합성 어텐션 기법을 제안한다. 기존의 스타일 적응형 종단형 음성 합성의 연구에서는 낭독체 단일화자의 경우와 같은 단일 어텐션을 사용하여 왔다. 하지만 스타일 음성의 경우 보다 다양한 어텐션 표현을 요구한다. 이를 위해 다중 어텐션을 활용하여 후보들을 생성하고 이를 선택 네트워크를 활용하여 최적의 어텐션을 선택하는 기법을 제안한다. SMA 기법은 기존의 어텐션과의 비교 실험을 통하여 보다 많은 스타일을 안정적으로 표현할 수 있음을 확인하였다.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 요약 93 감사의 글 95Docto

SNU Open Repository and Archive

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis

Author: Anand L. D. Vijay
D Hepsiba
R Vinotha
Reji Deepak John
Publication venue
Publication date: 16/02/2024
Field of study

Neural Text-to-speech (TTS) synthesis is a powerful technology that can generate speech using neural networks. One of the most remarkable features of TTS synthesis is its capability to produce speech in the voice of different speakers. This paper introduces voice cloning and speech synthesis https://pypi.org/project/voice-cloning/ an open-source python package for helping speech disorders to communicate more effectively as well as for professionals seeking to integrate voice cloning or speech synthesis capabilities into their projects. This package aims to generate synthetic speech that sounds like the natural voice of an individual, but it does not replace the natural human voice. The architecture of the system comprises a speaker verification system, a synthesizer, a vocoder, and noise reduction. Speaker verification system trained on a varied set of speakers to achieve optimal generalization performance without relying on transcriptions. Synthesizer is trained using both audio and transcriptions that generate Mel spectrogram from a text and vocoder which converts the generated Mel Spectrogram into corresponding audio signal. Then the audio signal is processed by a noise reduction algorithm to eliminate unwanted noise and enhance speech clarity. The performance of synthesized speech from seen and unseen speakers are then evaluated using subjective and objective evaluation such as Mean Opinion Score (MOS), Gross Pitch Error (GPE), and Spectral distortion (SD). The model can create speech in distinct voices by including speaker characteristics that are chosen randomly

arXiv.org e-Print Archive