Search CORE

11 research outputs found

Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks

Author: Takaki Shinji
Valentini Botinhao Cassia
Wang Xin
Yamagishi Junichi
Publication venue: 'International Speech Communication Association'
Publication date: 08/09/2016
Field of study

StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation

Author: Masztalski Piotr
Matuszewski Mateusz
Piaskowski Karol
Romaniuk Michał
Publication venue
Publication date: 17/08/2020
Field of study

In this paper we introduce StoRIR - a stochastic room impulse response generation method dedicated to audio data augmentation in machine learning applications. This technique, in contrary to geometrical methods like image-source or ray tracing, does not require prior definition of room geometry, absorption coefficients or microphone and source placement and is dependent solely on the acoustic parameters of the room. The method is intuitive, easy to implement and allows to generate RIRs of very complicated enclosures. We show that StoRIR, when used for audio data augmentation in a speech enhancement task, allows deep learning models to achieve better results on a wide range of metrics than when using the conventional image-source method, effectively improving many of them by more than 5 %. We publish a Python implementation of StoRIR onlineComment: Accepted for INTERSPEECH 202

arXiv.org e-Print Archive

Crossref

N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification

Author: Gaydecki Patrick
Marcinek Lubos
Millman Rebecca
Stone Michael
Publication venue
Publication date: 01/09/2021
Field of study

The University of Manchester - Institutional Repository

Expressive Multilingual Speech Synthesizer

Author: Nosek Tijana
Publication venue: Универзитет у Новом Саду, Факултет техничких наука
Publication date: 01/09/2023
Field of study

Cilj istraživanja ove doktorske disertacije je da ispita mogućnost sintetizovanja govora glasom govornika na jeziku koji on nikada nije govorio. Kreirani su višejezični modeli, kako za jezike čiji je govorni materijal anotiran na isti način, tako i za one čiji je govorni materijal anotiran različitim konvencijama, što uključuje i srpski jezik. Po kvalitetu sintetizovanog govora neki modeli čak prevazilaze standardne modele obučene na govornom materijalu na jednom jeziku. Pored arhitekture za višejezične modele, predložen je i način adaptacije takvog modela na novog govornika. Takva adaptacija omogućuje brzu i jednostavnu produkciju novih glasova zadržavajući mogućnost sinteze na svim jezicima podržanim modelom, bez obzira na originalni jezik novog govornika.The aim of this thesis is to investigate the possibility of synthesizing speech in the voice of a speaker in a language which he had never spoken. Multilanguage models are created, both for the languages whose databases are annotated using the same conventions, and for the languages whose databases are annotated using different conventions, which includes the Serbian language. Regarding quality of synthesized speech, some models even surpass the quality of synthesis produced by standard monolanguage models. Beside architecture for multilanguage models, а method for adaptation of such models to the data of a new speaker is proposed. The proposed method of adaptation enables fast and simple production of new voices, while preserving the possibility to synthesize speech in any language supported by the model, regardless of the target speaker’s original language

National Repository of Dissertations in Serbia (NaRDuS)

Noisy speech database for training speech enhancement algorithms and TTS models

Author: Institute of Language Cognition and Computation
School of Informatics
Valentini Botinhao Cassia
Publication venue
Publication date: 21/08/2017
Field of study

Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the papers associated with the database. For the 28 speaker dataset, details can be found in: C. Valentini-Botinhao, X. Wang, S. Takaki & J. Yamagishi, "Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks", In Proc. Interspeech 2016. For the 56 speaker dataset: C. Valentini-Botinhao, X. Wang, S. Takaki & J. Yamagishi, "Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech”, In Proc. SSW 2016. Some of the noises used to create the noisy speech were obtained from the Demand database, available here: http://parole.loria.fr/DEMAND/ . The speech database was obtained from the CSTR VCTK Corpus, available here: http://dx.doi.org/10.7488/ds/1994. The speech-shaped and babble noise files that were used to create this dataset are available here: http://homepages.inf.ed.ac.uk/cvbotinh/se/noises/.Valentini-Botinhao, Cassia. (2017). Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). http://dx.doi.org/10.7488/ds/2117

Edinburgh Research Explorer