11 research outputs found

    StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation

    Full text link
    In this paper we introduce StoRIR - a stochastic room impulse response generation method dedicated to audio data augmentation in machine learning applications. This technique, in contrary to geometrical methods like image-source or ray tracing, does not require prior definition of room geometry, absorption coefficients or microphone and source placement and is dependent solely on the acoustic parameters of the room. The method is intuitive, easy to implement and allows to generate RIRs of very complicated enclosures. We show that StoRIR, when used for audio data augmentation in a speech enhancement task, allows deep learning models to achieve better results on a wide range of metrics than when using the conventional image-source method, effectively improving many of them by more than 5 %. We publish a Python implementation of StoRIR onlineComment: Accepted for INTERSPEECH 202

    Expressive Multilingual Speech Synthesizer

    Get PDF
    Cilj istraživanja ove doktorske disertacije je da ispita mogućnost sintetizovanja govora glasom govornika na jeziku koji on nikada nije govorio. Kreirani su višejezični modeli, kako za jezike čiji je govorni materijal anotiran na isti način, tako i za one čiji je govorni materijal anotiran različitim konvencijama, što uključuje i srpski jezik. Po kvalitetu sintetizovanog govora neki modeli čak prevazilaze standardne modele obučene na govornom materijalu na jednom jeziku. Pored arhitekture za višejezične modele, predložen je i način adaptacije takvog modela na novog govornika. Takva adaptacija omogućuje brzu i jednostavnu produkciju novih glasova zadržavajući mogućnost sinteze na svim jezicima podržanim modelom, bez obzira na originalni jezik novog govornika.The aim of this thesis is to investigate the possibility of synthesizing speech in the voice of a speaker in a language which he had never spoken. Multilanguage models are created, both for the languages whose databases are annotated using the same conventions, and for the languages whose databases are annotated using different conventions, which includes the Serbian language. Regarding quality of synthesized speech, some models even surpass the quality of synthesis produced by standard monolanguage models. Beside architecture for multilanguage models, а method for adaptation of such models to the data of a new speaker is proposed. The proposed method of adaptation enables fast and simple production of new voices, while preserving the possibility to synthesize speech in any language supported by the model, regardless of the target speaker’s original language

    Noisy speech database for training speech enhancement algorithms and TTS models

    No full text
    Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the papers associated with the database. For the 28 speaker dataset, details can be found in: C. Valentini-Botinhao, X. Wang, S. Takaki & J. Yamagishi, "Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks", In Proc. Interspeech 2016. For the 56 speaker dataset: C. Valentini-Botinhao, X. Wang, S. Takaki & J. Yamagishi, "Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech”, In Proc. SSW 2016. Some of the noises used to create the noisy speech were obtained from the Demand database, available here: http://parole.loria.fr/DEMAND/ . The speech database was obtained from the CSTR VCTK Corpus, available here: http://dx.doi.org/10.7488/ds/1994. The speech-shaped and babble noise files that were used to create this dataset are available here: http://homepages.inf.ed.ac.uk/cvbotinh/se/noises/.Valentini-Botinhao, Cassia. (2017). Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). http://dx.doi.org/10.7488/ds/2117
    corecore