Search CORE

3 research outputs found

Controllable Emphasis with zero data for text-to-speech

Author: Abbas Ammar
Bonafonte Antonio
Drugman Thomas
Hussain Aman
Joly Arnaud
Karanasou Penny
Lajszczak Mateusz
Lombardi Alessandro
Moinet Alexis
Nicolis Marco
Peterova Ekaterina
Sharma Parul
Sokolova Elena
van Korlaar Arent
Publication venue
Publication date: 13/07/2023
Field of study

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by

7.3\%

and correct testers' identification of the emphasized word in a sentence by

40\%

on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202

arXiv.org e-Print Archive

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Author: Abbas Ammar
Beyhan Fatih
Cámbara Guillermo
Drugman Thomas
Gambino Soledad López
Guo Haohan
Joly Arnaud
Karlapati Sri
Li Yang
Martín-Cortinas Álvaro
Michalski Adam
Moinet Alexis
Muszyńska Ewa
Putrycz Bartosz
Sokolova Elena
van Korlaar Arent
Yang Fan
Yoo Kayeon
Łajszczak Mateusz
Publication venue
Publication date: 15/02/2024
Field of study

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for

\textbf{B}

\textbf{A}

daptive

\textbf{S}

treamable TTS with

\textbf{E}

mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.Comment: v1.1 (fixed typos

arXiv.org e-Print Archive

Field inversion and machine learning in turbulence modeling

Author: van Korlaar Arent (author)
Publication venue
Publication date: 22/02/2019
Field of study

Turbulence closure models will continue to be necessary in order to perform computationally affordable simulations in the foreseeable future. It is expected that Reynolds-averaged Navier-Stokes (RANS) turbulence models will still be useful with the further development of the more accurate, but computationally expensive large eddy simulation (LES), especially in industry. The use of the robust but often inaccurate linear eddy viscosity closures is still widespread in industry. More complex closure models, such as Reynolds stress models and nonlinear eddy viscosity models, provide a more general description of the underlying physics of turbulent flows. Nevertheless, because of implementational difficulties or failure to provide consistent improvements over the more robust linear models, RANS turbulence modeling is considered to have reached a plateau. In the past few years, the availability of high-fidelity datasets, the increased accuracy of machine learning algorithms, and the rise in computational power led to the proposal of several data-driven approaches to turbulence modeling. The general idea is to use experimental and high-fidelity data to develop or enhance RANS turbulence models, instead of employing an approach purely based on physics. As in any emerging field, there are many possibilities for further developing the novel approaches to data-driven turbulence modeling. Recent work combined machine learning with statistical inversion. First, a spatially varying correction is applied to the RANS model and optimized by minimizing the discrepancy between the RANS output and the data for several flows. Machine learning is used to approximate a function between a set of flow features and the inferred corrections. The aim of this work is to further investigate this methodology, called the paradigm of field inversion and machine learning, in a broader set of test cases by inferring a spatially varying correction to the production term of the ω-equation in the k−ω model and to the eigenvalues of the Reynolds stress tensor. The gradients of this high-dimensional optimization process are obtained by implementing the continuous adjoint of the k−ω model in OpenFOAM. Gaussian processes and random forests are then used to approximate a function between mean flow features and the inferred corrections. It was found that both formulations are able to accurately infer the mean velocities and related quantities of interest, but that the inferred corrective terms are often non-unique or not physically interpretable. For several flow cases, the corrective terms were able to generalize to unseen Reynolds number and flow geometries.Aerospace Engineerin

TU Delft Repository