3 research outputs found
Controllable Emphasis with zero data for text-to-speech
We present a scalable method to produce high quality emphasis for
text-to-speech (TTS) that does not require recordings or annotations. Many TTS
models include a phoneme duration model. A simple but effective method to
achieve emphasized speech consists in increasing the predicted duration of the
emphasised word. We show that this is significantly better than spectrogram
modification techniques improving naturalness by and correct testers'
identification of the emphasized word in a sentence by on a reference
female en-US voice. We show that this technique significantly closes the gap to
methods that require explicit recordings. The method proved to be scalable and
preferred in all four languages tested (English, Spanish, Italian, German), for
different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for
ig daptive treamable TTS with
mergent abilities. BASE TTS is the largest TTS model to-date,
trained on 100K hours of public domain speech data, achieving a new
state-of-the-art in speech naturalness. It deploys a 1-billion-parameter
autoregressive Transformer that converts raw texts into discrete codes
("speechcodes") followed by a convolution-based decoder which converts these
speechcodes into waveforms in an incremental, streamable manner. Further, our
speechcodes are built using a novel speech tokenization technique that features
speaker ID disentanglement and compression with byte-pair encoding. Echoing the
widely-reported "emergent abilities" of large language models when trained on
increasing volume of data, we show that BASE TTS variants built with 10K+ hours
and 500M+ parameters begin to demonstrate natural prosody on textually complex
sentences. We design and share a specialized dataset to measure these emergent
abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE
TTS by evaluating against baselines that include publicly available large-scale
text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated
by the model can be heard at https://amazon-ltts-paper.com/.Comment: v1.1 (fixed typos
Field inversion and machine learning in turbulence modeling
Turbulence closure models will continue to be necessary in order to perform computationally affordable simulations in the foreseeable future. It is expected that Reynolds-averaged Navier-Stokes (RANS) turbulence models will still be useful with the further development of the more accurate, but computationally expensive large eddy simulation (LES), especially in industry. The use of the robust but often inaccurate linear eddy viscosity closures is still widespread in industry. More complex closure models, such as Reynolds stress models and nonlinear eddy viscosity models, provide a more general description of the underlying physics of turbulent flows. Nevertheless, because of implementational difficulties or failure to provide consistent improvements over the more robust linear models, RANS turbulence modeling is considered to have reached a plateau. In the past few years, the availability of high-fidelity datasets, the increased accuracy of machine learning algorithms, and the rise in computational power led to the proposal of several data-driven approaches to turbulence modeling. The general idea is to use experimental and high-fidelity data to develop or enhance RANS turbulence models, instead of employing an approach purely based on physics. As in any emerging field, there are many possibilities for further developing the novel approaches to data-driven turbulence modeling. Recent work combined machine learning with statistical inversion. First, a spatially varying correction is applied to the RANS model and optimized by minimizing the discrepancy between the RANS output and the data for several flows. Machine learning is used to approximate a function between a set of flow features and the inferred corrections. The aim of this work is to further investigate this methodology, called the paradigm of field inversion and machine learning, in a broader set of test cases by inferring a spatially varying correction to the production term of the ω-equation in the k−ω model and to the eigenvalues of the Reynolds stress tensor. The gradients of this high-dimensional optimization process are obtained by implementing the continuous adjoint of the k−ω model in OpenFOAM. Gaussian processes and random forests are then used to approximate a function between mean flow features and the inferred corrections. It was found that both formulations are able to accurately infer the mean velocities and related quantities of interest, but that the inferred corrective terms are often non-unique or not physically interpretable. For several flow cases, the corrective terms were able to generalize to unseen Reynolds number and flow geometries.Aerospace Engineerin