65 research outputs found
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech
synthesis directly from text. The system is composed of a recurrent
sequence-to-sequence feature prediction network that maps character embeddings
to mel-scale spectrograms, followed by a modified WaveNet model acting as a
vocoder to synthesize timedomain waveforms from those spectrograms. Our model
achieves a mean opinion score (MOS) of comparable to a MOS of for
professionally recorded speech. To validate our design choices, we present
ablation studies of key components of our system and evaluate the impact of
using mel spectrograms as the input to WaveNet instead of linguistic, duration,
and features. We further demonstrate that using a compact acoustic
intermediate representation enables significant simplification of the WaveNet
architecture.Comment: Accepted to ICASSP 201
Large Language Model-guided Document Selection
Large Language Model (LLM) pre-training exhausts an ever growing compute
budget, yet recent research has demonstrated that careful document selection
enables comparable model quality with only a fraction of the FLOPs. Inspired by
efforts suggesting that domain-specific training document selection is in fact
an interpretable process [Gunasekar et al., 2023], as well as research showing
that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et
al.,2023], we explore a promising direction for scalable general-domain
document selection; employing a prompted LLM as a document grader, we distill
quality labels into a classifier model, which is applied at scale to a large,
and already heavily-filtered, web-crawl-derived corpus autonomously. Following
the guidance of this classifier, we drop 75% of the corpus and train LLMs on
the remaining data. Results across multiple benchmarks show that: 1. Filtering
allows us to quality-match a model trained on the full corpus across diverse
benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and
classifier models lead to better results that are less sensitive to the
labeler's prompt, 3. In-context learning helps to boost the performance of
less-capable labeling models. In all cases we use open-source datasets, models,
recipes, and evaluation frameworks, so that results can be reproduced by the
community.Comment: 9 page
- …
