747 research outputs found
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that
leverages style diffusion and adversarial training with large speech language
models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its
predecessor by modeling styles as a latent random variable through diffusion
models to generate the most suitable style for the text without requiring
reference speech, achieving efficient latent diffusion while benefiting from
the diverse speech synthesis offered by diffusion models. Furthermore, we
employ large pre-trained SLMs, such as WavLM, as discriminators with our novel
differentiable duration modeling for end-to-end training, resulting in improved
speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker
LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by
native English speakers. Moreover, when trained on the LibriTTS dataset, our
model outperforms previous publicly available models for zero-shot speaker
adaptation. This work achieves the first human-level TTS on both single and
multispeaker datasets, showcasing the potential of style diffusion and
adversarial training with large SLMs. The audio demos and source code are
available at https://styletts2.github.io/
The limits of the Mean Opinion Score for speech synthesis evaluation
The release of WaveNet and Tacotron has forever transformed the speech synthesis landscape. Thanks to these game-changing innovations, the quality of synthetic speech has reached unprecedented levels. However, to measure this leap in quality, an overwhelming majority of studies still rely on the Absolute Category Rating (ACR) protocol and compare systems using its output; the Mean Opinion Score (MOS). This protocol is not without controversy, and as the current state-of-the-art synthesis systems now produce outputs remarkably close to human speech, it is now vital to determine how reliable this score is.To do so, we conducted a series of four experiments replicating and following the 2013 edition of the Blizzard Challenge. With these experiments, we asked four questions about the MOS: How stable is the MOS of a system across time? How do the scores of lower quality systems influence the MOS of higher quality systems? How does the introduction of modern technologies influence the scores of past systems? How does the MOS of modern technologies evolve in isolation?The results of our experiments are manyfold. Firstly, we verify the superiority of modern technologies in comparison to historical synthesis. Then, we show that despite its origin as an absolute category rating, MOS is a relative score. While minimal variations are observed during the replication of the 2013-EH2 task, these variations can still lead to different conclusions for the intermediate systems. Our experiments also illustrate the sensitivity of MOS to the presence/absence of lower and higher anchors. Overall, our experiments suggest that we may have reached the end of a cul-de-sac by only evaluating the overall quality with MOS. We must embark on a new road and develop different evaluation protocols better suited to the analysis of modern speech synthesis technologies
An overview & analysis of sequence-to-sequence emotional voice conversion
Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion; it can thus be a key enabling technology for human-computer interaction applications and beyond. However, EVC remains an unsolved research problem with several challenges. In particular, as speech rate and rhythm are two key factors of emotional conversion, models have to generate output sequences of differing length. Sequence-to-sequence modelling is recently emerging as a competitive paradigm for models that can overcome those challenges. In an attempt to stimulate further research in this promising new direction, recent sequence-to-sequence EVC papers were systematically investigated and reviewed from six perspectives: their motivation, training strategies, model architectures, datasets, model inputs, and evaluation methods. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art. Finally, we discuss existing challenges of sequence-to-sequence EVC
A MODEL FOR PREDICTING THE PERFORMANCE OF IP VIDEOCONFERENCING
With the incorporation of free desktop videoconferencing (DVC) software on the
majority of the world's PCs, over the recent years, there has, inevitably, been considerable
interest in using DVC over the Internet. The growing popularity of DVC
increases the need for multimedia quality assessment. However, the task of predicting
the perceived multimedia quality over the Internet Protocol (IP) networks is
complicated by the fact that the audio and video streams are susceptible to unique
impairments due to the unpredictable nature of IP networks, different types of task
scenarios, different levels of complexity, and other related factors. To date, a standard
consensus to define the IP media Quality of Service (QoS) has yet to be implemented.
The thesis addresses this problem by investigating a new approach to
assess the quality of audio, video, and audiovisual overall as perceived in low cost
DVC systems.
The main aim of the thesis is to investigate current methods used to assess the perceived
IP media quality, and then propose a model which will predict the quality of
audiovisual experience from prevailing network parameters.
This thesis investigates the effects of various traffic conditions, such as, packet loss,
jitter, and delay and other factors that may influence end user acceptance, when low
cost DVC is used over the Internet. It also investigates the interaction effects between
the audio and video media, and the issues involving the lip sychronisation
error. The thesis provides the empirical evidence that the subjective mean opinion
score (MOS) of the perceived multimedia quality is unaffected by lip synchronisation
error in low cost DVC systems.
The data-gathering approach that is advocated in this thesis involves both field and
laboratory trials to enable the comparisons of results between classroom-based experiments
and real-world environments to be made, and to provide actual real-world
confirmation of the bench tests. The subjective test method was employed
since it has been proven to be more robust and suitable for the research studies, as
compared to objective testing techniques.
The MOS results, and the number of observations obtained, have enabled a set of
criteria to be established that can be used to determine the acceptable QoS for given
network conditions and task scenarios. Based upon these comprehensive findings,
the final contribution of the thesis is the proposal of a new adaptive architecture
method that is intended to enable the performance of IP based DVC of a particular
session to be predicted for a given network condition
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
- …