122 research outputs found
Speaking rate attention-based duration prediction for speed control TTS
With the advent of high-quality speech synthesis, there is a lot of interest
in controlling various prosodic attributes of speech. Speaking rate is an
essential attribute towards modelling the expressivity of speech. In this work,
we propose a novel approach to control the speaking rate for non-autoregressive
TTS. We achieve this by conditioning the speaking rate inside the duration
predictor, allowing implicit speaking rate control. We show the benefits of
this approach by synthesising audio at various speaking rate factors and
measuring the quality of speaking rate-controlled synthesised speech. Further,
we study the effect of the speaking rate distribution of the training data
towards effective rate control. Finally, we fine-tune a baseline pretrained TTS
model to obtain speaking rate control TTS. We provide various analyses to
showcase the benefits of using this proposed approach, along with objective as
well as subjective metrics. We find that the proposed methods have higher
subjective scores and lower speaker rate errors across many speaking rate
factors over the baseline.Comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Zero-shot speaker cloning aims to synthesize speech for any target speaker
unseen during TTS system building, given only a single speech reference of the
speaker at hand. Although more practical in real applications, the current
zero-shot methods still produce speech with undesirable naturalness and speaker
similarity. Moreover, endowing the target speaker with arbitrary speaking
styles in the zero-shot setup has not been considered. This is because the
unique challenge of zero-shot speaker and style cloning is to learn the
disentangled speaker and style representations from only short references
representing an arbitrary speaker and an arbitrary style. To address this
challenge, we propose U-Style, which employs Grad-TTS as the backbone,
particularly cascading a speaker-specific encoder and a style-specific encoder
between the text encoder and the diffusion decoder. Thus, leveraging signal
perturbation, U-Style is explicitly decomposed into speaker- and style-specific
modeling parts, achieving better speaker and style disentanglement. To improve
unseen speaker and style modeling ability, these two encoders conduct
multi-level speaker and style modeling by skip-connected U-nets, incorporating
the representation extraction and information reconstruction process. Besides,
to improve the naturalness of synthetic speech, we adopt mean-based instance
normalization and style adaptive layer normalization in these encoders to
perform representation extraction and condition adaptation, respectively.
Experiments show that U-Style significantly surpasses the state-of-the-art
methods in unseen speaker cloning regarding naturalness and speaker similarity.
Notably, U-Style can transfer the style from an unseen source speaker to
another unseen target speaker, achieving flexible combinations of desired
speaker timbre and style in zero-shot voice cloning
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Recommended from our members
Acoustic-Prosodic Entrainment in Human-Human and Human-Computer Dialogue
Entrainment (sometimes called adaptation or alignment) is the tendency of human speakers to adapt to or imitate characteristics of their interlocutors' behavior. This work focuses on entrainment on acoustic-prosodic features. Acoustic-prosodic entrainment has been extensively studied but is not well understood. In particular, it is difficult to compare the results of different studies, since entrainment is usually measured in different ways, reflect- ing disparate conceptualizations of the phenomenon. In the first part of this thesis, we look for evidence of entrainment on a variety of acoustic-prosodic features according to various conceptualizations, and show that human speakers of both Standard American English and Mandarin Chinese entrain to each other globally and locally, in synchrony, and that this entrainment can be constant or convergent. We explore the relationship between entrainment and gender and show that entrainment on some acoustic-prosodic features is related to social behavior and dialogue coordination. In addition, we show that humans entrain in a novel domain, backchannel-inviting cues, and propose and test a novel hypothesis: that entrainment will be stronger in the case of an outlier feature value. In the second part of the thesis, we describe a method for flexibly and dynamically entraining a TTS voice to multiple acoustic-prosodic features of a user's input utterances, and show in an exploratory study that users prefer an entraining avatar to one that does not entrain, are more likely to ask its advice, and choose more positive adjectives to describe its voice.
This work introduces a coherent view of entrainment in both familiar and novel domains. Our results add to the body of knowledge of entrainment in human-human conversations and propose new directions for making use of that knowledge to enhance human-computer interactions
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without
any adaptation time and parameters. Previous researches usually use a speaker
encoder to extract a global fixed speaker embedding from reference speech, and
several attempts have tried variable-length speaker embedding. However, they
neglect to transfer the personal pronunciation characteristics related to
phoneme content, leading to poor speaker similarity in terms of detailed
speaking styles and pronunciation habits. To improve the ability of the speaker
encoder to model personal pronunciation characteristics, we propose
content-dependent fine-grained speaker embedding for zero-shot speaker
adaptation. The corresponding local content embeddings and speaker embeddings
are extracted from a reference speech, respectively. Instead of modeling the
temporal relations, a reference attention module is introduced to model the
content relevance between the reference speech and the input text, and to
generate the fine-grained speaker embedding for each phoneme encoder output.
The experimental results show that our proposed method can improve speaker
similarity of synthesized speeches, especially for unseen speakers.Comment: Submitted to Interspeech 202
Data-Driven Enhancement of State Mapping-Based Cross-Lingual Speaker Adaptation
The thesis work was motivated by the goal of developing personalized speech-to-speech translation and focused on one of its key component techniques – cross-lingual speaker adaptation for text-to-speech synthesis. A personalized speech-to-speech translator enables a person’s spoken input to be translated into spoken output in another language while maintaining his/her voice identity. Before addressing any technical issues, work in this thesis set out to understand human perception of speaker identity. Listening tests were conducted in order to determine whether people could differentiate between speakers when they spoke different languages. The results demonstrated that differentiating between speakers across languages was an achievable task. However, it was difficult for listeners to differentiate between speakers across both languages and speech types (original recordings versus synthesized samples). The underlying challenge in cross-lingual speaker adaptation is how to apply speaker adaptation techniques when the language of adaptation data is different from that of synthesis models. The main body of the thesis work was devoted to the analysis and improvement of HMM state mapping-based cross-lingual speaker adaptation. Firstly, the effect of unsupervised cross-lingual adaptation was investigated, as it relates to the application scenario of personalized speech-to-speech translation. The comparison of paired supervised and unsupervised systems shows that the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised fashion, even if the average phoneme error rate of the unsupervised systems is around 75%. Then the effect of the language mismatch between synthesis models and adaptation data was investigated. The mismatch is found to transfer undesirable language information from adaptation data to synthesis models, thereby limiting the effectiveness of generating multiple regression class-specific transforms, using larger quantities of adaptation data and estimating adaptation transforms iteratively. Thirdly, in order to tackle the problems caused by the language mismatch, a data-driven adaptation framework using phonological knowledge is proposed. Its basic idea is to group HMM states according to phonological knowledge in a data-driven manner and then to map each state to a phonologically consistent counterpart in a different language. This framework is also applied to regression class tree construction for transform estimation. It is found that the proposed framework alleviates the negative effect of the language mismatch and gives consistent improvement compared to previous state-of-the-art approaches. Finally, a two-layer hierarchical transformation framework is developed, where one layer captures speaker characteristics and the other compensates for the language mismatch. The most appropriate means to construct the hierarchical arrangement of transforms was investigated in an initial study. While early results show some promise, further in-depth investigation is needed to confirm the validity of this hierarchy
- …