163 research outputs found
Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks
In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe
A review of differentiable digital signal processing for music and speech synthesis
The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research
Recommended from our members
A Log Domain Pulse Model for Parametric Speech Synthesis
Most of the degradation in current Statistical Parametric Speech Synthesis (SPSS) results from the form of the vocoder. One of the main causes of degradation is the reconstruction of the noise. In this article, a new signal model is proposed that leads to a simple synthesizer, without the need for ad-hoc tuning of model parameters. The model is not based on the traditional additive linear source-filter model, it adopts a combination of speech components that are additive in the log domain. Also, the same representation for voiced and unvoiced segments is used, rather than relying on binary voicing decisions. This avoids voicing error discontinuities that can occur in many current vocoders. A simple binary mask is used to denote the presence of noise in the time-frequency domain, which is less sensitive to classification errors. Four experiments have been carried out to evaluate this new model. The first experiment examines the noise reconstruction issue. Three listening tests have also been carried out that demonstrate the advantages of this model: comparison with the STRAIGHT vocoder; the direct prediction of the binary noise mask by using a mixed output configuration; and partial improvements of creakiness using a mask correction mechanism.European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie; 10.13039/501100000266-EPSR
The development of speech coding and the first standard coder for public mobile telephony
This thesis describes in its core chapter (Chapter 4) the original algorithmic and design features of the ??rst coder for public mobile telephony, the GSM full-rate speech coder, as standardized in 1988. It has never been described in so much detail as presented here. The coder is put in a historical perspective by two preceding chapters on the history of speech production models and the development of speech coding techniques until the mid 1980s, respectively. In the epilogue a brief review is given of later developments in speech coding. The introductory Chapter 1 starts with some preliminaries. It is de- ??ned what speech coding is and the reader is introduced to speech coding standards and the standardization institutes which set them. Then, the attributes of a speech coder playing a role in standardization are explained. Subsequently, several applications of speech coders - including mobile telephony - will be discussed and the state of the art in speech coding will be illustrated on the basis of some worldwide recognized standards. Chapter 2 starts with a summary of the features of speech signals and their source, the human speech organ. Then, historical models of speech production which form the basis of di??erent kinds of modern speech coders are discussed. Starting with a review of ancient mechanical models, we will arrive at the electrical source-??lter model of the 1930s. Subsequently, the acoustic-tube models as they arose in the 1950s and 1960s are discussed. Finally the 1970s are reviewed which brought the discrete-time ??lter model on the basis of linear prediction. In a unique way the logical sequencing of these models is exposed, and the links are discussed. Whereas the historical models are discussed in a narrative style, the acoustic tube models and the linear prediction tech nique as applied to speech, are subject to more mathematical analysis in order to create a sound basis for the treatise of Chapter 4. This trend continues in Chapter 3, whenever instrumental in completing that basis. In Chapter 3 the reader is taken by the hand on a guided tour through time during which successive speech coding methods pass in review. In an original way special attention is paid to the evolutionary aspect. Speci??cally, for each newly proposed method it is discussed what it added to the known techniques of the time. After presenting the relevant predecessors starting with Pulse Code Modulation (PCM) and the early vocoders of the 1930s, we will arrive at Residual-Excited Linear Predictive (RELP) coders, Analysis-by-Synthesis systems and Regular- Pulse Excitation in 1984. The latter forms the basis of the GSM full-rate coder. In Chapter 4, which constitutes the core of this thesis, explicit forms of Multi-Pulse Excited (MPE) and Regular-Pulse Excited (RPE) analysis-by-synthesis coding systems are developed. Starting from current pulse-amplitude computation methods in 1984, which included solving sets of equations (typically of order 10-16) two hundred times a second, several explicit-form designs are considered by which solving sets of equations in real time is avoided. Then, the design of a speci??c explicitform RPE coder and an associated eÆcient architecture are described. The explicit forms and the resulting architectural features have never been published in so much detail as presented here. Implementation of such a codec enabled real-time operation on a state-of-the-art singlechip digital signal processor of the time. This coder, at a bit rate of 13 kbit/s, has been selected as the Full-Rate GSM standard in 1988. Its performance is recapitulated. Chapter 5 is an epilogue brie y reviewing the major developments in speech coding technology after 1988. Many speech coding standards have been set, for mobile telephony as well as for other applications, since then. The chapter is concluded by an outlook
Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables
This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for
singing voice synthesis (SVS) that exploits the physical characteristics of the
human voice using differentiable digital signal processing. GOLF employs a
glottal model as the harmonic source and IIR filters to simulate the vocal
tract, resulting in an interpretable and efficient approach. We show it is
competitive with state-of-the-art singing voice vocoders, requiring fewer
synthesis parameters and less memory to train, and runs an order of magnitude
faster for inference. Additionally, we demonstrate that GOLF can model the
phase components of the human voice, which has immense potential for rendering
and analysing singing voices in a differentiable manner. Our results highlight
the effectiveness of incorporating the physical properties of the human voice
mechanism into SVS and underscore the advantages of signal-processing-based
approaches, which offer greater interpretability and efficiency in synthesis.
Audio samples are available at https://yoyololicon.github.io/golf-demo/.Comment: 9 pages, 4 figures. Accepted at ISMIR 202
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables
This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for singing voice synthesis (SVS) that exploits the physical characteristics of the human voice using differentiable digital signal processing. GOLF employs a glottal model as the harmonic source and IIR filters to simulate the vocal tract, resulting in an interpretable and efficient approach. We show it is competitive with state-of-the-art singing voice vocoders, requiring fewer synthesis parameters and less memory to train, and runs an order of magnitude faster for inference. Additionally, we demonstrate that GOLF can model the phase components of the human voice, which has immense potential for rendering and analysing singing voices in a differentiable manner. Our results highlight the effectiveness of incorporating the physical properties of the human voice mechanism into SVS and underscore the advantages of signal-processing-based approaches, which offer greater interpretability and efficiency in synthesis
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
- …