97 research outputs found
The CereProc Blizzard Entry 2009: Some dumb algorithms that don't work
Within unit selection systems there is a constant tension between data sparsity and quality. This limits the control possible in a unit selection system. The RP data used in Blizzard this year and last year is expressive and spoken in a spirited manner. Last years entry focused on maintaining expressiveness, this year we focused on two simple algorithms to restrain and control this prosodic variation. 1) Variable width valley floor pruning on duration and pitch (Applied to the full database entry EH1), 2) Bulking of data with average HTS data (Applied to small database entry EH2). Results for both techniques were disappointing. The full database system achieved an MOS of around 2 (compared to 4 for a similar system attempting to emphasise variation in 2008), while the small database entry achieved an MOS of also 2 (compared to 3 for a similar system, but with a difference voice, entered in 2007). Index Terms: speech synthesis, unit selection. 1
Automatic Classification of Synthetic Voices for Voice Banking Using Objective Measures
Speech is the most common way of communication among humans. People who cannot communicate through speech due to partial of total loss of the voice can benefit from Alternative and Augmentative Communication devices and Text to Speech technology. One problem of using these technologies is that the included synthetic voices might be impersonal and badly adapted to the user in terms of age, accent or even gender. In this context, the use of synthetic voices from voice banking systems is an attractive alternative. New voices can be obtained applying adaptation techniques using recordings from people with healthy voice (donors) or from the user himself/herself before losing his/her own voice. In this way, the goal is to offer a wide voice catalog to potential users. However, as there is no control over the recording or the adaptation processes, some method to control the final quality of the voice is needed. We present the work developed to automatically select the best synthetic voices using a set of objective measures and a subjective Mean Opinion Score evaluation. A prediction algorithm of the MOS has been build which correlates similarly to the most correlated individual measure.This work has been funded by the Basque Government under the project ref. PIBA 2018-035 and IT-1355-19. This work is part of the project Grant PID 2019-108040RB-C21 funded by MCIN/AEI/10.13039/501100011033
In search of the optimal acoustic features for statistical parametric speech synthesis
In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally
represented as acoustic features and the waveform is generated by a vocoder. A comprehensive
summary of state-of-the-art vocoding techniques is presented, highlighting
their characteristics, advantages, and drawbacks, primarily when used in SPSS. We
conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last
decade. In fact, it seems that the most complicated methods perform worse than simpler
ones based on more robust analysis/synthesis algorithms. Typical methods, based on
the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They
perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+
noise), which we believe to be a major drawback. Problems include: difficulties
in the estimation of components; modelling of complex non-linear mechanisms; a lack
of ground truth. In addition, the statistical dependence that exists between stochastic
and deterministic components of speech is not modelled.
We start by improving just the waveform generation stage of SPSS, using standard
acoustic features. We propose a new method of waveform generation tailored for SPSS,
based on neither source-filter separation nor sinusoidal modelling. The proposed waveform
generator avoids unnecessary assumptions and decompositions as far as possible,
and uses only the fundamental frequency and spectral envelope as acoustic features. A
very small speech database is used as a source of base speech signals which are subsequently
\reshaped" to match the specifications output by the acoustic model in the
SPSS framework. All of this is done without any decomposition, such as source+filter
or harmonics+noise. A comprehensive description of the waveform generation process
is presented, along with implementation issues. Two SPSS voices, a female and a male,
were built to test the proposed method by using a standard TTS toolkit, Merlin. In
a subjective evaluation, listeners preferred the proposed waveform generator over a
state-of-the-art vocoder, STRAIGHT.
Even though the proposed \waveform reshaping" generator generates higher speech
quality than STRAIGHT, the improvement is not large enough. Consequently, we propose
a new acoustic representation, whose implementation involves feature extraction
and waveform generation, i.e., a complete vocoder. The new representation encodes
the complex spectrum derived from the Fourier Transform in a way explicitly designed
for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises
four feature streams describing magnitude spectrum, phase spectrum, and fundamental
frequency; all of these are represented by real numbers. It avoids heuristics or unstable
methods for phase unwrapping. The new feature extraction does not attempt to
decompose the speech structure and thus the "phasiness" and "buzziness" found in a
typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at
a lower frame rate than a typical vocoder. To demonstrate the proposed method, two
DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective
comparisons were performed with a state-of-the-art baseline. The proposed vocoder
substantially outperformed the baseline for both voices and under all configurations
tested. Furthermore, several enhancements were made over the original design, which
are beneficial for either sound quality or compatibility with other tools. In addition to
its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing
in unit selection-based systems, and can be used for voice conversion or automatic
speech recognition
Ahots sintetiko pertsonalizatuak: esperientzia baten deskribapena
Ahotsa ezinbestekoa da giza komunikaziorako, eta haren galerak eragin handia du pertsonak gizartean integratzeko prozesuan. Testu-ahots bihurketak ahots sintetikoa eman diezaieke ahozko desgaitasuna duten pertsonei. Irtenbide arruntenek ahots estandarra izaten dute normalean, eta, horregatik, erabiltzaile batzuek zailtasunak dituzte beren burua ahots horrekin identifikatzeko. Horregatik, ahots sintetiko pertsonalizatuak sortu behar dira, eta ahozko desgaitasuna duten pertsonei ahots-katalogo bat eskaini behar zaie, beren beharretara egokitzen den ahots bat aukeratu ahal izan dezaten. ZureTTS proiektuaren helburua ahots pertsonalizatu horiek ematea da, bai gaztelaniaz, bai euskaraz. Ahotsa galduko duten pertsonek edo ahotsik ez dutenei ahotsa eman nahi dieten pertsona altruistek 100 esaldi grabatzen dituzte, AhoMyTTS web-atariaren bidez. Esaldi horiekin, egokitze-prozesu bat egiten da, grabaketako ahotsaren antzeko ahots sintetiko bat sortzeko. Erabiltzaileari sintesi-motor bat ematen zaio ahots pertsonalizatu horrekin batera, ahozko mezuak sortzea eskaintzen duten aplikazioetan erabiltzeko. Gainera, ahots-katalogo bat ere badago, grabaketarik egin ezin duen pertsona batek ahots horien artean gustukoena aukeratu dezan. 1.200 pertsonak baino gehiagok erabili dute sistema hori ahots pertsonalizatu bat lortzeko, eta haietatik 58 hautatu ditugu katalogoan sartzeko. Erabiltzaileei egindako inkestek erakusten dute gustura daudela ahots sintetikoaren hainbat alderdirekin: gehienen ustez, ahots sintetikoa jatorrizkoaren antzekoa da, atsegina eta argia, baina robotiko samarra. Lan honek garapen jasangarrirako 10. helburuari laguntzen dio, herrialde bakoitzaren barneko eta herrialdeen arteko desberdintasunak murriztuz. Era berean, garapen jasangarrirako 4. helburuari ere laguntzen dio, guztiontzako kalitatezko hezkuntza inklusiboa nahiz bidezkoa bermatzea errazten duten tresnak eskainiz.; The voice is so essential for human communication that its loss drastically affects the integration of people in society. Text-to-speech can provide a synthetic voice for people with oral disabilities. The most common solutions usually provide a standard voice, and users have difficulties to identify themselves with it. For this reason, we need to create personalized synthetic voices and offer a catalogue of voices to people with oral disabilities so that they can choose one that suits their needs. The objective of the ZureTTS project is to provide these personalized voices, both in Spanish and in Basque. Through the AhoMyTTS web portal, people who are going to lose their voice or altruistic people who want to provide voices to those who do not have it, record 100 carefully se-lected sentences. A synthetic voice with similar characteristics to the voice of the recording is generated by applying an adaptation process. The user is provided with a synthesis engine along with that personalized voice, so that they can use it in applications that require oral message generation. In addition, we offer a catalogue of voices to choose from if one is no longer able to record. More than 1,200 people have used the system to obtain a personalized voice and 58 of them have been selected to be included in the cata-logue. User surveys show user satisfaction with various aspects of the synthetic voice: most think that the synthetic voice is similar to the original, pleasant and clear, although a bit robotic. This work contributes mainly to goal 10 for sustainable development by re-ducing inequality within and among countries. It also contributes to goal 4 for sustainable development, providing tools that facilitate access for all to an inclusive, equitable and quality education
A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion
During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices.
This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method.
The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec.
Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing.
The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications
Language Design for Reactive Systems: On Modal Models, Time, and Object Orientation in Lingua Franca and SCCharts
Reactive systems play a crucial role in the embedded domain. They continuously interact with their environment, handle concurrent operations, and are commonly expected to provide deterministic behavior to enable application in safety-critical systems. In this context, language design is a key aspect, since carefully tailored language constructs can aid in addressing the challenges faced in this domain, as illustrated by the various concurrency models that prevent the known pitfalls of regular threads. Today, many languages exist in this domain and often provide unique characteristics that make them specifically fit for certain use cases. This thesis evolves around two distinctive languages: the actor-oriented polyglot coordination language Lingua Franca and the synchronous statecharts dialect SCCharts. While they take different approaches in providing reactive modeling capabilities, they share clear similarities in their semantics and complement each other in design principles. This thesis analyzes and compares key design aspects in the context of these two languages. For three particularly relevant concepts, it provides and evaluates lean and seamless language extensions that are carefully aligned with the fundamental principles of the underlying language. Specifically, Lingua Franca is extended toward coordinating modal behavior, while SCCharts receives a timed automaton notation with an efficient execution model using dynamic ticks and an extension toward the object-oriented modeling paradigm
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
Prosody generation for text-to-speech synthesis
The absence of convincing intonation makes current parametric speech
synthesis systems sound dull and lifeless, even when trained on expressive
speech data. Typically, these systems use regression techniques to predict the
fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth
pitch contours and fails to construct an appropriate prosodic structure
across the full utterance. In order to capture and reproduce larger-scale
pitch patterns, we propose a template-based approach for automatic F0 generation,
where per-syllable pitch-contour templates (from a small, automatically
learned set) are predicted by a recurrent neural network (RNN). The use of
syllable templates mitigates the over-smoothing problem and is able to reproduce
pitch patterns observed in the data. The use of an RNN, paired with connectionist
temporal classification (CTC), enables the prediction of structure in
the pitch contour spanning the entire utterance. This novel F0 prediction system
is used alongside separate LSTMs for predicting phone durations and the
other acoustic features, to construct a complete text-to-speech system. Later,
we investigate the benefits of including long-range dependencies in duration
prediction at frame-level using uni-directional recurrent neural networks.
Since prosody is a supra-segmental property, we consider an alternate approach
to intonation generation which exploits long-term dependencies of
F0 by effective modelling of linguistic features using recurrent neural networks.
For this purpose, we propose a hierarchical encoder-decoder and
multi-resolution parallel encoder where the encoder takes word and higher
level linguistic features at the input and upsamples them to phone-level
through a series of hidden layers and is integrated into a Hybrid system which
is then submitted to Blizzard challenge workshop. We then highlight some of
the issues in current approaches and a plan for future directions of investigation
is outlined along with on-going work
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis
Statistical parametric speech synthesis (SPSS) has seen improvements over
recent years, especially in terms of intelligibility. Synthetic speech is often clear
and understandable, but it can also be bland and monotonous. Proper generation
of natural speech prosody is still a largely unsolved problem. This is relevant
especially in the context of expressive audiobook speech synthesis, where speech
is expected to be fluid and captivating.
In general, prosody can be seen as a layer that is superimposed on the segmental
(phone) sequence. Listeners can perceive the same melody or rhythm
in different utterances, and the same segmental sequence can be uttered with a
different prosodic layer to convey a different message. For this reason, prosody
is commonly accepted to be inherently suprasegmental. It is governed by longer
units within the utterance (e.g. syllables, words, phrases) and beyond the utterance
(e.g. discourse). However, common techniques for the modeling of speech
prosody - and speech in general - operate mainly on very short intervals, either at
the state or frame level, in both hidden Markov model (HMM) and deep neural
network (DNN) based speech synthesis.
This thesis presents contributions supporting the claim that stronger representations
of suprasegmental variation are essential for the natural generation of
fundamental frequency for statistical parametric speech synthesis. We conceptualize
the problem by dividing it into three sub-problems: (1) representations of
acoustic signals, (2) representations of linguistic contexts, and (3) the mapping
of one representation to another. The contributions of this thesis provide novel
methods and insights relating to these three sub-problems.
In terms of sub-problem 1, we propose a multi-level representation of f0 using
the continuous wavelet transform and the discrete cosine transform, as well
as a wavelet-based decomposition strategy that is linguistically and perceptually
motivated. In terms of sub-problem 2, we investigate additional linguistic
features such as text-derived word embeddings and syllable bag-of-phones and
we propose a novel method for learning word vector representations based on
acoustic counts. Finally, considering sub-problem 3, insights are given regarding
hierarchical models such as parallel and cascaded deep neural networks
- …