17 research outputs found
An examination into the effects of speech rate on perceived stress in monolingual and bilingual populations
Stress is the body’s response to adverse or demanding circumstances and can cause physical changes such as increased respiratory rate and increased vocal cord muscle tension which can affect speech production and the acoustic properties of speech such as speech rate (duration). Acoustic properties such as duration and intensity act as cues in stress judgements, with duration proving to be the factor that provides the greatest fluctuation in these judgements. In the present study, 55 monolingual and bilingual (Spanish and English speaking) participants listened to 6 audio files spoken at 3 varying speeds in both English and Spanish and rated how stressed they perceived the speaker to be. Contrary to what was predicted, the results demonstrated high intra-cultural similarities in terms of perceptions of stress. As hypothesised, higher stress ratings were attributed to the faster spoken files, although they were also attributed to those files spoken in Spanish. There were interactions between the speed of audio and participant group, language spoken and participant group, speed of audio and language spoken and finally speed of audio, language being spoken and participant group. These results demonstrate that speech rate has significant effects on perceptions of stress and also suggest that the effect of speech rate on these perceptions varies between languages However, previous literature would suggest that the acoustic properties of speech are affected differently in real life scenarios compared to when speech is manipulated artificially. Suggesting that further research should endeavour to avoid electronically manipulated audios, instead capturing naturally occurring audio files
HMM-based speech synthesis using an acoustic glottal source model
Parametric speech synthesis has received increased attention in recent years following
the development of statistical HMM-based speech synthesis. However, the speech
produced using this method still does not sound as natural as human speech and there
is limited parametric flexibility to replicate voice quality aspects, such as breathiness.
The hypothesis of this thesis is that speech naturalness and voice quality can be
more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal
source model, the Liljencrants-Fant (LF) model, to represent the source component
of speech instead of the traditional impulse train.
Two different analysis-synthesis methods were developed during this thesis, in order
to integrate the LF-model into a baseline HMM-based speech synthesiser, which is
based on the popular HTS system and uses the STRAIGHT vocoder. The first method,
which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model
signal through a glottal post-filter to obtain the source signal and then generating
speech, by passing this source signal through the spectral envelope filter. The system
which uses the GPF method (HTS-GPF system) is similar to the baseline system,
but it uses a different source signal instead of the impulse train used by STRAIGHT.
The second method, called Glottal Spectral Separation (GSS), generates speech by
passing the LF-model signal through the vocal tract filter. The major advantage of the
synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic
properties of the LF-model parameters are automatically learnt by the HMMs.
In this thesis, an initial perceptual experiment was conducted to compare the LFmodel
to the impulse train. The results showed that the LF-model was significantly
better, both in terms of speech naturalness and replication of two basic voice qualities
(breathy and tense). In a second perceptual evaluation, the HTS-LF system was better
than the baseline system, although the difference between the two had been expected to
be more significant. A third experiment was conducted to evaluate the HTS-GPF system
and an improved HTS-LF system, in terms of speech naturalness, voice similarity
and intelligibility. The results showed that the HTS-GPF system performed similarly
to the baseline. However, the HTS-LF system was significantly outperformed by the
baseline. Finally, acoustic measurements were performed on the synthetic speech to
investigate the speech distortion in the HTS-LF system. The results indicated that a
problem in replicating the rapid variations of the vocal tract filter parameters at transitions
between voiced and unvoiced sounds is the most significant cause of speech
distortion. This problem encourages future work to further improve the system
Quality aspects of Internet telephony
Internet telephony has had a tremendous impact on how people communicate.
Many now maintain contact using some form of Internet telephony.
Therefore the motivation for this work has been to address the quality aspects
of real-world Internet telephony for both fixed and wireless telecommunication.
The focus has been on the quality aspects of voice communication,
since poor quality leads often to user dissatisfaction. The scope of the work
has been broad in order to address the main factors within IP-based voice
communication.
The first four chapters of this dissertation constitute the background
material. The first chapter outlines where Internet telephony is deployed
today. It also motivates the topics and techniques used in this research.
The second chapter provides the background on Internet telephony including
signalling, speech coding and voice Internetworking. The third chapter
focuses solely on quality measures for packetised voice systems and finally
the fourth chapter is devoted to the history of voice research.
The appendix of this dissertation constitutes the research contributions.
It includes an examination of the access network, focusing on how calls are
multiplexed in wired and wireless systems. Subsequently in the wireless
case, we consider how to handover calls from 802.11 networks to the cellular
infrastructure. We then consider the Internet backbone where most of our
work is devoted to measurements specifically for Internet telephony. The
applications of these measurements have been estimating telephony arrival
processes, measuring call quality, and quantifying the trend in Internet telephony
quality over several years. We also consider the end systems, since
they are responsible for reconstructing a voice stream given loss and delay
constraints. Finally we estimate voice quality using the ITU proposal PESQ
and the packet loss process.
The main contribution of this work is a systematic examination of Internet
telephony. We describe several methods to enable adaptable solutions
for maintaining consistent voice quality. We have also found that relatively
small technical changes can lead to substantial user quality improvements.
A second contribution of this work is a suite of software tools designed to
ascertain voice quality in IP networks. Some of these tools are in use within
commercial systems today
Joint estimation of vocal tract and source parameters of a speech production model
This thesis describes algorithms developed to jointly estimate vocal tract shapes and source signals from real speech. The methodology was developed and evaluated using simple articulatory models of the vocal tract, coupled with lumped parametric models of the loss mechanisms in the tract.
The vocal tract is modelled by a five parameter area function model [Lm, 1990] Energy losses due to wall vibration and glottal resistance are modelled as a pole- zero filter placed at the glottis. A model described in [Lame, 1982] is used to approximate the lip radiation characteristic.
An articulatory-to-acoustic "linked codebook" of approximately 1600 shapes is generated and exhaustively searched to estimate the vocal tract parameters.
Glottal waveforms (input signals) are obtained by inverse filtering real speech using the estimated vocal tract parameters. The inverse filter is constructed using the estimated area function. A new method is proposed to fit the Liljencrants - Fant glottal flow model [Fant, Liljencrants and Lm, 1985] to the inverse filtered signals Estimates of the parameters are found from both the inverse filtered signal and its derivative.
The descnbed model successfully estimates articulatory parameters for artificial speech waveforms. Tests on recorded vowels suggest that the technique is applicable to real speech.
The technique has applications in the development of natural sounding speech synthesis, the treatment of speech disorders and the reduction of data bit rates in speech codin
Overcoming the limitations of statistical parametric speech synthesis
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS)
using hidden Markov models (HMMs) was the dominant synthesis paradigm within the
research community. SPSS systems are effective at generalising across the linguistic
contexts present in training data to account for inevitable unseen linguistic contexts at
synthesis-time, making these systems flexible and their performance stable. However
HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning
that, despite great progress, the speech output is rarely confused for natural speech.
There are many hypotheses for the causes of reduced synthesis quality, and subsequent
required improvements, for HMM speech synthesis in literature. However, until this
thesis, these hypothesised causes were rarely tested.
This thesis makes two types of contributions to the field of speech synthesis; each
of these appears in a separate part of the thesis. Part I introduces a methodology for
testing hypothesised causes of limited quality within HMM speech synthesis systems.
This investigation aims to identify what causes these systems to fall short of natural
speech. Part II uses the findings from Part I of the thesis to make informed improvements
to speech synthesis.
The usual approach taken to improve synthesis systems is to attribute reduced synthesis
quality to a hypothesised cause. A new system is then constructed with the aim
of removing that hypothesised cause. However this is typically done without prior testing
to verify the hypothesised cause of reduced quality. As such, even if improvements
in synthesis quality are observed, there is no knowledge of whether a real underlying
issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a
wide range of perceptual tests in Part I of the thesis to discover what the real underlying
causes of reduced quality in HMM synthesis are and the level to which they contribute.
Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements
to synthesis quality. Two well-motivated improvements to standard HMM
synthesis are investigated. The first of these improvements follows on from averaging
across differing linguistic contexts being identified as a major contributing factor to
reduced synthesis quality. This is a practice typically performed during decision tree
regression in HMM synthesis. Therefore a system which removes averaging across
differing linguistic contexts and instead performs averaging only across matching linguistic
contexts (called rich-context synthesis) is investigated. The second of the motivated
improvements follows the finding that the parametrisation (i.e., vocoding) of
speech, standard practice in SPSS, introduces a noticeable drop in quality before any
modelling is even performed. Therefore the hybrid synthesis paradigm is investigated.
These systems aim to remove the effect of vocoding by using SPSS to inform the selection
of units in a unit selection system. Both of the motivated improvements applied
in Part II are found to make significant gains in synthesis quality, demonstrating the
benefit of performing the style of perceptual testing conducted in the thesis
Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm
This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The
TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal.
Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to
speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co-
occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These
parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings.
Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using
TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the
sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community
Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm
This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community.EThOS - Electronic Theses Online ServiceGBUnited Kingdo