115 research outputs found
The contribution of voice quality to the expression of politeness: an experimental study
This thesis investigates the role of voice quality in the expression of politeness under con¬
ditions of varying relative social status among Japanese male speakers. The thesis also
sheds light on four important methodological issues: 1) experimental control of sociolinguistic aspects, 2) eliciting semi-natural spontaneous speech which satisfies naturalness,
3) recording quality suitable for voice quality analysis, and 4) the use of direct waveform
and spectrum measurement as a non-invasive method for measuring glottal characteristics
related to perceived voice quality.Japanese has been believed to rely on what has been called "negative politeness" (formality
and deference). Since explicitly expressing deference under the Keigo (Japanese system
of honorifics) requires mastery of a highly complex system, in daily conversation, this
function may be taken over by vocal paralinguistics. Also, as the Keigo system is not
supposed to convey "positive politeness" (friendliness and solidarity), vocal paralinguis¬
tics may contribute to this aspect of politeness. The use of high fundamental frequency
(FO), which has been considered a universal cue of politeness, is more associated with
femininity in Japanese, and this usage of FO is not observed in male speakers, who possibly employ other vocal cues to express politeness. Therefore this study focuses on voice
quality of male speakers in expressing politeness.To obtain natural, unscripted utterances, the speech data were collected with the Map
Task. This task also allows us to study the effect of manipulating relative social status differences among participants in the same community. For voice quality analysis,
direct waveform and spectrum measurement (Hanson 1995) was employed. We mainly
computed relative amplitudes of harmonics and formant peaks in the spectrum, as alternatives to certain well known parameters used in previous studies. We also measured FO
and amplitude perturbations as possible indicators of voice quality.An experiment was conducted to observe the alignment between acoustic measures and
the perceived politeness from both written and spoken versions of the utterances obtained
from the Map Task. The results suggest two principal findings. First, Keigo does not play
a role in conveying politeness in everyday conversation, but speakers showed politeness
through voice quality variations. Second, in judging the politeness of test utterances, listeners reacted to the irregularity of the waveform and spectral characteristics in the third
formant region. In particular, the speakers' spectral tilt range and correlation coefficient
between ratings of politeness and spectral tilt are anti-proportional, and extremely large
or small spectral tilt values were consistently perceived as not appropriate when addressing social superiors. These results are expected to contribute to both sociolinguistics and
speech technology, such as speech synthesis
Recommended from our members
What makes a voice masculine: physiological and acoustical correlates of women's ratings of men's vocal masculinity
Men's voices contain acoustic cues to body size and hormonal status, which have been found to affect women's ratings of speaker size, masculinity and attractiveness. However, the extent to which these voice parameters mediate the relationship between speakers' fitness-related features and listener's judgments of their masculinity has not yet been investigated.
We audio-recorded 37 adult heterosexual males performing a range of speech tasks and asked 20 adult heterosexual female listeners to rate speakers' masculinity on the basis of their voices only. We then used a two-level (speaker within listener) path analysis to examine the relationships between the physiological (testosterone, height), acoustic (fundamental frequency or F0, and resonances or ΔF) and perceptual dimensions (listeners' ratings) of speakers' masculinity. Overall, results revealed that male speakers who were taller and had higher salivary testosterone levels also had lower F0 and ΔF, and were in turn rated as more masculine. The relationship between testosterone and perceived masculinity was essentially mediated by F0, while that of height and perceived masculinity was partially mediated by both F0 and ΔF.
These observations confirm that women listeners attend to sexually dimorphic voice cues to assess the masculinity of unseen male speakers. In turn, variation in these voice features correlate with speakers' variation in stature and hormonal status, highlighting the interdependence of these physiological, acoustic and perceptual dimensions
Evaluation of glottal characteristics for speaker identification.
Based on the assumption that the physical characteristics of people's vocal apparatus cause their voices to have distinctive characteristics, this thesis reports on investigations into the use of the long-term average glottal response for speaker identification. The long-term average glottal response is a new feature that is obtained by overlaying successive vocal tract responses within an utterance.
The way in which the long-term average glottal response varies with accent and gender is examined using a population of 352 American English speakers from eight different accent regions. Descriptors are defined that characterize the shape of the long-term average glottal response. Factor analysis of the descriptors of the long-term average glottal responses shows that the most important factor contains significant contributions from descriptors comprised of the coefficients of cubics fitted to the long-term average glottal response. Discriminant analysis demonstrates that the long-term average glottal response is potentially useful for classifying speakers according to their gender, but is not useful for distinguishing American accents.
The identification accuracy of the long-term average glottal response is compared with that obtained from vocal tract features. Identification experiments are performed using a speaker database containing utterances from twenty speakers of the digits zero to nine. Vocal tract features, which consist of cepstral coefficients, partial correlation coefficients and linear prediction coefficients, are shown to be more accurate than the long-term average glottal response. Despite analysis of the training data indicating that the long-term average glottal response was uncorrelated with the vocal tract features, various feature combinations gave insignificant improvements in identification accuracy.
The effect of noise and distortion on speaker identification is examined for each of the features. It is found that the identification performance of the long-term average glottal response is insensitive to noise compared with cepstral coefficients, partial correlation coefficients and the long-term average spectrum, but that it is highly sensitive to variations in the phase response of the speech transmission channel.
Before reporting on the identification experiments, the thesis introduces speech production, speech models and background to the various features used in the experiments. Investigations into the long-term average glottal response demonstrate that it approximates the glottal pulse convolved with the long-term average impulse response, and this relationship is verified using synthetic speech. Furthermore, the spectrum of the long-term average glottal response extracted from pre-emphasized speech is shown to be similar to the long-term average spectrum of pre-emphasized speech, but computationally much simpler
Improving wordspotting performance with limited training data
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (leaves 149-155).by Eric I-Chao Chang.Ph.D
Recommended from our members
A novel framework for high-quality voice source analysis and synthesis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified
speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate
Optimization of acoustic feature extraction from dysarthric speech
Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, February 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 171-180).Dysarthria is a motor speech disorder characterized by weak or uncoordinated movements of the speech musculature. While unfamiliar listeners struggle to understand speakers with severe dysarthria, familiar listeners are often able to comprehend with high accuracy. This observation implies that although the speech produced by an individual with dysarthria may appear distorted and unintelligible to the untrained listener, there must be a set of consistent acoustic cues that the familiar communication partner is able to interpret. While dysarthric speech has been characterized both acoustically and perceptually, most accounts tend to compare dysarthric productions to those of healthy controls rather than identify the set of reliable and consistently controlled segmental cues. This work aimed to elucidate possible recognition strategies used by familiar listeners by optimizing a model of human speech recognition, Stevens' Lexical Access from Features (LAFF) framework, for ten individual speakers with dysarthria (SWDs). The LAFF model is rooted in distinctive feature theory, with acoustic landmarks indicating changes in the manner of articulation. The acoustic correlates manifested around landmarks provide the identity to articulator-free (manner) and articulator-bound (place) features.(cont.) SWDs created weaker consonantal landmarks, likely due to an inability to form complete closures in the vocal tract and to fully release consonantal constrictions. Identification of speaker-optimized acoustic correlate sets improved discrimination of each speaker's productions, evidenced by increased sensitivity and specificity. While there was overlap between the types of correlates identified for healthy and dysarthric speakers, using the optimal sets of correlates identified for SWDs adversely impaired discrimination of healthy speech. These results suggest that the combinations of correlates suggested for SWDs were specific to the individual and different from the segmental cues used by healthy individuals. Application of the LAFF model to dysarthric speech has potential clinical utility as a diagnostic tool, highlighting the fine-grain components of speech production that require intervention and quantifying the degree of impairment.by Thomas M. DiCicco, Jr.Ph.D
Production and perception of Libyan Arabic vowels
PhD ThesisThis study investigates the production and perception of Libyan Arabic (LA)
vowels by native speakers and the relation between these major aspects of speech. The
aim was to provide a detailed acoustic and auditory description of the vowels available in
the LA inventory and to compare the phonetic features of these vowels with those of
other Arabic varieties.
A review of the relevant literature showed that the LA dialect has not been
investigated experimentally. The small number of studies conducted in the last few
decades have been based mainly on impressionistic accounts. This study consists of two
main investigations: one concerned with vowel production and the other with vowel
perception. In terms of production, the study focused on gathering the data necessary to
define the vowel inventory of the dialect and to explore the qualitative and quantitative
characteristics of the vowels contained in this inventory. Twenty native speakers of LA
were recorded while reading target monosyllabic words in carrier sentences. Acoustic and
auditory analyses were used in order to provide a fairly comprehensive and objective
description of the vocalic system of LA. The results showed that phonologically short and
long Arabic vowels vary significantly in quality as well as quantity; a finding which is
increasingly being reported in experimental studies of other Arabic dialects. Short vowels
in LA tend to be more centralised than has been reported for other Arabic vowels,
especially with regards to short /a/. The study also looked at the effect of voicing in
neighbouring consonants and vowel height on vowel duration, and the findings were
compared to those of other varieties/languages.
The perception part of the study explored the extent to which listeners use the
same acoustic cues of length and quality in vowel perception that are evident in their
production. This involved the use of continua from synthesised vowels which varied
along duration and/or formant frequency dimensions. The continua were randomised and
played to 20 native listeners who took part in an identification task. The results show that,
when it comes to perception, Arabic listeners still rely mainly on quantity for the
distinction between phonologically long and short vowels. That is, when presented with
stimuli containing conflicting acoustic cues (formant frequencies that are typical of long
vowels but with short duration or formant frequencies that are typical of short vowels but
with long duration), listeners reacted consistently to duration rather than formant
frequency.
The results of both parts of the study provided some understanding of the LA
vowel system. The production data allowed for a detailed description of the phonetic
characteristics of LA vowels, and the acoustic space that they occupy was compared with
those of other Arabic varieties. The perception data showed that production and
perception do not always go hand in hand and that primary acoustic cues for the
identification of vowels are dialect- and language-specific
A large-scale analysis of the acoustic-phonetic markers of speaker sex.
The research for this thesis lies within the fieIa of speaker characterisation through the
acoustic-phonetic analysis of speech. The thesis consists of two parts:
1. An inv.estigation of the acoustic-phonetic differences between the speech of women
and men;
2. An examination of the practicalities of automating the investigation to analyse a
large speech database.
The acoustic-phonetic markers of speaker sex examined here are the fundamental frequency,
the formant frequencies, and the relative amplitude of the first harmonic. The
aims of the investigation were, firstly, to establish to what extent these markers differentiate
between the sexes, and secondly, to examine the extent of between- and within-speaker
deviation from the female and male norms, or average values for each sex.
These points were investigated by an automated acoustic-phonetic analysis of the TIMIT
database, involving a data set of almost 16,000 segments of speech. An automated method
was dev~loped to enable the signal processing and statistical analysis of a data set of this
size. The problems to be encountered in the analysis of a highly variable data source (i.e.
the acoustic speech waveform) are addressed
Personalising synthetic voices for individuals with severe speech impairment.
Speech technology can help individuals with speech disorders to interact more easily. Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre-recorded voice output. This voice output effectively becomes the voice of the individual and should therefore represent the user accurately.
Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria.
The thesis shows that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for 'voice banking' for individuals before their condition causes deterioration of the speech and once deterioration has begun. Data input requirements for building personalised voices with this technique using human listener judgement evaluation is investigated. It shows that 100 sentences is the minimum required to build a significantly different voice from an average voice model and show some resemblance to the target speaker. This amount depends on the speaker and the average model used.
A neural network analysis trained on extracted acoustic features revealed that spectral features had the most influence for predicting human listener judgements of similarity of synthesised speech to a target speaker. Accuracy of prediction significantly improves if other acoustic features are introduced and combined non-linearly.
These results were used to inform the reconstruction of personalised synthetic voices for speakers whose voices had begun to show the effects of their conditions. Using HMM-based synthesis, personalised synthetic voices were built using dysarthric speech showing similarity to target speakers without recreating the impairment in the synthesised speech output
- …