115 research outputs found

    The contribution of voice quality to the expression of politeness: an experimental study

    Get PDF
    This thesis investigates the role of voice quality in the expression of politeness under con¬ ditions of varying relative social status among Japanese male speakers. The thesis also sheds light on four important methodological issues: 1) experimental control of sociolinguistic aspects, 2) eliciting semi-natural spontaneous speech which satisfies naturalness, 3) recording quality suitable for voice quality analysis, and 4) the use of direct waveform and spectrum measurement as a non-invasive method for measuring glottal characteristics related to perceived voice quality.Japanese has been believed to rely on what has been called "negative politeness" (formality and deference). Since explicitly expressing deference under the Keigo (Japanese system of honorifics) requires mastery of a highly complex system, in daily conversation, this function may be taken over by vocal paralinguistics. Also, as the Keigo system is not supposed to convey "positive politeness" (friendliness and solidarity), vocal paralinguis¬ tics may contribute to this aspect of politeness. The use of high fundamental frequency (FO), which has been considered a universal cue of politeness, is more associated with femininity in Japanese, and this usage of FO is not observed in male speakers, who possibly employ other vocal cues to express politeness. Therefore this study focuses on voice quality of male speakers in expressing politeness.To obtain natural, unscripted utterances, the speech data were collected with the Map Task. This task also allows us to study the effect of manipulating relative social status differences among participants in the same community. For voice quality analysis, direct waveform and spectrum measurement (Hanson 1995) was employed. We mainly computed relative amplitudes of harmonics and formant peaks in the spectrum, as alternatives to certain well known parameters used in previous studies. We also measured FO and amplitude perturbations as possible indicators of voice quality.An experiment was conducted to observe the alignment between acoustic measures and the perceived politeness from both written and spoken versions of the utterances obtained from the Map Task. The results suggest two principal findings. First, Keigo does not play a role in conveying politeness in everyday conversation, but speakers showed politeness through voice quality variations. Second, in judging the politeness of test utterances, listeners reacted to the irregularity of the waveform and spectral characteristics in the third formant region. In particular, the speakers' spectral tilt range and correlation coefficient between ratings of politeness and spectral tilt are anti-proportional, and extremely large or small spectral tilt values were consistently perceived as not appropriate when addressing social superiors. These results are expected to contribute to both sociolinguistics and speech technology, such as speech synthesis

    Evaluation of glottal characteristics for speaker identification.

    Get PDF
    Based on the assumption that the physical characteristics of people's vocal apparatus cause their voices to have distinctive characteristics, this thesis reports on investigations into the use of the long-term average glottal response for speaker identification. The long-term average glottal response is a new feature that is obtained by overlaying successive vocal tract responses within an utterance. The way in which the long-term average glottal response varies with accent and gender is examined using a population of 352 American English speakers from eight different accent regions. Descriptors are defined that characterize the shape of the long-term average glottal response. Factor analysis of the descriptors of the long-term average glottal responses shows that the most important factor contains significant contributions from descriptors comprised of the coefficients of cubics fitted to the long-term average glottal response. Discriminant analysis demonstrates that the long-term average glottal response is potentially useful for classifying speakers according to their gender, but is not useful for distinguishing American accents. The identification accuracy of the long-term average glottal response is compared with that obtained from vocal tract features. Identification experiments are performed using a speaker database containing utterances from twenty speakers of the digits zero to nine. Vocal tract features, which consist of cepstral coefficients, partial correlation coefficients and linear prediction coefficients, are shown to be more accurate than the long-term average glottal response. Despite analysis of the training data indicating that the long-term average glottal response was uncorrelated with the vocal tract features, various feature combinations gave insignificant improvements in identification accuracy. The effect of noise and distortion on speaker identification is examined for each of the features. It is found that the identification performance of the long-term average glottal response is insensitive to noise compared with cepstral coefficients, partial correlation coefficients and the long-term average spectrum, but that it is highly sensitive to variations in the phase response of the speech transmission channel. Before reporting on the identification experiments, the thesis introduces speech production, speech models and background to the various features used in the experiments. Investigations into the long-term average glottal response demonstrate that it approximates the glottal pulse convolved with the long-term average impulse response, and this relationship is verified using synthetic speech. Furthermore, the spectrum of the long-term average glottal response extracted from pre-emphasized speech is shown to be similar to the long-term average spectrum of pre-emphasized speech, but computationally much simpler

    Improving wordspotting performance with limited training data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (leaves 149-155).by Eric I-Chao Chang.Ph.D

    Optimization of acoustic feature extraction from dysarthric speech

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, February 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 171-180).Dysarthria is a motor speech disorder characterized by weak or uncoordinated movements of the speech musculature. While unfamiliar listeners struggle to understand speakers with severe dysarthria, familiar listeners are often able to comprehend with high accuracy. This observation implies that although the speech produced by an individual with dysarthria may appear distorted and unintelligible to the untrained listener, there must be a set of consistent acoustic cues that the familiar communication partner is able to interpret. While dysarthric speech has been characterized both acoustically and perceptually, most accounts tend to compare dysarthric productions to those of healthy controls rather than identify the set of reliable and consistently controlled segmental cues. This work aimed to elucidate possible recognition strategies used by familiar listeners by optimizing a model of human speech recognition, Stevens' Lexical Access from Features (LAFF) framework, for ten individual speakers with dysarthria (SWDs). The LAFF model is rooted in distinctive feature theory, with acoustic landmarks indicating changes in the manner of articulation. The acoustic correlates manifested around landmarks provide the identity to articulator-free (manner) and articulator-bound (place) features.(cont.) SWDs created weaker consonantal landmarks, likely due to an inability to form complete closures in the vocal tract and to fully release consonantal constrictions. Identification of speaker-optimized acoustic correlate sets improved discrimination of each speaker's productions, evidenced by increased sensitivity and specificity. While there was overlap between the types of correlates identified for healthy and dysarthric speakers, using the optimal sets of correlates identified for SWDs adversely impaired discrimination of healthy speech. These results suggest that the combinations of correlates suggested for SWDs were specific to the individual and different from the segmental cues used by healthy individuals. Application of the LAFF model to dysarthric speech has potential clinical utility as a diagnostic tool, highlighting the fine-grain components of speech production that require intervention and quantifying the degree of impairment.by Thomas M. DiCicco, Jr.Ph.D

    Automatic acoustic analysis of waveform perturbations

    Get PDF

    Production and perception of Libyan Arabic vowels

    Get PDF
    PhD ThesisThis study investigates the production and perception of Libyan Arabic (LA) vowels by native speakers and the relation between these major aspects of speech. The aim was to provide a detailed acoustic and auditory description of the vowels available in the LA inventory and to compare the phonetic features of these vowels with those of other Arabic varieties. A review of the relevant literature showed that the LA dialect has not been investigated experimentally. The small number of studies conducted in the last few decades have been based mainly on impressionistic accounts. This study consists of two main investigations: one concerned with vowel production and the other with vowel perception. In terms of production, the study focused on gathering the data necessary to define the vowel inventory of the dialect and to explore the qualitative and quantitative characteristics of the vowels contained in this inventory. Twenty native speakers of LA were recorded while reading target monosyllabic words in carrier sentences. Acoustic and auditory analyses were used in order to provide a fairly comprehensive and objective description of the vocalic system of LA. The results showed that phonologically short and long Arabic vowels vary significantly in quality as well as quantity; a finding which is increasingly being reported in experimental studies of other Arabic dialects. Short vowels in LA tend to be more centralised than has been reported for other Arabic vowels, especially with regards to short /a/. The study also looked at the effect of voicing in neighbouring consonants and vowel height on vowel duration, and the findings were compared to those of other varieties/languages. The perception part of the study explored the extent to which listeners use the same acoustic cues of length and quality in vowel perception that are evident in their production. This involved the use of continua from synthesised vowels which varied along duration and/or formant frequency dimensions. The continua were randomised and played to 20 native listeners who took part in an identification task. The results show that, when it comes to perception, Arabic listeners still rely mainly on quantity for the distinction between phonologically long and short vowels. That is, when presented with stimuli containing conflicting acoustic cues (formant frequencies that are typical of long vowels but with short duration or formant frequencies that are typical of short vowels but with long duration), listeners reacted consistently to duration rather than formant frequency. The results of both parts of the study provided some understanding of the LA vowel system. The production data allowed for a detailed description of the phonetic characteristics of LA vowels, and the acoustic space that they occupy was compared with those of other Arabic varieties. The perception data showed that production and perception do not always go hand in hand and that primary acoustic cues for the identification of vowels are dialect- and language-specific

    A large-scale analysis of the acoustic-phonetic markers of speaker sex.

    Get PDF
    The research for this thesis lies within the fieIa of speaker characterisation through the acoustic-phonetic analysis of speech. The thesis consists of two parts: 1. An inv.estigation of the acoustic-phonetic differences between the speech of women and men; 2. An examination of the practicalities of automating the investigation to analyse a large speech database. The acoustic-phonetic markers of speaker sex examined here are the fundamental frequency, the formant frequencies, and the relative amplitude of the first harmonic. The aims of the investigation were, firstly, to establish to what extent these markers differentiate between the sexes, and secondly, to examine the extent of between- and within-speaker deviation from the female and male norms, or average values for each sex. These points were investigated by an automated acoustic-phonetic analysis of the TIMIT database, involving a data set of almost 16,000 segments of speech. An automated method was dev~loped to enable the signal processing and statistical analysis of a data set of this size. The problems to be encountered in the analysis of a highly variable data source (i.e. the acoustic speech waveform) are addressed

    Personalising synthetic voices for individuals with severe speech impairment.

    Get PDF
    Speech technology can help individuals with speech disorders to interact more easily. Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre-recorded voice output. This voice output effectively becomes the voice of the individual and should therefore represent the user accurately. Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria. The thesis shows that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for 'voice banking' for individuals before their condition causes deterioration of the speech and once deterioration has begun. Data input requirements for building personalised voices with this technique using human listener judgement evaluation is investigated. It shows that 100 sentences is the minimum required to build a significantly different voice from an average voice model and show some resemblance to the target speaker. This amount depends on the speaker and the average model used. A neural network analysis trained on extracted acoustic features revealed that spectral features had the most influence for predicting human listener judgements of similarity of synthesised speech to a target speaker. Accuracy of prediction significantly improves if other acoustic features are introduced and combined non-linearly. These results were used to inform the reconstruction of personalised synthetic voices for speakers whose voices had begun to show the effects of their conditions. Using HMM-based synthesis, personalised synthetic voices were built using dysarthric speech showing similarity to target speakers without recreating the impairment in the synthesised speech output
    corecore