2,613 research outputs found

    Phoneme duration modelling for speaker verification

    Get PDF
    Higher-level features are considered to be a potential remedy against transmission line and cross-channel degradations, currently some of the biggest problems associated with speaker verification. Phoneme durations in particular are not altered by these factors; thus a robust duration model will be a particularly useful addition to traditional cepstral based speaker verification systems. In this dissertation we investigate the feasibility of phoneme durations as a feature for speaker verification. Simple speaker specific triphone duration models are created to statistically represent the phoneme durations. Durations are obtained from an automatic hidden Markov model (HMM) based automatic speech recognition system and are modeled using single mixture Gaussian distributions. These models are applied in a speaker verification system (trained and tested on the YOHO corpus) and found to be a useful feature, even when used in isolation. When fused with acoustic features, verification performance increases significantly. A novel speech rate normalization technique is developed in order to remove some of the inherent intra-speaker variability (due to differing speech rates). Speech rate variability has a negative impact on both speaker verification and automatic speech recognition. Although the duration modelling seems to benefit only slightly from this procedure, the fused system performance improvement is substantial. Other factors known to influence the duration of phonemes are incorporated into the duration model. Utterance final lengthening is known be a consistent effect and thus “position in sentence” is modeled. “Position in word” is also modeled since triphones do not provide enough contextual information. This is found to improve performance since some vowels’ duration are particularly sensitive to its position in the word. Data scarcity becomes a problem when building speaker specific duration models. By using information from available data, unknown durations can be predicted in an attempt to overcome the data scarcity problem. To this end we develop a novel approach to predict unknown phoneme durations from the values of known phoneme durations for a particular speaker, based on the maximum likelihood criterion. This model is based on the observation that phonemes from the same broad phonetic class tend to co-vary strongly, but that there is also significant cross-class correlations. This approach is tested on the TIMIT corpus and found to be more accurate than using back-off techniques.Dissertation (MEng)--University of Pretoria, 2009.Electrical, Electronic and Computer Engineeringunrestricte

    Production and perception of speaker-specific phonetic detail at word boundaries

    Get PDF
    Experiments show that learning about familiar voices affects speech processing in many tasks. However, most studies focus on isolated phonemes or words and do not explore which phonetic properties are learned about or retained in memory. This work investigated inter-speaker phonetic variation involving word boundaries, and its perceptual consequences. A production experiment found significant variation in the extent to which speakers used a number of acoustic properties to distinguish junctural minimal pairs e.g. 'So he diced them'—'So he'd iced them'. A perception experiment then tested intelligibility in noise of the junctural minimal pairs before and after familiarisation with a particular voice. Subjects who heard the same voice during testing as during the familiarisation period showed significantly more improvement in identification of words and syllable constituents around word boundaries than those who heard different voices. These data support the view that perceptual learning about the particular pronunciations associated with individual speakers helps listeners to identify syllabic structure and the location of word boundaries

    Acoustic correlates of linguistic rhythm: Perspectives

    Get PDF
    The empirical grounding of a typology of languages' rhythm is again a hot issue. The currently popular approach is based on the durations of vocalic and intervocalic intervals and their variability. Despite some successes, many questions remain. The main findings still need to be generalised to much larger corpora including many more languages. But a straightforward continuation of the current work faces many difficulties. Perspectives are outlined for future work, including proposals for the cross-linguistic control of speech rate, improvements on the statistical analyses, and prospects raised by automatic speech processing

    Assessing the Prosody of Non-Native Speakers of English: Measures and Feature Sets

    Get PDF
    In this paper, we describe a new database with audio recordings of non-native (L2) speakers of English, and the perceptual evaluation experiment conducted with native English speakers for assessing the prosody of each recording. These annotations are then used to compute the gold standard using different methods, and a series of regression experiments is conducted to evaluate their impact on the performance of a regression model predicting the degree of Abstract naturalness of L2 speech. Further, we compare the relevance of different feature groups modelling prosody in general (without speech tempo), speech rate and pauses modelling speech tempo (fluency), voice quality, and a variety of spectral features. We also discuss the impact of various fusion strategies on performance.Overall, our results demonstrate that the prosody of non-native speakers of English as L2 can be reliably assessed using supra- segmental audio features; prosodic features seem to be the most important ones

    Pauses and the temporal structure of speech

    Get PDF
    Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated

    An acoustic investigation of Arabic vowels pronounced by Malay speakers

    Get PDF
    AbstractIn Malaysia, Arabic language is spoken, and commonly used among the Malays. Malays use Arabic in their daily life, such as during performing worship. Hence, in this paper, some of the Arabic vowels attributes are investigated, analyzed and initial findings are presented based on tokens articulated by Malay speakers as we can consider the spoken Arabic by Malays as one of the Arabic dialects. It is known that in Arabic language there are 28 consonants and 6 main vowels. Firstly, the duration, variability, and overlapping attributes are highlighted based on syllables of Consonant–Vowel with each syllable representing every Arabic consonant with the corresponding vowels. Next, the dispersion of each vowel is examined to be compared with each other along with the variability among vowels that may cause overlapping between vowels in the vowel-space. Results showed that the vowel overlapping occurred between short vowels and their long counterpart vowels. Furthermore, an investigation of the Arabic vowel duration is addressed as well, and duration analysis for all the vowels is discussed, followed by the analysis for each vowel separately. In addition, a comparison between long and short vowels is presented as well as comparison between high and low vowel is carried out

    A computational model for studying L1’s effect on L2 speech learning

    Get PDF
    abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s.Dissertation/ThesisDoctoral Dissertation Speech and Hearing Science 201

    The Effect of Speaking Rate on Audio and Visual Speech

    Get PDF
    The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech
    • …
    corecore