378 research outputs found

    CLASSIFICATION OF VISEMES USING VISUAL CUES

    Get PDF
    Studies have shown that visual features extracted from the lips of a speaker (visemes) can be used to automatically classify the visual representation of phonemes. Different visual features were extracted from the audio-visual recordings of a set of phonemes and used to define Linear Discriminant Analysis (LDA) functions to classify the phonemes. . Audio-visual recordings from 18 speakers of Native American English for 12 Vowel-Consonant-Vowel (VCV) sounds were obtained using the consonants /b,v,w,ð,d,z/ and the vowels /ɑ,i/. The visual features used in this study were related to the lip height, lip width, motion in upper lips and the rate at which lips move while producing the VCV sequences. Features extracted from half of the speakers were used to design the classifier and features extracted from the other half were used in testing the classifiers.When each VCV sound was treated as an independent class, resulting in 12 classes, the percentage of correct recognition was 55.3% in the training set and 43.1% in the testing set. This percentage increased as classes were merged based on the level of confusion appearing between them in the results. When the same consonants with different vowels were treated as one class, resulting in 6 classes, the percentage of correct classification was 65.2% in the training set and 61.6% in the testing set. This is consistent with psycho-visual experiments in which subjects were unable to distinguish between visemes associated with VCV words with the same consonant but different vowels. When the VCV sounds were grouped into 3 classes, the percentage of correct classification in the training set was 84.4% and 81.1% in the testing set.In the second part of the study, linear discriminant functions were developed for every speaker resulting in 18 different sets of LDA functions. For every speaker, five VCV utterances were used to design the LDA functions, and 3 different VCV utterances were used to test these functions. For the training data, the range of correct classification for the 18 speakers was 90-100% with an average of 96.2%. For the testing data, the range of correct classification was 50-86% with an average of 68%.A step-wise linear discriminant analysis evaluated the contribution of different features towards the dissemination problem. The analysis indicated that classifiers using only the top 7 features in the analysis had a performance drop of 2-5%. The top 7 features were related to the shape of the mouth and the rate of motion of lips when the consonant in the VCV sequence was being produced. Results of this work showed that visual features extracted from the lips can separate the visual representation of phonemes into different classes

    The shadow of a doubt? Evidence for perceptuo-motor linkage during auditory and audiovisual close-shadowing

    Get PDF
    One classical argument in favor of a functional role of the motor system in speech perception comes from the close shadowing task in which a subject has to identify and to repeat as quickly as possible an auditory speech stimulus. The fact that close shadowing can occur very rapidly and much faster than manual identification of the speech target is taken to suggest that perceptually-induced speech representations are already shaped in a motor-compatible format. Another argument is provided by audiovisual interactions often interpreted as referring to a multisensory-motor framework. In this study, we attempted to combine these two paradigms by testing whether the visual modality could speed motor response in a close-shadowing task. To this aim, both oral and manual responses were evaluated during the perception of auditory and audio-visual speech stimuli, clear or embedded in white noise. Overall, oral responses were faster than manual ones, but it also appeared that they were less accurate in noise, which suggests that motor representations evoked by the speech input could be rough at a first processing stage. In the presence of acoustic noise, the audiovisual modality led to both faster and more accurate responses than the auditory modality. No interaction was however observed between modality and response. Altogether, these results are interpreted within a two-stage sensory-motor framework, in which the auditory and visual streams are integrated together and with internally generated motor representations before a final decision may be available

    Real-Time Contrast Enhancement to Improve Speech Recognition

    Get PDF
    An algorithm that operates in real-time to enhance the salient features of speech is described and its efficacy is evaluated. The Contrast Enhancement (CE) algorithm implements dynamic compressive gain and lateral inhibitory sidebands across channels in a modified winner-take-all circuit, which together produce a form of suppression that sharpens the dynamic spectrum. Normal-hearing listeners identified spectrally smeared consonants (VCVs) and vowels (hVds) in quiet and in noise. Consonant and vowel identification, especially in noise, were improved by the processing. The amount of improvement did not depend on the degree of spectral smearing or talker characteristics. For consonants, when results were analyzed according to phonetic feature, the most consistent improvement was for place of articulation. This is encouraging for hearing aid applications because confusions between consonants differing in place are a persistent problem for listeners with sensorineural hearing loss

    Acoustic analyses and perceptual data on anticipatory labial coarticulation in adults and children

    Get PDF
    This is the published version, also available here: http://dx.doi.org/10.1121/1.394917.The present study investigated anticipatory labial coarticulation in the speech of adults and children. CV syllables, composed of [s], [t], and [d] before [i] and [u], were produced by four adult speakers and eight child speakers aged 3–7 years. Each stimulus was computer edited to include only the aperiodic portion of fricative‐vowel and stop‐vowel syllables. LPC spectra were then computed for each excised segment. Analyses of the effect of the following vowel on the spectral peak associated with the second formant frequency and on the characteristic spectral prominence for each consonant were performed. Perceptual data were obtained by presenting the aperiodic consonantal segments to subjects who were instructed to identify the following vowel as [i] or [u]. Both the acoustic and the perceptual data show strong coarticulatory effects for the adults and comparable, although less consistent, coarticulation in the speech stimuli of the children. The results are discussed in terms of the articulatory and perceptual aspects of coarticulation in language learning

    Segmental alignment of English syllables with singleton and cluster onsets

    Get PDF
    Recent research has shown fresh evidence that consonant and vowel are synchronised at the syllable onset, as predicted by a number of theoretical models. The finding was made by using a minimal contrast paradigm to determine segment onset in Mandarin CV syllables, which differed from the conventional method of detecting gesture onset with a velocity threshold [1]. It has remained unclear, however, if CV co-onset also occurs between the nucleus vowel and a consonant cluster, as predicted by the articulatory syllable model [2]. This study applied the minimal contrast paradigm to British English in both CV and clusterV (CLV) syllables, and analysed the spectral patterns with signal chopping in conjunction with recurrent neural networks (RNN) with long short-term memory (LSTM) [3]. Results show that vowel onset is synchronised with the onset of the first consonant in a cluster, thus supporting the articulatory syllable model

    Automated nasal feature detection for the lexical access from features project

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (leaves 150-151).The focus of this thesis was the design, implementation, and evaluation of a set of automated algorithms to detect nasal consonants from the speech waveform in a distinctive feature-based speech recognition system. The study used a VCV database of over 450 utterances recorded from three speakers, two male and one female. The first stage of processing for each speech waveform included automated 'pivot' estimation using the Consonant Landmark Detector - these 'pivots' were considered possible sonorant closures and releases in further analyses. Estimated pivots were analyzed acoustically for the nasal murmur and vowel-nasal boundary characteristics. For nasal murmur, the analyzed cues included observing the presence of a low frequency resonance in the short-time spectra, stability in the signal energy, and characteristic spectral tilt. The acoustic cues for the nasal boundary measured the change in the energy of the first harmonic and the net energy change of the 0-350Hz and 350-1000Hz frequency bands around the pivot time. The results of the acoustic analyses were translated into a simple set of general acoustic criteria that detected 98% of true nasal pivots. The high detection rate was partially offset by a relatively large number of false positives - 16% of all non-nasal pivots were also detected as showing characteristics of the nasal murmur and nasal boundary. The advantage of the presented algorithms is in their consistency and accuracy across users and contexts, and unlimited applicability to spontaneous speech.by Neira Hajro.M.Eng

    Acoustic analysis of Sindhi speech - a pre-curser for an ASR system

    Get PDF
    The functional and formative properties of speech sounds are usually referred to as acoustic-phonetics in linguistics. This research aims to demonstrate acoustic-phonetic features of the elemental sounds of Sindhi, which is a branch of the Indo-European family of languages mainly spoken in the Sindh province of Pakistan and in some parts of India. In addition to the available articulatory-phonetic knowledge; acoustic-phonetic knowledge has been classified for the identification and classification of Sindhi language sounds. Determining the acoustic features of the language sounds helps to bring together the sounds with similar acoustic characteristics under the name of one natural class of meaningful phonemes. The obtained acoustic features and corresponding statistical results for a particular natural class of phonemes provides a clear understanding of the meaningful phonemes of Sindhi and it also helps to eliminate redundant sounds present in the inventory. At present Sindhi includes nine redundant, three interchanging, three substituting, and three confused pairs of consonant sounds. Some of the unique acoustic-phonetic features of Sindhi highlighted in this study are determining the acoustic features of the large number of the contrastive voiced implosives of Sindhi and the acoustic impact of the language flexibility in terms of the insertion and digestion of the short vowels in the utterance. In addition to this the issue of the presence of the affricate class of sounds and the diphthongs in Sindhi is addressed. The compilation of the meaningful language phoneme set by learning their acoustic-phonetic features serves one of the major goals of this study; because twelve such sounds of Sindhi are studied that are not yet part of the language alphabet. The main acoustic features learned for the phonological structures of Sindhi are the fundamental frequency, formants, and the duration — along with the analysis of the obtained acoustic waveforms, the formant tracks and the computer generated spectrograms. The impetus for doing such research comes from the fact that detailed knowledge of the sound characteristics of the language-elements has a broad variety of applications — from developing accurate synthetic speech production systems to modeling robust speaker-independent speech recognizers. The major research achievements and contributions this study provides in the field include the compilation and classification of the elemental sounds of Sindhi. Comprehensive measurement of the acoustic features of the language sounds; suitable to be incorporated into the design of a Sindhi ASR system. Understanding of the dialect specific acoustic variation of the elemental sounds of Sindhi. A speech database comprising the voice samples of the native Sindhi speakers. Identification of the language‘s redundant, substituting and interchanging pairs of sounds. Identification of the language‘s sounds that can potentially lead to the segmentation and recognition errors for a Sindhi ASR system design. The research achievements of this study create the fundamental building blocks for future work to design a state-of-the-art prototype, which is: gender and environment independent, continuous and conversational ASR system for Sindhi

    Vowel recognition in continuous speech

    Get PDF
    In continuous speech, the identification of phonemes requires the ability to extract features that are capable of characterizing the acoustic signal. Previous work has shown that relatively high classification accuracy can be obtained from a single spectrum taken during the steady-state portion of the phoneme, assuming that the phonetic environment is held constant. The present study represents an attempt to extend this work to variable phonetic contexts by using dynamic rather than static spectral information. This thesis has four aims: 1) Classify vowels in continuous speech; 2) Find the optimal set of features that best describe the vowel regions; 3) Compare the classification results using a multivariate maximum likelihood distance measure with those of a neural network using the backpropagation model; 4) Examine the classification performance of a Hidden Markov Model given a pathway through phonetic space
    • 

    corecore