2,416 research outputs found

    Robust Classification of Stop Consonants Using Auditory-Based Speech Processing

    Get PDF
    In this work, a feature-based system for the automatic classification of stop consonants, in speaker independent continuous speech, is reported. The system uses a new auditory-based speech processing front-end that is based on the biologically rooted property of average localized synchrony detection (ALSD). It incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved, statistically, to be rich in their information content. The experiments are performed on stop consonants extracted from the TIMIT database with additive white Gaussian noise at various signal-to-noise ratios. The obtained classification accuracy compares favorably with previous work. The results also showed a consistent improvement of 3% in the place detection over the Generalized Synchrony Detector (GSD) system under identical circumstances on clean and noisy speech. This illustrates the superior ability of the ALSD to suppress the spurious peaks and produce a consistent and robust formant (peak) representation

    Segregation of Vowels and Consonants in Human Auditory Cortex: Evidence for Distributed Hierarchical Organization

    Get PDF
    The speech signal consists of a continuous stream of consonants and vowels, which must be de- and encoded in human auditory cortex to ensure the robust recognition and categorization of speech sounds. We used small-voxel functional magnetic resonance imaging to study information encoded in local brain activation patterns elicited by consonant-vowel syllables, and by a control set of noise bursts. First, activation of anterior–lateral superior temporal cortex was seen when controlling for unspecific acoustic processing (syllables versus band-passed noises, in a “classic” subtraction-based design). Second, a classifier algorithm, which was trained and tested iteratively on data from all subjects to discriminate local brain activation patterns, yielded separations of cortical patches discriminative of vowel category versus patches discriminative of stop-consonant category across the entire superior temporal cortex, yet with regional differences in average classification accuracy. Overlap (voxels correctly classifying both speech sound categories) was surprisingly sparse. Third, lending further plausibility to the results, classification of speech–noise differences was generally superior to speech–speech classifications, with the no\ exception of a left anterior region, where speech–speech classification accuracies were significantly better. These data demonstrate that acoustic–phonetic features are encoded in complex yet sparsely overlapping local patterns of neural activity distributed hierarchically across different regions of the auditory cortex. The redundancy apparent in these multiple patterns may partly explain the robustness of phonemic representations

    Changes in the McGurk Effect Across Phonetic Contexts

    Full text link
    To investigate the process underlying audiovisual speech perception, the McGurk illusion was examined across a range of phonetic contexts. Two major changes were found. First, the frequency of illusory /g/ fusion percepts increased relative to the frequency of illusory /d/ fusion percepts as vowel context was shifted from /i/ to /a/ to /u/. This trend could not be explained by biases present in perception of the unimodal visual stimuli. However, the change found in the McGurk fusion effect across vowel environments did correspond systematically with changes in second format frequency patterns across contexts. Second, the order of consonants in illusory combination percepts was found to depend on syllable type. This may be due to differences occuring across syllable contexts in the timecourses of inputs from the two modalities as delaying the auditory track of a vowel-consonant stimulus resulted in a change in the order of consonants perceived. Taken together, these results suggest that the speech perception system either fuses audiovisual inputs into a visually compatible percept with a similar second formant pattern to that of the acoustic stimulus or interleaves the information from different modalities, at a phonemic or subphonemic level, based on their relative arrival times.National Institutes of Health (R01 DC02852

    Acoustic-Phonetic Features for the Automatic Classification of Stop Consonants

    Get PDF
    In this paper, the acoustic–phonetic characteristics of American English stop consonants are investigated. Features studied in the literature are evaluated for their information content and new features are proposed. A statistically guided, knowledge-based, acoustic–phonetic system for the automatic classification of stops, in speaker independent continuous speech, is proposed. The system uses a new auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic–phonetic features that proved to be rich in their information content. Recognition experiments are performed using hard decision algorithms on stops extracted from the TIMIT database continuous speech of 60 speakers (not used in the design process) from seven different dialects of American English. An accuracy of 96% is obtained for voicing detection, 90% for place articulation detection and 86% for the overall classification of stops

    Robust Auditory-Based Speech Processing Using the Average Localized Synchrony Detection

    Get PDF
    In this paper, a new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response’s spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise

    Changes in the McGurk Effect across Phonetic Contexts. I. Fusions

    Full text link
    The McGurk effect has generally been studied within a limited range of phonetic contexts. With the goal of characterizing the McGurk effect through a wider range of contexts, a parametric investigation across three different vowel contexts, /i/, /α/, and /u/, and two different syllable types, consonant-vowel (CV) and vowel-consonant (VC), was conducted. This paper discusses context-dependent changes found specifically in the McGurk fusion phenomenon (Part II addresses changes found in combination percepts). After normalizing for differences in the magnitude of the McGurk effect in different contexts, a large qualitative change in the effect across vowel contexts became apparent. In particular, the frequency of illusory /g/ percepts increased relative to the frequency of illusory /d/ percepts as vowel context was shifted from /i/ to /α/ to /u/. This trend was seen in both syllable sets, and held regardless of whether the visual stimulus used was a /g/ or /d/ articulation. This qualitative change in the McGurk fusion effect across vowel environments corresponded systematically with changes in the typical second formant frequency patterns of the syllables presented. The findings are therefore consistent with sensory-based theories of speech perception which emphasize the importance of second formant patterns as cues in multimodal speech perception.National Institue on Deafness and other Communication Disorders (R29 02852); Alfred P. Sloan Foundation and National Institute on Deafness and other Communication Disorders (R29 02852

    Effects of Aging and Spectral Shaping on the Sub-cortical (Brainstem) Differentiation of Contrastive Stop Consonants

    Get PDF
    Purpose: The objectives of this dissertation are to: (1) evaluate the influence of aging on the sub-cortical (brainstem) differentiation of voiced stop consonants (i.e. /b-d-g/); (2) determine whether potential aging deficits at the brainstem level influence behavioral identification of the /b-d-g/ stimuli, (3) investigate whether spectral shaping diminishes any aging impairments at the brainstem level; and (4) if so, whether minimizing these deficits improves the behavioral identification of the speech stimuli. Subjects: Behavioral and electrophysiological responses were collected from 11 older adults (\u3e 50 years old) with near-normal to normal hearing and were compared to those of 16 normal-hearing younger adults (control group). Stimuli and Methods: Speech- evoked auditory brainstem responses (Speech-ABRs) were recorded for three 100-ms long /b-d-g/ consonant-vowel exemplars in unshaped and shaped conditions, for a total of six stimuli. Frequency-dependent spectral-shaping enhanced the second formant (F2) transition relative to the rest of the stimulus, such that it reduced gain for low frequencies; and increased gain for mid and high frequencies, the frequency region of the F2 transition in the /b-d-g/ syllables. Behavioral identification of 15-step perceptual unshaped and shaped /b-d-g/ continua was assessed by generating psychometric functions in order to quantify stimuli perception. Speech ABR peak amplitudes and latencies and stop consonant differentiation scores were measured for 6 stimuli (3 unshaped stimuli and 3 shaped stimuli). Summary of Findings: Older adults exhibited more robust categorical perception, and subtle sub-cortical deficits when compared to younger adults. Individual data showed fewer expected latency patterns for the /b-d-g/ speech-ABRs in older adults as opposed to younger adults, especially for major peaks. Spectral shaping improved the stop consonant differentiation score for major peaks in older adults, such that it moved older adults in the direction of the younger adults’ responses. Conclusion: Sub-cortical impairments at least those measured in this study do not seem to influence the behavioral differentiation of stop consonants in older adults. On the other hand, cue enhancement by spectral shaping seems to overcome some of the deficits noted at the electrophysiological level. However, due to a possible ceiling effect, improvements to the originally robust perception of older adults, at the behavioral level were not found. Significance: Aging seems to reduce the sub-cortical responsiveness to dynamic spectral cues without distorting the spectral coding as evident by the “reparable” age-related changes seen at the electrophysiological level. Cue enhancement appears to increase the neural responsiveness of aged but intact neurons, yielding a better sub-cortical differentiation of stop consonants

    Spoken Word Recognition Using Hidden Markov Model

    Get PDF
    The main aim of this project is to develop isolated spoken word recognition system using Hidden Markov Model (HMM) with a good accuracy at all the possible frequency range of human voice. Here ten different words are recorded by different speakers including male and female and results are compared with different feature extraction methods. Earlier work includes recognition of seven small utterances using HMM with the use only one feature extraction method. This spoken word recognition system mainly divided into two major blocks. First includes recording data base and feature extraction of recorded signals. Here we use Mel frequency cepstral coefficients, linear cepstral coefficients and fundamental frequency as feature extraction methods. To obtain Mel frequency cepstral coefficients signal should go through the following: pre emphasis, framing, applying window function, Fast Fourier transform, filter bank and then discrete cosine transform, where as a linear frequency cepstral coefficients does not use Mel frequency. Second part describes HMM used for modeling and recognizing the spoken words. All the raining samples are clustered using K-means algorithm. Gaussian mixture containing mean, variance and weight are modeling parameters. Here Baum Welch algorithm is used for training the samples and re-estimate the parameters. Finally Viterbi algorithm recognizes best sequence that exactly matches for given sequence there is given spoken utterance to be recognized. Here all the simulations are done by the MATLAB tool and Microsoft window 7 operating system
    corecore