Consonant recognition by humans and machines

Abstract

Thesis (Ph.D.)--Harvard--Massachusetts Institute of Technology Division of Health Sciences and Technology, 1998.Includes bibliographical references (p. 113-117).The goal of this research is to determine how aspects of human speech processing can be utilized to improve the performance of Automatic Speech Recognition (ASR) systems. Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans on a consonant recognition task using Consonant Vowel- Consonant (CVC) nonsense syllables degraded by highpass filtering, lowpass filtering, or additive noise. Confusion matrices were determined by recognizing the syllabies using different ASR front ends, including Mel-Filter Bank (MFB) energies, Mel-F filtered Cepstral Coefficients (MFCCs), and the Ensemble Interval Histogram (EIH). For syllables degraded by lowpass and highpass filtering, automated systems trained on the degraded condition recognized the consonants roughly as well as humans. Moreover, all the ASR systems produce similar patterns of recognition errors for a given filtering condition. These patterns differ significantly from that characteristic of humans under the same filtering conditions. For syllables degraded by additive speech-shaped noise, none of the automated systems recognized consonants as well as humans. As with filtered conditions, confusion matrices revealed similar error patterns for all the ASR systems. While the error patterns of humans and machines was more similar for noise conditions than for filtered conditions, the similarities were not as great as between the ASR systems. The greatest difference between human and machine performances was in determining the correct voiced/unvoiced classification of consonants. Given these results, work was focused on recognition of the correct voicing classification in additive noise (0 dB SNR). The approach taken attempted to automatically extract attributes of the. speech signal, termed subphonetic features, which are useful in determining the distinctive feature voicing. Two subphonetic features, intervocal period ( the length of time between the onset of the vowel and any preceding vocalization) and delta fundamental (the average first difference of fundamental frequency over the first 90 msec of the vowel) proved particularly useful. When these two features were appended to traditional ASR parameters, th-3 deficit exhibited by automated systems was reduced substantially, though not eliminated.by Jason Sroka.Ph.D

    Similar works

    Full text

    thumbnail-image

    Available Versions