2,549 research outputs found

    Phonetic Error Analysis Beyond Phone Error Rate

    Get PDF
    In this article, we analyse the performance of the TIMIT-based phone recognition systems beyond the overall phone error rate (PER) metric. We consider three broad phonetic classes (BPCs): {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel, silence} and {voiced, unvoiced, silence} and, calculate the contribution of each phonetic class in terms of the substitution, deletion, insertion and PER. Furthermore, for each BPC we investigate the following: evolution of PER during training, effect of noise (NTIMIT), importance of different spectral subbands (1, 2, 4, and 8 kHz), usefulness of bidirectional vs unidirectional sequential modelling, transfer learning from WSJ and regularisation via monophones. In addition, we construct a confusion matrix for each BPC and analyse the confusions via dimensionality reduction to 2D at the input (acoustic features) and output (logits) levels of the acoustic model. We also compare the performance and confusion matrices of the BLSTM-based hybrid baseline system with those of the GMM-HMM based hybrid, Conformer and wav2vec 2.0 based end-to-end phone recognisers. Finally, the relationship of the unweighted and weighted PERs with the broad phonetic class priors is studied for both the hybrid and end-to-end systems

    Applications of broad class knowledge for noise robust speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 157-164).This thesis introduces a novel technique for noise robust speech recognition by first describing a speech signal through a set of broad speech units, and then conducting a more detailed analysis from these broad classes. These classes are formed by grouping together parts of the acoustic signal that have similar temporal and spectral characteristics, and therefore have much less variability than typical sub-word units used in speech recognition (i.e., phonemes, acoustic units). We explore broad classes formed along phonetic and acoustic dimensions. This thesis first introduces an instantaneous adaptation technique to robustly recognize broad classes in the input signal. Given an initial set of broad class models and input speech data, we explore a gradient steepness metric using the Extended Baum-Welch (EBW) transformations to explain how much these initial model must be adapted to fit the target data. We incorporate this gradient metric into a Hidden Markov Model (HMM) framework for broad class recognition and illustrate that this metric allows for a simple and effective adaptation technique which does not suffer from issues such as data scarcity and computational intensity that affect other adaptation methods such as Maximum a-Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) and feature-space Maximum Likelihood Linear Regression (fM-LLR). Broad class recognition experiments indicate that the EBW gradient metric method outperforms the standard likelihood technique, both when initial models are adapted via MLLR and without adaptation.(cont.) Next, we explore utilizing broad class knowledge as a pre-processor for segmentbased speech recognition systems, which have been observed to be quite sensitive to noise. The experiments are conducted with the SUMMIT segment-based speech recognizer, which detects landmarks - representing possible transitions between phonemes - from large energy changes in the acoustic signal. These landmarks are often poorly detected in noisy conditions. We investigate using the transitions between broad classes, which typically occur at areas of large acoustic change in the audio signal, to aid in landmark detection. We also explore broad classes motivated along both acoustic and phonetic dimensions. Phonetic recognition experiments indicate that utilizing either phonetically or acoustically motivated broad classes offers significant recognition improvements compared to the baseline landmark method in both stationary and non-stationary noise conditions. Finally, this thesis investigates using broad class knowledge for island-driven search. Reliable regions of a speech signal, known as islands, carry most information in the signal compared to unreliable regions, known as gaps. Most speech recognizers do not differentiate between island and gap regions during search and as a result most of the search computation is spent in unreliable regions. Island-driven search addresses this problem by first identifying islands in the speech signal and directing the search outwards from these islands.(cont.) In this thesis, we develop a technique to identify islands from broad classes which have been confidently identified from the input signal. We explore a technique to prune the search space given island/gap knowledge. Finally, to further limit the amount of computation in unreliable regions, we investigate scoring less detailed broad class models in gap regions and more detailed phonetic models in island regions. Experiments on both small and large scale vocabulary tasks indicate that the island-driven search strategy results in an improvement in recognition accuracy and computation time.by Tara N. Sainath.Ph.D

    Broad phonetic class definition driven by phone confusions

    Get PDF
    Intermediate representations between the speech signal and phones may be used to improve discrimination among phones that are often confused. These representations are usually found according to broad phonetic classes, which are defined by a phonetician. This article proposes an alternative data-driven method to generate these classes. Phone confusion information from the analysis of the output of a phone recognition system is used to find clusters at high risk of mutual confusion. A metric is defined to compute the distance between phones. The results, using TIMIT data, show that the proposed confusion-driven phone clustering method is an attractive alternative to the approaches based on human knowledge. A hierarchical classification structure to improve phone recognition is also proposed using a discriminative weight training method. Experiments show improvements in phone recognition on the TIMIT database compared to a baseline system

    Segment-based automatic language identification

    Full text link

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Automatic prosodic analysis for computer aided pronunciation teaching

    Get PDF
    Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech..
    • …
    corecore