19 research outputs found

    Using Audio and Video Features to Classify the Most Dominant Person in a Group Meeting

    Get PDF
    The automated extraction of semantically meaningful information from multi-modal data is becoming increasingly necessary due to the escalation of captured data for archival. A novel area of multi-modal data labelling, which has received relatively little attention, is the automatic estimation of the most dominant person in a group meeting. In this paper, we provide a framework for detecting dominance in group meetings using different audio and video cues. We show that by using a simple model for dominance estimation we can obtain promising results

    Desperately Seeking Impostors: Data-Mining for Competitive Impostor Testing in a Text-Dependent Speaker Verification System

    No full text
    Precise determination of the operating point of a real-world verification application is of great importance. For a textdependent password-based security system, this can be a challenging task, as lexically matched impostor test data may be nonexistent. In this work we present a data mining approach for extracting suitable impostor data. The approach may be applied to either the Target database (the application data itself) or the Stock databases (data from other applications). The method entails 1) determining Levenstein distances of impostor text utterances with respect to the claimant password 2) selecting subsets of impostor data at various levels of lexical distance, 3) calculating the score threshold using such subsets, 4) extrapolating the score threshold (and hence the operating point) for lexically perfectly-matched data. Experiments on four databases in two languages are presented. This approach, as applied to the Target database, provides an accurate and inexpensive solution to a formidable real-world problem

    Towards Robustness To Fast Speech In Asr

    No full text
    Psychoacoustic studies show that human listeners are sen- sitive to speaking rate variations [t0]. Automatic speech recognition (ASR) systems are even more affected by the changes in rate, as double to quadruple word recognition error rates of average speakers have been observed for fast speakers on many ASR systems [6]. In our earher work [5], we studied the causes of higher error and concluded that both the acoustic-phonetic and the phonological differences are sources of higher word error rates. In this work, we have studied various measures for quantifying rate of speech (ROS), and used simple methods for estimating the speak- ing rate of a novel utterance using ASR technology. We have also implemented mechanisms that make our ASR system more robust to fast speech. Using our ROS estimator to identify fast sentences in the test set, our rate-dependent system has 24.5% fewer errors on the fastest sentences and 6.2% fewer errors on all sentences of the WSJ93 evaluation set relative to the basehue HMM/MLP system

    Speech Recognition Using On-Line Estimation Of Speaking Rate

    No full text
    In this paper, we describe a rate of speech estimator that is derived directly from the acoustic signal. This measure has been developed as an alternative to lexical measures of speaking rate such as phones or syllables per second, which, in previous work, we estimated using a first recognition pass; the accuracy of our earlier lexical rate estimate depended on the quality of recognition. Here we show that our new measure is a good predictor of word error rate, and in addition, correlates moderately well with lexical speech rate. We also show that a simple modification of the model transition probabilities based on this measure can reduce the error rate almost as much as using lexical phones per second calculated from manually transcribed data. When we categorized test utterances based on speaking rate thresholds computed from the training set, we observed that a different transition probability value was required to minimize the error rate in each speaking rate bin. However, the reduc..

    Why Is Asr Harder For Fast Speech And What Can We Do About It?

    No full text
    INTRODUCTION It has been observed in various NIST evaluations (e.g. WSJ-Nov93 & RM-Sep92) that ASR systems typically have about 2-3 times higher word error rates on very fast speakers [2, 3]. This observation naturally inspires the following question: "why do ASR systems perform significantly worse on fast speech?" We have considered two reasons for the higher error rate of faster speakers. First, due to increased coarticulation effects, the spectral features of fast speech are inherently different from normal speech and these differences are reflected in the extracted features (acoustic-phonetic causes). Phonological causes are the second potential culprit: the normal word models may be unsuitable for fast speech because fast speakers often violate the phonemic durational constraints of the word-models (durational errors) or omit phones altogether (deletion errors). In the following sections, we describe our inv
    corecore