5 research outputs found

    Improving ASR error detection with non-decoder based features

    Get PDF
    Abstract This study reports error detection experiments in large vocabulary automatic speech recognition (ASR) systems, by using statistical classifiers. We explored new features gathered from other knowledge sources than the decoder itself: a binary feature that compares outputs from two different ASR systems (word by word), a feature based on the number of hits of the hypothesized bigrams, obtained by queries entered into a very popular Web search engine, and finally a feature related to automatically infered topics at sentence and word levels. Experiments were conducted on a European Portuguese broadcast news corpus. The combination of baseline decoder-based features and two of these additional features led to significant improvements, from 13.87% to 12.16% classification error rate (CER) with a maximum entropy model, and from 14.01% to 12.39% CER with linear-chain conditional random fields, comparing to a baseline using only decoder-based features

    Speaker model adaptation based on confidence score

    Get PDF
    Očekuje se da mjere povjerenja postanu mjera za pouzdanost rezultata sustava za prepoznavanje govora. Najčešće korištene mjere povjerenja zasnovane su na vjerojatnosti sljedeće riječi ili fonema, koja se može dobiti iz izlaznog rezultata prepoznavatelja. U ovom smo radu uveli mjeru povjerenja zasnovanu na linearnoj interpretaciji vjerojatnoće sljedeće riječi primjenom obrnute Fisher transformacije. Adaptacija govornika sastoji se od ažuriranja parametara modela nezavisnog od govornika zbog boljeg predstavljanja postojećeg govornika. Mjere povjerenja daju pouzdanije kriterije za odabir riječi koje najbolje predstavljaju govornika. Linearna interpretacija mjere povjerenja vrlo je važna pri odabiru najreprezentativnijih podataka za adaptaciju.Confidence measures are expected to give a measure of reliability on the result of a speech/speaker recognition system. Most commonly used confidence measures are based on posterior word or phoneme probabilities which can be obtained from the output of the recognizer. In this paper we introduced a linear interpretation of posterior probability based confidence measure by using inverse Fisher transformation. Speaker adaptation consists in updating model parameters of a speaker independent model to have a better representation of the current speaker. Confidence measures give more reliable selection criteria to select the utterances which best represent the speaker. A linear interpretation of confidence measure is very important to select the most representative data for adaptation

    High-Level Approaches to Confidence Estimation in Speech Recognition

    No full text
    We describe some high-level approaches to estimating confidence scores for the words output by a speech recognizer. By "high-level" we mean that the proposed measures do not rely on decoder specific "side information" and so should find more general applicability than measures that have been developed for specific recognizers. Our main approach is to attempt to decouple the language modeling and acoustic modeling in the recognizer in order to generate independent information from these two sources that can then be used for estimation of confidence. We isolate these two information sources by using a phone recognizer working in parallel with the word recognizer. A set of techniques for estimating confidence measures using the phone recognizer output in conjunction with the word recognizer output is described. The most effective of these techniques is based on the construction of "metamodels," which generate alternative word hypotheses for an utterance. An alternative approach requires no other recognizers or extra information for confidence estimation and is based on the notion that a word that is semantically "distant" from the other decoded words in the utterance is likely to be incorrect. We describe a method for constructing "semantic similarities" between words and hence estimating a confidence. Results using the U.K. version of the Wall Street Journal are given for each technique

    A Discriminative Locally-Adaptive Nearest Centroid Classifier for Phoneme Classification

    Get PDF
    Phoneme classification is a key area of speech recognition. Phonemes are the basic modeling units in modern speech recognition and they are the constructive units of words. Thus, being able to quickly and accurately classify phonemes that are input to a speech-recognition system is a basic and important step towards improving and eventually perfecting speech recognition as a whole. Many classification approaches currently exist that can be applied to the task of classifying phonemes. These techniques range from simple ones such as the nearest centroid classifier to complex ones such as support vector machine. Amongst the existing classifiers, the simpler ones tend to be quicker to train but have lower accuracy, whereas the more complex ones tend to be higher in accuracy but are slower to train. Because phoneme classification involves very large datasets, it is desirable to have classifiers that are both quick to train and are high in accuracy. The formulation of such classifiers is still an active ongoing research topic in phoneme classification. One paradigm in formulating such classifiers attempts to increase the accuracies of the simpler classifiers with minimal sacrifice to their running times. The opposite paradigm attempts to increase the training speeds of the more complex classifiers with minimal sacrifice to their accuracies. The objective of this research is to develop a new centroid-based classifier that builds upon the simpler nearest centroid classifier by incorporating a new discriminative locally-adaptive training procedure developed from recent advances in machine learning. This new classifier, which is referred to as the discriminative locally-adaptive nearest centroid (DLANC) classifier, achieves much higher accuracies as compared to the nearest centroid classifier whilst having a relatively low computational complexity and being able to scale up to very large datasets

    Error handling in multimodal voice-enabled interfaces of tour-guide robots using graphical models

    Get PDF
    Mobile service robots are going to play an increasing role in the society of humans. Voice-enabled interaction with service robots becomes very important, if such robots are to be deployed in real-world environments and accepted by the vast majority of potential human users. The research presented in this thesis addresses the problem of speech recognition integration in an interactive voice-enabled interface of a service robot, in particular a tour-guide robot. The task of a tour-guide robot is to engage visitors to mass exhibitions (users) in dialogue providing the services it is designed for (e.g. exhibit presentations) within a limited time. In managing tour-guide dialogues, extracting the user goal (intention) for requesting a particular service at each dialogue state is the key issue. In mass exhibition conditions speech recognition errors are inevitable because of noisy speech and uncooperative users of robots with no prior experience in robotics. They can jeopardize the user goal identification. Wrongly identified user goals can lead to communication failures. Therefore, to reduce the risk of such failures, methods for detecting and compensating for communication failures in human-robot dialogue are needed. During the short-term interaction with visitors, the interpretation of the user goal at each dialogue state can be improved by combining speech recognition in the speech modality with information from other available robot modalities. The methods presented in this thesis exploit probabilistic models for fusing information from speech and auxiliary modalities of the robot for user goal identification and communication failure detection. To compensate for the detected communication failures we investigate multimodal methods for recovery from communication failures. To model the process of modality fusion, taking into account the uncertainties in the information extracted from each input modality during human-robot interaction, we use the probabilistic framework of Bayesian networks. Bayesian networks are graphical models that represent a joint probability function over a set of random variables. They are used to model the dependencies among variables associated with the user goals, modality related events (e.g. the event of user presence that is inferred from the laser scanner modality of the robot), and observed modality features providing evidence in favor of these modality events. Bayesian networks are used to calculate posterior probabilities over the possible user goals at each dialogue state. These probabilities serve as a base in deciding if the user goal is valid, i.e. if it can be mapped into a tour-guide service (e.g. exhibit presentation) or is undefined – signaling a possible communication failure. The Bayesian network can be also used to elicit probabilities over the modality events revealing information about the possible cause for a communication failure. Introducing new user goal aspects (e.g. new modality events and related features) that provide auxiliary information for detecting communication failures makes the design process cumbersome, calling for a systematic approach in the Bayesian network modelling. Generally, introducing new variables for user goal identification in the Bayesian networks can lead to complex and computationally expensive models. In order to make the design process more systematic and modular, we adapt principles from the theory of grounding in human communication. When people communicate, they resolve understanding problems in a collaborative joint effort of providing evidence of common shared knowledge (grounding). We use Bayesian network topologies, tailored to limited computational resources, to model a state-based grounding model fusing information from three different input modalities (laser, video and speech) to infer possible grounding states. These grounding states are associated with modality events showing if the user is present in range for communication, if the user is attending to the interaction, whether the speech modality is reliable, and if the user goal is valid. The state-based grounding model is used to compute probabilities that intermediary grounding states have been reached. This serves as a base for detecting if the the user has reached the final grounding state, or wether a repair dialogue sequence is needed. In the case of a repair dialogue sequence, the tour-guide robot can exploit the multiple available modalities along with speech. For example, if the user has failed to reach the grounding state related to her/his presence in range for communication, the robot can use its move modality to search and attract the attention of the visitors. In the case when speech recognition is detected to be unreliable, the robot can offer the alternative use of the buttons modality in the repair sequence. Given the probability of each grounding state, and the dialogue sequence that can be executed in the next dialogue state, a tour-guide robot has different preferences on the possible dialogue continuation. If the possible dialogue sequences at each dialogue state are defined as actions, the introduced principle of maximum expected utility (MEU) provides an explicit way of action selection, based on the action utility, given the evidence about the user goal at each dialogue state. Decision networks, constructed as graphical models based on Bayesian networks are proposed to perform MEU-based decisions, incorporating the utility of the actions to be chosen at each dialogue state by the tour-guide robot. These action utilities are defined taking into account the tour-guide task requirements. The proposed graphical models for user goal identification and dialogue error handling in human-robot dialogue are evaluated in experiments with multimodal data. These data were collected during the operation of the tour-guide robot RoboX at the Autonomous System Lab of EPFL and at the Swiss National Exhibition in 2002 (Expo.02). The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation. On the component level, the technical evaluation is done by calculating accuracies, as objective measures of the performance of the grounding model, and the resulting performance of the user goal identification in dialogue. The benefit of the proposed error handling framework is demonstrated comparing the accuracy of a baseline interactive system, employing only speech recognition for user goal identification, and a system equipped with multimodal grounding models for error handling
    corecore