24 research outputs found

    The RWTH Aachen German and English LVCSR systems for IWSLT-2013

    Get PDF
    Abstract In this paper, German and English large vocabulary continuous speech recognition (LVCSR) systems developed by the RWTH Aachen University for the IWSLT-2013 evaluation campaign are presented. Good improvements are obtained with state-of-the-art monolingual and multilingual bottleneck features. In addition, an open vocabulary approach using morphemic sub-lexical units is investigated along with the language model adaptation for the German LVCSR. For both the languages, competitive WERs are achieved using system combination

    Investigations on neural networks, discriminative training criteria and error bounds

    No full text
    The task of an automatic speech recognition system is to convert speech signals into written text by choosing the recognition result according to a statistical decision rule. The discriminative training of the underlying statistical model is an essential part to improve the word error rate performance of the system. In automatic speech recognition a mismatch exists between the loss used in the word error rate performance measure, the loss of the decision rule and the loss of the discriminative training criterion. In the course of this thesis the analysis of this mismatch leads to the development of novel error bounds and training criteria. The novel training criteria are evaluated in practical speech recognition experiments. In summary, we come to the conclusion the statistical model is able to compensate for this mismatch if the discriminative training criterion involves the loss of the performance measure.Automatic speech recognition is based on Bayes decision rule. This rule chooses the most probable sentence as the recognition result for a given speech signal. The word error rate measures the performance of the recognition result. This measure is based on the Levenshtein loss and calculates the minimum number of insertions, deletions, and substitutions to transform the spoken into the recognized sentence. However, this choice of performance measure bears a fundamental mismatch to the one targeted in the maximum probability decision rule, as by definition, Bayes decision rule minimizes the sentence error rate, which does not guarantee to optimize the performance measure of automatic speech recognition — the word error rate. The straightforward approach to overcome this problem incorporates the Levenshtein loss into Bayes decision rule by choosing the recognition result according to the sentence minimizing the posterior-expected Levenshtein loss. Nevertheless, the evaluation of this decision rule is too time and memory consuming. It only is performed as a post-processing step after the search of the maximum probability decision rule. In practice, we have to make a model assumption to Bayes decision theory. The theory assumes the true distribution, which is the empirical prior of all speech signals and spoken sentences. This distribution is unknown in practice. To stay as close to the principle of Bayes decision rule, a model distribution with free parameters substitutes the true distribution. The corresponding maximum probability decision rule using the model is called the model-based decision rule. The free parameters of the model are learned from training data, e.g., with generative training. Subsequently, discriminative training finetunes the model. For automatic speech recognition, the type of discriminative training criterion plays a crucial role. For example, the Minimum Phone Error (MPE) criterion, which involves the Levenshtein loss, performs better than other discriminative criteria like cross-entropy or maximum-mutual-information. Apart from its superior practical performance, the MPE criterion has a lack of theoretical justification. In contrast to this criterion, the cross-entropy criterion can be derived based on a formal derivation scheme from the Kullback-Leibler divergence comparing the true and model distribution. In this scheme, the Kullback-Leibler divergence is an upper bound to the error difference between the model-based and Bayes decision rule. The error difference measures the performance difference between both decision rules. For the MPE criterion, different from the cross-entropy criterion, no such derivation scheme exists relating the training criterion to an upper bound on the error difference. In this thesis, we close this gap and give a theoretical justification for the MPE criterion. In the first part of this thesis, we develop a scheme to derive discriminative training criteria from bounds on the error difference between the model-based and Bayes decision rule. The f-Divergence is the basis for the examined error bounds. This divergence family is a generalization of the Kullback-Leibler divergence and is used to compare two distributions. We start by formulating proofs to derive upper f-Divergence bounds on the classification error difference. These proofs are then extended to error bounds based on a more general loss. These also include error bounds based on the Levenshtein loss, which are relevant to the mismatch between performance measure and model-based decision rule in automatic speech recognition. We ultimatively find a type of explicit bound which is suitable to derive discriminative training criteria. Before this thesis, no derivation scheme for more general losses like the Levenshtein loss existed relating the training criterion to an upper bound on the error difference. Practical automatic speech recognition experiments evaluate our novel training criteria. These experiments include frame-wise training of neural network training as well as sequence training of log-linear mixture models. We show that our novel f-Divergence training criteria achieve a competitive or better performance than the conventional cross-entropy and minimum phone error criteria. The second part of this thesis summarizes our successful participation in the QUAERO project evaluation campaign. We contributed the automatic speech recognition system for German in all project periods achieving the best or competitive results

    Application-Agnostic Language Modeling for On-Device ASR

    Full text link
    On-device automatic speech recognition systems face several challenges compared to server-based systems. They have to meet stricter constraints in terms of speed, disk size and memory while maintaining the same accuracy. Often they have to serve several applications with different distributions at once, such as communicating with a virtual assistant and speech-to-text. The simplest solution to serve multiple applications is to build application-specific (language) models, but this leads to an increase in memory. Therefore, we explore different data- and architecture-driven language modeling approaches to build a single application-agnostic model. We propose two novel feed-forward architectures that find an optimal trade off between different on-device constraints. In comparison to the application-specific solution, one of our novel approaches reduces the disk size by half, while maintaining speed and accuracy of the original model.Comment: accepted for ACL 2023 industry trac

    Does the Cost Function Matter in Bayes Decision Rule?

    No full text

    Spoken Language Translation Using Automatically Transcribed Text in Training

    Get PDF
    In spoken language translation a machine translation system takes speech as input and translates it into another language. A standard machine translation system is trained on written language data and expects written language as input. In this paper we propose an approach to close the gap between the output of automatic speech recognition and the input of machine translation by training the translation system on automatically transcribed speech. In our experiments we show improvements of up to 0.9 BLEU points on the IWSLT 2012 English-to-French speech translation task. 1
    corecore