2 research outputs found
A language score based output selection method for multilingual speech recognition
The quality of a multilingual speech recognition system can be improved by
adaptation methods if the input language is specified. For systems that can
accept multilingual inputs, the popular approach is to apply a language
identifier to the input then switch or configure decoders in the next step, or
use one more subsequence model to select the output from a set of candidates.
Motivated by the goal of reducing the latency for real-time applications, in
this paper, a language model rescoring method is firstly applied to produce all
possible candidates for target languages, then a simple score is proposed to
automatically select the output without any identifier model or language
specification of the input language. The main point is that this score can be
simply and automatically estimated on-the-fly so that the whole decoding
pipeline is more simple and compact. Experimental results showed that this
method can achieve the same quality as when the input language is specified. In
addition, we present to design an English and Vietnamese End-to-End model to
deal with not only the problem of cross-lingual speakers but also as a solution
to improve the accuracy of borrowed words of English in Vietnamese
Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?
Phones, the segmental units of the International Phonetic Alphabet (IPA), are
used for lexical distinctions in most human languages; Tones, the
suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies
have explored cross-lingual adaptation of automatic speech recognition (ASR)
phone models, but few have explored the multilingual and cross-lingual transfer
of synchronization between phones and tones. In this paper, we test four
Connectionist Temporal Classification (CTC)-based acoustic models, differing in
the degree of synchrony they impose between phones and tones. Models are
trained and tested multilingually in three languages, then adapted and tested
cross-lingually in a fourth. Both synchronous and asynchronous models are
effective in both multilingual and cross-lingual settings. Synchronous models
achieve lower error rate in the joint phone+tone tier, but asynchronous
training results in lower tone error rate.Comment: Accepted to Interspeech 202