144,631 research outputs found
Stochastic Modelling: From Pattern Classification to Speech Recognition and Language Translation
This paper gives an overview of the stochastic modelling approach to machine translation. Starting with the Bayes decision rule as in pattern classification and speech recognition, we show how the resulting system architecture can be structured into three parts: the language model probability, the string translation model probability and the search procedure that gener-ates the word sequence in the target language. We discuss the properties of the system components and report results on the translation of spoken dialogues in the VERBMOBIL project. The experience obtained in the VERB-MOBIL project, in particular a large-scale end-to-end evaluation, showed that the stochastic modelling approach resulted in significantly lower error rates than three competing translation approaches: the sentence error rate was 29 % in comparison with 52 % to 62% for the other translation approaches.
Combining Several ASR Outputs in a Graph-Based SLU System
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-25751-8_66In this paper, we present an approach to Spoken Language
Understanding (SLU) where we perform a combination of multiple
hypotheses from several Automatic Speech Recognizers (ASRs) in
order to reduce the impact of recognition errors in the SLU module. This
combination is performed using a Grammatical Inference algorithm that
provides a generalization of the input sentences by means of a weighted
graph of words. We have also developed a specific SLU algorithm that is
able to process these graphs of words according to a stochastic semantic
modelling.The results show that the combinations of several hypotheses
from the ASR module outperform the results obtained by taking just the
1-best transcriptionThis work is partially supported by the Spanish MEC under contract TIN2014-54288-C4-3-R and FPU Grant AP2010-4193.Calvo Lance, M.; Hurtado Oliver, LF.; GarcĂa-Granada, F.; SanchĂs Arnal, E. (2015). Combining Several ASR Outputs in a Graph-Based SLU System. En Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer. 551-558. https://doi.org/10.1007/978-3-319-25751-8_66S551558Bangalore, S., Bordel, G., Riccardi, G.: Computing consensus translation from multiple machine translation systems. In: ASRU, pp. 351–354 (2001)BenedĂ, J.M., Lleida, E., Varona, A., Castro, M.J., Galiano, I., Justo, R., de Letona, I.L., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: LREC, pp. 1636–1639 (2006)Bonneau-Maynard, H., Lefèvre, F.: Investigating stochastic speech understanding. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 260–263 (2001)Calvo, M., GarcĂa, F., Hurtado, L.F., JimĂ©nez, S., Sanchis, E.: Exploiting multiple hypotheses for multilingual spoken language understanding. In: CoNLL, pp. 193–201 (2013)Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354 (1997)Hahn, S., Dinarelli, M., Raymond, C., Lefèvre, F., Lehnen, P., De Mori, R., Moschitti, A., Ney, H., Riccardi, G.: Comparing stochastic approaches to spoken language understanding in multiple languages. IEEE Transactions on Audio, Speech, and Language Processing 6(99), 1569–1583 (2010)Hakkani-TĂĽr, D., BĂ©chet, F., Riccardi, G., TĂĽr, G.: Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech & Language 20(4), 495–514 (2006)He, Y., Young, S.: Spoken language understanding using the hidden vector state model. Speech Communication 48, 262–275 (2006)Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: ClustalW and ClustalX version 2.0. Bioinformatics 23(21), 2947–2948 (2007)Segarra, E., Sanchis, E., Galiano, M., GarcĂa, F., Hurtado, L.: Extracting Semantic Information Through Automatic Learning Techniques. IJPRAI 16(3), 301–307 (2002)TĂĽr, G., Deoras, A., Hakkani-TĂĽr, D.: Semantic parsing using word confusion networks with conditional random fields. In: INTERSPEECH (2013
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensory
modalities. For example, the environment where a conversation takes place
generally provides semantic and/or acoustic context that helps us to resolve
ambiguities or to recall named entities. Motivated by this, there have been
many works studying the integration of visual information into the speech
recognition pipeline. Specifically, in our previous work, we propose a
multistep visual adaptive training approach which improves the accuracy of an
audio-based Automatic Speech Recognition (ASR) system. This approach, however,
is not end-to-end as it requires fine-tuning the whole model with an adaptation
layer. In this paper, we propose novel end-to-end multimodal ASR systems and
compare them to the adaptive approach by using a range of visual
representations obtained from state-of-the-art convolutional neural networks.
We show that adaptive training is effective for S2S models leading to an
absolute improvement of 1.4% in word error rate. As for the end-to-end systems,
although they perform better than baseline, the improvements are slightly less
than adaptive training, 0.8 absolute WER reduction in single-best models. Using
ensemble decoding, end-to-end models reach a WER of 15% which is the lowest
score among all systems.Comment: ICASSP 201
Efficient Invariant Features for Sensor Variability Compensation in Speaker Recognition
In this paper, we investigate the use of invariant features for speaker recognition. Owing to their characteristics, these features are introduced to cope with the difficult and challenging problem of sensor variability and the source of performance degradation inherent in speaker recognition systems. Our experiments show: (1) the effectiveness of these features in match cases; (2) the benefit of combining these features with the mel frequency cepstral coefficients to exploit their discrimination power under uncontrolled conditions (mismatch cases). Consequently, the proposed invariant features result in a performance improvement as demonstrated by a reduction in the equal error rate and the minimum decision cost function compared to the GMM-UBM speaker recognition systems based on MFCC features
- …