62,875 research outputs found
Incorporating Device Context In Natural Language Understanding
Automatic speech recognition (ASR) models are used to recognize user commands or queries in products such as smartphones, smart speakers/displays, and other products that enable speech interaction. Automatic speech recognition is a complex problem that requires correct processing of the acoustic and semantic signals from the voice input. Natural language understanding (NLU) systems sometimes fail to correctly interpret utterances that are associated with multiple possible intents. Per techniques described herein, device context features such as the identity of the foreground application and other information is utilized to disambiguate intent for a voice query. Incorporating device context as input to NLU models leads to improvement in the ability of the NLU models to correctly interpret utterances with ambiguous intent
A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding
Self-supervised speech representations such as wav2vec 2.0 and HuBERT are
making revolutionary progress in Automatic Speech Recognition (ASR). However,
self-supervised models have not been totally proved to produce better
performance on tasks other than ASR. In this work, we explore partial
fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models
for three non-ASR speech tasks : Speech Emotion Recognition, Speaker
Verification and Spoken Language Understanding. We also compare pre-trained
models with/without ASR fine-tuning. With simple down-stream frameworks, the
best scores reach 79.58% weighted accuracy for Speech Emotion Recognition on
IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 87.51%
accuracy for Intent Classification and 75.32% F1 for Slot Filling on SLURP,
thus setting a new state-of-the-art for these three benchmarks, proving that
fine-tuned wav2vec 2.0 and HuBERT models can better learn prosodic, voice-print
and semantic representations.Comment: 5 pages, 2 figure
Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech
International audienceThis work investigates speaker adaptation and transfer learning for spoken language understanding (SLU). We focus on the direct extraction of semantic tags from the audio signal using an end-to-end neural network approach. We demonstrate that the learning performance of the target predictive function for the semantic slot filling task can be substantially improved by speaker adaptation and by various knowledge transfer approaches. First, we explore speaker adaptive training (SAT) for end-to-end SLU models and propose to use zero pseudo i-vectors for more efficient model initialization and pretraining in SAT. Second, in order to improve the learning convergence for the target semantic slot filling (SF) task, models trained for different tasks, such as automatic speech recognition and named entity extraction are used to initialize neural end-to-end models trained for the target task. In addition, we explore the impact of the knowledge transfer for SLU from a speech recognition task trained in a different language. These approaches allow to develop end-to-end SLU systems in low-resource data scenarios when there is no enough in-domain semantically labeled data, but other resources, such as word transcriptions for the same or another language or named entity annotation, are available
A multilingual SLU system based on semantic decoding of graphs of words
In this paper, we present a statistical approach to Language
Understanding that allows to avoid the effort of obtaining new semantic
models when changing the language. This way, it is not necessary to acquire
and label new training corpora in the new language. Our approach
consists of learning all the semantic models in a target language and
to do the semantic decoding of the sentences pronounced in the source
language after a translation process. In order to deal with the errors and
the lack of coverage of the translations, a mechanism to generalize the
result of several translators is proposed. The graph of words generated
in this phase is the input to the semantic decoding algorithm specifically
designed to combine statistical models and graphs of words. Some experiments
that show the good behavior of the proposed approach are also
presented.Calvo Lance, M.; Hurtado Oliver, LF.; GarcĂa Granada, F.; SanchĂs Arnal, E. (2012). A multilingual SLU system based on semantic decoding of graphs of words. En Advances in Speech and Language Technologies for Iberian Languages. Springer Verlag (Germany). 328:158-167. doi:10.1007/978-3-642-35292-8_17S158167328Hahn, S., Dinarelli, M., Raymond, C., Lefèvre, F., Lehnen, P., De Mori, R., Moschitti, A., Ney, H., Riccardi, G.: Comparing stochastic approaches to spoken language understanding in multiple languages. IEEE Transactions on Audio, Speech, and Language Processing 6(99), 1569–1583 (2010)Raymond, C., Riccardi, G.: Generative and discriminative algorithms for spoken language understanding. In: Proceedings of Interspeech 2007, pp. 1605–1608 (2007)Tur, G., Mori, R.D.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 1st edn. Wiley (2011)Maynard, H.B., Lefèvre, F.: Investigating Stochastic Speech Understanding. In: Proc. of IEEE Automatic Speech Recognition and Understanding Workshop, ASRU (2001)Segarra, E., Sanchis, E., Galiano, M., GarcĂa, F., Hurtado, L.: Extracting Semantic Information Through Automatic Learning Techniques. IJPRAI 16(3), 301–307 (2002)He, Y., Young, S.: Spoken language understanding using the hidden vector state model. Speech Communication 48, 262–275 (2006)De Mori, R., Bechet, F., Hakkani-Tur, D., McTear, M., Riccardi, G., Tur, G.: Spoken language understanding: A survey. IEEE Signal Processing Magazine 25(3), 50–58 (2008)Hakkani-TĂĽr, D., BĂ©chet, F., Riccardi, G., Tur, G.: Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech & Language 20(4), 495–514 (2006)Tur, G., Wright, J., Gorin, A., Riccardi, G., Hakkani-TĂĽr, D.: Improving spoken language understanding using word confusion networks. In: Proceedings of the ICSLP. Citeseer (2002)Tur, G., Hakkani-TĂĽr, D., Schapire, R.E.: Combining active and semi-supervised learning for spoken language understanding. Speech Communication 45, 171–186 (2005)Ortega, L., Galiano, I., Hurtado, L.F., Sanchis, E., Segarra, E.: A statistical segment-based approach for spoken language understanding. In: Proc. of InterSpeech 2010, Makuhari, Chiba, Japan, pp. 1836–1839 (2010)Sim, K.C., Byrne, W.J., Gales, M.J.F., Sahbi, H., Woodland, P.C.: Consensus network decoding for statistical machine translation system combination. In: IEEE Int. Conference on Acoustics, Speech, and Signal Processing (2007)Bangalore, S., Bordel, G., Riccardi, G.: Computing Consensus Translation from Multiple Machine Translation Systems. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2001, pp. 351–354 (2001)Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: ClustalW and ClustalX version 2.0. Bioinformatics 23(21), 2947–2948 (2007)BenedĂ, J.M., Lleida, E., Varona, A., Castro, M.J., Galiano, I., Justo, R., LĂłpez de Letona, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: Proceedings of LREC 2006, Genoa, Italy, pp. 1636–1639 (May 2006
Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target
Spoken Language Understanding (SLU) is a task that aims to extract semantic
information from spoken utterances. Previous research has made progress in
end-to-end SLU by using paired speech-text data, such as pre-trained Automatic
Speech Recognition (ASR) models or paired text as intermediate targets.
However, acquiring paired transcripts is expensive and impractical for
unwritten languages. On the other hand, Textless SLU extracts semantic
information from speech without utilizing paired transcripts. However, the
absence of intermediate targets and training guidance for textless SLU often
results in suboptimal performance. In this work, inspired by the
content-disentangled discrete units from self-supervised speech models, we
proposed to use discrete units as intermediate guidance to improve textless SLU
performance. Our method surpasses the baseline method on five SLU benchmark
corpora. Additionally, we find that unit guidance facilitates few-shot learning
and enhances the model's ability to handle noise.Comment: Accepted by interspeech 202
Exploiting multiple ASR outputs for a spoken language understanding task
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-01931-4_19In this paper, we present an approach to Spoken Language Understanding, where the input to the semantic decoding process is a composition of multiple hypotheses provided by the Automatic Speech Recognition module. This way, the semantic constraints can be applied not only to a unique hypothesis, but also to other hypotheses that could represent a better recognition of the utterance. To do this, we have developed an algorithm to combine multiple sentences into a weighted graph of words, which is the input to the semantic decoding process. It has also been necessary to develop a specific algorithm to process these graphs of words according to the statistical models that represent the semantics of the task. This approach has been evaluated in a SLU task in Spanish. Results, considering different configurations of ASR outputs, show the better behavior of the system when a combination of hypotheses is considered.This work is partially supported by the Spanish MICINN under contract TIN2011-28169-C05-01, and under FPU Grant AP2010-4193Calvo Lance, M.; GarcĂa Granada, F.; Hurtado Oliver, LF.; JimĂ©nez Serrano, S.; SanchĂs Arnal, E. (2013). Exploiting multiple ASR outputs for a spoken language understanding task. En Speech and Computer. Springer Verlag (Germany). 8113:138-145. https://doi.org/10.1007/978-3-319-01931-4_19S1381458113TĂĽr, G., Mori, R.D.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 1st edn. Wiley (2011)Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354. IEEE (1997)Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: ClustalW and ClustalX version 2.0. Bioinformatics 23, 2947–2948 (2007)Sim, K.C., Byrne, W.J., Gales, M.J.F., Sahbi, H., Woodland, P.C.: Consensus network decoding for statistical machine translation system combination. In: IEEE Int. Conference on Acoustics, Speech, and Signal Processing (2007)Bangalore, S., Bordel, G., Riccardi, G.: Computing Consensus Translation from Multiple Machine Translation Systems. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2001), pp. 351–354 (2001)Calvo, M., Hurtado, L.-F., GarcĂa, F., SanchĂs, E.: A Multilingual SLU System Based on Semantic Decoding of Graphs of Words. In: Torre Toledano, D., Ortega GimĂ©nez, A., Teixeira, A., González RodrĂguez, J., Hernández GĂłmez, L., San Segundo Hernández, R., Ramos Castro, D. (eds.) IberSPEECH 2012. CCIS, vol. 328, pp. 158–167. Springer, Heidelberg (2012)Hakkani-TĂĽr, D., BĂ©chet, F., Riccardi, G., TĂĽr, G.: Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech & Language 20, 495–514 (2006)BenedĂ, J.M., Lleida, E., Varona, A., Castro, M.J., Galiano, I., Justo, R., LĂłpez de Letona, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: Proceedings of LREC 2006, Genoa, Italy, pp. 1636–1639 (2006
Spoken Language Intent Detection using Confusion2Vec
Decoding speaker's intent is a crucial part of spoken language understanding
(SLU). The presence of noise or errors in the text transcriptions, in real life
scenarios make the task more challenging. In this paper, we address the spoken
language intent detection under noisy conditions imposed by automatic speech
recognition (ASR) systems. We propose to employ confusion2vec word feature
representation to compensate for the errors made by ASR and to increase the
robustness of the SLU system. The confusion2vec, motivated from human speech
production and perception, models acoustic relationships between words in
addition to the semantic and syntactic relations of words in human language. We
hypothesize that ASR often makes errors relating to acoustically similar words,
and the confusion2vec with inherent model of acoustic relationships between
words is able to compensate for the errors. We demonstrate through experiments
on the ATIS benchmark dataset, the robustness of the proposed model to achieve
state-of-the-art results under noisy ASR conditions. Our system reduces
classification error rate (CER) by 20.84% and improves robustness by 37.48%
(lower CER degradation) relative to the previous state-of-the-art going from
clean to noisy transcripts. Improvements are also demonstrated when training
the intent detection models on noisy transcripts
- …