215,903 research outputs found

    A Train-on-Target Strategy for Multilingual Spoken Language Understanding

    Full text link
    [EN] There are two main strategies to adapt a Spoken Language Understanding system to deal with languages different from the original (source) language: test-on-source and train-on-target. In the train-ontarget approach, a new understanding model is trained in the target language, which is the language in which the test utterances are pronounced. To do this, a segmented and semantically labeled training set for each new language is needed. In this work, we use several general-purpose translators to obtain the translation of the training set and we apply an alignment process to automatically segment the training sentences. We have applied this train-on-target approach to estimate the understanding module of a Spoken Dialog System for the DIHANA task, which consists of an information system about train timetables and fares in Spanish. We present an evaluation of our train-on-target multilingual approach for two target languages, French and EnglishThis work has been partially funded by the project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MEC TIN2014-54288-C4-3-R).García-Granada, F.; Segarra Soriano, E.; Millán, C.; Sanchís Arnal, E.; Hurtado Oliver, LF. (2016). A Train-on-Target Strategy for Multilingual Spoken Language Understanding. Lecture Notes in Computer Science. 10077:224-233. https://doi.org/10.1007/978-3-319-49169-1_22S22423310077Benedí, J.M., Lleida, E., Varona, A., Castro, M.J., Galiano, I., Justo, R., López de Letona, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: LREC 2006, pp. 1636–1639 (2006)Calvo, M., Hurtado, L.-F., García, F., Sanchís, E.: A Multilingual SLU system based on semantic decoding of graphs of words. In: Torre Toledano, D., Ortega Giménez, A., Teixeira, A., González Rodríguez, J., Hernández Gómez, L., San Segundo Hernández, R., Ramos Castro, D. (eds.) IberSPEECH 2012. CCIS, vol. 328, pp. 158–167. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-35292-8_17Calvo, M., Hurtado, L.F., Garca, F., Sanchis, E., Segarra, E.: Multilingual spoken language understanding using graphs and multiple translations. Comput. Speech Lang. 38, 86–103 (2016)Dinarelli, M., Moschitti, A., Riccardi, G.: Concept segmentation and labeling for conversational speech. In: Interspeech, Brighton, UK (2009)Esteve, Y., Raymond, C., Bechet, F., Mori, R.D.: Conceptual decoding for spoken dialog systems. In: Proceedings of EuroSpeech 2003, pp. 617–620 (2003)García, F., Hurtado, L., Segarra, E., Sanchis, E., Riccardi, G.: Combining multiple translation systems for spoken language understanding portability. In: Proceedings of IEEE Workshop on Spoken Language Technology (SLT), pp. 282–289 (2012)Hahn, S., Dinarelli, M., Raymond, C., Lefèvre, F., Lehnen, P., De Mori, R., Moschitti, A., Ney, H., Riccardi, G.: Comparing stochastic approaches to spoken language understanding in multiple languages. IEEE Trans. Audio Speech Lang. Process. 6(99), 1569–1583 (2010)He, Y., Young, S.: A data-driven spoken language understanding system. In: Proceedings of ASRU 2003, pp. 583–588 (2003)Hurtado, L., Segarra, E., García, F., Sanchis, E.: Language understanding using n-multigram models. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 207–219. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-30228-5_19Jabaian, B., Besacier, L., Lefèvre, F.: Comparison and combination of lightly supervised approaches for language portability of a spoken language understanding system. IEEE Trans. Audio Speech Lang. Process. 21(3), 636–648 (2013)Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL Demonstration Session, pp. 177–180 (2007)Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, pp. 282–289. Citeseer (2001)Lefèvre, F.: Dynamic Bayesian networks and discriminative classifiers for multi-stage semantic interpretation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. 13–16. IEEE (2007)Ortega, L., Galiano, I., Hurtado, L.F., Sanchis, E., Segarra, E.: A statistical segment-based approach for spoken language understanding. In: Proceedings of InterSpeech 2010, Makuhari, Chiba, Japan, pp. 1836–1839 (2010)Segarra, E., Sanchis, E., Galiano, M., García, F., Hurtado, L.: Extracting semantic information through automatic learning techniques. IJPRAI 16(3), 301–307 (2002)Servan, C., Camelin, N., Raymond, C., Bchet, F., Mori, R.D.: On the use of machine translation for spoken language understanding portability. In: Proceedings of ICASSP 2010, pp. 5330–5333 (2010)Tür, G., Mori, R.D.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 1st edn. Wiley, Hoboken (2011

    An Efficient Probabilistic Deep Learning Model for the Oral Proficiency Assessment of Student Speech Recognition and Classification

    Get PDF
    Natural Language Processing is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. Speech recognition systems utilize machine learning algorithms and statistical models to analyze acoustic features of speech, such as pitch, duration, and frequency, to convert spoken words into written text. The Student English Oral Proficiency Assessment and Feedback System provides students with a comprehensive evaluation of their spoken English skills and offers tailored feedback to help them improve. It can be used in language learning institutions, universities, or online platforms to support language education and enhance oral communication abilities. In this paper constructed a framework stated as Latent Dirichlet Integrated Deep Learning (LDiDL) for the assessment of student English proficiency assessment. The system begins by collecting a comprehensive dataset of spoken English samples, encompassing various proficiency levels. Relevant features are extracted from the samples, including acoustic characteristics and linguistic attributes. Leveraging Latent Dirichlet Allocation (LDA), the system uncovers latent topics within the data, enabling a deeper understanding of the underlying themes present in the spoken English. To further enhance the analysis, a deep learning model is developed, integrating the LDA topics with the extracted features. This model is trained using appropriate techniques and evaluated using performance metrics. Utilizing the predictions made by the model, the system generates personalized feedback for each student, focusing on areas of improvement such as vocabulary, grammar, fluency, and pronunciation. Simulation mode uses the native English speech audio for the LDiDL training and classification. The experimental analysis stated that the proposed LDiDL model achieves an accuracy of 99% for the assessment of English Proficiency

    Spontaneous speech recognition using visual context-aware language models

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 83-88).The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurrences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces.by Niloy Mukherjee.S.M

    ASR error management for improving spoken language understanding

    Get PDF
    This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .Comment: Interspeech 2017, Aug 2017, Stockholm, Sweden. 201

    Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems

    Full text link
    This work investigates the embeddings for representing dialog history in spoken language understanding (SLU) systems. We focus on the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. We proposed to integrate dialogue history into an end-to-end signal-to-concept SLU system. The dialog history is represented in the form of dialog history embedding vectors (so-called h-vectors) and is provided as an additional information to end-to-end SLU models in order to improve the system performance. Three following types of h-vectors are proposed and experimentally evaluated in this paper: (1) supervised-all embeddings predicting bag-of-concepts expected in the answer of the user from the last dialog system response; (2) supervised-freq embeddings focusing on predicting only a selected set of semantic concept (corresponding to the most frequent errors in our experiments); and (3) unsupervised embeddings. Experiments on the MEDIA corpus for the semantic slot filling task demonstrate that the proposed h-vectors improve the model performance.Comment: Accepted for ICASSP 2020 (Submitted: October 21, 2019
    • …
    corecore