71 research outputs found

    Speaker-adapted confidence measures for speech recognition of video lectures

    Full text link
    [EN] Automatic speech recognition applications can benefit from a confidence measure (CM) to predict the reliability of the output. Previous works showed that a word-dependent native Bayes (NB) classifier outperforms the conventional word posterior probability as a CM. However, a discriminative formulation usually renders improved performance due to the available training techniques. Taking this into account, we propose a logistic regression (LR) classifier defined with simple input functions to approximate to the NB behaviour. Additionally, as a main contribution, we propose to adapt the CM to the speaker in cases in which it is possible to identify the speakers, such as online lecture repositories. The experiments have shown that speaker-adapted models outperform their non-adapted counterparts on two difficult tasks from English (videoLectures.net) and Spanish (poliMedia) educational lectures. They have also shown that the NB model is clearly superseded by the proposed LR classifier.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755. Also supported by the Spanish MINECO (iTrans2 TIN2009-14511 and Active2Trans TIN2012-31723) research projects and the FPI Scholarship BES-2010-033005.Sanchez-Cortina, I.; Andrés Ferrer, J.; Sanchis Navarro, JA.; Juan Císcar, A. (2016). Speaker-adapted confidence measures for speech recognition of video lectures. Computer Speech and Language. 37:11-23. https://doi.org/10.1016/j.csl.2015.10.003S11233

    Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks

    Full text link
    © 2018 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence measures considering a multi-task framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures which results in better automatic speech recognition performance.This work was supported in part by the European Union's Horizon 2020 research and innovation programme under Grant 761758 (X5gon), in part by the Seventh Framework Programme (FP7/2007-2013) under Grant 287755 (transLectures), in part by the ICT Policy Support Programme (ICT PSP/2007-2013) as part of the Competitiveness and Innovation Framework Programme under Grant 621030 (EMMA), and in part by the Spanish Government's TIN2015-68326-R (MINECO/FEDER) research project MORE.Del Agua Teba, MA.; Giménez Pastor, A.; Sanchis Navarro, JA.; Civera Saiz, J.; Juan, A. (2018). Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks. IEEE/ACM Transactions on Audio Speech and Language Processing. 26(7):1198-1206. https://doi.org/10.1109/TASLP.2018.2819900S1198120626

    Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

    Full text link
    [EN] Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal performance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows on a main Spanish broadcasting station.This work was supported in part by European Union's Horizon 2020 Research and Innovation Programme under Grant 761758 (X5gon), and 952215 (TAILOR) and Erasmus+ Education Program under Grant Agreement 20-226-093604-SCH, in part by MCIN/AEI/10.13039/501100011033 ERDF A way of making Europe under Grant RTI2018-094879-B-I00, and in part by Generalitat Valenciana's Research Project Classroom Activity Recognition under Grant PROMETEO/2019/111. Funding for open access charge: CRUE-Universitat Politecnica de Valencia. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lei Xie.Jorge-Cano, J.; Giménez Pastor, A.; Silvestre Cerdà, JA.; Civera Saiz, J.; Sanchis Navarro, JA.; Juan, A. (2022). Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models. IEEE/ACM Transactions on Audio Speech and Language Processing. 30:148-161. https://doi.org/10.1109/TASLP.2021.3133216S1481613

    The TransLectures-UPV Toolkit

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-13623-3_28Over the past few years, online multimedia educational repositories have increased in number and popularity. The main aim of the transLectures project is to develop cost-effective solutions for producing accurate transcriptions and translations for large video lecture repositories, such as VideoLectures.NET or the Universitat Politècnica de València s repository, poliMedia. In this paper, we present the transLectures-UPV toolkit (TLK), which has been specifically designed to meet the requirements of the transLectures project, but can also be used as a conventional ASR toolkit. The main features of the current release include HMM training and decoding with speaker adaptation techniques (fCMLLR). TLK has been tested on the VideoLectures.NET and poliMedia repositories, yielding very competitive results. TLK has been released under the permissive open source Apache License v2.0 and can be directly downloaded from the transLectures website.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures) and ICT Policy Support Programme (ICT PSP/2007-2013) as part of the Competitiveness and InnovationFramework Programme (CIP) under grant agreement no 621030 (EMMA), andthe Spanish MINECO Active2Trans (TIN2012-31723) research project.Del Agua Teba, MA.; Giménez Pastor, A.; Serrano Martinez Santos, N.; Andrés Ferrer, J.; Civera Saiz, J.; Sanchis Navarro, JA.; Juan Císcar, A. (2014). The TransLectures-UPV Toolkit. En Advances in Speech and Language Technologies for Iberian Languages: Second International Conference, IberSPEECH 2014, Las Palmas de Gran Canaria, Spain, November 19-21, 2014. Proceedings. Springer International Publishing. 269-278. https://doi.org/10.1007/978-3-319-13623-3_28S269278Final report on massive adaptation (M36). To be delivered on October 2014 (2014)First report on massive adaptation (M12), https://www.translectures.eu/wp-content/uploads/2013/05/transLectures-D3.1.1-18Nov2012.pdfOpencast Matterhorn, http://opencast.org/matterhorn/sclite - Score speech recognition system output, http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htmSecond report on massive adaptation (M24), https://www.translectures.eu//wp-content/uploads/2014/01/transLectures-D3.1.2-15Nov2013.pdfTLK: The transLectures-UPV Toolkit, https://www.translectures.eu/tlk/Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 41(1), 164–171 (1970)Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)Digalakis, V., Rtischev, D., Neumeyer, L., Sa, E.: Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures. IEEE Transactions on Speech and Audio Processing 3, 357–366 (1995)Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proc. of ICASSP (2013)Munteanu, C., Baecker, R., Penn, G., Toms, E., James, D.: The Effect of Speech Recognition Accuracy Rates on the Usefulness and Usability of Webcast Archives. In: Proc. of CHI, pp. 493–502 (2006)Ney, H., Ortmanns, S.: Progress in dynamic programming search for LVCSR. Proceedings of the IEEE 88(8), 1224–1240 (2000)Ortmanns, S., Ney, H., Eiden, A.: Language-model look-ahead for large vocabulary speech recognition. In: Proc. of ICSLP, vol. 4, pp. 2095–2098 (1996)Ortmanns, S., Ney, H., Aubert, X.: A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language 11(1), 43–72 (1997)Povey, D., et al.: The Kaldi Speech Recognition Toolkit. In: Proc. of ASRU (2011)Rumelhart, D., Hintont, G., Williams, R.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)Rybach, D., et al.: The RWTH Aachen University Open Source Speech Recognition System. In: Proc. Interspeech, pp. 2111–2114 (2009)Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription. In: Proc. of ASRU, pp. 24–29 (2011)Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)Young, S., et al.: The HTK Book. Cambridge University Engineering Department (1995)Young, S.J., Odell, J.J., Woodland, P.C.: Tree-based state tying for high accuracy acoustic modelling. In: Proc. of HLT, pp. 307–312 (1994

    MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks

    Get PDF
    [EN] This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements no. 761758 (X5Gon) and 952215 (TAILOR), and Erasmus+ Education programme under grant agreement no. 20- 226-093604-SCH (EXPERT); the Government of Spain¿s grant RTI2018-094879-B-I00 (Multisub) funded by MCIN/AEI/10.13039/501100011033 & ERDF A way of making Europe, and FPU scholarships FPU18/04135; and the Generalitat Valenciana¿s research project Classroom Activity Recognition (ref. PROMETEO/2019/111).Iranzo-Sánchez, J.; Jorge-Cano, J.; Pérez-González De Martos, A.; Giménez, A.; Garcés Díaz-Munío, G.; Baquero-Arnal, P.; Silvestre Cerdà, JA.... (2022). MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks. Association for Computational Linguistics (ACL). 255-264. https://doi.org/10.18653/v1/2022.iwslt-1.2225526

    Hacia la traducción integral de vídeo charlas educativas

    Full text link
    [EN] More and more universities and educational institutions are banking on production of technological resources for different uses in higher education. The MLLP research group has been working closely with the ASIC at UPV in order to enrich educational multimedia resources through the use of machine learning technologies, such as automatic speech recognition, machine translation or text-to-speech synthesis. In this work, developed under the Plan de Docencia en Red 2016-17’s framework, we present the application of innovative technologies in order to achive the integral translation of educational videos.[ES] Cada vez son más las universidades e instituciones educativas que apuestan por la producción de recursos tecnológicos para diversos usos en enseñanza superior. El grupo de investigación MLLP lleva años colaborando con el ASIC de la UPV con el fin de enriquecer estos materiales haciendo uso de tecnologías de machine learning, como son el reconocimiento automático del habla, la traducción automática o la síntesis de voz. En este trabajo, bajo el marco del Plan de Docencia en Red 2016-17, abordaremos la traducción integral de vídeos docentes mediante el uso de estas tecnologías.El trabajo de investigaci´on aqu´ı presentado ha recibido fondos del programa europeo FP7/2007-2013 en virtud del acuerdo de subvenci´on no 287755 (transLectures) y del ICT PSP/2007-2013 como parte del Competitiveness and Innovation Framework Programme (CIP) en virtud del acuerdo de subvenci´on no 621030 (EMMA); as´ı como del proyecto de investigaci´on nacional TIN2015-68326-R (MINECO/FEDER) (MORE) y de la beca VALi+d de la Generalitat Valenciana ACIF/2015/082.Piqueras, S.; Pérez González De Martos, AM.; Turró Ribalta, C.; Jimenez, M.; Sanchis Navarro, JA.; Civera Saiz, J.; Juan Císcar, A. (2017). Hacia la traducción integral de vídeo charlas educativas. En In-Red 2017. III Congreso Nacional de innovación educativa y de docencia en red. Editorial Universitat Politècnica de València. 117-124. https://doi.org/10.4995/INRED2017.2017.6812OCS11712

    MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

    Full text link
    [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements no. 761758 (X5Gon) and 952215 (TAILOR), and Erasmus+ Education programme under grant agreement no. 20-226-093604-SCH (EXPERT); the Government of Spain's grant RTI2018-094879-B-I00 (Multisub) funded by MCIN/AEI/10.13039/501100011033 & "ERDF A way of making Europe", and FPU scholarships FPU14/03981 and FPU18/04135; the Generalitat Valenciana's research project Classroom Activity Recognition (ref. PROMETEO/2019/111), and predoctoral research scholarship ACIF/2017/055; and the Universitat Politecnica de Valencia's PAID-01-17 R&D support programme.Baquero-Arnal, P.; Jorge-Cano, J.; Giménez Pastor, A.; Iranzo-Sánchez, J.; Pérez-González De Martos, AM.; Garcés Díaz-Munío, G.; Silvestre Cerdà, JA.... (2022). MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension. Applied Sciences. 12(2):1-14. https://doi.org/10.3390/app1202080411412

    Doblaje automático de vídeo-charlas educativas en UPV[Media]

    Full text link
    [EN] More and more universities are banking on the production of digital contents to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV’s ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-tospeech. In this work we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.[ES] Cada vez son más las universidades que apuestan por la producción de contenidos digitales como apoyo al aprendizaje en lı́nea o combinado en la enseñanza superior. El grupo de investigación MLLP lleva años trabajando junto al ASIC de la UPV para enriquecer estos materiales, y particularmente su accesibilidad y oferta lingüı́stica, haciendo uso de tecnologı́as del lenguaje como el reconocimiento automático del habla, la traducción automática y la sı́ntesis de voz. En este trabajo presentamos los pasos que se están dando hacia la traducción integral de estos materiales, concretamente a través del doblaje (semi-)automático mediante sistemas de sı́ntesis de voz adaptables al locutor.Este trabajo ha recibido financiación del Gobierno de España a través de la subvención RTI2018-094879-B-I00 financiada por MCIN/AEI/10.13039/501100011033 (Multisub) y por ”FEDER Una manera de hacer Europa”; del programa Erasmus+ Educación a través del acuerdo de subvención 20-226-093604-SCH (EXPERT); and by the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 761758 (X5gon).Pérez González De Martos, AM.; Giménez Pastor, A.; Jorge Cano, J.; Iranzo Sánchez, J.; Silvestre Cerdà, JA.; Garcés Díaz-Munío, GV.; Baquero Arnal, P.... (2023). Doblaje automático de vídeo-charlas educativas en UPV[Media]. En In-Red 2022 - VIII Congreso Nacional de Innovación Educativa y Docencia en Red. Editorial Universitat Politècnica de València. https://doi.org/10.4995/INRED2022.2022.1584

    TransLectures

    Full text link
    transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project¿s main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMedia. The first results obtained by the UPV group for the poliMedia repository will also be provided.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government (iTrans2 project, TIN2009-14511; FPI scholarship BES-2010-033005; FPU scholarship AP2010-4349)Silvestre Cerdà, JA.; Del Agua Teba, MA.; Garcés Díaz-Munío, GV.; Gascó Mora, G.; Giménez Pastor, A.; Martínez-Villaronga, AA.; Pérez González De Martos, AM.... (2012). TransLectures. IberSPEECH 2012. 345-351. http://hdl.handle.net/10251/3729034535

    Jardins per a la salut

    Get PDF
    Facultat de Farmàcia, Universitat de Barcelona. Ensenyament: Grau de Farmàcia. Assignatura: Botànica farmacèutica. Curs: 2014-2015. Coordinadors: Joan Simon, Cèsar Blanché i Maria Bosch.Els materials que aquí es presenten són el recull de les fitxes botàniques de 128 espècies presents en el Jardí Ferran Soldevila de l’Edifici Històric de la UB. Els treballs han estat realitzats manera individual per part dels estudiants dels grups M-3 i T-1 de l’assignatura Botànica Farmacèutica durant els mesos de febrer a maig del curs 2014-15 com a resultat final del Projecte d’Innovació Docent «Jardins per a la salut: aprenentatge servei a Botànica farmacèutica» (codi 2014PID-UB/054). Tots els treballs s’han dut a terme a través de la plataforma de GoogleDocs i han estat tutoritzats pels professors de l’assignatura. L’objectiu principal de l’activitat ha estat fomentar l’aprenentatge autònom i col·laboratiu en Botànica farmacèutica. També s’ha pretès motivar els estudiants a través del retorn de part del seu esforç a la societat a través d’una experiència d’Aprenentatge-Servei, deixant disponible finalment el treball dels estudiants per a poder ser consultable a través d’una Web pública amb la possibilitat de poder-ho fer in-situ en el propi jardí mitjançant codis QR amb un smartphone
    corecore