14 research outputs found

    Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

    Full text link
    Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due to the large volume of multimedia information. This paper presents the systems submitted to the ALBAYZIN QbE STD 2014 evaluation held as a part of the ALBAYZIN 2014 Evaluation campaign within the context of the IberSPEECH 2014 conference. This is the second QbE STD evaluation in Spanish, which allows us to evaluate the progress in this technology for this language. The evaluation consists in retrieving the speech files that contain the input queries, indicating the start and end times where the input queries were found, along with a score value that reflects the confidence given to the detection of the query. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from workshops, which amount to about 7 h of speech. We present the database, the evaluation metric, the systems submitted to the evaluation, the results, and compare this second evaluation with the first ALBAYZIN QbE STD evaluation held in 2012. Four different research groups took part in the evaluations held in 2012 and 2014. In 2014, new multi-word and foreign queries were added to the single-word and in-language queries used in 2012. Systems submitted to the second evaluation are hybrid systems which integrate letter transcription- and template matching-based systems. Despite the significant improvement obtained by the systems submitted to this second evaluation compared to those of the first evaluation, results still show the difficulty of this task and indicate that there is still room for improvement.This research was funded by the Spanish Government ('SpeechTech4All Project' TEC2012 38939 C03 01 and 'CMC-V2 Project' TEC2012 37585 C02 01), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and 'AtlantTIC Project' CN2012/160, and also by the Spanish Government and the European Regional Development Fund (ERDF) under project TACTICA

    Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

    Get PDF
    [Abstract] The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.Centro singular de investigación de Galicia; ED431G/04Universidad del País Vasco; GIU16/68Ministerio de Economía y Competitividad; TEC2015-68172-C2-1-PMinisterio de Ciencia, Innovación y Competitividad; RTI2018-098091-B-I00Xunta de Galicia; ED431G/0

    ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

    Full text link
    [EN] Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.This work was partially supported by Fundacao para a Ciencia e Tecnologia (FCT) under the projects UID/EEA/50008/2013 (pluriannual funding in the scope of the LETSREAD project) and UID/CEC/50021/2013, and Grant SFRH/BD/97187/2013. Jorge Proenca is supported by the SFRH/BD/97204/2013 FCT Grant. This work was also supported by the Galician Government ('Centro singular de investigacion de Galicia' accreditation 2016-2019 ED431G/01 and the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014)), the European Regional Development Fund (ERDF), the projects "DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz" (TEC2015-68172-C2-1-P) and the TIN2015-64282-R funded by Ministerio de Economia y Competitividad in Spain, the Spanish Government through the project "TraceThem" (TEC2015-65345-P), and AtlantTIC ED431G/04.Tejedor, J.; Toledano, DT.; Lopez-Otero, P.; Docio-Fernandez, L.; Proença, J.; Perdigão, F.; García-Granada, F.... (2018). ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation. EURASIP Journal on Audio, Speech and Music Processing. 1-25. https://doi.org/10.1186/s13636-018-0125-9S125Jarina, R, Kuba, M, Gubka, R, Chmulik, M, Paralic, M (2013). UNIZA system for the spoken web search task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 791–792).Ali, A, & Clements, MA (2013). Spoken web search using and ergodic hidden Markov model of speech. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 861–862).Buzo, A, Cucu, H, Burileanu, C (2014). SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–722).Caranica, A, Buzo, A, Cucu, H, Burileanu, C (2015). SpeeD@MediaEval 2015: Multilingual phone recognition approach to Query By Example STD. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 781–783).Kesiraju, S, Mantena, G, Prahallad, K (2014). IIIT-H system for MediaEval 2014 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 761–762).Ma, M, & Rosenberg, A (2015). CUNY systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–833).Takahashi, J, Hashimoto, T, Konno, R, Sugawara, S, Ouchi, K, Oshima, S, Akyu, T, Itoh, Y (2014). An IWAPU STD system for OOV query terms and spoken queries. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 384–389).Makino, M, & Kai, A (2014). Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery & Doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 413–418).Konno, R, Ouchi, K, Obara, M, Shimizu, Y, Chiba, T, Hirota, T, Itoh, Y (2016). An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery & Doc task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 200–204).Sakamoto, N, Yamamoto, K, Nakagawa, S (2015). Combination of syllable based N-gram search and word search for spoken term detection through spoken queries and IV/OOV classification. In Proc. of ASRU. IEEE, New York, (pp. 200–206).Hou, J, Pham, VT, Leung, C-C, Wang, L, 2, HX, Lv, H, Xie, L, Fu, Z, Ni, C, Xiao, X, Chen, H, Zhang, S, Sun, S, Yuan, Y, Li, P, Nwe, TL, Sivadas, S, Ma, B, Chng, ES, Li, H (2015). The NNI Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 141–143).Vavrek, J, Viszlay, P, Lojka, M, Pleva, M, Juhar, J, Rusko, M (2015). TUKE at MediaEval 2015 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 451–453).Mantena, G, Achanta, S, Prahallad, K (2014). Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(5), 946–955.Anguera, X, & Ferrarons, M (2013). Memory efficient subsequence DTW for query-by-example spoken term detection. In Proc. of ICME. IEEE, New York, (pp. 1–6).Tulsiani, H, & Rao, P (2015). The IIT-B Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 341–343).Bouallegue, M, Senay, G, Morchid, M, Matrouf, D, Linares, G, Dufour, R (2013). LIA@MediaEval 2013 spoken web search task: An I-Vector based approach. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 771–772).Rodriguez-Fuentes, LJ, Varona, A, Penagarikano, M, Bordel, G, Diez, M (2013). GTTS systems for the SWS task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–832).Wang, H, Lee, T, Leung, C-C, Ma, B, Li, H (2013). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 8545–8549).Wang, H, & Lee, T (2013). The CUHK spoken web search system for MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 681–682).Proenca, J, Veiga, A, Perdigão, F (2014). The SPL-IT query by example search on speech system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 741–742).Proenca, J, Veiga, A, Perdigao, F (2015). Query by example search with segmented dynamic time warping for non-exact spoken queries. In Proc. of EUSIPCO. Springer, Berlin, (pp. 1691–1695).Proenca, J, Castela, L, Perdigao, F (2015). The SPL-IT-UC Query by Example search on speech system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 471–473).Proenca, J, & Perdigao, F (2016). Segmented dynamic time warping for spoken Query-by-Example search. In Proc. of Interspeech. ISCA, Baixas, (pp. 750–754).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). GTM-UVigo systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 521–523).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). Phonetic unit selection for cross-lingual Query-by-Example spoken term detection. In Proc. of ASRU. IEEE, New York, (pp. 223–229).Saxena, A, & Yegnanarayana, B (2015). Distinctive feature based representation of speech for Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 3680–3684).Skacel, M, & Szöke, I (2015). BUT QUESST 2015 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–723).Chen, H, Leung, C-C, Xie, L, Ma, B, Li, H (2016). Unsupervised bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 923–927).Yuan, Y, Leung, C-C, Xie, L, Chen, H, Ma, B, Li, H (2017). Pairwise learning using multi-lingual bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 5645–5649).Torbati, AHHN, & Picone, J (2016). A nonparametric Bayesian approach for spoken term detection by example query. In Proc. of Interspeech. ISCA, Baixas, (pp. 928–932).Popli, A, & Kumar, A (2015). Query-by-example spoken term detection using low dimensional posteriorgrams motivated by articulatory classes. In Proc. of MMSP. IEEE, New York, (pp. 1–6).Yang, P, Leung, C-C, Xie, L, Ma, B, Li, H (2014). Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 1722–1726).George, B, Saxena, A, Mantena, G, Prahallad, K, Yegnanarayana, B (2014). Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping. In Proc. of Interspeech. ISCA, Baixas, (pp. 1742–1746).Hazen, TJ, Shen, W, White, CM (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. of ASRU. IEEE, New York, (pp. 421–426).Abad, A, Astudillo, RF, Trancoso, I (2013). The L2F spoken web search system for mediaeval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 851–852).Szöke, I, Skácel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).Szöke, I, Burget, L, Grézl, F, Černocký, JH, Ondel, L (2014). Calibration and fusion of query-by-example systems - BUT SWS 2013. In Proc. of ICASSP. IEEE, New York, (pp. 621–622).Abad, A, Rodríguez-Fuentes, LJ, Penagarikano, M, Varona, A, Bordel, G (2013). On the calibration and fusion of heterogeneous spoken term detection systems. In Proc. of Interspeech. ISCA, Baixas, (pp. 20–24).Yang, P, Xu, H, Xiao, X, Xie, L, Leung, C-C, Chen, H, Yu, J, Lv, H, Wang, L, Leow, SJ, Ma, B, Chng, ES, Li, H (2014). The NNI query-by-example system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 691–692).Leung, C-C, Wang, L, Xu, H, Hou, J, Pham, VT, Lv, H, Xie, L, Xiao, X, Ni, C, Ma, B, Chng, ES, Li, H (2016). Toward high-performance language-independent Query-by-Example spoken term detection for MediaEval 2015: Post-evaluation analysis. In Proc. of Interspeech. ISCA, Baixas, (pp. 3703–3707).Xu, H, Hou, J, Xiao, X, Pham, VT, Leung, C-C, Wang, L, Do, VH, Lv, H, Xie, L, Ma, B, Chng, ES, Li, H (2016). Approximate search of audio queries by using DTW with phone time boundary and data augmentation. In Proc. of ICASSP. IEEE, New York, (pp. 6030–6034).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level and DNN-based acoustic matches for efficient spoken term detection in NTCIR-12 SpokenQuery &Doc-2 task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 205–210).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level spotting and posterior-based acoustic match for improved query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 740–744).Obara, M, Kojima, K, Tanaka, K, Lee, S-w, Itoh, Y (2016). Rescoring by combination of posteriorgram score and subword-matching score for use in Query-by-Example. In Proc. of Interspeech. ISCA, Baixas, (pp. 1918–1922).NIST. The Ninth Text REtrieval Conference (TREC 9). http://trec.nist.gov . Accessed Feb 2018.Anguera, X, Rodriguez-Fuentes, LJ, Szöke, I, Buzo, A, Metze, F (2014). Query by Example Search on Speech at Mediaeval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 351–352).Joho, H, & Kishida, K (2014). Overview of the NTCIR-11 SpokenQuery&Doc Task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–7).NIST. Draft KWS16 Keyword Search Evaluation Plan. https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf . Accessed Feb 2018.Anguera, X, Metze, F, Buzo, A, Szöke, I, Rodriguez-Fuentes, LJ (2013). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 921–922).Taras, B, & Nadeu, C (2011). Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1–10.Zelenák, M, Schulz, H, Hernando, J (2012). Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP Journal on Audio, Speech, and Music Processing, 2012(19), 1–9.Rodríguez-Fuentes, LJ, Penagarikano, M, Varona, A, Díez, M, Bordel, G (2011). The Albayzin 2010 Language Recognition Evaluation. In Proc. of Interspeech. ISCA, Baixas, (pp. 1529–1532).Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C, Cardenal, A, Echeverry-Correa, JD, Coucheiro-Limeres, A, Olcoz, J, Miguel, A (2015). Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech and Music Processing, 2015(21), 1–27.Tejedor, J, Toledano, DT, Anguera, X, Varona, A, Hurtado, LF, Miguel, A, Colás, J (2013). Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech, and Music Processing, 2013(23), 1–17.Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations. EURASIP, Journal on Audio, Speech and Music Processing, 2016(1), 1–19.Méndez, F, Docío, L, Arza, M, Campillo, F (2010). The Albayzin 2010 text-to-speech evaluation. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 317–340).Billa, J, Ma, KW, McDonough, JW, Zavaliagkos, G, Miller, DR, Ross, KN, El-Jaroudi, A (1997). Multilingual speech recognition: the 1996 Byblos Callhome system. In Proc. of Eurospeech. ISCA, Baixas, (pp. 363–366).Killer, M, Stuker, S, Schultz, T (2003). Grapheme based speech recognition. In Proc. of Eurospeech. ISCA, Baixas, (pp. 3141–3144).Burget, L, Schwarz, P, Agarwal, M, Akyazi, P, Feng, K, Ghoshal, A, Glembek, O, Goel, N, Karafiat, M, Povey, D, Rastrow, A, Rose, RC, Thomas, S (2010). Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models. In Proc. of ICASSP. IEEE, New York, (pp. 4334–4337).Cuayahuitl, H, & Serridge, B (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In Proc. of MICAI. Springer, Berlin, (pp. 156–165).Tejedor, J (2009). Contributions to keyword spotting and spoken term detection for information retrieval in audio mining. PhD thesis, Universidad Autónoma de Madrid, Madrid, Spain.Tejedor, J, Toledano, DT, Wang, D, King, S, Colás, J (2014). Feature analysis for discriminative confidence estimation in spoken term detection. Computer Speech and Language, 28(5), 1083–1114.Li, J, Wang, X, Xu, B (2014). An empirical study of multilingual and low-resource spoken term detection using deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 1747–1751).NIST. The Spoken Term Detection (STD) 2006 evaluation plan. http://berlin.csie.ntnu.edu.tw/Courses/Special%20Topics%20in%20Spoken%20Language%20Processing/Lectures2008/SLP2008S-Lecture12-Spoken%20Term%20Detection.pdf . Accessed Feb 2018.Fiscus, JG, Ajot, J, Garofolo, JS, Doddingtion, G (2007). Results of the 2006 spoken term detection evaluation. In Proc. of SSCS. ACM, New York, (pp. 45–50).Martin, A, Doddington, G, Kamm, T, Ordowski, M, Przybocki, M (1997). The DET curve in assessment of detection task performance. In Proc. of Eurospeech. ISCA, Baixas, (pp. 1895–1898).NIST. Evaluation Toolkit (STDEval) software. https://www.nist.gov/itl/iad/mig/tools . Accessed Feb 2018.Union, IT. ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. http://www.itu.int/rec/T-REC-P.563/en . Accessed Feb 2018.Rajput, N, & Metze, F (2011). Spoken web search. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 1–2).Metze, F, Barnard, E, Davel, M, van Heerden, C, Anguera, X, Gravier, G, Rajput, N (2012). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 41–42).Szöke, I, Rodriguez-Fuentes, LJ, Buzo, A, Anguera, X, Metze, F, Proenca, J, Lojka, M, Xiong, X (2015). Query by Example Search on Speech at Mediaeval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Szöke, I, & Anguera, X (2016). Zero-cost speech recognition task at Mediaeval 2016. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2014). Overview of the NTCIR-11 spokenquery &doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–15).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2016). Overview of the NTCIR-12 spokenquery &doc-2. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 1–13).Schwarz, P (2008). Phoneme recognition based on long temporal context. PhD thesis, FIT, BUT, Brno, Czech Republic.Varona, A, Penagarikano, M, Rodríguez-Fuentes, LJ, Bordel, G (2011). On the use of lattices of time-synchronous cross-decoder phone co-occurrences in a SVM-phonotactic language recognition system. In Proc. of Interspeech. ISCA, Baixas, (pp. 2901–2904).Eyben, F, Wollmer, M, Schuller, B (2010). OpenSMILE—the munich versatile and fast open-source audio feature extractor. In Proc. of ACM Multimedia (MM). ACM, New York, (pp. 1459–1462).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Finding relevant features for zero-resource query-by-example search on speech. Speech Communication, 84(1), 24–35.Zhang, Y, & Glass, JR (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proc. of ASRU. IEEE, New York, (pp. 398–403).Povey, D, Ghoshal, A, Boulianne, G, Burget, L, Glembek, O, Goel, N, Hannemann, M, Motlicek, P, Qian, Y, Schwarz, P, Silovsky, J, Stemmer, G, Vesely, K (2011). The KALDI speech recognition toolkit. In Proc. of ASRU. IEEE, New York, (pp. 1–4).Muller, M. (2007). Information retrieval for music and motion. New York: Springer.Szöke, I, Skacel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).Brümmer, N, & van Leeuwen, D (2006). On calibration of language recognition scores. In Proc of the IEEE Odyssey: The speaker and language recognition workshop. IEEE, New York, (pp. 1–8).Brümmer, N, & de Villiers, E. The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing. Technical report. https://sites.google.com/site/nikobrummer . Accessed Feb 2018.Meinedo, H, & Neto, J (2005). A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In Proc. of Interspeech. ISCA, Baixas, (pp. 237–240).Morgan, N, & Bourlard, H (1995). An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3), 25–42.Meinedo, H, Abad, A, Pellegrini, T, Trancoso, I, Neto, J (2010). The L2F broadcast news speech recognition system. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 93–96).Abad, A, Luque, J, Trancoso, I (2011). Parallel transformation network features for speaker recognition. In Proc. of ICASSP. IEEE, New York, (pp. 5300–5303).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2012). On the use of phone log-likelihood ratios as features in spoken language recognition. In Proc. of SLT. IEEE, New York, (pp. 274–279).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2014). New insight into the use of phone log-likelihood ratios as features for language recognition. In Proc. of Interspeech. ISCA, Baixas, (pp. 1841–1845).Abad, A, Ribeiro, E, Kepler, F, Astudillo, R, Trancoso, I (2016). Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers. In Proc. of Interspeech. ISCA, Baixas, (pp. 2413–2417).Rodríguez-Fuentes, LJ, Varona, A, Peñagarikano, M, Bordel, G, Díez, M (2014). High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In Proc. of ICASSP. IEEE, New York, (pp. 7819–7823).Vesely, K, Ghoshal, A, Burget, L, Povey, D (2013). Sequence-discriminative training of deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 2345–2349).Ghahremani, P, BabaAli, B, Povey, D, Riedhammer, K, Trmal, J, Khudanpur, S (2014). A pitch extraction algorithm tuned for automatic speech recognition. In Proc. of ICASSP. IEEE, New York, (pp. 2494–2498).Povey, D, Hannemann, M, Boulianne, G, Burget, L, Ghoshal, A, Janda, M, Karafiat, M, Kombrink, S, Motlicek, P, Qian, Y, Riedhammer, K, Vesely, K, Vu, NT (2012). Generating exact lattices in the WFST framework. In Proc. of ICASSP. IEEE, New York, (pp. 4213–4216).Garcia-Mateo, C, Dieguez-Tirado, J, Docio-Fernandez, L, Cardenal-Lopez, A (2004). Transcrigal: A bilingual system for automatic indexing of broadcast news. In Proc. of LREC. ELRA, Paris, (pp. 2061–2064).Stolcke, A (2002). SRILM—an extensible language modeling toolkit. In Proc. of Interspeech. ISCA, Baixas, (pp. 901–904).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). GTM-UVigo systems for Albayzin 2016 search on speech evaluation. In Proc. of Iberspeech. Springer, Berlin, (pp. 65–74).Chen, G, Khudanpur, S, Povey, D, Trmal, J, Yarowsky, D, Yilmaz, O (2013). Quantifying the value of pronunciation lexicons for keyword search in low resource languages. In Proc. of ICASSP. IEEE, New York, (pp. 8560–8564).Pham, VT, Chen, NF, Sivadas, S, Xu, H, Chen, I-F, Ni, C, Chng, ES, Li, H (2014). System and keyword dependent fusion for spoken term detection. In Proc. of SLT. IEEE, New York, (pp. 430–435).Can, D, & Saraclar, M (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech and Language Processing, 19(8), 2338–2347.Miller, DRH, K

    Query-by-Example Spoken Term Detection

    Get PDF
    Tato práce se zabývá vyhledáváním výrazů v řeči pomocí mluvených příkladů (QbE STD). Výrazy jsou zadávány v mluvené podobě a jsou vyhledány v množině řečových nahrávek, výstupem vyhledávání je seznam detekcí s jejich skóre a časováním. V práci popisujeme, analyzujeme a srovnáváme tři různé přístupy ke QbE STD v jazykově závislých a jazykově nezávislých podmínkách, s jedním a pěti příklady na dotaz. Pro naše experimenty jsme použili česká, maďarská, anglická a arabská (levantská) data, a pro každý z těchto jazyků jsme natrénovali 3-stavový fonémový rozpoznávač. To nám dalo 16 možných kombinací jazyka pro vyhodnocení a jazyka na kterém byl natrénovaný rozpoznávač. Čtyři kombinace byly tedy závislé na jazyce (language-dependent) a 12 bylo jazykově nezávislých (language-independent). Všechny QbE systémy byly vyhodnoceny na stejných datech a stejných fonémových posteriorních příznacích, pomocí metrik: nesdružené Figure-of-Merit (non pooled FOM) a námi navrhnuté nesdružené Figure-of-Merit se simulací normalizace přes promluvy (utterrance-normalized non-pooled Figure-of-Merit). Ty nám poskytly relevantní údaje pro porovnání těchto QbE přístupů a pro získání lepšího vhledu do jejich chování. QbE přístupy použité v této práci jsou: sekvenční statistické modelování (GMM/HMM), srovnávání vzorů v příznacích (DTW) a srovnávání grafů hypotéz (WFST). Abychom porovnali výsledky QbE přístupů s běžnými STD systémy vyhledávajícími textové výrazy, vyhodnotili jsme jazykově závislé konfigurace také s akustickým detektorem klíčových slov (AKWS) a systémem pro vyhledávání fonémových řetězců v grafech hypotéz (WFSTlat). Jádrem této práce je vývoj, analýza a zlepšení systému WFST QbE STD, který po zlepšení dosahuje podobných výsledků jako DTW systém v jazykově závislých podmínkách.This thesis investigates query-by-example (QbE) spoken term detection (STD). Queries are entered in their spoken form and searched for in a pool of recorded spoken utterances, providing a list of detections with their scores and timing. We describe, analyze and compare three different approaches to QbE STD, in various language-dependent and language-independent setups with diverse audio conditions, searching for a single example and five examples per query. For our experiments we used Czech, Hungarian, English and Levantine data and for each of the languages we trained a 3-state phone posterior estimator. This gave us 16 possible combinations of the evaluation language and the language of the posterior estimator, out of which 4 combinations were language-dependent and 12 were language-independent. All QbE systems were evaluated on the same data and the same features, using the metrics: non-pooled Figure-of-Merit and our proposed utterrance-normalized non-pooled Figure-of-Merit, which provided us with relevant data for the comparison of these QbE approaches and for gaining a better insight into their behavior. QbE approaches presented in this work are: sequential statistical modeling (GMM/HMM), template matching of features (DTW) and matching of phone lattices (WFST). To compare the performance of QbE approaches with the common query-by-text STD systems, for language-dependent setups we also evaluated an acoustic keyword spotting system (AKWS) and a system searching for phone strings in lattices (WFSTlat). The core of this thesis is the development, analysis and improvement of the WFST QbE STD system, which after the improvements, achieved similar performance to the DTW system in language-dependent setups.

    Representation Learning for Spoken term Detection

    Get PDF
    Spoken Term Detection (STD) is the task of searching a given spoken query word in large speech database. Applications of STD include speech data indexing, voice dialling, telephone monitoring and data mining. Performance of STD depends mainly on representation of speech signal and matching of represented signal. This work investigates methods for robust representation of speech signal, which is invariant to speaker variability, in the context of STD task. Here the representation is in the form of templates, a sequence of feature vectors. Typical representation in speech community Mel-Frequency CepstralCoe cients (MFCC) carry both speech-specific and speaker-specific information, so the need for better representation. Searching is done by matching sequence of feature vectors of query and reference utterances by using Subsequence Dynamic Time Warping (DTW). The performance of the proposed representation is evaluated on Telugu broadcast news data. In the absence of labelled data i.e., in unsupervised setting, we propose to capture joint density of acoustic space spanned by MFCCs using Gaussian Mixture Models (GMM) and Gaussian-Bernoulli Restricted Boltzmann Machines (GBRBM). Posterior features extracted from trained models are used to search the query word. It is noticed that 8% and 12% improvement in STD performance compared to MFCC by using GMM and GBRBM posterior features respectively. As transcribed data is not required, this approach is optimal solution to low-resource languages. But due to it’s intermediate performance, this method cannot be immediate solution to high resource language

    End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

    Full text link
    Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 202

    Low Resource Efficient Speech Retrieval

    Get PDF
    Speech retrieval refers to the task of retrieving the information, which is useful or relevant to a user query, from speech collection. This thesis aims to examine ways in which speech retrieval can be improved in terms of requiring low resources - without extensively annotated corpora on which automated processing systems are typically built - and achieving high computational efficiency. This work is focused on two speech retrieval technologies, spoken keyword retrieval and spoken document classification. Firstly, keyword retrieval - also referred to as keyword search (KWS) or spoken term detection - is defined as the task of retrieving the occurrences of a keyword specified by the user in text form, from speech collections. We make advances in an open vocabulary KWS platform using context-dependent Point Process Model (PPM). We further accomplish a PPM-based lattice generation framework, which improves KWS performance and enables automatic speech recognition (ASR) decoding. Secondly, the massive volumes of speech data motivate the effort to organize and search speech collections through spoken document classification. In classifying real-world unstructured speech into predefined classes, the wildly collected speech recordings can be extremely long, of varying length, and contain multiple class label shifts at variable locations in the audio. For this reason each spoken document is often first split into sequential segments, and then each segment is independently classified. We present a general purpose method for classifying spoken segments, using a cascade of language independent acoustic modeling, foreign-language to English translation lexicons, and English-language classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large classification performance improvements. Lastly, we remove the need of any orthographic lexicon and instead exploit alternative unsupervised approaches to decoding speech in terms of automatically discovered word-like or phoneme-like units. We show that the spoken segment representations based on such lexical or phonetic discovery can achieve competitive classification performance as compared to those based on a domain-mismatched ASR or a universal phone set ASR

    Unsupervised spoken keyword spotting and learning of acoustically meaningful units

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 103-106).The problem of keyword spotting in audio data has been explored for many years. Typically researchers use supervised methods to train statistical models to detect keyword instances. However, such supervised methods require large quantities of annotated data that is unlikely to be available for the majority of languages in the world. This thesis addresses this lack-of-annotation problem and presents two completely unsupervised spoken keyword spotting systems that do not require any transcribed data. In the first system, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram, without any transcription information. Given several spoken samples of a keyword, a segmental dynamic time warping is used to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. In the second system, to avoid the need for spoken samples, a Joint-Multigram model is used to build a mapping from the keyword text samples to the Gaussian component indices. A keyword instance in the test data can be detected by calculating the similarity score of the Gaussian component index sequences between keyword samples and test utterances. The proposed two systems are evaluated on the TIMIT and MIT Lecture corpus. The result demonstrates the viability and effectiveness of the two systems. Furthermore, encouraged by the success of using unsupervised methods to perform keyword spotting, we present some preliminary investigation on the unsupervised detection of acoustically meaningful units in speech.by Yaodong Zhang.S.M

    Application of automatic speech recognition technologies to singing

    Get PDF
    The research field of Music Information Retrieval is concerned with the automatic analysis of musical characteristics. One aspect that has not received much attention so far is the automatic analysis of sung lyrics. On the other hand, the field of Automatic Speech Recognition has produced many methods for the automatic analysis of speech, but those have rarely been employed for singing. This thesis analyzes the feasibility of applying various speech recognition methods to singing, and suggests adaptations. In addition, the routes to practical applications for these systems are described. Five tasks are considered: Phoneme recognition, language identification, keyword spotting, lyrics-to-audio alignment, and retrieval of lyrics from sung queries. The main bottleneck in almost all of these tasks lies in the recognition of phonemes from sung audio. Conventional models trained on speech do not perform well when applied to singing. Training models on singing is difficult due to a lack of annotated data. This thesis offers two approaches for generating such data sets. For the first one, speech recordings are made more “song-like”. In the second approach, textual lyrics are automatically aligned to an existing singing data set. In both cases, these new data sets are then used for training new acoustic models, offering considerable improvements over models trained on speech. Building on these improved acoustic models, speech recognition algorithms for the individual tasks were adapted to singing by either improving their robustness to the differing characteristics of singing, or by exploiting the specific features of singing performances. Examples of improving robustness include the use of keyword-filler HMMs for keyword spotting, an i-vector approach for language identification, and a method for alignment and lyrics retrieval that allows highly varying durations. Features of singing are utilized in various ways: In an approach for language identification that is well-suited for long recordings; in a method for keyword spotting based on phoneme durations in singing; and in an algorithm for alignment and retrieval that exploits known phoneme confusions in singing.Das Gebiet des Music Information Retrieval befasst sich mit der automatischen Analyse von musikalischen Charakteristika. Ein Aspekt, der bisher kaum erforscht wurde, ist dabei der gesungene Text. Auf der anderen Seite werden in der automatischen Spracherkennung viele Methoden für die automatische Analyse von Sprache entwickelt, jedoch selten für Gesang. Die vorliegende Arbeit untersucht die Anwendung von Methoden aus der Spracherkennung auf Gesang und beschreibt mögliche Anpassungen. Zudem werden Wege zur praktischen Anwendung dieser Ansätze aufgezeigt. Fünf Themen werden dabei betrachtet: Phonemerkennung, Sprachenidentifikation, Schlagwortsuche, Text-zu-Gesangs-Alignment und Suche von Texten anhand von gesungenen Anfragen. Das größte Hindernis bei fast allen dieser Themen ist die Erkennung von Phonemen aus Gesangsaufnahmen. Herkömmliche, auf Sprache trainierte Modelle, bieten keine guten Ergebnisse für Gesang. Das Trainieren von Modellen auf Gesang ist schwierig, da kaum annotierte Daten verfügbar sind. Diese Arbeit zeigt zwei Ansätze auf, um solche Daten zu generieren. Für den ersten wurden Sprachaufnahmen künstlich gesangsähnlicher gemacht. Für den zweiten wurden Texte automatisch zu einem vorhandenen Gesangsdatensatz zugeordnet. Die neuen Datensätze wurden zum Trainieren neuer Modelle genutzt, welche deutliche Verbesserungen gegenüber sprachbasierten Modellen bieten. Auf diesen verbesserten akustischen Modellen aufbauend wurden Algorithmen aus der Spracherkennung für die verschiedenen Aufgaben angepasst, entweder durch das Verbessern der Robustheit gegenüber Gesangscharakteristika oder durch das Ausnutzen von hilfreichen Besonderheiten von Gesang. Beispiele für die verbesserte Robustheit sind der Einsatz von Keyword-Filler-HMMs für die Schlagwortsuche, ein i-Vector-Ansatz für die Sprachenidentifikation sowie eine Methode für das Alignment und die Textsuche, die stark schwankende Phonemdauern nicht bestraft. Die Besonderheiten von Gesang werden auf verschiedene Weisen genutzt: So z.B. in einem Ansatz für die Sprachenidentifikation, der lange Aufnahmen benötigt; in einer Methode für die Schlagwortsuche, die bekannte Phonemdauern in Gesang mit einbezieht; und in einem Algorithmus für das Alignment und die Textsuche, der bekannte Phonemkonfusionen verwertet
    corecore