2,555 research outputs found

    Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1186/1687-4722-2013-23Query-by-Example Spoken Term Detection (QbE STD) aims at retrieving data from a speech data repository given an acoustic query containing the term of interest as input. Nowadays, it has been receiving much interest due to the high volume of information stored in audio or audiovisual format. QbE STD differs from automatic speech recognition (ASR) and keyword spotting (KWS)/spoken term detection (STD) since ASR is interested in all the terms/words that appear in the speech signal and KWS/STD relies on a textual transcription of the search term to retrieve the speech data. This paper presents the systems submitted to the ALBAYZIN 2012 QbE STD evaluation held as a part of ALBAYZIN 2012 evaluation campaign within the context of the IberSPEECH 2012 Conferencea. The evaluation consists of retrieving the speech files that contain the input queries, indicating their start and end timestamps within the appropriate speech file. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from MAVIR workshopsb, which amount at about 7 h of speech in total. We present the database metric systems submitted along with all results and some discussion. Four different research groups took part in the evaluation. Evaluation results show the difficulty of this task and the limited performance indicates there is still a lot of room for improvement. The best result is achieved by a dynamic time warping-based search over Gaussian posteriorgrams/posterior phoneme probabilities. This paper also compares the systems aiming at establishing the best technique dealing with that difficult task and looking for defining promising directions for this relatively novel task.Tejedor, J.; Toledano, DT.; Anguera, X.; Varona, A.; Hurtado Oliver, LF.; Miguel, A.; ColĂĄs, J. (2013). Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results and discussion. EURASIP Journal on Audio, Speech, and Music Processing. (23):1-17. doi:10.1186/1687-4722-2013-23S11723Zhang T, Kuo CCJ: Hierarchical classification of audio data for archiving and retrieving. In Proceedings of ICASSP. Phoenix; 15–19 March 1999:3001-3004.HelĂ©n M, Virtanen T: Query by example of audio signals using Euclidean distance between Gaussian Mixture Models. In Proceedings of ICASSP. Honolulu; 15–20 April 2007:225-228.HelĂ©n M, Virtanen T: Audio query by example using similarity measures between probability density functions of features. EURASIP J. Audio Speech Music Process 2010, 2010: 2:1-2:12.Tzanetakis G, Ermolinskyi A, Cook P: Pitch histograms in audio and symbolic music information retrieval. In Proceedings of the Third International Conference on Music Information Retrieval: ISMIR. Paris; 2002:31-38.Tsai HM, Wang WH: A query-by-example framework to retrieve music documents by singer. In Proceedings of the IEEE International Conference on Multimedia and Expo. Taipei; 27–30 June 2004:1863-1866.Chia TK, Sim KC, Li H, Ng HT: A lattice-based approach to query-by-example spoken document retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore; 20–24 July 2008:363-370.Tejedor J, FapĆĄo M, Szöke I, ČernockĂœ H, GrĂ©zl F: Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):18:1-18:34.Muscariello A, Gravier G, Bimbot F: Zero-resource audio-only spoken term detection based on a combination of template matching techniques. In Proceedings of Interspeech. Florence; 27–31 August 2011:921-924.Lin H, Stupakov A, Bilmes J: Spoken keyword spotting via multi-lattice alignment. In 9th International Speech Communication Association Annual Conference. Brisbane, Australia; September 2008:2191-2194.Parada C, Sethy A, Ramabhadran B: Query-by-Example Spoken Term Detection for OOV terms. In Proceedings of ASRU. Merano; 13-17 December 2009:404-409.Shen W, White TJ, Hazen CM: A comparison of Query-by-Example methods for Spoken Term Detection. In Proceedings of Interspeech. Brighton; September 2009:2143-2146.Lin H, Stupakov A, Bilmes J: Improving multi-lattice alignment based spoken keyword spotting. In Proceedings of ICASSP. Taipei; 19–24 April 2009:4877-4880.Barnard E, Davel M, van Heerden C, Kleynhans N, Bali K: Phone recognition for spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:5-6.Buzo A, Cucu H, Safta M, Ionescu B, Burileanu C: ARF@MediaEval 2012: a Romanian ASR-based approach to spoken term detection. In Proceedings of MediaEval. Pisa; 4–5 October 2012:7-8.Abad A, Astudillo RF: The L2F spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:9-10.Varona A, Penagarikano M, RodrĂ­guez-Fuentes L, Bordel L, Diez M: GTTS system for the spoken web search task at MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:13-14.Szöke I, Faps̆o M, VeselĂœ K: BUT2012 Approaches for spoken web search - MediaEval 2012. In Proceedings of MediaEval. Pisa;4–5October 2012:15-16.Hazen W, Shen TJ, White CM: Query-by-Example spoken term detection using phonetic posteriorgram templates. In Proceedings of ASRU. Merano; 13–17 December 2009:421-426.Zhang Y, Glass JR: Unsupervised spoken keyword spotting via segmental DTW on Gaussian Posteriorgrams. In Proceedings of ASRU. Merano; 13–17 December 2009:398-403.Chan C, Lee L: Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proceedings of Interspeech. Makuhari; 26–30 September 2010:693-696.Anguera X, Macrae R, Oliver N: Partial sequence matching using an unbounded dynamic time warping algorithm. In Proceedings of ICASSP. Dallas; 14–19 March 2010:3582-3585.Anguera X: Telefonica system for the spoken web search Task at Mediaeval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:3-4.Muscariello A, Gravier G: Irisa MediaEval 2011 spoken web search system. In Proceedings of MediaEval. Pisa; 1–2 September 2011:9-10.Szöke I, Tejedor J, Faps̆o M, ColĂĄs J: BUT-HCTLab approaches for spoken web search - MediaEval 2011. In Proceedings of MediaEval. Pisa; 1–2 September 2011:11-12.Wang H, Lee T: CUHK System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:3-4.Joder C, Weninger F, Wöllmer M, Schuller M: The TUM cumulative DTW approach for the Mediaeval 2012 spoken web search task. In Proceedings of MediaEval. Pisa; 4–5 October 2012:5-6.Vavrek J, Pleva M, JuhĂĄr J: TUKE MediaEval 2012: spoken web search using DTW and unsupervised SVM. In Proceedings of MediaEval. Pisa; 4–5 October 2012:11-12.Jansen A, Durme P, Clark BV: The JHU-HLTCOE spoken web search system for MediaEval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:17-18.Anguera X: Telefonica Research System for the spoken web search task at Mediaeval 2012. In Proceedings of MediaEval. Pisa; 4–5 October 2012:19-20.NIST: The Ninth Text REtrieval Conference (TREC 9). 2000. http://trec.nist.gov . Accessed 16 September 2013NIST: The Spoken Term Detection (STD) 2006 Evaluation Plan. 10 (National Institute of Standards and Technology (NIST), Gaithersburg, 2006). . Accessed 16 September 2013 http://www.nist.gov/speech/tests/stdSakai T, Joho H: Overview of NTCIR-9. Proceedings of NTCIR-9 Workshop 2011, 1-7.Rajput N, Metze F: Spoken web search. In Proceedings of MediaEval. Pisa; 1–2 September 2011:1-2.Metze F, Barnard E, Davel M, van Heerden C, Anguera X, Gravier G, Rajput N: Spoken web search. In Proceedings of MediaEval. Pisa; 4–5 October 2012:1-2.Tokyo University of Technology: Evaluation of information access technologies: information retrieval, question answering and cross-lingual information access. 2013. http://research.nii.ac.jp/ntcir/ntcir-10/ . Accessed 16 September 2013NIST: The OpenKWS13 evaluation plan. 1, (National Institute of Standards and Technology (NIST), Gaithersburg, 2013). . Accessed 16 September 2013 http://www.nist.gov/itl/iad/mig/openkws13.cfmTaras B, Nadeu C: Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music Process 2011, 1: 1-10.ZelenĂĄk M, Schulz H, Hernando J: Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music Process 2012, 19: 1-9.RodrĂ­guez-Fuentes LJ, Penagarikano M, Varona A, DĂ­ez M, Bordel G: The Albayzin 2010 language recognition evaluation. In Proceedings of Interspeech. Florence; 27–31 August 2011:1529-1532.MĂ©ndez F, DocĂ­o L, Arza M, Campillo F: The Albayzin 2010 text-to-speech evaluation. In Proceedings of FALA. Vigo; November 2010:317-340.Fiscus JG, Ajot J, Garofolo JS, Doddington G: Results of the 2006 spoken term detection evaluation. In Proceedings of SIGIR Workshop Searching Spontaneous Conversational Speech. Rhodes; 22–25 September 2007:45-50.Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. In Proceedings of Eurospeech. Rhodes; 22-25 September 1997:1895-1898.NIST: NIST Speech Tools and APIs: 2006 (National Institute of Standards and Technology (NIST), Gaithersburg, 1996). . Accessed 16 September 2013 http://www.nist.gov/speech/tools/index.htmIberspeech 2012: VII Jornadas en TecnologĂ­a del Habla and III Iberian SLTech Workshop. . Accessed 16 September 2013 http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdfAnguera X: Speaker independent discriminant feature extraction for acoustic pattern-matching. In Proceedings of ICASSP. Kyoto; 25–30 March 2012:485-488.Anguera X, Ferrarons M: Memory efficient subsequence DTW for Query-by-Example spoken term detection. Proceedings of ICME 2013. http://www.xavieranguera.com/papers/sdtw_icme2013.pdfAnguera X: Telefonica Research System for the Query-by-example task at Albayzin 2012. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:626-632.Schwarz P: Phoneme recognition based on long temporal context. PhD Thesis, FIT, BUT, Brno, Czech Republic. 2008.Stolckem A: SRILM - an extensible language modeling toolkit. In Proceedings of Interspeech. Denver; 2002:901-904.Wang D, King S, Frankel J: Stochastic pronunciation modelling for out-of-vocabulary spoken term detection. IEEE Trans. Audio Speech Language Process 2011, 19(4):688-698.Wang D, Tejedor J, King S, Frankel J: Term-dependent confidence normalization for out-of-vocabulary spoken term detection. J. Comput. Sci. Technol 2012, 27(2):358-375. 10.1007/s11390-012-1228-xWang D, King S, Frankel J, Vipperla R, Evans N, Troncy R: Direct posterior confidence for out-of-vocabulary spoken term detection. ACM Trans. Inf. Syst 2012, 30(3):1-34.Varona A, Penagarikano M, RodrĂ­guez-Fuentes LJ, Bordel G, Diez M: GTTS systems for the query-by-example spoken term detection task of the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:619-625.GĂłmez J, Sanchis E, Castro-Bleda M: Automatic speech segmentation based on acoustical clustering. Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition 2010, 540-548.GĂłmez J, Castro M: Automatic segmentation of speech at the phonetic level. Proceedings of the joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition 2002, 672-680.Sanchis E, Hurtado LF, GĂłmez JA, Calvo M, Fabra R: The ELiRF Query-by-example STD systems for the Albayzin 2012 search on speech evaluation. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:611-618.Park A, Glass J: Towards unsupervised pattern discovery in speech. In Proceedings of ASRU. Cancun; 27 November to 1 December 2005:53-58.Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book. Engineering Department, Cambridge University; 2006.Miguel A, Villalba J, Ortega A, Lleida E: Albayzin 2012 search on speech @ ViVoLab UZ. In Proceedings of IberSPEECH. Madrid, Spain; 21–23 November 2012:633-642.Boersma P, Weenink D: Praat: Doing Phonetics by Computer. University of Amsterdam, Spuistraat, 210, Amsterdam, Holland. 2007. http://www.fon.hum.uva.nl/praat/ . Accessed 16 September 2013Goldwater S, Jurafsky D, Maning CD: Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 2009, 52(3):181-200.Mertens T, Wallace R, Schneider D: Cross-site combination and evaluation of subword spoken term detection systems. In Proceedings of CBMI. Madrid; 13–15 June 2011:61-66

    ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

    Full text link
    [EN] Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.This work was partially supported by Fundacao para a Ciencia e Tecnologia (FCT) under the projects UID/EEA/50008/2013 (pluriannual funding in the scope of the LETSREAD project) and UID/CEC/50021/2013, and Grant SFRH/BD/97187/2013. Jorge Proenca is supported by the SFRH/BD/97204/2013 FCT Grant. This work was also supported by the Galician Government ('Centro singular de investigacion de Galicia' accreditation 2016-2019 ED431G/01 and the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014)), the European Regional Development Fund (ERDF), the projects "DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz" (TEC2015-68172-C2-1-P) and the TIN2015-64282-R funded by Ministerio de Economia y Competitividad in Spain, the Spanish Government through the project "TraceThem" (TEC2015-65345-P), and AtlantTIC ED431G/04.Tejedor, J.; Toledano, DT.; Lopez-Otero, P.; Docio-Fernandez, L.; Proença, J.; PerdigĂŁo, F.; GarcĂ­a-Granada, F.... (2018). ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation. EURASIP Journal on Audio, Speech and Music Processing. 1-25. https://doi.org/10.1186/s13636-018-0125-9S125Jarina, R, Kuba, M, Gubka, R, Chmulik, M, Paralic, M (2013). UNIZA system for the spoken web search task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 791–792).Ali, A, & Clements, MA (2013). Spoken web search using and ergodic hidden Markov model of speech. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 861–862).Buzo, A, Cucu, H, Burileanu, C (2014). SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–722).Caranica, A, Buzo, A, Cucu, H, Burileanu, C (2015). SpeeD@MediaEval 2015: Multilingual phone recognition approach to Query By Example STD. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 781–783).Kesiraju, S, Mantena, G, Prahallad, K (2014). IIIT-H system for MediaEval 2014 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 761–762).Ma, M, & Rosenberg, A (2015). CUNY systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–833).Takahashi, J, Hashimoto, T, Konno, R, Sugawara, S, Ouchi, K, Oshima, S, Akyu, T, Itoh, Y (2014). An IWAPU STD system for OOV query terms and spoken queries. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 384–389).Makino, M, & Kai, A (2014). Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery & Doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 413–418).Konno, R, Ouchi, K, Obara, M, Shimizu, Y, Chiba, T, Hirota, T, Itoh, Y (2016). An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery & Doc task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 200–204).Sakamoto, N, Yamamoto, K, Nakagawa, S (2015). Combination of syllable based N-gram search and word search for spoken term detection through spoken queries and IV/OOV classification. In Proc. of ASRU. IEEE, New York, (pp. 200–206).Hou, J, Pham, VT, Leung, C-C, Wang, L, 2, HX, Lv, H, Xie, L, Fu, Z, Ni, C, Xiao, X, Chen, H, Zhang, S, Sun, S, Yuan, Y, Li, P, Nwe, TL, Sivadas, S, Ma, B, Chng, ES, Li, H (2015). The NNI Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 141–143).Vavrek, J, Viszlay, P, Lojka, M, Pleva, M, Juhar, J, Rusko, M (2015). TUKE at MediaEval 2015 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 451–453).Mantena, G, Achanta, S, Prahallad, K (2014). Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(5), 946–955.Anguera, X, & Ferrarons, M (2013). Memory efficient subsequence DTW for query-by-example spoken term detection. In Proc. of ICME. IEEE, New York, (pp. 1–6).Tulsiani, H, & Rao, P (2015). The IIT-B Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 341–343).Bouallegue, M, Senay, G, Morchid, M, Matrouf, D, Linares, G, Dufour, R (2013). LIA@MediaEval 2013 spoken web search task: An I-Vector based approach. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 771–772).Rodriguez-Fuentes, LJ, Varona, A, Penagarikano, M, Bordel, G, Diez, M (2013). GTTS systems for the SWS task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–832).Wang, H, Lee, T, Leung, C-C, Ma, B, Li, H (2013). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 8545–8549).Wang, H, & Lee, T (2013). The CUHK spoken web search system for MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 681–682).Proenca, J, Veiga, A, PerdigĂŁo, F (2014). The SPL-IT query by example search on speech system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 741–742).Proenca, J, Veiga, A, Perdigao, F (2015). Query by example search with segmented dynamic time warping for non-exact spoken queries. In Proc. of EUSIPCO. Springer, Berlin, (pp. 1691–1695).Proenca, J, Castela, L, Perdigao, F (2015). The SPL-IT-UC Query by Example search on speech system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 471–473).Proenca, J, & Perdigao, F (2016). Segmented dynamic time warping for spoken Query-by-Example search. In Proc. of Interspeech. ISCA, Baixas, (pp. 750–754).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). GTM-UVigo systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 521–523).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). Phonetic unit selection for cross-lingual Query-by-Example spoken term detection. In Proc. of ASRU. IEEE, New York, (pp. 223–229).Saxena, A, & Yegnanarayana, B (2015). Distinctive feature based representation of speech for Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 3680–3684).Skacel, M, & Szöke, I (2015). BUT QUESST 2015 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–723).Chen, H, Leung, C-C, Xie, L, Ma, B, Li, H (2016). Unsupervised bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 923–927).Yuan, Y, Leung, C-C, Xie, L, Chen, H, Ma, B, Li, H (2017). Pairwise learning using multi-lingual bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 5645–5649).Torbati, AHHN, & Picone, J (2016). A nonparametric Bayesian approach for spoken term detection by example query. In Proc. of Interspeech. ISCA, Baixas, (pp. 928–932).Popli, A, & Kumar, A (2015). Query-by-example spoken term detection using low dimensional posteriorgrams motivated by articulatory classes. In Proc. of MMSP. IEEE, New York, (pp. 1–6).Yang, P, Leung, C-C, Xie, L, Ma, B, Li, H (2014). Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 1722–1726).George, B, Saxena, A, Mantena, G, Prahallad, K, Yegnanarayana, B (2014). Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping. In Proc. of Interspeech. ISCA, Baixas, (pp. 1742–1746).Hazen, TJ, Shen, W, White, CM (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. of ASRU. IEEE, New York, (pp. 421–426).Abad, A, Astudillo, RF, Trancoso, I (2013). The L2F spoken web search system for mediaeval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 851–852).Szöke, I, SkĂĄcel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).Szöke, I, Burget, L, GrĂ©zl, F, ČernockĂœ, JH, Ondel, L (2014). Calibration and fusion of query-by-example systems - BUT SWS 2013. In Proc. of ICASSP. IEEE, New York, (pp. 621–622).Abad, A, RodrĂ­guez-Fuentes, LJ, Penagarikano, M, Varona, A, Bordel, G (2013). On the calibration and fusion of heterogeneous spoken term detection systems. In Proc. of Interspeech. ISCA, Baixas, (pp. 20–24).Yang, P, Xu, H, Xiao, X, Xie, L, Leung, C-C, Chen, H, Yu, J, Lv, H, Wang, L, Leow, SJ, Ma, B, Chng, ES, Li, H (2014). The NNI query-by-example system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 691–692).Leung, C-C, Wang, L, Xu, H, Hou, J, Pham, VT, Lv, H, Xie, L, Xiao, X, Ni, C, Ma, B, Chng, ES, Li, H (2016). Toward high-performance language-independent Query-by-Example spoken term detection for MediaEval 2015: Post-evaluation analysis. In Proc. of Interspeech. ISCA, Baixas, (pp. 3703–3707).Xu, H, Hou, J, Xiao, X, Pham, VT, Leung, C-C, Wang, L, Do, VH, Lv, H, Xie, L, Ma, B, Chng, ES, Li, H (2016). Approximate search of audio queries by using DTW with phone time boundary and data augmentation. In Proc. of ICASSP. IEEE, New York, (pp. 6030–6034).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level and DNN-based acoustic matches for efficient spoken term detection in NTCIR-12 SpokenQuery &Doc-2 task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 205–210).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level spotting and posterior-based acoustic match for improved query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 740–744).Obara, M, Kojima, K, Tanaka, K, Lee, S-w, Itoh, Y (2016). Rescoring by combination of posteriorgram score and subword-matching score for use in Query-by-Example. In Proc. of Interspeech. ISCA, Baixas, (pp. 1918–1922).NIST. The Ninth Text REtrieval Conference (TREC 9). http://trec.nist.gov . Accessed Feb 2018.Anguera, X, Rodriguez-Fuentes, LJ, Szöke, I, Buzo, A, Metze, F (2014). Query by Example Search on Speech at Mediaeval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 351–352).Joho, H, & Kishida, K (2014). Overview of the NTCIR-11 SpokenQuery&Doc Task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–7).NIST. Draft KWS16 Keyword Search Evaluation Plan. https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf . Accessed Feb 2018.Anguera, X, Metze, F, Buzo, A, Szöke, I, Rodriguez-Fuentes, LJ (2013). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 921–922).Taras, B, & Nadeu, C (2011). Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1–10.ZelenĂĄk, M, Schulz, H, Hernando, J (2012). Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP Journal on Audio, Speech, and Music Processing, 2012(19), 1–9.RodrĂ­guez-Fuentes, LJ, Penagarikano, M, Varona, A, DĂ­ez, M, Bordel, G (2011). The Albayzin 2010 Language Recognition Evaluation. In Proc. of Interspeech. ISCA, Baixas, (pp. 1529–1532).Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C, Cardenal, A, Echeverry-Correa, JD, Coucheiro-Limeres, A, Olcoz, J, Miguel, A (2015). Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech and Music Processing, 2015(21), 1–27.Tejedor, J, Toledano, DT, Anguera, X, Varona, A, Hurtado, LF, Miguel, A, ColĂĄs, J (2013). Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech, and Music Processing, 2013(23), 1–17.Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations. EURASIP, Journal on Audio, Speech and Music Processing, 2016(1), 1–19.MĂ©ndez, F, DocĂ­o, L, Arza, M, Campillo, F (2010). The Albayzin 2010 text-to-speech evaluation. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 317–340).Billa, J, Ma, KW, McDonough, JW, Zavaliagkos, G, Miller, DR, Ross, KN, El-Jaroudi, A (1997). Multilingual speech recognition: the 1996 Byblos Callhome system. In Proc. of Eurospeech. ISCA, Baixas, (pp. 363–366).Killer, M, Stuker, S, Schultz, T (2003). Grapheme based speech recognition. In Proc. of Eurospeech. ISCA, Baixas, (pp. 3141–3144).Burget, L, Schwarz, P, Agarwal, M, Akyazi, P, Feng, K, Ghoshal, A, Glembek, O, Goel, N, Karafiat, M, Povey, D, Rastrow, A, Rose, RC, Thomas, S (2010). Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models. In Proc. of ICASSP. IEEE, New York, (pp. 4334–4337).Cuayahuitl, H, & Serridge, B (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In Proc. of MICAI. Springer, Berlin, (pp. 156–165).Tejedor, J (2009). Contributions to keyword spotting and spoken term detection for information retrieval in audio mining. PhD thesis, Universidad AutĂłnoma de Madrid, Madrid, Spain.Tejedor, J, Toledano, DT, Wang, D, King, S, ColĂĄs, J (2014). Feature analysis for discriminative confidence estimation in spoken term detection. Computer Speech and Language, 28(5), 1083–1114.Li, J, Wang, X, Xu, B (2014). An empirical study of multilingual and low-resource spoken term detection using deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 1747–1751).NIST. The Spoken Term Detection (STD) 2006 evaluation plan. http://berlin.csie.ntnu.edu.tw/Courses/Special%20Topics%20in%20Spoken%20Language%20Processing/Lectures2008/SLP2008S-Lecture12-Spoken%20Term%20Detection.pdf . Accessed Feb 2018.Fiscus, JG, Ajot, J, Garofolo, JS, Doddingtion, G (2007). Results of the 2006 spoken term detection evaluation. In Proc. of SSCS. ACM, New York, (pp. 45–50).Martin, A, Doddington, G, Kamm, T, Ordowski, M, Przybocki, M (1997). The DET curve in assessment of detection task performance. In Proc. of Eurospeech. ISCA, Baixas, (pp. 1895–1898).NIST. Evaluation Toolkit (STDEval) software. https://www.nist.gov/itl/iad/mig/tools . Accessed Feb 2018.Union, IT. ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. http://www.itu.int/rec/T-REC-P.563/en . Accessed Feb 2018.Rajput, N, & Metze, F (2011). Spoken web search. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 1–2).Metze, F, Barnard, E, Davel, M, van Heerden, C, Anguera, X, Gravier, G, Rajput, N (2012). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 41–42).Szöke, I, Rodriguez-Fuentes, LJ, Buzo, A, Anguera, X, Metze, F, Proenca, J, Lojka, M, Xiong, X (2015). Query by Example Search on Speech at Mediaeval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Szöke, I, & Anguera, X (2016). Zero-cost speech recognition task at Mediaeval 2016. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2014). Overview of the NTCIR-11 spokenquery &doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–15).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2016). Overview of the NTCIR-12 spokenquery &doc-2. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 1–13).Schwarz, P (2008). Phoneme recognition based on long temporal context. PhD thesis, FIT, BUT, Brno, Czech Republic.Varona, A, Penagarikano, M, RodrĂ­guez-Fuentes, LJ, Bordel, G (2011). On the use of lattices of time-synchronous cross-decoder phone co-occurrences in a SVM-phonotactic language recognition system. In Proc. of Interspeech. ISCA, Baixas, (pp. 2901–2904).Eyben, F, Wollmer, M, Schuller, B (2010). OpenSMILE—the munich versatile and fast open-source audio feature extractor. In Proc. of ACM Multimedia (MM). ACM, New York, (pp. 1459–1462).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Finding relevant features for zero-resource query-by-example search on speech. Speech Communication, 84(1), 24–35.Zhang, Y, & Glass, JR (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proc. of ASRU. IEEE, New York, (pp. 398–403).Povey, D, Ghoshal, A, Boulianne, G, Burget, L, Glembek, O, Goel, N, Hannemann, M, Motlicek, P, Qian, Y, Schwarz, P, Silovsky, J, Stemmer, G, Vesely, K (2011). The KALDI speech recognition toolkit. In Proc. of ASRU. IEEE, New York, (pp. 1–4).Muller, M. (2007). Information retrieval for music and motion. New York: Springer.Szöke, I, Skacel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).BrĂŒmmer, N, & van Leeuwen, D (2006). On calibration of language recognition scores. In Proc of the IEEE Odyssey: The speaker and language recognition workshop. IEEE, New York, (pp. 1–8).BrĂŒmmer, N, & de Villiers, E. The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing. Technical report. https://sites.google.com/site/nikobrummer . Accessed Feb 2018.Meinedo, H, & Neto, J (2005). A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In Proc. of Interspeech. ISCA, Baixas, (pp. 237–240).Morgan, N, & Bourlard, H (1995). An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3), 25–42.Meinedo, H, Abad, A, Pellegrini, T, Trancoso, I, Neto, J (2010). The L2F broadcast news speech recognition system. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 93–96).Abad, A, Luque, J, Trancoso, I (2011). Parallel transformation network features for speaker recognition. In Proc. of ICASSP. IEEE, New York, (pp. 5300–5303).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2012). On the use of phone log-likelihood ratios as features in spoken language recognition. In Proc. of SLT. IEEE, New York, (pp. 274–279).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2014). New insight into the use of phone log-likelihood ratios as features for language recognition. In Proc. of Interspeech. ISCA, Baixas, (pp. 1841–1845).Abad, A, Ribeiro, E, Kepler, F, Astudillo, R, Trancoso, I (2016). Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers. In Proc. of Interspeech. ISCA, Baixas, (pp. 2413–2417).RodrĂ­guez-Fuentes, LJ, Varona, A, Peñagarikano, M, Bordel, G, DĂ­ez, M (2014). High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In Proc. of ICASSP. IEEE, New York, (pp. 7819–7823).Vesely, K, Ghoshal, A, Burget, L, Povey, D (2013). Sequence-discriminative training of deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 2345–2349).Ghahremani, P, BabaAli, B, Povey, D, Riedhammer, K, Trmal, J, Khudanpur, S (2014). A pitch extraction algorithm tuned for automatic speech recognition. In Proc. of ICASSP. IEEE, New York, (pp. 2494–2498).Povey, D, Hannemann, M, Boulianne, G, Burget, L, Ghoshal, A, Janda, M, Karafiat, M, Kombrink, S, Motlicek, P, Qian, Y, Riedhammer, K, Vesely, K, Vu, NT (2012). Generating exact lattices in the WFST framework. In Proc. of ICASSP. IEEE, New York, (pp. 4213–4216).Garcia-Mateo, C, Dieguez-Tirado, J, Docio-Fernandez, L, Cardenal-Lopez, A (2004). Transcrigal: A bilingual system for automatic indexing of broadcast news. In Proc. of LREC. ELRA, Paris, (pp. 2061–2064).Stolcke, A (2002). SRILM—an extensible language modeling toolkit. In Proc. of Interspeech. ISCA, Baixas, (pp. 901–904).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). GTM-UVigo systems for Albayzin 2016 search on speech evaluation. In Proc. of Iberspeech. Springer, Berlin, (pp. 65–74).Chen, G, Khudanpur, S, Povey, D, Trmal, J, Yarowsky, D, Yilmaz, O (2013). Quantifying the value of pronunciation lexicons for keyword search in low resource languages. In Proc. of ICASSP. IEEE, New York, (pp. 8560–8564).Pham, VT, Chen, NF, Sivadas, S, Xu, H, Chen, I-F, Ni, C, Chng, ES, Li, H (2014). System and keyword dependent fusion for spoken term detection. In Proc. of SLT. IEEE, New York, (pp. 430–435).Can, D, & Saraclar, M (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech and Language Processing, 19(8), 2338–2347.Miller, DRH, K

    MediaEval 2011 evaluation campaign

    Get PDF
    MediaEval is an international multimedia benchmarking initiative offering innovative new tasks to the multimedia community. MediaEval 2011 featured tasks incorporating social media search, analysis of affect and location placing of images

    An introduction to crowdsourcing for language and multimedia technology research

    Get PDF
    Language and multimedia technology research often relies on large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible

    Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

    Full text link
    Audio Word2Vec offers vector representations of fixed dimensionality for variable-length audio segments using Sequence-to-sequence Autoencoder (SA). These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with real world applications such as query-by-example Spoken Term Detection (STD). This paper examines the capability of language transfer of Audio Word2Vec. We train SA from one language (source language) and use it to extract the vector representation of the audio segments of another language (target language). We found that SA can still catch phonetic structure from the audio segments of the target language if the source and target languages are similar. In query-by-example STD, we obtain the vector representations from the SA learned from a large amount of source language data, and found them surpass the representations from naive encoder and SA directly learned from a small amount of target language data. The result shows that it is possible to learn Audio Word2Vec model from high-resource languages and use it on low-resource languages. This further expands the usability of Audio Word2Vec.Comment: arXiv admin note: text overlap with arXiv:1603.0098

    Blip10000: a social video dataset containing SPUG content for tagging and retrieval

    Get PDF
    The increasing amount of digital multimedia content available is inspiring potential new types of user interaction with video data. Users want to easilyfind the content by searching and browsing. For this reason, techniques are needed that allow automatic categorisation, searching the content and linking to related information. In this work, we present a dataset that contains comprehensive semi-professional user generated (SPUG) content, including audiovisual content, user-contributed metadata, automatic speech recognition transcripts, automatic shot boundary les, and social information for multiple `social levels'. We describe the principal characteristics of this dataset and present results that have been achieved on different tasks

    Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

    Full text link
    Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due to the large volume of multimedia information. This paper presents the systems submitted to the ALBAYZIN QbE STD 2014 evaluation held as a part of the ALBAYZIN 2014 Evaluation campaign within the context of the IberSPEECH 2014 conference. This is the second QbE STD evaluation in Spanish, which allows us to evaluate the progress in this technology for this language. The evaluation consists in retrieving the speech files that contain the input queries, indicating the start and end times where the input queries were found, along with a score value that reflects the confidence given to the detection of the query. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from workshops, which amount to about 7 h of speech. We present the database, the evaluation metric, the systems submitted to the evaluation, the results, and compare this second evaluation with the first ALBAYZIN QbE STD evaluation held in 2012. Four different research groups took part in the evaluations held in 2012 and 2014. In 2014, new multi-word and foreign queries were added to the single-word and in-language queries used in 2012. Systems submitted to the second evaluation are hybrid systems which integrate letter transcription- and template matching-based systems. Despite the significant improvement obtained by the systems submitted to this second evaluation compared to those of the first evaluation, results still show the difficulty of this task and indicate that there is still room for improvement.This research was funded by the Spanish Government ('SpeechTech4All Project' TEC2012 38939 C03 01 and 'CMC-V2 Project' TEC2012 37585 C02 01), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and 'AtlantTIC Project' CN2012/160, and also by the Spanish Government and the European Regional Development Fund (ERDF) under project TACTICA
    • 

    corecore