169 research outputs found

    Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

    Get PDF
    [Abstract] The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.Centro singular de investigación de Galicia; ED431G/04Universidad del País Vasco; GIU16/68Ministerio de Economía y Competitividad; TEC2015-68172-C2-1-PMinisterio de Ciencia, Innovación y Competitividad; RTI2018-098091-B-I00Xunta de Galicia; ED431G/0

    Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

    Full text link
    This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that this strategy is able to produce networks that significantly improve existing localization methods based on \textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN method exhibits better resistance against varying gender of the speaker and different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table

    Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

    Full text link
    Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due to the large volume of multimedia information. This paper presents the systems submitted to the ALBAYZIN QbE STD 2014 evaluation held as a part of the ALBAYZIN 2014 Evaluation campaign within the context of the IberSPEECH 2014 conference. This is the second QbE STD evaluation in Spanish, which allows us to evaluate the progress in this technology for this language. The evaluation consists in retrieving the speech files that contain the input queries, indicating the start and end times where the input queries were found, along with a score value that reflects the confidence given to the detection of the query. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from workshops, which amount to about 7 h of speech. We present the database, the evaluation metric, the systems submitted to the evaluation, the results, and compare this second evaluation with the first ALBAYZIN QbE STD evaluation held in 2012. Four different research groups took part in the evaluations held in 2012 and 2014. In 2014, new multi-word and foreign queries were added to the single-word and in-language queries used in 2012. Systems submitted to the second evaluation are hybrid systems which integrate letter transcription- and template matching-based systems. Despite the significant improvement obtained by the systems submitted to this second evaluation compared to those of the first evaluation, results still show the difficulty of this task and indicate that there is still room for improvement.This research was funded by the Spanish Government ('SpeechTech4All Project' TEC2012 38939 C03 01 and 'CMC-V2 Project' TEC2012 37585 C02 01), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and 'AtlantTIC Project' CN2012/160, and also by the Spanish Government and the European Regional Development Fund (ERDF) under project TACTICA

    Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media

    Get PDF
    The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained

    Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

    Get PDF
    The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0063-8Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).This work has been partly supported by project CMC-V2 (TEC2012-37585-C02-01) from the Spanish Ministry of Economy and Competitiveness. This research was also funded by the European Regional Development Fund, the Galician Regional Government (GRC2014/024, “Consolidation of Research Units: AtlantTIC Project” CN2012/160)

    ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish

    Get PDF
    [Abstract] Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.Ministerio de Economía y Competitividad; TIN2015-64282-R,Ministerio de Economía y Competitividad; RTI2018-093336-B-C22Ministerio de Economía y Competitividad; TEC2015-65345-PXunta de Galicia; ED431B 2016/035Xunta de Galicia; GPC ED431B 2019/003Xunta de Galicia; GRC 2014/024Xunta de Galicia; ED431G/01Xunta de Galicia; ED431G/04Agrupación estratéxica consolidada; GIU16/68Ministerio de Economía y Competitividad; TEC2015-68172-C2-1-

    Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers

    Get PDF
    A new language recognition technique based on the application of the philosophy of the Shifted Delta Coefficients (SDC) to phone log-likelihood ratio features (PLLR) is described. The new methodology allows the incorporation of long-span phonetic information at a frame-by-frame level while dealing with the temporal length of each phone unit. The proposed features are used to train an i-vector based system and tested on the Albayzin LRE 2012 dataset. The results show a relative improvement of 33.3% in Cavg in comparison with different state-of-the-art acoustic i-vector based systems. On the other hand, the integration of parallel phone ASR systems where each one is used to generate multiple PLLR coefficients which are stacked together and then projected into a reduced dimension are also presented. Finally, the paper shows how the incorporation of state information from the phone ASR contributes to provide additional improvements and how the fusion with the other acoustic and phonotactic systems provides an important improvement of 25.8% over the system presented during the competition

    ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

    Full text link
    [EN] Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.This work was partially supported by Fundacao para a Ciencia e Tecnologia (FCT) under the projects UID/EEA/50008/2013 (pluriannual funding in the scope of the LETSREAD project) and UID/CEC/50021/2013, and Grant SFRH/BD/97187/2013. Jorge Proenca is supported by the SFRH/BD/97204/2013 FCT Grant. This work was also supported by the Galician Government ('Centro singular de investigacion de Galicia' accreditation 2016-2019 ED431G/01 and the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014)), the European Regional Development Fund (ERDF), the projects "DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz" (TEC2015-68172-C2-1-P) and the TIN2015-64282-R funded by Ministerio de Economia y Competitividad in Spain, the Spanish Government through the project "TraceThem" (TEC2015-65345-P), and AtlantTIC ED431G/04.Tejedor, J.; Toledano, DT.; Lopez-Otero, P.; Docio-Fernandez, L.; Proença, J.; Perdigão, F.; García-Granada, F.... (2018). ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation. EURASIP Journal on Audio, Speech and Music Processing. 1-25. https://doi.org/10.1186/s13636-018-0125-9S125Jarina, R, Kuba, M, Gubka, R, Chmulik, M, Paralic, M (2013). UNIZA system for the spoken web search task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 791–792).Ali, A, & Clements, MA (2013). Spoken web search using and ergodic hidden Markov model of speech. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 861–862).Buzo, A, Cucu, H, Burileanu, C (2014). SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–722).Caranica, A, Buzo, A, Cucu, H, Burileanu, C (2015). SpeeD@MediaEval 2015: Multilingual phone recognition approach to Query By Example STD. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 781–783).Kesiraju, S, Mantena, G, Prahallad, K (2014). IIIT-H system for MediaEval 2014 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 761–762).Ma, M, & Rosenberg, A (2015). CUNY systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–833).Takahashi, J, Hashimoto, T, Konno, R, Sugawara, S, Ouchi, K, Oshima, S, Akyu, T, Itoh, Y (2014). An IWAPU STD system for OOV query terms and spoken queries. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 384–389).Makino, M, & Kai, A (2014). Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery & Doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 413–418).Konno, R, Ouchi, K, Obara, M, Shimizu, Y, Chiba, T, Hirota, T, Itoh, Y (2016). An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery & Doc task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 200–204).Sakamoto, N, Yamamoto, K, Nakagawa, S (2015). Combination of syllable based N-gram search and word search for spoken term detection through spoken queries and IV/OOV classification. In Proc. of ASRU. IEEE, New York, (pp. 200–206).Hou, J, Pham, VT, Leung, C-C, Wang, L, 2, HX, Lv, H, Xie, L, Fu, Z, Ni, C, Xiao, X, Chen, H, Zhang, S, Sun, S, Yuan, Y, Li, P, Nwe, TL, Sivadas, S, Ma, B, Chng, ES, Li, H (2015). The NNI Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 141–143).Vavrek, J, Viszlay, P, Lojka, M, Pleva, M, Juhar, J, Rusko, M (2015). TUKE at MediaEval 2015 QUESST. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 451–453).Mantena, G, Achanta, S, Prahallad, K (2014). Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(5), 946–955.Anguera, X, & Ferrarons, M (2013). Memory efficient subsequence DTW for query-by-example spoken term detection. In Proc. of ICME. IEEE, New York, (pp. 1–6).Tulsiani, H, & Rao, P (2015). The IIT-B Query-by-Example system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 341–343).Bouallegue, M, Senay, G, Morchid, M, Matrouf, D, Linares, G, Dufour, R (2013). LIA@MediaEval 2013 spoken web search task: An I-Vector based approach. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 771–772).Rodriguez-Fuentes, LJ, Varona, A, Penagarikano, M, Bordel, G, Diez, M (2013). GTTS systems for the SWS task at MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 831–832).Wang, H, Lee, T, Leung, C-C, Ma, B, Li, H (2013). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 8545–8549).Wang, H, & Lee, T (2013). The CUHK spoken web search system for MediaEval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 681–682).Proenca, J, Veiga, A, Perdigão, F (2014). The SPL-IT query by example search on speech system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 741–742).Proenca, J, Veiga, A, Perdigao, F (2015). Query by example search with segmented dynamic time warping for non-exact spoken queries. In Proc. of EUSIPCO. Springer, Berlin, (pp. 1691–1695).Proenca, J, Castela, L, Perdigao, F (2015). The SPL-IT-UC Query by Example search on speech system for MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 471–473).Proenca, J, & Perdigao, F (2016). Segmented dynamic time warping for spoken Query-by-Example search. In Proc. of Interspeech. ISCA, Baixas, (pp. 750–754).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). GTM-UVigo systems for the Query-by-Example search on speech task at MediaEval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 521–523).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2015). Phonetic unit selection for cross-lingual Query-by-Example spoken term detection. In Proc. of ASRU. IEEE, New York, (pp. 223–229).Saxena, A, & Yegnanarayana, B (2015). Distinctive feature based representation of speech for Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 3680–3684).Skacel, M, & Szöke, I (2015). BUT QUESST 2015 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 721–723).Chen, H, Leung, C-C, Xie, L, Ma, B, Li, H (2016). Unsupervised bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 923–927).Yuan, Y, Leung, C-C, Xie, L, Chen, H, Ma, B, Li, H (2017). Pairwise learning using multi-lingual bottleneck features for low-resource Query-by-Example spoken term detection. In Proc. of ICASSP. IEEE, New York, (pp. 5645–5649).Torbati, AHHN, & Picone, J (2016). A nonparametric Bayesian approach for spoken term detection by example query. In Proc. of Interspeech. ISCA, Baixas, (pp. 928–932).Popli, A, & Kumar, A (2015). Query-by-example spoken term detection using low dimensional posteriorgrams motivated by articulatory classes. In Proc. of MMSP. IEEE, New York, (pp. 1–6).Yang, P, Leung, C-C, Xie, L, Ma, B, Li, H (2014). Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 1722–1726).George, B, Saxena, A, Mantena, G, Prahallad, K, Yegnanarayana, B (2014). Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping. In Proc. of Interspeech. ISCA, Baixas, (pp. 1742–1746).Hazen, TJ, Shen, W, White, CM (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. of ASRU. IEEE, New York, (pp. 421–426).Abad, A, Astudillo, RF, Trancoso, I (2013). The L2F spoken web search system for mediaeval 2013. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 851–852).Szöke, I, Skácel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).Szöke, I, Burget, L, Grézl, F, Černocký, JH, Ondel, L (2014). Calibration and fusion of query-by-example systems - BUT SWS 2013. In Proc. of ICASSP. IEEE, New York, (pp. 621–622).Abad, A, Rodríguez-Fuentes, LJ, Penagarikano, M, Varona, A, Bordel, G (2013). On the calibration and fusion of heterogeneous spoken term detection systems. In Proc. of Interspeech. ISCA, Baixas, (pp. 20–24).Yang, P, Xu, H, Xiao, X, Xie, L, Leung, C-C, Chen, H, Yu, J, Lv, H, Wang, L, Leow, SJ, Ma, B, Chng, ES, Li, H (2014). The NNI query-by-example system for MediaEval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 691–692).Leung, C-C, Wang, L, Xu, H, Hou, J, Pham, VT, Lv, H, Xie, L, Xiao, X, Ni, C, Ma, B, Chng, ES, Li, H (2016). Toward high-performance language-independent Query-by-Example spoken term detection for MediaEval 2015: Post-evaluation analysis. In Proc. of Interspeech. ISCA, Baixas, (pp. 3703–3707).Xu, H, Hou, J, Xiao, X, Pham, VT, Leung, C-C, Wang, L, Do, VH, Lv, H, Xie, L, Ma, B, Chng, ES, Li, H (2016). Approximate search of audio queries by using DTW with phone time boundary and data augmentation. In Proc. of ICASSP. IEEE, New York, (pp. 6030–6034).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level and DNN-based acoustic matches for efficient spoken term detection in NTCIR-12 SpokenQuery &Doc-2 task. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 205–210).Oishi, S, Matsuba, T, Makino, M, Kai, A (2016). Combining state-level spotting and posterior-based acoustic match for improved query-by-example spoken term detection. In Proc. of Interspeech. ISCA, Baixas, (pp. 740–744).Obara, M, Kojima, K, Tanaka, K, Lee, S-w, Itoh, Y (2016). Rescoring by combination of posteriorgram score and subword-matching score for use in Query-by-Example. In Proc. of Interspeech. ISCA, Baixas, (pp. 1918–1922).NIST. The Ninth Text REtrieval Conference (TREC 9). http://trec.nist.gov . Accessed Feb 2018.Anguera, X, Rodriguez-Fuentes, LJ, Szöke, I, Buzo, A, Metze, F (2014). Query by Example Search on Speech at Mediaeval 2014. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 351–352).Joho, H, & Kishida, K (2014). Overview of the NTCIR-11 SpokenQuery&Doc Task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–7).NIST. Draft KWS16 Keyword Search Evaluation Plan. https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf . Accessed Feb 2018.Anguera, X, Metze, F, Buzo, A, Szöke, I, Rodriguez-Fuentes, LJ (2013). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 921–922).Taras, B, & Nadeu, C (2011). Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech, and Music Processing, 2011(1), 1–10.Zelenák, M, Schulz, H, Hernando, J (2012). Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP Journal on Audio, Speech, and Music Processing, 2012(19), 1–9.Rodríguez-Fuentes, LJ, Penagarikano, M, Varona, A, Díez, M, Bordel, G (2011). The Albayzin 2010 Language Recognition Evaluation. In Proc. of Interspeech. ISCA, Baixas, (pp. 1529–1532).Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C, Cardenal, A, Echeverry-Correa, JD, Coucheiro-Limeres, A, Olcoz, J, Miguel, A (2015). Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech and Music Processing, 2015(21), 1–27.Tejedor, J, Toledano, DT, Anguera, X, Varona, A, Hurtado, LF, Miguel, A, Colás, J (2013). Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP, Journal on Audio, Speech, and Music Processing, 2013(23), 1–17.Tejedor, J, Toledano, DT, Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations. EURASIP, Journal on Audio, Speech and Music Processing, 2016(1), 1–19.Méndez, F, Docío, L, Arza, M, Campillo, F (2010). The Albayzin 2010 text-to-speech evaluation. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 317–340).Billa, J, Ma, KW, McDonough, JW, Zavaliagkos, G, Miller, DR, Ross, KN, El-Jaroudi, A (1997). Multilingual speech recognition: the 1996 Byblos Callhome system. In Proc. of Eurospeech. ISCA, Baixas, (pp. 363–366).Killer, M, Stuker, S, Schultz, T (2003). Grapheme based speech recognition. In Proc. of Eurospeech. ISCA, Baixas, (pp. 3141–3144).Burget, L, Schwarz, P, Agarwal, M, Akyazi, P, Feng, K, Ghoshal, A, Glembek, O, Goel, N, Karafiat, M, Povey, D, Rastrow, A, Rose, RC, Thomas, S (2010). Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models. In Proc. of ICASSP. IEEE, New York, (pp. 4334–4337).Cuayahuitl, H, & Serridge, B (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In Proc. of MICAI. Springer, Berlin, (pp. 156–165).Tejedor, J (2009). Contributions to keyword spotting and spoken term detection for information retrieval in audio mining. PhD thesis, Universidad Autónoma de Madrid, Madrid, Spain.Tejedor, J, Toledano, DT, Wang, D, King, S, Colás, J (2014). Feature analysis for discriminative confidence estimation in spoken term detection. Computer Speech and Language, 28(5), 1083–1114.Li, J, Wang, X, Xu, B (2014). An empirical study of multilingual and low-resource spoken term detection using deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 1747–1751).NIST. The Spoken Term Detection (STD) 2006 evaluation plan. http://berlin.csie.ntnu.edu.tw/Courses/Special%20Topics%20in%20Spoken%20Language%20Processing/Lectures2008/SLP2008S-Lecture12-Spoken%20Term%20Detection.pdf . Accessed Feb 2018.Fiscus, JG, Ajot, J, Garofolo, JS, Doddingtion, G (2007). Results of the 2006 spoken term detection evaluation. In Proc. of SSCS. ACM, New York, (pp. 45–50).Martin, A, Doddington, G, Kamm, T, Ordowski, M, Przybocki, M (1997). The DET curve in assessment of detection task performance. In Proc. of Eurospeech. ISCA, Baixas, (pp. 1895–1898).NIST. Evaluation Toolkit (STDEval) software. https://www.nist.gov/itl/iad/mig/tools . Accessed Feb 2018.Union, IT. ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. http://www.itu.int/rec/T-REC-P.563/en . Accessed Feb 2018.Rajput, N, & Metze, F (2011). Spoken web search. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 1–2).Metze, F, Barnard, E, Davel, M, van Heerden, C, Anguera, X, Gravier, G, Rajput, N (2012). The spoken web search task. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 41–42).Szöke, I, Rodriguez-Fuentes, LJ, Buzo, A, Anguera, X, Metze, F, Proenca, J, Lojka, M, Xiong, X (2015). Query by Example Search on Speech at Mediaeval 2015. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Szöke, I, & Anguera, X (2016). Zero-cost speech recognition task at Mediaeval 2016. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 81–82).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2014). Overview of the NTCIR-11 spokenquery &doc task. In Proc. of NTCIR-11. National Institute of Informatics, Tokyo, (pp. 1–15).Akiba, T, Nishizaki, H, Nanjo, H, Jones, GJF (2016). Overview of the NTCIR-12 spokenquery &doc-2. In Proc. of NTCIR-12. National Institute of Informatics, Tokyo, (pp. 1–13).Schwarz, P (2008). Phoneme recognition based on long temporal context. PhD thesis, FIT, BUT, Brno, Czech Republic.Varona, A, Penagarikano, M, Rodríguez-Fuentes, LJ, Bordel, G (2011). On the use of lattices of time-synchronous cross-decoder phone co-occurrences in a SVM-phonotactic language recognition system. In Proc. of Interspeech. ISCA, Baixas, (pp. 2901–2904).Eyben, F, Wollmer, M, Schuller, B (2010). OpenSMILE—the munich versatile and fast open-source audio feature extractor. In Proc. of ACM Multimedia (MM). ACM, New York, (pp. 1459–1462).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). Finding relevant features for zero-resource query-by-example search on speech. Speech Communication, 84(1), 24–35.Zhang, Y, & Glass, JR (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proc. of ASRU. IEEE, New York, (pp. 398–403).Povey, D, Ghoshal, A, Boulianne, G, Burget, L, Glembek, O, Goel, N, Hannemann, M, Motlicek, P, Qian, Y, Schwarz, P, Silovsky, J, Stemmer, G, Vesely, K (2011). The KALDI speech recognition toolkit. In Proc. of ASRU. IEEE, New York, (pp. 1–4).Muller, M. (2007). Information retrieval for music and motion. New York: Springer.Szöke, I, Skacel, M, Burget, L (2014). BUT QUESST 2014 system description. In Proc. of MediaEval. Ruzica Piskac, New Haven, (pp. 621–622).Brümmer, N, & van Leeuwen, D (2006). On calibration of language recognition scores. In Proc of the IEEE Odyssey: The speaker and language recognition workshop. IEEE, New York, (pp. 1–8).Brümmer, N, & de Villiers, E. The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing. Technical report. https://sites.google.com/site/nikobrummer . Accessed Feb 2018.Meinedo, H, & Neto, J (2005). A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In Proc. of Interspeech. ISCA, Baixas, (pp. 237–240).Morgan, N, & Bourlard, H (1995). An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3), 25–42.Meinedo, H, Abad, A, Pellegrini, T, Trancoso, I, Neto, J (2010). The L2F broadcast news speech recognition system. In Proc. of FALA. UniversidadeVigo, Vigo, (pp. 93–96).Abad, A, Luque, J, Trancoso, I (2011). Parallel transformation network features for speaker recognition. In Proc. of ICASSP. IEEE, New York, (pp. 5300–5303).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2012). On the use of phone log-likelihood ratios as features in spoken language recognition. In Proc. of SLT. IEEE, New York, (pp. 274–279).Diez, M, Varona, A, Penagarikano, M, Rodriguez-Fuentes, LJ, Bordel, G (2014). New insight into the use of phone log-likelihood ratios as features for language recognition. In Proc. of Interspeech. ISCA, Baixas, (pp. 1841–1845).Abad, A, Ribeiro, E, Kepler, F, Astudillo, R, Trancoso, I (2016). Exploiting phone log-likelihood ratio features for the detection of the native language of non-native English speakers. In Proc. of Interspeech. ISCA, Baixas, (pp. 2413–2417).Rodríguez-Fuentes, LJ, Varona, A, Peñagarikano, M, Bordel, G, Díez, M (2014). High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In Proc. of ICASSP. IEEE, New York, (pp. 7819–7823).Vesely, K, Ghoshal, A, Burget, L, Povey, D (2013). Sequence-discriminative training of deep neural networks. In Proc. of Interspeech. ISCA, Baixas, (pp. 2345–2349).Ghahremani, P, BabaAli, B, Povey, D, Riedhammer, K, Trmal, J, Khudanpur, S (2014). A pitch extraction algorithm tuned for automatic speech recognition. In Proc. of ICASSP. IEEE, New York, (pp. 2494–2498).Povey, D, Hannemann, M, Boulianne, G, Burget, L, Ghoshal, A, Janda, M, Karafiat, M, Kombrink, S, Motlicek, P, Qian, Y, Riedhammer, K, Vesely, K, Vu, NT (2012). Generating exact lattices in the WFST framework. In Proc. of ICASSP. IEEE, New York, (pp. 4213–4216).Garcia-Mateo, C, Dieguez-Tirado, J, Docio-Fernandez, L, Cardenal-Lopez, A (2004). Transcrigal: A bilingual system for automatic indexing of broadcast news. In Proc. of LREC. ELRA, Paris, (pp. 2061–2064).Stolcke, A (2002). SRILM—an extensible language modeling toolkit. In Proc. of Interspeech. ISCA, Baixas, (pp. 901–904).Lopez-Otero, P, Docio-Fernandez, L, Garcia-Mateo, C (2016). GTM-UVigo systems for Albayzin 2016 search on speech evaluation. In Proc. of Iberspeech. Springer, Berlin, (pp. 65–74).Chen, G, Khudanpur, S, Povey, D, Trmal, J, Yarowsky, D, Yilmaz, O (2013). Quantifying the value of pronunciation lexicons for keyword search in low resource languages. In Proc. of ICASSP. IEEE, New York, (pp. 8560–8564).Pham, VT, Chen, NF, Sivadas, S, Xu, H, Chen, I-F, Ni, C, Chng, ES, Li, H (2014). System and keyword dependent fusion for spoken term detection. In Proc. of SLT. IEEE, New York, (pp. 430–435).Can, D, & Saraclar, M (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech and Language Processing, 19(8), 2338–2347.Miller, DRH, K

    Using word graphs as intermediate representation of uttered sentences

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33275-3_35We present an algorithm for building graphs of words as an intermediate representation of uttered sentences. No language model is used. The input data for the algorithm are the pronunciation lexicon organized as a tree and the sequence of acoustic frames. The transition between consecutive units are considered as additional units. Nodes represent discrete instants of time, arcs are labelled with words, and a confidence measure is assigned to each detected word, which is computed by using the phonetic probabilities of the subsequence of acoustic frames used for completing the word. We evaluated the obtained word graphs by searching the path that best matches with the correct sentence and then measuring the word accuracy, i.e. the oracle word accuracy. © 2012 Springer-Verlag.This work was supported by the Spanish MICINN under contract TIN2011-28169-C05-01 and the Vic. d’Investigació of the UPV under contract 20110897.Gómez Adrian, JA.; Sanchís Arnal, E. (2012). Using word graphs as intermediate representation of uttered sentences. En Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer Verlag (Germany). 284-291. doi:10.1007/978-3-642-33275-3_35S284291Ortmanns, S., Ney, H., Aubert, X.: A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language 11, 43–72 (1997)Ney, H., Ortmanns, S., Lindam, I.: Extensions to the word graph method for large vocabulary continuous speech recognition. In: Proceedings of IEEE ICASSP 1997, Munich, Germany, vol. 3, pp. 1791–1794 (1997)Wessel, F., Schlüter, R., Macherey, K., Ney, H.: Confidence Measures for Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing 9(3), 288–298 (2001)Ferreiros, J., San-Segundo, R., Fernández, F., D’Haro, L.-F., Sama, V., Barra, R., Mellén, P.: New word-level and sentence-level confidence scoring using graph theory calculus and its evaluation on speech understanding. In: Proceedings of INTERSPEECH 2005, Lisbon, Portugal, pp. 3377–3380 (2005)Raymond, C., Béchet, F., De Mori, R., Damnati, G.: On the use of finite state transducers for semantic interpretation. Speech Communication 48, 288–304 (2006)Hakkani-Tür, D., Béchet, F., Riccardi, G., Tur, G.: Beyond ASR 1-best: Using word confusion networks in spoken language understanding. Computer Speech and Language 20, 495–514 (2006)Justo, R., Pérez, A., Torres, M.I.: Impact of the Approaches Involved on Word-Graph Derivation from the ASR System. In: Vitrià, J., Sanches, J.M., Hernández, M. (eds.) IbPRIA 2011. LNCS, vol. 6669, pp. 668–675. Springer, Heidelberg (2011)Gómez, J.A., Calvo, M.: Improvements on Automatic Speech Segmentation at the Phonetic Level. In: San Martin, C., Kim, S.-W. (eds.) CIARP 2011. LNCS, vol. 7042, pp. 557–564. Springer, Heidelberg (2011)Calvo, M., Gómez, J.A., Sanchis, E., Hurtado, L.F.: An algorithm for automatic speech understanding over word graphs. Procesamiento del Lenguaje Natural (48) (accepted, pending of publication, 2012)Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J.B., Nadeu, C.: Albayzin Speech Database: Design of the Phonetic Corpus. In: Proceedings of Eurospeech, Berlin, Germany, vol. 1, pp. 653–656 (September 1993)Benedí, J.M., Lleida, E., Varona, A., Castro, M., Galiano, I., Justo, R., López, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In: Proc. of LREC 2006, Genova, Italy (2006
    corecore