10 research outputs found

    Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

    Get PDF
    The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0063-8Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).This work has been partly supported by project CMC-V2 (TEC2012-37585-C02-01) from the Spanish Ministry of Economy and Competitiveness. This research was also funded by the European Regional Development Fund, the Galician Regional Government (GRC2014/024, “Consolidation of Research Units: AtlantTIC Project” CN2012/160)

    Creating expressive synthetic voices by unsupervised clustering of audiobooks

    No full text
    In this work we design an approach for automatic feature selection and voice creation for expressive synthesis. Our approach is guided by two main goals: (1) increasing the flexibility of expressive voice creation and (2) overcoming the limitations of speaking styles in expressive synthesis. We define a novel set of features, combining traditionally used prosodic features with spectral features and proposing the use of iVectors. With these features we perform unsupervised clustering of an audiobook excerpt and, from these clusters, we create synthetic voices using the SAT technique. To evaluate the clustering performance we propose an objective evaluation of the unsupervised clustering results technique based on perplexity reduction. This objective evaluation indicates that both prosodic and spectral features contribute to separate speaking styles and emotions, achieving the best results when including iVectors in the feature set, leading to a perplexity reduction of the expressions and audiobook characters by factors 14 and 2, respectively. We also designed a novel subjective evaluation method where the participants have to edit a small excerpt of an audiobook using synthetic voices created from clusters. The results suggest that our feature set is effective in the task of expressiveness and character detection.Peer ReviewedPostprint (published version

    Creating expressive synthetic voices by unsupervised clustering of audiobooks

    No full text
    In this work we design an approach for automatic feature selection and voice creation for expressive synthesis. Our approach is guided by two main goals: (1) increasing the flexibility of expressive voice creation and (2) overcoming the limitations of speaking styles in expressive synthesis. We define a novel set of features, combining traditionally used prosodic features with spectral features and proposing the use of iVectors. With these features we perform unsupervised clustering of an audiobook excerpt and, from these clusters, we create synthetic voices using the SAT technique. To evaluate the clustering performance we propose an objective evaluation of the unsupervised clustering results technique based on perplexity reduction. This objective evaluation indicates that both prosodic and spectral features contribute to separate speaking styles and emotions, achieving the best results when including iVectors in the feature set, leading to a perplexity reduction of the expressions and audiobook characters by factors 14 and 2, respectively. We also designed a novel subjective evaluation method where the participants have to edit a small excerpt of an audiobook using synthetic voices created from clusters. The results suggest that our feature set is effective in the task of expressiveness and character detection.Peer Reviewe

    ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish

    Get PDF
    Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.Xunta de Galicia | Ref. ED431B 2016/035Xunta de Galicia | Ref. ED431B 2019/003Xunta de Galicia | Ref. 2014/024Xunta de Galicia | Ref. ED431G/01Xunta de Galicia | Ref. ED431G/04Ministerio de Economía y Competitividad | Ref. TEC2015-68172-C2-1-PMinisterio de Economía y Competitividad | Ref. TEC2015-65345-PMinisterio de Economía y Competitivad | Ref. TIN2015-64282-RAgencia Estatal de Investigación | Ref. RTI2018-093336-B-C2

    Performance and Stability of Wet-Milled CoAl2O4, Ni/CoAl2O4, and Pt,Ni/CoAl2O4 for Soot Combustion

    No full text
    [EN] Low-energy wet milling was employed to activate commercial CoAl2O4 spinel and disperse mono- and multimetallic nanoparticles on its surface. This method yielded efficient Pt,Ni catalysts for soot oxidation in simulated diesel exhaust conditions. The characterization and activity results indicated that although Ni/CoAl2O4 was highly active, the presence of Pt was required to obtain a stable Ni(0.25 wt. %),Pt(0.75 wt. %)/CoAl2O4 catalyst under the operating conditions of diesel particulate filters, and that hot spots formation must be controlled to avoid the deactivation of the cobalt aluminate. Our work provides important insight for new design strategies to develop high-efficiency low-cost catalysts. Platinum-containing multimetallic nanostructures could efficiently reduce the amount of the costly, but to date non-replaceable, Pt noble metal for a large number of industrially important catalytic processesThis research was funded by CSIC, NANOMIND project number CSIC201560E068, and the Spanish Government, grant number MAT2017-86450-C4-1-R projects. C.M.A.-D. received financial support from the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) for an FPI grant BES-2014-069779, which is co-financed with FEDER funds. F.R.-M. is indebted to MINECO for a ‘Ramon y Cajal’ contract (ref: RyC-2015-18626), which is co-financed by the European Social Fund. F.R.-M. also acknowledges support from a 2018 Leonardo Grant for Researchers and Cultural Creators (BBVA Foundation)

    Towards large scale multimedia indexing: a case study on person discovery in broadcast news

    Get PDF
    Comunicació presentada a: the 15th International Workshop on Content-Based Multimedia Indexing (CBMI'17), celebrat a Florència, Itàlia, del 19 al 21 de juny de 2017The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we present 3 different strategies to approach this problem: clustering-based naming, verification-based naming, and graph-based naming. Each of these strategies utilizes different recent advances in unsupervised face / speech representation, verification, and optimization. To have a better understanding of the approaches, this paper also provides a quantitative and qualitative comparative study of these approaches using the associated corpus of the Person Discovery challenge at MediaEval 2016. From the results of our experiments, we can observe the pros and cons of each approach, thus paving the way for future promising research directions.This work was supported by the EU project EUMSSI (FP7-611057), ANR project MetaDaTV (ANR-14-CE24-0024) project, Camomile project (PCIN-2013-067), and the projects TEC2013-43935-R, TEC2015-69266-P, TEC2016-75976-R, TEC2015-65345-P financed by the Spanish government and ERDF

    Towards large scale multimedia indexing: a case study on person discovery in broadcast news

    No full text
    Comunicació presentada a: the 15th International Workshop on Content-Based Multimedia Indexing (CBMI'17), celebrat a Florència, Itàlia, del 19 al 21 de juny de 2017The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we present 3 different strategies to approach this problem: clustering-based naming, verification-based naming, and graph-based naming. Each of these strategies utilizes different recent advances in unsupervised face / speech representation, verification, and optimization. To have a better understanding of the approaches, this paper also provides a quantitative and qualitative comparative study of these approaches using the associated corpus of the Person Discovery challenge at MediaEval 2016. From the results of our experiments, we can observe the pros and cons of each approach, thus paving the way for future promising research directions.This work was supported by the EU project EUMSSI (FP7-611057), ANR project MetaDaTV (ANR-14-CE24-0024) project, Camomile project (PCIN-2013-067), and the projects TEC2013-43935-R, TEC2015-69266-P, TEC2016-75976-R, TEC2015-65345-P financed by the Spanish government and ERDF

    Towards large scale multimedia indexing: a case study on person discovery in broadcast news

    No full text
    The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we present 3 different strategies to approach this problem: clustering-based naming, verification-based naming, and graph-based naming. Each of these strategies utilizes different recent advances in unsupervised face / speech representation, verification, and optimization. To have a better understanding of the approaches, this paper also provides a quantitative and qualitative comparative study of these approaches using the associated corpus of the Person Discovery challenge at MediaEval 2016. From the results of our experiments, we can observe the pros and cons of each approach, thus paving the way for future promising research directions.Peer Reviewe
    corecore