2 research outputs found

    Desarrollo de un sistema de búsquedas en voz basadas en ejemplos

    Full text link
    En la actualidad existen diversas técnicas para implementar un sistema de búsqueda en voz. En este TFG se describe el desarrollo de un sistema de búsqueda en voz basadas en ejemplos usando el método QbE - STD (Query-by-Example – Spoken Detection Term). El sistema se ajusta a las condiciones de la evaluación ALBAYZIN 2016, es decir, se usan los mismos recursos (archivos) que son proporcionados para los participantes y se despliega uno de los sistemas propuestos. El sistema QbE – STD tiene como objetivo encontrar un término (consulta) en un repositorio (documento). Sin embargo, en lugar de usar palabras o fonemas en formato de texto como términos de búsqueda, el sistema emplea fragmentos de audio como términos de búsqueda. Estos fragmentos, son las consultas y han de ser pre-procesadas. La fase de pre-procesamiento se basa esencialmente en la extracción de características. Conjuntamente los documentos han de pasar por este mismo proceso. La técnica adoptada para la extracción de características debido a sus buenos resultados es la de probabilidades a posteriori de fonemas. Estas son obtenidas a través de un reconocedor fonético. Por otro lado, otra fase fundamental en el desarrollo del sistema es la implementación de un algoritmo de comparación. Se ha implementado el algoritmo S-DTW (DTW de Subsecuencias) el cual es una variante del algoritmo original de alineamiento temporal dinámico (DTW). El algoritmo decide si la consulta se encuentra en el documento o no. En caso afirmativo, nos proporciona los instantes de tiempo en los cuales comienza y termina una posible detección de la consulta dentro de la base de datos. En añadidura se devuelve la puntuación de la detección. Finalmente; se realizan una serie de pruebas y resultados para evaluar la eficacia del sistema.Nowadays, there are available several techniques for the development of a voice search system. In this Bachelor Thesis, we describe a voice search system based on examples employing the QbE – STD (Query-by-Example – Spoken Detection Term) approach. This system satisfies the ALBAYZIN 2016 evaluation conditions, in other words, the system makes use of the same resources (audio files) that were provided for the participants. Furthermore, one of the proposed systems is deployed. The aim of the QbE - STD system is to detect a term (query) in an audio database. Nevertheless, instead of using words or some phonemes in a text format as search terms, the system uses audio fragments as search terms. This audio fragments are the queries, and they must be pre-processed. The pre-processing step is essentially based on the feature extraction. Jointly, the audio database must go through this same process. The phoneme posteriorgrams is the feature extraction approach, this choice is due to its excellent results on the evaluation. These are obtained using phoneme decoders. On the other hand, it is indispensable the implementation of a searching algorithm. The chosen algorithm is the subsequence DTW, a variant of the classic DTW approach. The algorithm decides whether the query is in the audio database or not. If the query is found, it returns the boundaries (the beginning and the end) of the query in the audio database. In addition, the detection score is returned. Finally, several tests and results are performed to evaluate the system’s performance

    Representation Learning for Spoken term Detection

    Get PDF
    Spoken Term Detection (STD) is the task of searching a given spoken query word in large speech database. Applications of STD include speech data indexing, voice dialling, telephone monitoring and data mining. Performance of STD depends mainly on representation of speech signal and matching of represented signal. This work investigates methods for robust representation of speech signal, which is invariant to speaker variability, in the context of STD task. Here the representation is in the form of templates, a sequence of feature vectors. Typical representation in speech community Mel-Frequency CepstralCoe cients (MFCC) carry both speech-specific and speaker-specific information, so the need for better representation. Searching is done by matching sequence of feature vectors of query and reference utterances by using Subsequence Dynamic Time Warping (DTW). The performance of the proposed representation is evaluated on Telugu broadcast news data. In the absence of labelled data i.e., in unsupervised setting, we propose to capture joint density of acoustic space spanned by MFCCs using Gaussian Mixture Models (GMM) and Gaussian-Bernoulli Restricted Boltzmann Machines (GBRBM). Posterior features extracted from trained models are used to search the query word. It is noticed that 8% and 12% improvement in STD performance compared to MFCC by using GMM and GBRBM posterior features respectively. As transcribed data is not required, this approach is optimal solution to low-resource languages. But due to it’s intermediate performance, this method cannot be immediate solution to high resource language
    corecore