6 research outputs found

    Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

    Get PDF
    In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.This work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22)

    Analysis of the GOP metric for assessing non-native Spanish pronunciation in the SAMPLE corpus

    Get PDF
    Este trabajo consiste en el análisis de los resultados obtenidos en la evaluación de pronunciación a nivel fonema utilizando el algoritmo Forced GOP que ha sido implementado para ello. Se ha hecho uso de locuciones de diferentes oraciones realizadas por distintos hablantes, las cuales han sido grabadas y anotadas dentro del corpus SAMPLE. Este corpus fue desarrollado dentro de nuestro grupo de investigación en colaboración con personas del ámbito lingüista. Se ha trabajado con los datos obtenidos para identificar posibles mejoras, se hacen varias observaciones en el comportamiento de la métrica y se discute la dependencia a nivel fonema y hablante que sugiere el establecimiento de posibles umbrales para mejorar su rendimiento. Además se agregan propuestas en base a los datos de loglikelihood que arroja la FGOP y se aplican una serie de reglas para establecer un nuevo parámetro que permita dar una calificación por cada fonema. Estas calificaciones permiten generar una calificación global de pronunciación a nivel hablante. Las puntuaciones globales se han contrastado con los resultados de la FGOP y las evaluaciones realizadas por jueces humanos.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Máster en Investigación en Tecnologías de la Información y las Comunicacione

    Articulatory features for conversational speech recognition

    Get PDF

    Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina

    Get PDF
    A comunicação verbal humana é realizada em dois sentidos, existindo uma compreensão de ambas as partes que resulta em determinadas considerações. Este tipo de comunicação, também chamada de diálogo, para além de agentes humanos pode ser constituído por agentes humanos e máquinas. A interação entre o Homem e máquinas, através de linguagem natural, desempenha um papel importante na melhoria da comunicação entre ambos. Com o objetivo de perceber melhor a comunicação entre Homem e máquina este documento apresenta vários conhecimentos sobre sistemas de conversação Homemmáquina, entre os quais, os seus módulos e funcionamento, estratégias de diálogo e desafios a ter em conta na sua implementação. Para além disso, são ainda apresentados vários sistemas de Speech Recognition, Speech Synthesis e sistemas que usam conversação Homem-máquina. Por último são feitos testes de performance sobre alguns sistemas de Speech Recognition e de forma a colocar em prática alguns conceitos apresentados neste trabalho, é apresentado a implementação de um sistema de conversação Homem-máquina. Sobre este trabalho várias ilações foram obtidas, entre as quais, a alta complexidade dos sistemas de conversação Homem-máquina, a baixa performance no reconhecimento de voz em ambientes com ruído e as barreiras que se podem encontrar na implementação destes sistemas

    Voice inactivity ranking for enhancement of speech on microphone arrays

    Full text link
    Motivated by the problem of improving the performance of speech enhancement algorithms in non-stationary acoustic environments with low SNR, a framework is proposed for identifying signal frames of noisy speech that are unlikely to contain voice activity. Such voice-inactive frames can then be incorporated into an adaptation strategy to improve the performance of existing speech enhancement algorithms. This adaptive approach is applicable to single-channel as well as multi-channel algorithms for noisy speech. In both cases, the adaptive versions of the enhancement algorithms are observed to improve SNR levels by 20dB, as indicated by PESQ and WER criteria. In advanced speech enhancement algorithms, it is often of interest to identify some regions of the signal that have a high likelihood of being noise only i.e. no speech present. This is in contrast to advanced speech recognition, speaker recognition, and pitch tracking algorithms in which we are interested in identifying all regions that have a high likelihood of containing speech, as well as regions that have a high likelihood of not containing speech. In other terms, this would mean minimizing the false positive and false negative rates, respectively. In the context of speech enhancement, the identification of some speech-absent regions prompts the minimization of false positives while setting an acceptable tolerance on false negatives, as determined by the performance of the enhancement algorithm. Typically, Voice Activity Detectors (VADs) are used for identifying speech absent regions for the application of speech enhancement. In recent years a myriad of Deep Neural Network (DNN) based approaches have been proposed to improve the performance of VADs at low SNR levels by training on combinations of speech and noise. Training on such an exhaustive dataset is combinatorically explosive. For this dissertation, we propose a voice inactivity ranking framework, where the identification of voice-inactive frames is performed using a machine learning (ML) approach that only uses clean speech utterances for training and is robust to high levels of noise. In the proposed framework, input frames of noisy speech are ranked by ‘voice inactivity score’ to acquire definitely speech inactive (DSI) frame-sequences. These DSI regions serve as a noise estimate and are adaptively used by the underlying speech enhancement algorithm to enhance speech from a speech mixture. The proposed voice-inactivity ranking framework was used to perform speech enhancement in single-channel and multi-channel systems. In the context of microphone arrays, the proposed framework was used to determine parameters for spatial filtering using adaptive beamformers. We achieved an average Word Error Rate (WER) improvement of 50% at SNR levels below 0dB compared to the noisy signal, which is 7±2.5% more than the framework where state-of-the-art VAD decision was used for spatial filtering. For monaural signals, we propose a multi-frame multiband spectral-subtraction (MF-MBSS) speech enhancement system utilizing the voice inactivity framework to compute and update the noise statistics on overlapping frequency bands. The proposed MF-MBSS not only achieved an average PESQ improvement of 16% with a maximum improvement of 56% when compared to the state-of-the-art Spectral Subtraction but also a 5 ± 1.5% improvement in the Word Error Rate (WER) of the spatially filtered output signal, in non-stationary acoustic environments

    Assessments of Voice Use, Voice Quality, and Perceived Singing Voice Function Among College/University Singing Students Ages 18-24 Through Simultaneous Ambulatory Monitoring With Accelerometer and Acoustic Transducers

    Get PDF
    Previous vocal dose studies have analyzed the duration, intensity and frequency (in Hz) of voice use among college/university singing students through ambulatory monitoring. However, no ambulatory studies of this population have acquired these vocal dose data simultaneously with acoustic measures of voice quality in order to facilitate direct comparisons of voice use with voice quality during the same voicing period. The purpose of this study was to assess the voice use, voice quality, and perceived singing voice function of college/university singing students (N = 19), ages 18-24 years, enrolled in both voice lessons and choir, through (a) measurements of vocal dose and voice quality collected over 3 full days of ambulatory monitoring with an unfiltered neck accelerometer signal acquired with the Sonovox AB VoxLog portable voice analyzer collar; (b) measurements of voice quality during singing and speaking vocal tasks acquired at 3 different times of day by the VoxLog collar's acoustic and accelerometer transducers; and (c) multiple applications of the Evaluation of the Ability to Sing Easily (EASE) questionnaire about perceived singing voice function. Vocal dose metrics included phonation percentage, dose time, cycle dose, and distance dose. Voice quality measures included fundamental frequency (F0), perceived pitch (P0), dB SPL, LTAS slope, alpha ratio, dB SPL 1-3 kHz, pitch strength, shimmer, jitter, and harmonic-to-noise ratio. Major findings indicated that among these students (a) higher vocal doses correlated significantly with greater voice amplitude, more vocal clarity, and less perturbation; (b) there were significant differences in vocal dose and voice quality among non-singing, solo singing, and choral singing time periods; (c) analysis of repeated vocal tasks with the acoustic transducer showed that F0, P0, SPL, and resonance measures displayed increases from morning to afternoon to evening; (d) less perceived ability to sing easily correlated positively with higher frequency and lower amplitude when analyzing repeated vocal tasks with the acoustic transducer; and (e) the two transducers exhibited significant and irregular differences in data simultaneously obtained for 8 of the 10 measures of voice quality
    corecore