Search CORE

29 research outputs found

Evaluation of Tacotron Based Synthesizers for Spanish and Basque

Author: García Romillo Víctor
Hernáez Rioja Inmaculada
Navas Cordón Eva
Publication venue: 'MDPI AG'
Publication date: 01/02/2022
Field of study

In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost the alignment during the inference process. To mitigate the problem, we implemented a guided attention providing the system with the explicit duration of the phonemes. The resulting system was evaluated to assess its robustness, quality and naturalness both with objective and subjective measures. The results reveal the capacity of the system to produce good quality and natural audios.This work was funded by the Basque Government (Project refs. PIBA 2018-035, IT-1355-19). This work is part of the project Grant PID 2019-108040RB-C21 funded by MCIN/AEI/10.13039/ 501100011033

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Archivo Digital para la Docencia y la Investigación

An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods

Author: Hernaez Rioja Inmaculada Concepción
Navas Cordón Eva
Odriozola Sustaeta Igor
Publication venue: 'Elsevier BV'
Publication date: 31/05/2018
Field of study

Preprint del artículo públicado online el 31 de mayo 2018Voice activity detection (VAD) is an essential task in expert systems that rely on oral interfaces. The VAD module detects the presence of human speech and separates speech segments from silences and non-speech noises. The most popular current on-line VAD systems are based on adaptive parameters which seek to cope with varying channel and noise conditions. The main disadvantages of this approach are the need for some initialisation time to properly adjust the parameters to the incoming signal and uncertain performance in the case of poor estimation of the initial parameters. In this paper we propose a novel on-line VAD based only on previous training which does not introduce any delay. The technique is based on a strategy that we have called Multi-Normalisation Scoring (MNS). It consists of obtaining a vector of multiple observation likelihood scores from normalised mel-cepstral coefficients previously computed from different databases. A classifier is then used to label the incoming observation likelihood vector. Encouraging results have been obtained with a Multi-Layer Perceptron (MLP). This technique can generalise for unseen noise levels and types. A validation experiment with two current standard ITU-T VAD algorithms demonstrates the good performance of the method. Indeed, lower classification error rates are obtained for non-speech frames, while results for speech frames are similar.This work was partially supported by the EU (ERDF) under grant TEC2015-67163-C2-1-R (RESTORE) (MINECO/ERDF, EU) and by the Basque Government under grant KK-2017/00043 (BerbaOla)

Archivo Digital para la Docencia y la Investigación

Modelo de duración para conversión de texto a voz en euskera

Author: Hernáez Rioja Inmaculada
Navas Cordón Eva
Sánchez de la Fuente Jon
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2002
Field of study

En este artículo se presenta el trabajo realizado en el modelado de la duración de los fonemas en euskera estándar, para ser utilizado en conversión de texto a voz. El modelado estadístico se ha llevado a cabo mediante árboles binarios de regresión utilizando un corpus de 57.300 fonemas. Se han realizado varios experimentos de predicción testeando diferentes factores de influencia. El resultado obtenido en la predicción de la duración tiene un RMSE de 22.23 ms.This paper presents the modelling of phone durations in standard Basque, to be included in a text-to-speech system. The statistical modelling has been done using binary regression trees and a large corpus containing 57.300 phones. Several experiments have been performed, testing different sets of predicting factors. The result when predicting durations with this model has a RMSE of 22.23 ms.Este trabajo ha sido parcialmente financiado por el Ministerio de Ciencia y Tecnología (TIC2000-1005-C03-03 y TIC2000-1669-C04-03)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Secretaría de Estado de Cultura

Enrichment of Oesophageal Speech: Voice Conversion with Duration-Matched Synthetic Speech as Target

Author: Hernáez Rioja Inmaculada
Navas Cordón Eva
Raman Sneha
Sarasola Aramendia Xabier
Publication venue: 'MDPI AG'
Publication date: 08/07/2021
Field of study

Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We have used a neural network based voice conversion method with the aim of improving the intelligibility and reducing the listening effort (LE) of four OS speakers of varying speaking proficiency. The novelty of this method is the use of synthetic speech matched in duration with the source OS as the target, instead of parallel aligned healthy speech. We evaluated the converted samples from this system using a collection of Automatic Speech Recognition systems (ASR), an objective intelligibility metric (STOI) and a subjective test. ASR evaluation shows that the proposed system had significantly better word recognition accuracy compared to unprocessed OS, and baseline systems which used aligned healthy speech as the target. There was an improvement of at least 15% on STOI scores indicating a higher intelligibility for the proposed system compared to unprocessed OS, and a higher target similarity in the proposed system compared to baseline systems. The subjective test reveals a significant preference for the proposed system compared to unprocessed OS for all OS speakers, except one who was the least proficient OS speaker in the data set.This project was supported by funding from the European Union’s H2020 research and innovation programme under the MSCA GA 675324 (the ENRICH network: www.enrich-etn.eu (accessed on 25 June 2021)), and the Basque Government (PIBA_2018_1_0035 and IT355-19)

Archivo Digital para la Docencia y la Investigación

Intelligibility and Listening Effort of Spanish Oesophageal Speech

Author: Hernáez Rioja Inmaculada
Navas Cordón Eva
Raman Sneha
Serrano García Luis
Winneke Axel
Publication venue: 'MDPI AG'
Publication date: 01/08/2019
Field of study

Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human-human and human-machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) in comparison to Healthy Laryngeal Speech (HS). We conducted two listening tests (one web-based, the other in laboratory settings) to collect these measurements. Participants performed a sentence recognition and LE rating task in each test. Intelligibility, calculated as Word Error Rate, showed significant correlation with self-reported LE ratings. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. More LE was reported for OS compared to HS even when OS intelligibility was close to HS. Listeners familiar with OS reported less effort when listening to OS compared to nonfamiliar listeners. However, such advantage of familiarity was not observed for intelligibility. Automatic speech recognition scores were higher for OS compared to HS.This project was supported by funding from the EUs H2020 research and innovation programme under the MSCA GA 67532*4 (the ENRICH network: www.enrich-etn.eu), the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project, TEC2015-67163-C2-1-R) and the Basque Government (DL4NLP KK-2019/00045, PIBA_2018_1_0035 and IT355-19)

Archivo Digital para la Docencia y la Investigación

Intelligibility and Listening Effort of Spanish Oesophageal Speech

Author: Hernáez Rioja Inmaculada
Navas Cordón Eva
Raman Sneha
Serrano García Luis
Winneke Axel
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

RESTORE Project: REpair, STOrage and REhabilitation of speech

Author: Gómez Suárez Javier
Hernaez Rioja Inmaculada Concepción
Municio Martín José Antonio
Navas Cordón Eva
Publication venue: 'International Speech Communication Association'
Publication date: 23/11/2018
Field of study

RESTORE is a project aimed to improve the quality of commu-nication for people with difficulties producing speech, provid-ing them with tools and alternative communication services. Atthe same time, progress will be made at the research of tech-niques for restoration and rehabilitation of disordered speech.The ultimate goal of the project is to offer new possibilities inthe rehabilitation and reintegration into society of patients withspeech pathologies, especially those laryngectomised, by de-signing new intervention strategies aimed to favour their com-munication with the environment and ultimately increase theirquality of life.This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R

Crossref

Archivo Digital para la Docencia y la Investigación

Automatic Classification of Synthetic Voices for Voice Banking Using Objective Measures

Author: Alonso Agustin
García Romillo Víctor
Hernáez Rioja Inmaculada
Navas Cordón Eva [
Sánchez de la Fuente Jon
Publication venue: 'MDPI AG'
Publication date: 01/02/2022
Field of study

Speech is the most common way of communication among humans. People who cannot communicate through speech due to partial of total loss of the voice can benefit from Alternative and Augmentative Communication devices and Text to Speech technology. One problem of using these technologies is that the included synthetic voices might be impersonal and badly adapted to the user in terms of age, accent or even gender. In this context, the use of synthetic voices from voice banking systems is an attractive alternative. New voices can be obtained applying adaptation techniques using recordings from people with healthy voice (donors) or from the user himself/herself before losing his/her own voice. In this way, the goal is to offer a wide voice catalog to potential users. However, as there is no control over the recording or the adaptation processes, some method to control the final quality of the voice is needed. We present the work developed to automatically select the best synthetic voices using a set of objective measures and a subjective Mean Opinion Score evaluation. A prediction algorithm of the MOS has been build which correlates similarly to the most correlated individual measure.This work has been funded by the Basque Government under the project ref. PIBA 2018-035 and IT-1355-19. This work is part of the project Grant PID 2019-108040RB-C21 funded by MCIN/AEI/10.13039/501100011033

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Archivo Digital para la Docencia y la Investigación

Frame-Based Phone Classification Using EMG Signals

Author: De Zuazo Oteiza Xabier
Del Blanco Sierra Eder
Hernáez Rioja Inmaculada
Navas Cordón Eva
Salomons Inge
Publication venue: MDPI
Publication date: 13/07/2023
Field of study

This paper evaluates the impact of inter-speaker and inter-session variability on the development of a silent speech interface (SSI) based on electromyographic (EMG) signals from the facial muscles. The final goal of the SSI is to provide a communication tool for Spanish-speaking laryngectomees by generating audible speech from voiceless articulation. However, before moving on to such a complex task, a simpler phone classification task in different modalities regarding speaker and session dependency is performed for this study. These experiments consist of processing the recorded utterances into phone-labeled segments and predicting the phonetic labels using only features obtained from the EMG signals. We evaluate and compare the performance of each model considering the classification accuracy. Results show that the models are able to predict the phonetic label best when they are trained and tested using data from the same session. The accuracy drops drastically when the model is tested with data from a different session, although it improves when more data are added to the training data. Similarly, when the same model is tested on a session from a different speaker, the accuracy decreases. This suggests that using larger amounts of data could help to reduce the impact of inter-session variability, but more research is required to understand if this approach would suffice to account for inter-speaker variability as well.This research was funded by Agencia Estatal de Investigación grant number ref.PID2019-108040RB-C21/AEI/10.13039/50110001103

Archivo Digital para la Docencia y la Investigación