15 research outputs found

    LSTM based voice conversion for laryngectomees

    Get PDF
    This paper describes a voice conversion system designed withthe aim of improving the intelligibility and pleasantness of oe-sophageal voices. Two different systems have been built, oneto transform the spectral magnitude and another one for thefundamental frequency, both based on DNNs. Ahocoder hasbeen used to extract the spectral information (mel cepstral co-efficients) and a specific pitch extractor has been developed tocalculate the fundamental frequency of the oesophageal voices.The cepstral coefficients are converted by means of an LSTMnetwork. The conversion of the intonation curve is implementedthrough two different LSTM networks, one dedicated to thevoiced unvoiced detection and another one for the predictionof F0 from the converted cepstral coefficients. The experi-ments described here involve conversion from one oesophagealspeaker to a specific healthy voice. The intelligibility of thesignals has been measured with a Kaldi based ASR system. Apreference test has been implemented to evaluate the subjectivepreference of the obtained converted voices comparing themwith the original oesophageal voice. The results show that spec-tral conversion improves ASR while restoring the intonation ispreferred by human listenersThis work has been partially funded by the Spanish Ministryof Economy and Competitiveness with FEDER support (RE-STORE project, TEC2015-67163-C2-1-R), the Basque Govern-ment (BerbaOla project, KK-2018/00014) and from the Euro-pean Unions H2020 research and innovation programme un-der the Marie Curie European Training Network ENRICH(675324)

    Restoring speech following total removal of the larynx by a learned transformation from sensor data to acoustics

    Get PDF
    Total removal of the larynx may be required to treat laryngeal cancer: speech is lost. This article shows that it may be possible to restore speech by sensing movement of the remaining speech articulators and use machine learning algorithms to derive a transformation to convert this sensor data into an acoustic signal. The resulting “silent speech,” which may be delivered in real time, is intelligible and sounds natural. The identity of the speaker is recognisable. The sensing technique involves attaching small, unobtrusive magnets to the lips and tongue and monitoring changes in the magnetic field induced by their movement

    RESTORE Project: REpair, STOrage and REhabilitation of speech

    Get PDF
    RESTORE is a project aimed to improve the quality of commu-nication for people with difficulties producing speech, provid-ing them with tools and alternative communication services. Atthe same time, progress will be made at the research of tech-niques for restoration and rehabilitation of disordered speech.The ultimate goal of the project is to offer new possibilities inthe rehabilitation and reintegration into society of patients withspeech pathologies, especially those laryngectomised, by de-signing new intervention strategies aimed to favour their com-munication with the environment and ultimately increase theirquality of life.This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R

    Oesophageal speech: enrichment and evaluations

    Get PDF
    167 p.After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics.First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power.Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech.Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community

    Enrichment of Oesophageal Speech: Voice Conversion with Duration-Matched Synthetic Speech as Target

    Get PDF
    Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We have used a neural network based voice conversion method with the aim of improving the intelligibility and reducing the listening effort (LE) of four OS speakers of varying speaking proficiency. The novelty of this method is the use of synthetic speech matched in duration with the source OS as the target, instead of parallel aligned healthy speech. We evaluated the converted samples from this system using a collection of Automatic Speech Recognition systems (ASR), an objective intelligibility metric (STOI) and a subjective test. ASR evaluation shows that the proposed system had significantly better word recognition accuracy compared to unprocessed OS, and baseline systems which used aligned healthy speech as the target. There was an improvement of at least 15% on STOI scores indicating a higher intelligibility for the proposed system compared to unprocessed OS, and a higher target similarity in the proposed system compared to baseline systems. The subjective test reveals a significant preference for the proposed system compared to unprocessed OS for all OS speakers, except one who was the least proficient OS speaker in the data set.This project was supported by funding from the European Union’s H2020 research and innovation programme under the MSCA GA 675324 (the ENRICH network: www.enrich-etn.eu (accessed on 25 June 2021)), and the Basque Government (PIBA_2018_1_0035 and IT355-19)

    Intelligibility and Listening Effort of Spanish Oesophageal Speech

    Get PDF
    Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human-human and human-machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) in comparison to Healthy Laryngeal Speech (HS). We conducted two listening tests (one web-based, the other in laboratory settings) to collect these measurements. Participants performed a sentence recognition and LE rating task in each test. Intelligibility, calculated as Word Error Rate, showed significant correlation with self-reported LE ratings. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. More LE was reported for OS compared to HS even when OS intelligibility was close to HS. Listeners familiar with OS reported less effort when listening to OS compared to nonfamiliar listeners. However, such advantage of familiarity was not observed for intelligibility. Automatic speech recognition scores were higher for OS compared to HS.This project was supported by funding from the EUs H2020 research and innovation programme under the MSCA GA 67532*4 (the ENRICH network: www.enrich-etn.eu), the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project, TEC2015-67163-C2-1-R) and the Basque Government (DL4NLP KK-2019/00045, PIBA_2018_1_0035 and IT355-19)

    Intelligibility and Listening Effort of Spanish Oesophageal Speech

    Get PDF
    Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human-human and human-machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) in comparison to Healthy Laryngeal Speech (HS). We conducted two listening tests (one web-based, the other in laboratory settings) to collect these measurements. Participants performed a sentence recognition and LE rating task in each test. Intelligibility, calculated as Word Error Rate, showed significant correlation with self-reported LE ratings. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. More LE was reported for OS compared to HS even when OS intelligibility was close to HS. Listeners familiar with OS reported less effort when listening to OS compared to nonfamiliar listeners. However, such advantage of familiarity was not observed for intelligibility. Automatic speech recognition scores were higher for OS compared to HS.This project was supported by funding from the EUs H2020 research and innovation programme under the MSCA GA 67532*4 (the ENRICH network: www.enrich-etn.eu), the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project, TEC2015-67163-C2-1-R) and the Basque Government (DL4NLP KK-2019/00045, PIBA_2018_1_0035 and IT355-19)

    Técnicas para la mejora de la inteligibilidad en voces patológicas

    Get PDF
    229 p.Los laringectomizados son personas cuya laringe ha sido extirpada quirúrgicamente, normalmente comoconsecuencia de un tumor. Al tratarse éste de un órgano fundamental para la producción de la voz,pierden la capacidad de hablar. Sin embargo, muchas de ellas consiguen re-aprender a hablar de unamanera distinta. Este tipo de habla se conoce como voz esofágica y es bastante distinta de la voz sana. Sunaturalidad e inteligibilidad es menor hasta el punto de que ciertos oyentes tienen que hacer un esfuerzopara comprender lo que se les está diciendo.Esto supone un perjuicio en la calidad de vida de los laringectomizados ya que sus capacidadescomunicativas se ven afectadas, no solo en las interacciones entre personas sino también en las interfaceshombre-máquina controladas por la voz. En esta tesis se abordan diferentes métodos para la mejora de lainteligibilidad de las voces alaríngeas de manera que palíen estos problemas.Un aspecto importante ha sido analizar las características propias de la voz esofágica. No es fácilencontrar el material necesario para hacer este análisis y los recursos disponibles son escasos. Esta tesisha querido llenar este vacío mediante la grabación de una base de datos paralela de locutores esofágicos.Esta base de datos ha sido caracterizada acústicamente. Con este objetivo se ha comprobado los efectosque tiene el método de extracción de la frecuencia fundamental a la hora de analizar las características delas señales esofágicas. Se ha propuesto utilizar el análisis del residuo glotal ya que capta mejor laspeculiaridades de este tipo de voces.Es necesario también disponer de algún método para evaluar de manera objetiva el impacto que tienen losmétodos propuestos para mejorar la inteligibilidad. Con este propósito se ha implementado unreconocedor cuyas características y particularidades se recogen en este documento. Este ASR se validóparticipando en una evaluación de detección de términos hablados organizada por la Red Temática enTecnologías del Habla.Para la mejora de la inteligibilidad de las voces esofágicas primero se han analizado diferentes algoritmosbasados en las técnicas de conversión de voz existentes aplicadas a voces sanas. Se ha evaluado tanto elcomportamiento de técnicas clásicas basadas en mezclas de Gaussianas como el de técnicas deconversión basadas en aprendizaje profundo.Por último, se han adaptado con éxito estas técnicas de conversión a las voces esofágicas. Estasconversiones se han evaluado de manera objetiva mediante el ASR construido, y subjetivamentemediante tests de preferencia. Aunque los resultados de las pruebas subjetivas exponen que para losoyentes no hay diferencias significativas entre las voces convertidas y las esofágicas originales, losresultados del reconocimiento automático muestran que las técnicas de conversión aplicadas a este tipode voces consiguen disminuir la tasa de error obtenida

    Técnicas para la mejora de la inteligibilidad en voces patológicas

    Get PDF
    229 p.Los laringectomizados son personas cuya laringe ha sido extirpada quirúrgicamente, normalmente comoconsecuencia de un tumor. Al tratarse éste de un órgano fundamental para la producción de la voz,pierden la capacidad de hablar. Sin embargo, muchas de ellas consiguen re-aprender a hablar de unamanera distinta. Este tipo de habla se conoce como voz esofágica y es bastante distinta de la voz sana. Sunaturalidad e inteligibilidad es menor hasta el punto de que ciertos oyentes tienen que hacer un esfuerzopara comprender lo que se les está diciendo.Esto supone un perjuicio en la calidad de vida de los laringectomizados ya que sus capacidadescomunicativas se ven afectadas, no solo en las interacciones entre personas sino también en las interfaceshombre-máquina controladas por la voz. En esta tesis se abordan diferentes métodos para la mejora de lainteligibilidad de las voces alaríngeas de manera que palíen estos problemas.Un aspecto importante ha sido analizar las características propias de la voz esofágica. No es fácilencontrar el material necesario para hacer este análisis y los recursos disponibles son escasos. Esta tesisha querido llenar este vacío mediante la grabación de una base de datos paralela de locutores esofágicos.Esta base de datos ha sido caracterizada acústicamente. Con este objetivo se ha comprobado los efectosque tiene el método de extracción de la frecuencia fundamental a la hora de analizar las características delas señales esofágicas. Se ha propuesto utilizar el análisis del residuo glotal ya que capta mejor laspeculiaridades de este tipo de voces.Es necesario también disponer de algún método para evaluar de manera objetiva el impacto que tienen losmétodos propuestos para mejorar la inteligibilidad. Con este propósito se ha implementado unreconocedor cuyas características y particularidades se recogen en este documento. Este ASR se validóparticipando en una evaluación de detección de términos hablados organizada por la Red Temática enTecnologías del Habla.Para la mejora de la inteligibilidad de las voces esofágicas primero se han analizado diferentes algoritmosbasados en las técnicas de conversión de voz existentes aplicadas a voces sanas. Se ha evaluado tanto elcomportamiento de técnicas clásicas basadas en mezclas de Gaussianas como el de técnicas deconversión basadas en aprendizaje profundo.Por último, se han adaptado con éxito estas técnicas de conversión a las voces esofágicas. Estasconversiones se han evaluado de manera objetiva mediante el ASR construido, y subjetivamentemediante tests de preferencia. Aunque los resultados de las pruebas subjetivas exponen que para losoyentes no hay diferencias significativas entre las voces convertidas y las esofágicas originales, losresultados del reconocimiento automático muestran que las técnicas de conversión aplicadas a este tipode voces consiguen disminuir la tasa de error obtenida
    corecore