Search CORE

360 research outputs found

A silent speech system based on permanent magnet articulography and direct synthesis

Author: Bai Jie
Cheah Lam A.
Ell Stephen R.
Gilbert James M.
Gonzalez Jose A.
Green Phil D.
Moore Roger K.
Publication venue: 'Elsevier BV'
Publication date: 14/03/2016
Field of study

In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies

Repository@Hull - Worktribe

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Author: Csapó Tamás Gábor
Gosztolya Gábor
Markó Alexandra
Tóth László
Publication venue
Publication date: 01/01/2021
Field of study

Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based text-to-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text input. Besides, we analyze the ultrasound tongue recordings of several speakers, and show that misalignments in the ultrasound transducer positioning can have a negative effect on the final synthesis performance.Comment: accepted at SSW11 (11th Speech Synthesis Workshop

arXiv.org e-Print Archive

Repository of the Academy's Library

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

Author: Al-Radhi Mohammed Salah
Csapó Tamás Gábor
Gosztolya Gábor
Grósz Tamás
Markó Alexandra
Németh Géza
Tóth László
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201

arXiv.org e-Print Archive

Crossref

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Repository of the Academy's Library

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Author: Csapó Tamás Gábor
Gosztolya Gábor
Markó Alexandra
Tóth László
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2021
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Author: Csapó Tamás Gábor
Gosztolya Gábor
Honarmandi Shandiz Amin
Markó Alexandra
Németh Géza
Tóth László
Zainkó Csaba
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2021
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications