3,048 research outputs found
Lipreading with Long Short-Term Memory
Lipreading, i.e. speech recognition from visual-only recordings of a
speaker's face, can be achieved with a processing pipeline based solely on
neural networks, yielding significantly better accuracy than conventional
methods. Feed-forward and recurrent neural network layers (namely Long
Short-Term Memory; LSTM) are stacked to form a single structure which is
trained by back-propagating error gradients through all the layers. The
performance of such a stacked network was experimentally evaluated and compared
to a standard Support Vector Machine classifier using conventional computer
vision features (Eigenlips and Histograms of Oriented Gradients). The
evaluation was performed on data from 19 speakers of the publicly available
GRID corpus. With 51 different words to classify, we report a best word
accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural
network-based solution (11.6% improvement over the best feature-based solution
evaluated).Comment: Accepted for publication at ICASSP 201
A silent speech system based on permanent magnet articulography and direct synthesis
In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies
Towards a Practical Silent Speech Interface Based on Vocal Tract Imaging
Intégralité des actes de cette conférence disponible au lien suivant: http://www.issp2011.uqam.ca/upload/files/proceedings.pdfInternational audienceThe paper describes advances in the development of an ultrasound silent speech interface for use in silent communications applications or as a speaking aid for persons who have undergone a laryngectomy. It reports some first steps towards making such a device lightweight, portable, interactive, and practical to use. Simple experimental tests of an interactive silent speech interface for everyday applications are described. Possible future improvements including extension to continuous speech and real time operation are discussed.Cet article décrit les avancements dans le développement d'une interface ultrasonore de parole silencieuse, pour des applications en communication silencieuse ou comme une aide pour les personnes laryngectomisées. Nous rapportons les premiers pas pour réaliser une telle interface portable, interactive, et pratique à utiliser. De simples tests expérimentaux de cette interface pour des applications quotidiennes sont décrits. Des améliorations futures possibles incluant l'extension à la parole continue et aux traitements en temps réels sont discutées
Towards a Multimodal Silent Speech Interface for European Portuguese
Automatic Speech Recognition (ASR) in the presence of environmental noise is still a hard problem to tackle in speech science (Ng et al., 2000). Another problem well described in the literature is the one concerned with elderly speech production. Studies (Helfrich, 1979) have shown evidence of a slower speech rate, more breaks, more speech errors and a humbled volume of speech, when comparing elderly with teenagers or adults speech, on an acoustic level. This fact makes elderly speech hard to recognize, using currently available stochastic based ASR technology. To tackle these two problems in the context of ASR for HumanComputer Interaction, a novel Silent Speech Interface (SSI) in European Portuguese (EP) is envisioned.info:eu-repo/semantics/acceptedVersio
Silent versus modal multi-speaker speech recognition from ultrasound and video
We investigate multi-speaker speech recognition from ultrasound images of the
tongue and video images of the lips. We train our systems on imaging data from
modal speech, and evaluate on matched test sets of two speaking modes: silent
and modal speech. We observe that silent speech recognition from imaging data
underperforms compared to modal speech recognition, likely due to a
speaking-mode mismatch between training and testing. We improve silent speech
recognition performance using techniques that address the domain mismatch, such
as fMLLR and unsupervised model adaptation. We also analyse the properties of
silent and modal speech in terms of utterance duration and the size of the
articulatory space. To estimate the articulatory space, we compute the convex
hull of tongue splines, extracted from ultrasound tongue images. Overall, we
observe that the duration of silent speech is longer than that of modal speech,
and that silent speech covers a smaller articulatory space than modal speech.
Although these two properties are statistically significant across speaking
modes, they do not directly correlate with word error rates from speech
recognition.Comment: 5 pages, 5 figures, Submitted to Interspeech 202
TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio,
ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a
set of six recording sessions of one professional voice talent, a male native
speaker of English; TaL80 is a set of recording sessions of 81 native speakers
of English without voice talent experience. Overall, the corpus contains 24
hours of parallel ultrasound, video, and audio data, of which approximately
13.5 hours are speech. This paper describes the corpus and presents benchmark
results for the tasks of speech recognition, speech synthesis
(articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound
to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.Comment: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language
Technology Worksho
Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech
Ultrasound tongue imaging (UTI) provides a convenient way to visualize the
vocal tract during speech production. UTI is increasingly being used for speech
therapy, making it important to develop automatic methods to assist various
time-consuming manual tasks currently performed by speech therapists. A key
challenge is to generalize the automatic processing of ultrasound tongue images
to previously unseen speakers. In this work, we investigate the classification
of phonetic segments (tongue shapes) from raw ultrasound recordings under
several training scenarios: speaker-dependent, multi-speaker,
speaker-independent, and speaker-adapted. We observe that models underperform
when applied to data from speakers not seen at training time. However, when
provided with minimal additional speaker information, such as the mean
ultrasound frame, the models generalize better to unseen speakers.Comment: 5 pages, 4 figures, published in ICASSP2019 (IEEE International
Conference on Acoustics, Speech and Signal Processing, 2019
- …