4,642 research outputs found
TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos
We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio,
ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a
set of six recording sessions of one professional voice talent, a male native
speaker of English; TaL80 is a set of recording sessions of 81 native speakers
of English without voice talent experience. Overall, the corpus contains 24
hours of parallel ultrasound, video, and audio data, of which approximately
13.5 hours are speech. This paper describes the corpus and presents benchmark
results for the tasks of speech recognition, speech synthesis
(articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound
to audio. The TaL corpus is publicly available under the CC BY-NC 4.0 license.Comment: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language
Technology Worksho
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
Silent speech interface is a promising technology that enables private
communications in natural language. However, previous approaches only support a
small and inflexible vocabulary, which leads to limited expressiveness. We
leverage contrastive learning to learn efficient lipreading representations,
enabling few-shot command customization with minimal user effort. Our model
exhibits high robustness to different lighting, posture, and gesture conditions
on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947
is achievable only using one shot, and its performance can be further boosted
by adaptively learning from more data. This generalizability allowed us to
develop a mobile silent speech interface empowered with on-device fine-tuning
and visual keyword spotting. A user study demonstrated that with LipLearner,
users could define their own commands with high reliability guaranteed by an
online incremental learning scheme. Subjective feedback indicated that our
system provides essential functionalities for customizable silent speech
interactions with high usability and learnability.Comment: Conditionally accepted to the ACM CHI Conference on Human Factors in
Computing Systems 2023 (CHI '23
- …