312 research outputs found
Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework
Speech Neuroprostheses have the potential to enable communication for people
with dysarthria or anarthria. Recent advances have demonstrated high-quality
text decoding and speech synthesis from electrocorticographic grids placed on
the cortical surface. Here, we investigate a less invasive measurement modality
in three participants, namely stereotactic EEG (sEEG) that provides sparse
sampling from multiple brain regions, including subcortical regions. To
evaluate whether sEEG can also be used to synthesize high-quality audio from
neural recordings, we employ a recurrent encoder-decoder model based on modern
deep learning methods. We find that speech can indeed be reconstructed with
correlations up to 0.8 from these minimally invasive recordings, despite
limited amounts of training data
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation
In recent years, audio-driven 3D facial animation has gained significant
attention, particularly in applications such as virtual reality, gaming, and
video conferencing. However, accurately modeling the intricate and subtle
dynamics of facial expressions remains a challenge. Most existing studies
approach the facial animation task as a single regression problem, which often
fail to capture the intrinsic inter-modal relationship between speech signals
and 3D facial animation and overlook their inherent consistency. Moreover, due
to the limited availability of 3D-audio-visual datasets, approaches learning
with small-size samples have poor generalizability that decreases the
performance. To address these issues, in this study, we propose a cross-modal
dual-learning framework, termed DualTalker, aiming at improving data usage
efficiency as well as relating cross-modal dependencies. The framework is
trained jointly with the primary task (audio-driven facial animation) and its
dual task (lip reading) and shares common audio/motion encoder components. Our
joint training framework facilitates more efficient data usage by leveraging
information from both tasks and explicitly capitalizing on the complementary
relationship between facial motion and audio to improve performance.
Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate
the potential over-smoothing underlying the cross-modal complementary
representations, enhancing the mapping of subtle facial expression dynamics.
Through extensive experiments and a perceptual user study conducted on the VOCA
and BIWI datasets, we demonstrate that our approach outperforms current
state-of-the-art methods both qualitatively and quantitatively. We have made
our code and video demonstrations available at
https://github.com/sabrina-su/iadf.git
- …