2,055 research outputs found
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The goal of this work is to reconstruct speech from a silent talking face
video. Recent studies have shown impressive performance on synthesizing speech
from silent talking face videos. However, they have not explicitly considered
on varying identity characteristics of different speakers, which place a
challenge in the video-to-speech synthesis, and this becomes more critical in
unseen-speaker settings. Our approach is to separate the speech content and the
visage-style from a given silent talking face video. By guiding the model to
independently focus on modeling the two representations, we can obtain the
speech of high intelligibility from the model even when the input video of an
unseen subject is given. To this end, we introduce speech-visage selection that
separates the speech content and the speaker identity from the visual features
of the input video. The disentangled representations are jointly incorporated
to synthesize speech through visage-style based synthesizer which generates
speech by coating the visage-styles while maintaining the speech content. Thus,
the proposed framework brings the advantage of synthesizing the speech
containing the right content even with the silent talking face video of an
unseen subject. We validate the effectiveness of the proposed framework on the
GRID, TCD-TIMIT volunteer, and LRW datasets.Comment: Accepted by ECCV 202
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.Comment: ICCV 202
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
We present a novel approach to multilingual audio-visual speech recognition
tasks by introducing a single model on a multilingual dataset. Motivated by a
human cognitive system where humans can intuitively distinguish different
languages without any conscious effort or guidance, we propose a model that can
capture which language is given as an input speech by distinguishing the
inherent similarities and differences between languages. To do so, we design a
prompt fine-tuning technique into the largely pre-trained audio-visual
representation model so that the network can recognize the language class as
well as the speech with the corresponding language. Our work contributes to
developing robust and efficient multilingual audio-visual speech recognition
systems, reducing the need for language-specific models.Comment: EMNLP 2023 Finding
Splice variants of DOMINO control Drosophila circadian behavior and pacemaker neuron maintenance.
Circadian clocks control daily rhythms in behavior and physiology. In Drosophila, the small ventral lateral neurons (sLNvs) expressing PIGMENT DISPERSING FACTOR (PDF) are the master pacemaker neurons generating locomotor rhythms. Despite the importance of sLNvs and PDF in circadian behavior, little is known about factors that control sLNvs maintenance and PDF accumulation. Here, we identify the Drosophila SWI2/SNF2 protein DOMINO (DOM) as a key regulator of circadian behavior. Depletion of DOM in circadian neurons eliminates morning anticipatory activity under light dark cycle and impairs behavioral rhythmicity in constant darkness. Interestingly, the two major splice variants of DOM, DOM-A and DOM-B have distinct circadian functions. DOM-A depletion mainly leads to arrhythmic behavior, while DOM-B knockdown lengthens circadian period without affecting the circadian rhythmicity. Both DOM-A and DOM-B bind to the promoter regions of key pacemaker genes period and timeless, and regulate their protein expression. However, we identify that only DOM-A is required for the maintenance of sLNvs and transcription of pdf. Lastly, constitutive activation of PDF-receptor signaling rescued the arrhythmia and period lengthening of DOM downregulation. Taken together, our findings reveal that two splice variants of DOM play distinct roles in circadian rhythms through regulating abundance of pacemaker proteins and sLNvs maintenance
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion
Speech-driven 3D facial animation has gained significant attention for its
ability to create realistic and expressive facial animations in 3D space based
on speech. Learning-based methods have shown promising progress in achieving
accurate facial motion synchronized with speech. However, one-to-many nature of
speech-to-3D facial synthesis has not been fully explored: while the lip
accurately synchronizes with the speech content, other facial attributes beyond
speech-related motions are variable with respect to the speech. To account for
the potential variance in the facial attributes within a single speech, we
propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
DF-3DFace captures the complex one-to-many relationships between speech and 3D
face based on diffusion. It concurrently achieves aligned lip motion by
exploiting audio-mesh synchronization and masked conditioning. Furthermore, the
proposed method jointly models identity and pose in addition to facial motions
so that it can generate 3D face animation without requiring a reference
identity mesh and produce natural head poses. We contribute a new large-scale
3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in
identities, poses, and facial motions of 3D face mesh. Extensive experiments
demonstrate that our method successfully generates highly variable facial
shapes and motions from speech and simultaneously achieves more realistic
facial animation than the state-of-the-art methods
Can a training hub deliver undergraduate medical education with patient educators?
Background
Medical schools may find it difficult to coordinate GP practices to support undergraduate medical education in primary care. In England, every Integrated Care System area now has a funded training hub to plan and upskill the primary care and community health workforce. We evaluated whether a training hub could help deliver undergraduate medical education, co-facilitated by patient educators. No published research has evaluated this model before.
Methods
We used before and after surveys (617 students), interviews (28) and focus groups (20 people) with undergraduate medical students, patient educators and training hub and medical school team members.
Findings
It was feasible for a training hub to develop and co-deliver a workshop with patient educators. 61% of Year 4 undergraduate students (first clinical year) took part, a high attendance rate during the COVID-19 pandemic. 80% of students said they learnt a lot about managing conditions in primary care and the community as a result. They particularly valued engaging with patient educators and seeing interprofessional working between GPs and pharmacists, which were cornerstones of the training hub approach. The hub was able to recruit and retain patient educators more effectively than the medical school alone. Patient educators said they felt valued and developed new skills.
Conclusions
Working with training hubs may be part of the solution to issues medical schools face when organising undergraduate education about primary care. This small evaluation suggests that this model could be tested further
The unsuitability of Emergence Theory for Pentecostal theology:A response to Bradnick and McCall
In this response to David Bradnick's and Bradford McCall's defense of Amos Yong's usage of emergence theory, we defend our previous argument regarding the tension between Yong's Pentecostal commitments and the philosophical entailments of emergence theory. We clarify and extend our previous concerns in three ways. First, we explore the difficulties of construing divine action naturalistically (i.e. natural divine causation). Second, we clarify the problems of employing supervenience in theology. Third, we show why Bradnick's and McCall's advice to Yong to adopt weak emergence is theologically costly. In conclusion, it is suggested that theologians within the science and religion dialogue should not fear, but recover, the language of supernaturalism and dualism
- …