Search CORE

830 research outputs found

Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

Author: Hong Joanna
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 23/10/2023
Field of study

We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.Comment: EMNLP 2023 Finding

arXiv.org e-Print Archive

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

Author: Choi Jeongsoo
Kim Minsu
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 01/04/2024
Field of study

Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.Comment: Accepted at ICASSP 202

arXiv.org e-Print Archive

Reprogramming Audio-driven Talking Face Synthesis into Text-driven

Author: Choi Jeongsoo
Kim Minsu
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 28/06/2023
Field of study

In this paper, we propose a method to reprogram pre-trained audio-driven talking face synthesis models to be able to operate with text inputs. As the audio-driven talking face synthesis model takes speech audio as inputs, in order to generate a talking avatar with the desired speech content, speech recording needs to be performed in advance. However, this is burdensome to record audio for every video to be generated. In order to alleviate this problem, we propose a novel method that embeds input text into the learned audio latent space of the pre-trained audio-driven model. To this end, we design a Text-to-Audio Embedding Module (TAEM) which is guided to learn to map a given text input to the audio latent features. Moreover, to model the speaker characteristics lying in the audio features, we propose to inject visual speaker embedding into the TAEM, which is obtained from a single face image. After training, we can synthesize talking face videos with either text or speech audio

arXiv.org e-Print Archive

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

Author: Hong Joanna
Kim Minsu
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 23/08/2023
Field of study

Speech-driven 3D facial animation has gained significant attention for its ability to create realistic and expressive facial animations in 3D space based on speech. Learning-based methods have shown promising progress in achieving accurate facial motion synchronized with speech. However, one-to-many nature of speech-to-3D facial synthesis has not been fully explored: while the lip accurately synchronizes with the speech content, other facial attributes beyond speech-related motions are variable with respect to the speech. To account for the potential variance in the facial attributes within a single speech, we propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. DF-3DFace captures the complex one-to-many relationships between speech and 3D face based on diffusion. It concurrently achieves aligned lip motion by exploiting audio-mesh synchronization and masked conditioning. Furthermore, the proposed method jointly models identity and pose in addition to facial motions so that it can generate 3D face animation without requiring a reference identity mesh and produce natural head poses. We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh. Extensive experiments demonstrate that our method successfully generates highly variable facial shapes and motions from speech and simultaneously achieves more realistic facial animation than the state-of-the-art methods

arXiv.org e-Print Archive

Persona Extraction Through Semantic Similarity for Emotional Support Conversation Generation

Author: Han Seunghee
Kim Chae Won
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 06/03/2024
Field of study

Providing emotional support through dialogue systems is becoming increasingly important in today's world, as it can support both mental health and social interactions in many conversation scenarios. Previous works have shown that using persona is effective for generating empathetic and supportive responses. They have often relied on pre-provided persona rather than inferring them during conversations. However, it is not always possible to obtain a user persona before the conversation begins. To address this challenge, we propose PESS (Persona Extraction through Semantic Similarity), a novel framework that can automatically infer informative and consistent persona from dialogues. We devise completeness loss and consistency loss based on semantic similarity scores. The completeness loss encourages the model to generate missing persona information, and the consistency loss guides the model to distinguish between consistent and inconsistent persona. Our experimental results demonstrate that high-quality persona information inferred by PESS is effective in generating emotionally supportive responses.Comment: Accepted by ICASSP202

arXiv.org e-Print Archive

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

Author: Choi Jeongsoo
Hong Joanna
Kim Minsu
Park Se Jin
Ro Yong Man
Publication venue
Publication date: 02/11/2022
Field of study

The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral

arXiv.org e-Print Archive

The minimum clinically important difference of the incremental shuttle walk test in bronchiectasis: a prospective cohort study.

Author: Barker RE
Cairn J
Jones SE
Kon SS-C
Loebinger MR
Man WD-C
Nolan CM
Patel S
Walsh JA
Wilson R
Wynne SC
Publication venue: 'American Thoracic Society'
Publication date: 04/11/2019
Field of study

The incremental shuttle walk test (ISW) is an externally-paced field walking test that measures maximal exercise capacity1 and is widely used in patients with chronic obstructive pulmonary disease (COPD) undergoing pulmonary rehabilitation (PR). Its psychometric properties, including reliability, construct validity2 and responsiveness to intervention,2-5 have been demonstrated in patients with bronchiectasis, but little data exist on the minimum clinically important difference (MCID). Although two studies have investigated the MCID of ISW in patients with bronchiectasis, the generalisability of these data is limited because of the study sample characteristics,6 or did not involve an exercise-based intervention.2 The MCID enables clinicians and researchers to understand the clinical significance of change data and forms an important part of the evidence required by regulatory agencies for approval for use in clinical trials. Accordingly, the aim of this study was to provide MCID estimates of the ISW in response to intervention, namely PR, in patients with bronchiectasis

Spiral - Imperial College Digital Repository

The determination of dark adaptation time using electroretinography in conscious Miniature Schnauzer dogs

Author: Chae Je-Min
Jeong Man-Bok
Kim Se-Eun
Kim Won-Tae
Park Shin-Ae
Seo Kang-Moon
Yi Na-Young
Yu Hyung-Ah
Publication venue: The Korean Society of Veterinary Science
Publication date: 01/01/2007
Field of study

The optimal dark adaptation time of electroretinograms (ERG's) performed on conscious dogs were determined using a commercially available ERG unit with a contact lens electrode and a built-in light source (LED-electrode). The ERG recordings were performed on nine healthy Miniature Schnauzer dogs. The bilateral ERG's at seven different dark adaptation times at an intensity of 2.5 cd·s/m2 was performed. Signal averaging (4 flashes of light stimuli) was adopted to reduce electrophysiologic noise. As the dark adaptation time increased, a significant increase in the mean a-wave amplitudes was observed in comparison to base-line levels up to 10 min (p < 0.05). Thereafter, no significant differences in amplitude occured over the dark adaptation time. Moreover, at this time the mean amplitude was 60.30 ± 18.47 µV. However, no significant changes were observed for the implicit times of the a-wave. The implicit times and amplitude of the b-wave increased significantly up to 20 min of dark adaptation (p < 0.05). Beyond this time, the mean b-wave amplitudes was 132.92 ± 17.79 µV. The results of the present study demonstrate that, the optimal dark adaptation time when performing ERG's, should be at least 20 min in conscious Miniature Schnauzer dogs

SNU Open Repository and Archive

PubMed Central

Serial Examination of an Inducible and Reversible Dilated Cardiomyopathy in Individual Adult Drosophila

Author: A Heydemann
Andreas Bergmann
B Lilly
D Huang
G Paternostro
HT Nguyen
Il-Man Kim
K Ocorr
KA Ocorr
M Fink
Matthew J. Wolf
MJ Allikian
MJ Wolf
O Taghli-Lamallem
RJ Wessells
S Tsubata
SE McGuire
SE McGuire
Z Yin
Z Yin
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Recent work has demonstrated that Drosophila can be used as a model of dilated cardiomyopathy, defined as an enlarged cardiac chamber at end-diastole when the heart is fully relaxed and having an impaired systolic function when the heart is fully contracted. Gene mutations that cause cardiac dysfunction in adult Drosophila can result from abnormalities in cardiac development or alterations in post-developmental heart function. To clarify the contribution of transgene expression to post-developmental cardiac abnormalities, we applied strategies to examine the temporal and spacial effects of transgene expression on cardiac function. We engineered transgenic Drosophila based on the well-characterized temperature-sensitive Gal80 protein in the context of the bipartite Gal4/UAS transgenic expression system in Drosophila employing the cardiac specific driver, tinCΔ4-Gal4. Then, we developed a strategy using optical coherence tomography to serially measure cardiac function in the individual flies over time course of several days. As a proof of concept we examined the effects of the expression of a human mutant delta-sarcoglycan associated with familial heart failure and observed a reversible, post-developmental dilated cardiomyopathy in Drosophila. Our results show that the unique imaging strategy based on the non-destructive, non-invasive properties of optical coherence tomography can be applied to serially examine cardiac function in individual adult flies. Furthermore, the induction and reversal of cardiac transgene expression can be investigated in adult flies thereby providing insight into the post-developmental effects of transgene expression

Public Library of Science (PLOS)

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central