36 research outputs found

    DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

    Full text link
    Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.Comment: ICCV 202

    SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

    Full text link
    The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral

    The Properties of Microwave-Assisted Synthesis of Metal–Organic Frameworks and Their Applications

    No full text
    Metal–organic frameworks (MOF) are a class of porous materials with various functions based on their host-guest chemistry. Their selectivity, diffusion kinetics, and catalytic activity are influenced by their design and synthetic procedure. The synthesis of different MOFs has been of considerable interest during the past decade thanks to their various applications in the arena of sensors, catalysts, adsorption, and electronic devices. Among the different techniques for the synthesis of MOFs, such as the solvothermal, sonochemical, ionothermal, and mechanochemical processes, microwave-assisted synthesis has clinched a significant place in MOF synthesis. The main assets of microwave-assisted synthesis are the short reaction time, the fast rate of nucleation, and the modified properties of MOFs. The review encompasses the development of the microwave-assisted synthesis of MOFs, their properties, and their applications in various fields
    corecore