36 research outputs found
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.Comment: ICCV 202
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
The challenge of talking face generation from speech lies in aligning two
different modal information, audio and video, such that the mouth region
corresponds to input audio. Previous methods either exploit audio-visual
representation learning or leverage intermediate structural information such as
landmarks and 3D models. However, they struggle to synthesize fine details of
the lips varying at the phoneme level as they do not sufficiently provide
visual information of the lips at the video synthesis step. To overcome this
limitation, our work proposes Audio-Lip Memory that brings in visual
information of the mouth region corresponding to input audio and enforces
fine-grained audio-visual coherence. It stores lip motion features from
sequential ground truth images in the value memory and aligns them with
corresponding audio features so that they can be retrieved using audio input at
inference time. Therefore, using the retrieved lip motion features as visual
hints, it can easily correlate audio with visual dynamics in the synthesis
step. By analyzing the memory, we demonstrate that unique lip features are
stored in each memory slot at the phoneme level, capturing subtle lip motion
based on memory addressing. In addition, we introduce visual-visual
synchronization loss which can enhance lip-syncing performance when used along
with audio-visual synchronization loss in our model. Extensive experiments are
performed to verify that our method generates high-quality video with mouth
shapes that best align with the input audio, outperforming previous
state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral
The Properties of Microwave-Assisted Synthesis of Metal–Organic Frameworks and Their Applications
Metal–organic frameworks (MOF) are a class of porous materials with various functions based on their host-guest chemistry. Their selectivity, diffusion kinetics, and catalytic activity are influenced by their design and synthetic procedure. The synthesis of different MOFs has been of considerable interest during the past decade thanks to their various applications in the arena of sensors, catalysts, adsorption, and electronic devices. Among the different techniques for the synthesis of MOFs, such as the solvothermal, sonochemical, ionothermal, and mechanochemical processes, microwave-assisted synthesis has clinched a significant place in MOF synthesis. The main assets of microwave-assisted synthesis are the short reaction time, the fast rate of nucleation, and the modified properties of MOFs. The review encompasses the development of the microwave-assisted synthesis of MOFs, their properties, and their applications in various fields