14 research outputs found

    Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

    Full text link
    The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. This mechanism is specifically tailored for the speech domain to integrate the phonetic information from audio-visual correspondence in speech generation. In this way, the fusion process maintains the high temporal resolution of the features, without excessive computational requirements. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.Comment: Project page with demo: https://mm.kaist.ac.kr/projects/avdiffuss

    That's What I Said: Fully-Controllable Talking Face Generation

    Full text link
    The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentangle identity and motion, we introduce an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately manifest full target facial motions including lip, head pose, and eye movements in the generated video without any additional supervision beyond RGB video with audio

    ?????? ?????????????????? ?????? 2-??? ????????? ?????? ??????

    No full text
    ????????? ?????? ????????? ???????????? ????????? ????????? ?????? ???????????? ????????? ????????? ???????????? ?????? ????????? ?????? ????????? ?????? ???????????? ???????????? ????????? ????????? ???????????? ????????? ????????? ????????? ???????????? ?????????. ??? ??????????????? ?????? ?????? ?????? ????????? ????????? ????????? ??????????????? ????????? ?????? 2-??? ????????? ?????? ????????? ?????? ??????????????? ???????????? ?????? ????????? ??????. ??? ??????????????? ??????????????? ????????? ???????????? ????????? ????????? ????????? ??? ???????????? ?????? ????????? ????????? ?????? ????????? ??? ??????. ??????????????? ????????? ????????? ????????? ????????? ?????? ???????????? ?????? ????????? ???????????? ?????? ????????????

    Search prospects for axionlike particles at rare nuclear isotope accelerator facilities

    No full text
    We propose a novel experimental scheme, called DAMSA (Dump-produced Aboriginal Matter Searches at an Accelerator), for searching for dark-sector particles, using rare nuclear isotope accelerator facilities that provide high-flux proton beams to produce a large number of rare nuclear isotopes. The high-intensity nature of their beams enables the investigation of dark-sector particles, including axionlike particles (ALPs) and dark photons. By contrast, their typical beam energies are not large enough to produce the backgrounds such as neutrinos resulting from secondary charged particles. The detector of DAMSA is then placed immediate downstream of the proton beam dump to maximize the prompt decay signals of dark-sector particles, which are often challenging to probe in other beam-dump-type experiments featuring a longer baseline, at the expense of an enormous amount of the beam-related neutron background. We demonstrate that beam-related neutrons can be significantly suppressed if the signal accompanies multiple, correlated visible particles in the final state. We show that the close proximity of the detector to the ALP production dump makes it possible to probe a high-mass region of ALP parameter space that the existing experiments have never explored.Y
    corecore