10 research outputs found

    Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

    Full text link
    Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.Comment: Accepted to SIGGRAPH Asia 2022 (Conference Proceedings). Project page: https://hangz-nju-cuhk.github.io/projects/AV-CA

    Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation

    Full text link
    Creating the photo-realistic version of people sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is greatly preferred by user.Comment: Project Page on https://hangz-nju-cuhk.github.io

    L2 learners’ pronunciation of English phonetic sounds: An acoustic analysis with software Praat

    No full text
    Pronouncing English sounds correctly is not an easy task for second language (L2) learners because of the influence of their mother tongue. Empirical studies, based on first language (L1) interference, have investigated L2 learners’ pronunciation problems. However, these studies rarely focused on students’ development in pronunciation, and their results lack validity and reliability because of their mere employment of L2 English teachers as pronunciation assessors. The present study, using the acoustic software Praat as the instrument and taking a native speaker as the comparison, investigated Chinese L2 English learners’ problems and improvement in pronouncing the English sounds that do not have exact counterparts in Chinese. Data analysis revealed that the participants manifested different degrees of pronunciation accuracy with the target English sounds; their mispronunciations of consonants were mainly due to lacking voicing, wrong manners, and wrong places of articulation, while their mispronunciations of vowels were attributed to their improper tongue position, mouth opening, and diphthongization; and that higher-proficiency students tended to have greater pronunciation accuracy. The findings were discussed with reference to the literature, and pedagogical implications were provided at the end

    Robust Video Portrait Reenactment via Personalized Representation Quantization

    No full text
    While progress has been made in the field of portrait reenactment, the problem of how to produce high-fidelity and robust videos remains. Recent studies normally find it challenging to handle rarely seen target poses due to the limitation of source data. This paper proposes the Video Portrait via Non-local Quantization Modeling (VPNQ) framework, which produces pose- and disturbance-robust reenactable video portraits. Our key insight is to learn position-invariant quantized local patch representations and build a mapping between simple driving signals and local textures with non-local spatial-temporal modeling. Specifically, instead of learning a universal quantized codebook, we identify that a personalized one can be trained to preserve desired position-invariant local details better. Then, a simple representation of projected landmarks can be used as sufficient driving signals to avoid 3D rendering. Following, we employ a carefully designed Spatio-Temporal Transformer to predict reasonable and temporally consistent quantized tokens from the driving signal. The predicted codes can be decoded back to robust and high-quality videos. Comprehensive experiments have been conducted to validate the effectiveness of our approach

    The Seventh Visual Object Tracking VOT2019 Challenge Results

    No full text
    The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOT-ST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2019 focused on long-term tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard short-term, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website(1).Funding Agencies|Slovenian research agencySlovenian Research Agency - Slovenia [J2-8175, P2-0214, P2-0094]; Czech Science Foundation Project GACR [P103/12/G084]; MURI project - MoD/DstlMURI; EPSRCEngineering &amp; Physical Sciences Research Council (EPSRC) [EP/N019415/1]; WASP; VR (ELLIIT, LAST, and NCNN); SSF (SymbiCloud); AIT Strategic Research Programme; Faculty of Computer Science, University of Ljubljana, Slovenia</p
    corecore