5,012 research outputs found
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to
match pedestrian images of the same identity from different modalities without
annotations. Existing works mainly focus on alleviating the modality gap by
aligning instance-level features of the unlabeled samples. However, the
relationships between cross-modality clusters are not well explored. To this
end, we propose a novel bilateral cluster matching-based learning framework to
reduce the modality gap by matching cross-modality clusters. Specifically, we
design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM)
algorithm through optimizing the maximum matching problem in a bipartite graph.
Then, the matched pairwise clusters utilize shared visible and infrared
pseudo-labels during the model training. Under such a supervisory signal, a
Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework
is proposed to align features jointly at a cluster-level. Meanwhile, the
cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the
large modality discrepancy. Extensive experiments on the public SYSU-MM01 and
RegDB datasets demonstrate the effectiveness of the proposed method, surpassing
state-of-the-art approaches by a large margin of 8.76% mAP on average
- …