457,941 research outputs found

    ChatAnything: Facetime Chat with LLM-Enhanced Personas

    Full text link
    In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/

    FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction

    Full text link
    DeepFake based digital facial forgery is threatening public media security, especially when lip manipulation has been used in talking face generation, and the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity are hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face videos also becomes inevitable. It's found that the optical flow of the fake talking face video is disordered especially in the lip region while the optical flow of the real video changes regularly, which means the motion feature from optical flow is useful to capture manipulation cues. In this study, a fake talking face detection network (FTFDNet) is proposed by incorporating visual, audio and motion features using an efficient cross-modal fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architecture by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods not only on the established fake talking face detection dataset (FTFDD) but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).Comment: arXiv admin note: substantial text overlap with arXiv:2203.0517

    PENGARUH STRATEGI PERSONAL SELLING TERHADAP SPORT DECISION DI BATTLE FIELD PAINTBALL: Survey pada partisipan yang menjadi Pengambil Keputusan Wisata Olahraga di Battlefield Paintball

    Get PDF
    Wisata Olahraga adalah salah satu wisata yang ditawarkan di Kota Bandung. Paintball merupakan salah satu wisata olahraga dalam kategori olahraga extreme yang berada di Kota Bandung. Salah satu penyedia jasa paintball di Kota Bandung adalah Battlefield Paintball. Berdasarkan data yang diperoleh, tingkat kunjungan pada tahun 2011-2014 mengalami penurunan. Untuk meningkatkan kembali kunjungan pada tahun 2015, pihak manajemen Battlefield Paintball melakukan strategi personal selling yang terdiri, talking a consumer on the phone, talking face to face, dan communication through text messaging on mobile cellular phone or through internet portal. Jenis penelitian yang digunakan bersifat deskriptif dan verifikatif dengan metode yang digunakan explanatory survey. Tujuan penelitian ini adalah mengetahui gambaran personal selling terhadap sport decision di Battlefield Paintball. Sampel yang digunakan dalam penelitian ini sebanyak 81 responden dengan teknik penarikan sampel yang digunakan sistematic random sampling. Jenis penelitan yang digunakan adalah deskriptif dan verifikatif. Teknik sampling yang digunakan adalah syistematic random sampling dengan Teknik analisis data yang digunakan adalah regresi linear berganda. Hasil penelitian menunjukan bahwa terdapat pengaruh yang signifikan personal selling terhadap sport decision. Penilaian tertinggi personal selling yaitu sub variabel talking face to face, dan nilai terendah pada sub variabel communication through text messaging on mobile cellular phone or through internet portal. Adapun untuk sport decision, indikator time mendapat penilaian paling tinggi dibandingkan dengan indikator lainnya.----------Sport Tourism is one of the tours offered in the city of Bandung. Paintball is one of the sports tourism in the category of extreme sports in the city of Bandung. One of the service providers in the city of Bandung is paintball Battlefield Paintball. Based on the data obtained, the visit rate in 2011-2014 decreased. To improve the return visit in 2015, the management Battlefield Paintball do personal selling strategy comprising, a consumer talking on the phone, talking face to face, and communication through text messaging on mobile cellular phone or through the internet portal. This type of research is descriptive and verification with the method used explanatory survey. The purpose of this study is to know the description of personal selling in Battlefield Paintball sport decision. The sample used in this study were 81 respondents to the sampling technique used systematic random sampling. Type of research is descriptive and verification. The sampling technique used is systematic random sampling with data analysis technique used is multiple linear regression. The results showed that there is significant influence of personal selling sport decision. Highest Ratings personal selling sub-variables that talking face to face, and the lowest rate in the sub-variable communication through text messaging on mobile cellular phone or through the internet portal. As for sport decision, indicators of time gets the highest ratings compared to other indicators

    DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

    Full text link
    Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.Comment: submmit to ICASSP 202

    Speaker identification by lipreading

    Get PDF
    This paper describes a new approach for speaker identification based on lipreading. Visual features are extracted from image sequences of the talking face and consist of shape parameters which describe the lip boundary and intensity parameters which describe the grey-level distribution of the mouth area. Intensity information is based on principal component analysis using eigenspaces which deform with the shape model. The extracted parameters account for both, speech dependent and speaker dependent information. We built spatio-temporal speaker models based on these features, using HMMs with mixtures of Gaussians. Promising results were obtained for text dependent and text independent speaker identification tests performed on a small video database

    Neural Voice Puppetry: Audio-driven Facial Reenactment

    Get PDF
    We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques

    Text-based Editing of Talking-head Video

    No full text
    Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

    Designing and Implementing Embodied Agents: Learning from Experience

    Get PDF
    In this paper, we provide an overview of part of our experience in designing and implementing some of the embodied agents and talking faces that we have used for our research into human computer interaction. We focus on the techniques that were used and evaluate this with respect to the purpose that the agents and faces were to serve and the costs involved in producing and maintaining the software. We discuss the function of this research and development in relation to the educational programme of our graduate students
    • …
    corecore