19,841 research outputs found

    Text-based Editing of Talking-head Video

    No full text
    Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

    Object Referring in Visual Scene with Spoken Language

    Full text link
    Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    A system for synthetic vision and augmented reality in future flight decks

    Get PDF
    Rockwell Science Center is investigating novel human-computer interaction techniques for enhancing the situational awareness in future flight decks. One aspect is to provide intuitive displays that provide the vital information and the spatial awareness by augmenting the real world with an overlay of relevant information registered to the real world. Such Augmented Reality (AR) techniques can be employed during bad weather scenarios to permit flying in Visual Flight Rules (VFR) in conditions which would normally require Instrumental Flight Rules (IFR). These systems could easily be implemented on heads-up displays (HUD). The advantage of AR systems vs. purely synthetic vision (SV) systems is that the pilot can relate the information overlay to real objects in the world, whereas SV systems provide a constant virtual view, where inconsistencies can hardly be detected. The development of components for such a system led to a demonstrator implemented on a PC. A camera grabs video images which are overlaid with registered information. Orientation of the camera is obtained from an inclinometer and a magnetometer; position is acquired from GPS. In a possible implementation in an airplane, the on-board attitude information can be used for obtaining correct registration. If visibility is sufficient, computer vision modules can be used to fine-tune the registration by matching visual cues with database features. This technology would be especially useful for landing approaches. The current demonstrator provides a frame-rate of 15 fps, using a live video feed as background with an overlay of avionics symbology in the foreground. In addition, terrain rendering from a 1 arc sec. digital elevation model database can be overlaid to provide synthetic vision in case of limited visibility. For true outdoor testing (on ground level), the system has been implemented on a wearable computer
    • 

    corecore