2,626 research outputs found
Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers
We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality
Recommended from our members
Automatic Prediction of Impressions in Time and across Varying Context: Personality, Attractiveness and Likeability
© 2010-2012 IEEE. In this paper, we propose a novel multimodal framework for automatically predicting the impressions of extroversion, agreeableness, conscientiousness, neuroticism , openness, attractiveness and likeability continuously in time and across varying situational contexts. Differently from the existing works, we obtain visual-only and audio-only annotations continuously in time for the same set of subjects, for the first time in the literature, and compare them to their audio-visual annotations. We propose a time-continuous prediction approach that learns the temporal relationships rather than treating each time instant separately. Our experiments show that the best prediction results are obtained when regression models are learned from audio-visual annotations and visual cues, and from audio-visual annotations and visual cues combined with audio cues at the decision level. Continuously generated annotations have the potential to provide insight into better understanding which impressions can be formed and predicted more dynamically, varying with situational context, and which ones appear to be more static and stable over time.This research work was supported by the EPSRC MAPTRAITS Project (Grant Ref: EP/K017500/1) and the EPSRC HARPS Project under its IDEAS Factory Sandpits call on Digital Personhood (Grant Ref: EP/L00416X/1)
Autonomous agents and avatars in REVERIE’s virtual environment
In this paper, we describe the enactment of autonomous agents and avatars in the web-based social collaborative virtual environment of REVERIE that supports natural, human-like behavior, physical interaction and engagement. Represented by avatars, users feel immersed in this virtual world in which they can meet and share experiences as in real life. Like the avatars, autonomous agents that may act in this world are capable of demonstrating human-like non-verbal behavior and facilitate social interaction. We describe how reasoning components of the REVERIE system connect and cooperatively control autonomous agents and avatars representing a user
Automatic Face Reenactment
We propose an image-based, facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserving the original target performance. Our system is fully automatic and does not require a database of source expressions. Instead, it is able to produce convincing reenactment results from a short source video captured with an off-the-shelf camera, such as a webcam, where the user performs arbitrary facial gestures. Our reenactment pipeline is conceived as part image retrieval and part face transfer: The image retrieval is based on temporal clustering of target frames and a novel image matching metric that combines appearance and motion to select candidate frames from the source video, while the face transfer uses a 2D warping strategy that preserves the user's identity. Our system excels in simplicity as it does not rely on a 3D face model, it is robust under head motion and does not require the source and target performance to be similar. We show convincing reenactment results for videos that we recorded ourselves and for low-quality footage taken from the Internet
Example Based Caricature Synthesis
The likeness of a caricature to the original face image is an essential and often overlooked part of caricature
production. In this paper we present an example based caricature synthesis technique, consisting of shape
exaggeration, relationship exaggeration, and optimization for likeness. Rather than relying on a large training set
of caricature face pairs, our shape exaggeration step is based on only one or a small number of examples of facial
features. The relationship exaggeration step introduces two definitions which facilitate global facial feature
synthesis. The first is the T-Shape rule, which describes the relative relationship between the facial elements in an
intuitive manner. The second is the so called proportions, which characterizes the facial features in a proportion
form. Finally we introduce a similarity metric as the likeness metric based on the Modified Hausdorff Distance
(MHD) which allows us to optimize the configuration of facial elements, maximizing likeness while satisfying a
number of constraints. The effectiveness of our algorithm is demonstrated with experimental results
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
- …