7,119 research outputs found
Speech-driven Animation with Meaningful Behaviors
Conversational agents (CAs) play an important role in human computer
interaction. Creating believable movements for CAs is challenging, since the
movements have to be meaningful and natural, reflecting the coupling between
gestures and speech. Studies in the past have mainly relied on rule-based or
data-driven approaches. Rule-based methods focus on creating meaningful
behaviors conveying the underlying message, but the gestures cannot be easily
synchronized with speech. Data-driven approaches, especially speech-driven
models, can capture the relationship between speech and gestures. However, they
create behaviors disregarding the meaning of the message. This study proposes
to bridge the gap between these two approaches overcoming their limitations.
The approach builds a dynamic Bayesian network (DBN), where a discrete variable
is added to constrain the behaviors on the underlying constraint. The study
implements and evaluates the approach with two constraints: discourse functions
and prototypical behaviors. By constraining on the discourse functions (e.g.,
questions), the model learns the characteristic behaviors associated with a
given discourse class learning the rules from the data. By constraining on
prototypical behaviors (e.g., head nods), the approach can be embedded in a
rule-based system as a behavior realizer creating trajectories that are timely
synchronized with speech. The study proposes a DBN structure and a training
approach that (1) models the cause-effect relationship between the constraint
and the gestures, (2) initializes the state configuration models increasing the
range of the generated behaviors, and (3) captures the differences in the
behaviors across constraints by enforcing sparse transitions between shared and
exclusive states per constraint. Objective and subjective evaluations
demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table
pDaniM: A case study on interactive processes for expressive music generation in the computer animataion production pipeline
This paper presents an experimental
application that shows how an interactive process can save time and production resources in the process pipeline of
music scoring for a 3D animation movie. This investigation uses as basis several rules (about 30) that performers use to
add certain expressive emotions to a score, covering different aspects of music performance. Basically, applying the rules
produces a rearrange in time and velocity parameters such as āmicrotimingā, articulation, tempo and sound level. The
resulting application should take as input a basic music score
(MIDI) quantized linearly, and imprint expressiveness and
emotion to this music in sync with the expressiveness cues from the timeline exported from the animation project.
Acoustic cues are driven by facial and gesture expression from the characters
Responsive Listening Behavior
Humans use their bodies in a highly expressive way during conversation, and animated characters that lack this form of non-verbal expression can seem stiff and unemotional. An important aspect of non-verbal expression is that people respond to each other's behavior and are highly attuned to picking up this type of response. This is particularly important for the feedback given while listening to some one speak. However, automatically generating this type of behavior is difficult as it is highly complex and subtle. This paper takes a data driven approach to generating interactive social behavior. Listening behavior is motion captured, together with the audio being listened to. This data is used to learn an animation model of the responses of one person to the other. This allows us to create characters that respond in real-time during a conversation with a real human
Investigating How Speech And Animation Realism Influence The Perceived Personality Of Virtual Characters And Agents
The portrayed personality of virtual characters and agents is understood to influence how we perceive and engage with digital applications. Understanding how the features of speech and animation drive portrayed personality allows us to intentionally design characters to be more personalized and engaging. In this study, we use performance capture data of unscripted conversations from a variety of actors to explore the perceptual outcomes associated with the modalities of speech and motion. Specifically, we contrast full performance-driven characters to those portrayed by generated gestures and synthesized speech, analysing how the features of each influence portrayed personality according to the Big Five personality traits. We find that processing speech and motion can have mixed effects on such traits, with our results highlighting motion as the dominant modality for portraying extraversion and speech as dominant for communicating agreeableness and emotional stability. Our results can support the Extended Reality (XR) community in development of virtual characters, social agents and 3D User Interface (3DUI) agents portraying a range of targeted personalities
Lip syncing method for realistic expressive 3D face model
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchikās wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model. Ā© 2017 Springer Science+Business Media New Yor
Lexical semantics and auditory presentation in virtual storytelling
Presented at the 11th International Conference on Auditory Display (ICAD2005)Audio presentation is an important modality in virtual storytelling. In this paper we present our work on audio presentation in our intelligent multimodal storytelling system, CONFUCIUS, which automatically generates 3D animation speech, and non-speech audio from natural language sentences. We provide an overview of the system and describe speech and non-speech audio in virtual storytelling by using linguistic approaches. We discuss several issues in auditory display, such as its relation to verb and adjective ontology, concepts and modalities, and media allocation. Finally we conclude that introducing linguistic knowledge provides more intelligent virtual storytelling, especially in audio presentation
Example Based Caricature Synthesis
The likeness of a caricature to the original face image is an essential and often overlooked part of caricature
production. In this paper we present an example based caricature synthesis technique, consisting of shape
exaggeration, relationship exaggeration, and optimization for likeness. Rather than relying on a large training set
of caricature face pairs, our shape exaggeration step is based on only one or a small number of examples of facial
features. The relationship exaggeration step introduces two definitions which facilitate global facial feature
synthesis. The first is the T-Shape rule, which describes the relative relationship between the facial elements in an
intuitive manner. The second is the so called proportions, which characterizes the facial features in a proportion
form. Finally we introduce a similarity metric as the likeness metric based on the Modified Hausdorff Distance
(MHD) which allows us to optimize the configuration of facial elements, maximizing likeness while satisfying a
number of constraints. The effectiveness of our algorithm is demonstrated with experimental results
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
- ā¦