1,165 research outputs found
Speech-driven Animation with Meaningful Behaviors
Conversational agents (CAs) play an important role in human computer
interaction. Creating believable movements for CAs is challenging, since the
movements have to be meaningful and natural, reflecting the coupling between
gestures and speech. Studies in the past have mainly relied on rule-based or
data-driven approaches. Rule-based methods focus on creating meaningful
behaviors conveying the underlying message, but the gestures cannot be easily
synchronized with speech. Data-driven approaches, especially speech-driven
models, can capture the relationship between speech and gestures. However, they
create behaviors disregarding the meaning of the message. This study proposes
to bridge the gap between these two approaches overcoming their limitations.
The approach builds a dynamic Bayesian network (DBN), where a discrete variable
is added to constrain the behaviors on the underlying constraint. The study
implements and evaluates the approach with two constraints: discourse functions
and prototypical behaviors. By constraining on the discourse functions (e.g.,
questions), the model learns the characteristic behaviors associated with a
given discourse class learning the rules from the data. By constraining on
prototypical behaviors (e.g., head nods), the approach can be embedded in a
rule-based system as a behavior realizer creating trajectories that are timely
synchronized with speech. The study proposes a DBN structure and a training
approach that (1) models the cause-effect relationship between the constraint
and the gestures, (2) initializes the state configuration models increasing the
range of the generated behaviors, and (3) captures the differences in the
behaviors across constraints by enforcing sparse transitions between shared and
exclusive states per constraint. Objective and subjective evaluations
demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
Recommended from our members
The Multimodal and Sequential Design of Co-Animation as a Practice for Association in English Interaction
This thesis describes the understudied interactional practice of co-animation: during the development of an activity in conversation, a speaker incorporates an animation -i.e. a quote, or (re)enactment - and a co-participant responds, pre-emptively, or in the contiguous turn, with a completion or continuation of the animation of the same figure. Based on the study of 89 co-animation sequences found in 10 hours of video-recordings of naturalistic English interaction between friends, relatives or co-workers, this thesis adopts the theoretical and methodological tenets of Conversation Analysis and Interactional Linguistics to describe the multimodal, sequential, and relational organisation of this practice. This thesis analyses how participants mark the shift from the here-and-now into the animation space, and how co-participants make their contributions both hearable as coherent with prior animations, and as fitted affiliative responses that further the ongoing course of action. Lexico-grammatical, phonetic, and gestural-postural resources are analysed for their interactional import in their concurrent framing of animation and the display of stance and conditional relevance. The organisation of resources in responsive co-animations is found to be positionally-sensitive, with co-participants negotiating agency and epistemic access and entitlement differently relative to the onset of co-animation and to the stage in the ongoing activity. The scrutiny of the situated deployment of co-animation in the social activities of troubles-tellings/complaint stories on the one hand, and teasing/joint fictionalisation on the other, reveals how co-animation contributes to the process of association, that is, the building of single momentary units of participation (collectivities). Co-participants are found to team up around what is presented as a shared stance, values, and identity, against absent but invoked behaviours or individuals engaging in moral transgressions, by jointly âdoing beingâ the same voice
Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters
"Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many peopleâs household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies
Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters
"Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many peopleâs household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies
Eyebrow raising in dialogue: discourse structure, utterance function, and pitch accents
Some studies have suggested a relationship between eyebrow raising and different aspects of the verbal message, but our knowledge about this link is still
very limited. If we could establish and characterise a relation between eyebrow
raises and the linguistic signal we could better understand human multimodal
communication behaviour. We could also improve the credibility and efficiency
of computer animated conversational agents in multimodal communication systems.This thesis investigated eyebrow raising in a corpus of task-oriented English
dialogues. Applying a standard dialogue coding scheme (Conversational Game
Analysis, Carletta et al., 1997), eyebrow raises were studied in connection with
discourse structure and utterance function. Supporting the prediction, more
frequent and longer eyebrow raising occurred in the initial utterance of highlevel discourse segments than anywhere else in the dialogue (where 'high-level
discourse segment' = transaction, and 'utterance' = move, following Carletta et al.).
Additionally, eyebrow raises were more frequent in instructions than in requests
for or acknowledgements of information. Instructions also had longer eyebrow
raising than any other type of utterance. Contrary to the prediction, the start of a
lower-level discourse segment (conversational game) did not have more eyebrow
raising than any other position in the dialogue, and queries did not have more
eyebrow raising than any other type of utterance.Eyebrow raises were also studied in relation to intonational events, namely pitch
accents. Results showed evidence of alignment between the brow raise start and
the start of a pitch accent. Most pitch accents were not associated with brow raising, but when brow raises occurred they tended to immediately precede a pitch
accent on the speech signal. To investigate what could explain the alignment
between the two events, pitch accents aligned with eyebrow raises were compared to all other pitch accents in terms of: phonological characteristics (primary
vs. secondary pitch accents, and downstep-initial vs. non-initial pitch accents), information structure (given vs. new information in referring expressions, and the
last quarter vs. earlier parts of the utterance length) and type of utterance in which
they occurred (instruction vs. non-instruction). Those comparisons suggested that
brow raises may be aligned more frequently with pitch accents in downstepinitial position and in instructions. No differences were found in terms of information structure or between primary/secondary accents.The results provide evidence of a link between eyebrow raising and spoken language. Eyebrow raises may signal the start of linguistic units such as discourse
segments and some prosodic phenomena, they may be related to utterance function, and they are aligned with pitch accents. Possible linguistic functions are
proposed, such as structuring and emphasising information in the verbal message
Real-time generation and adaptation of social companion robot behaviors
Social robots will be part of our future homes.
They will assist us in everyday tasks, entertain us, and provide helpful advice.
However, the technology still faces challenges that must be overcome to equip the machine with social competencies and make it a socially intelligent and accepted housemate.
An essential skill of every social robot is verbal and non-verbal communication.
In contrast to voice assistants, smartphones, and smart home technology, which are already part of many people's lives today, social robots have an embodiment that raises expectations towards the machine.
Their anthropomorphic or zoomorphic appearance suggests they can communicate naturally with speech, gestures, or facial expressions and understand corresponding human behaviors.
In addition, robots also need to consider individual users' preferences: everybody is shaped by their culture, social norms, and life experiences, resulting in different expectations towards communication with a robot.
However, robots do not have human intuition - they must be equipped with the corresponding algorithmic solutions to these problems.
This thesis investigates the use of reinforcement learning to adapt the robot's verbal and non-verbal communication to the user's needs and preferences.
Such non-functional adaptation of the robot's behaviors primarily aims to improve the user experience and the robot's perceived social intelligence.
The literature has not yet provided a holistic view of the overall challenge: real-time adaptation requires control over the robot's multimodal behavior generation, an understanding of human feedback, and an algorithmic basis for machine learning.
Thus, this thesis develops a conceptual framework for designing real-time non-functional social robot behavior adaptation with reinforcement learning.
It provides a higher-level view from the system designer's perspective and guidance from the start to the end.
It illustrates the process of modeling, simulating, and evaluating such adaptation processes.
Specifically, it guides the integration of human feedback and social signals to equip the machine with social awareness.
The conceptual framework is put into practice for several use cases, resulting in technical proofs of concept and research prototypes.
They are evaluated in the lab and in in-situ studies.
These approaches address typical activities in domestic environments, focussing on the robot's expression of personality, persona, politeness, and humor.
Within this scope, the robot adapts its spoken utterances, prosody, and animations based on human explicit or implicit feedback.Soziale Roboter werden Teil unseres zukĂŒnftigen Zuhauses sein.
Sie werden uns bei alltĂ€glichen Aufgaben unterstĂŒtzen, uns unterhalten und uns mit hilfreichen RatschlĂ€gen versorgen.
Noch gibt es allerdings technische Herausforderungen, die zunĂ€chst ĂŒberwunden werden mĂŒssen, um die Maschine mit sozialen Kompetenzen auszustatten und zu einem sozial intelligenten und akzeptierten Mitbewohner zu machen.
Eine wesentliche FĂ€higkeit eines jeden sozialen Roboters ist die verbale und nonverbale Kommunikation.
Im Gegensatz zu Sprachassistenten, Smartphones und Smart-Home-Technologien, die bereits heute Teil des Lebens vieler Menschen sind, haben soziale Roboter eine Verkörperung, die Erwartungen an die Maschine weckt.
Ihr anthropomorphes oder zoomorphes Aussehen legt nahe, dass sie in der Lage sind, auf natĂŒrliche Weise mit Sprache, Gestik oder Mimik zu kommunizieren, aber auch entsprechende menschliche Kommunikation zu verstehen.
DarĂŒber hinaus mĂŒssen Roboter auch die individuellen Vorlieben der Benutzer berĂŒcksichtigen.
So ist jeder Mensch von seiner Kultur, sozialen Normen und eigenen Lebenserfahrungen geprĂ€gt, was zu unterschiedlichen Erwartungen an die Kommunikation mit einem Roboter fĂŒhrt.
Roboter haben jedoch keine menschliche Intuition - sie mĂŒssen mit entsprechenden Algorithmen fĂŒr diese Probleme ausgestattet werden.
In dieser Arbeit wird der Einsatz von bestĂ€rkendem Lernen untersucht, um die verbale und nonverbale Kommunikation des Roboters an die BedĂŒrfnisse und Vorlieben des Benutzers anzupassen.
Eine solche nicht-funktionale Anpassung des Roboterverhaltens zielt in erster Linie darauf ab, das Benutzererlebnis und die wahrgenommene soziale Intelligenz des Roboters zu verbessern.
Die Literatur bietet bisher keine ganzheitliche Sicht auf diese Herausforderung: Echtzeitanpassung erfordert die Kontrolle ĂŒber die multimodale Verhaltenserzeugung des Roboters, ein VerstĂ€ndnis des menschlichen Feedbacks und eine algorithmische Basis fĂŒr maschinelles Lernen.
Daher wird in dieser Arbeit ein konzeptioneller Rahmen fĂŒr die Gestaltung von nicht-funktionaler Anpassung der Kommunikation sozialer Roboter mit bestĂ€rkendem Lernen entwickelt.
Er bietet eine ĂŒbergeordnete Sichtweise aus der Perspektive des Systemdesigners und eine Anleitung vom Anfang bis zum Ende.
Er veranschaulicht den Prozess der Modellierung, Simulation und Evaluierung solcher Anpassungsprozesse.
Insbesondere wird auf die Integration von menschlichem Feedback und sozialen Signalen eingegangen, um die Maschine mit sozialem Bewusstsein auszustatten.
Der konzeptionelle Rahmen wird fĂŒr mehrere AnwendungsfĂ€lle in die Praxis umgesetzt, was zu technischen Konzeptnachweisen und Forschungsprototypen fĂŒhrt, die in Labor- und In-situ-Studien evaluiert werden.
Diese AnsÀtze befassen sich mit typischen AktivitÀten in hÀuslichen Umgebungen, wobei der Schwerpunkt auf dem Ausdruck der Persönlichkeit, dem Persona, der Höflichkeit und dem Humor des Roboters liegt.
In diesem Rahmen passt der Roboter seine Sprache, Prosodie, und Animationen auf Basis expliziten oder impliziten menschlichen Feedbacks an
- âŠ