15,161 research outputs found
Inferring Intentions to Speak Using Accelerometer Data In-the-Wild
Humans have good natural intuition to recognize when another person has
something to say. It would be interesting if an AI can also recognize
intentions to speak. Especially in scenarios when an AI is guiding a group
discussion, this can be a useful skill. This work studies the inference of
successful and unsuccessful intentions to speak from accelerometer data. This
is chosen because it is privacy-preserving and feasible for in-the-wild
settings since it can be placed in a smart badge. Data from a real-life social
networking event is used to train a machine-learning model that aims to infer
intentions to speak. A subset of unsuccessful intention-to-speak cases in the
data is annotated. The model is trained on the successful intentions to speak
and evaluated on both the successful and unsuccessful cases. In conclusion,
there is useful information in accelerometer data, but not enough to reliably
capture intentions to speak. For example, posture shifts are correlated with
intentions to speak, but people also often shift posture without having an
intention to speak, or have an intention to speak without shifting their
posture. More modalities are likely needed to reliably infer intentions to
speak
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour
Rapport, the close and harmonious relationship in which interaction partners
are "in sync" with each other, was shown to result in smoother social
interactions, improved collaboration, and improved interpersonal outcomes. In
this work, we are first to investigate automatic prediction of low rapport
during natural interactions within small groups. This task is challenging given
that rapport only manifests in subtle non-verbal signals that are, in addition,
subject to influences of group dynamics as well as inter-personal
idiosyncrasies. We record videos of unscripted discussions of three to four
people using a multi-view camera system and microphones. We analyse a rich set
of non-verbal signals for rapport detection, namely facial expressions, hand
motion, gaze, speaker turns, and speech prosody. Using facial features, we can
detect low rapport with an average precision of 0.7 (chance level at 0.25),
while incorporating prior knowledge of participants' personalities can even
achieve early prediction without a drop in performance. We further provide a
detailed analysis of different feature sets and the amount of information
contained in different temporal segments of the interactions.Comment: 12 pages, 6 figure
Non-Verbal Communication Analysis in Victim-Offender Mediations
In this paper we present a non-invasive ambient intelligence framework for
the semi-automatic analysis of non-verbal communication applied to the
restorative justice field. In particular, we propose the use of computer vision
and social signal processing technologies in real scenarios of Victim-Offender
Mediations, applying feature extraction techniques to multi-modal
audio-RGB-depth data. We compute a set of behavioral indicators that define
communicative cues from the fields of psychology and observational methodology.
We test our methodology on data captured in real world Victim-Offender
Mediation sessions in Catalonia in collaboration with the regional government.
We define the ground truth based on expert opinions when annotating the
observed social responses. Using different state-of-the-art binary
classification approaches, our system achieves recognition accuracies of 86%
when predicting satisfaction, and 79% when predicting both agreement and
receptivity. Applying a regression strategy, we obtain a mean deviation for the
predictions between 0.5 and 0.7 in the range [1-5] for the computed social
signals.Comment: Please, find the supplementary video material at:
http://sunai.uoc.edu/~vponcel/video/VOMSessionSample.mp
Moving together: the organisation of non-verbal cues during multiparty conversation
PhDConversation is a collaborative activity. In face-to-face interactions interlocutors have mutual
access to a shared space. This thesis aims to explore the shared space as a resource for coordinating
conversation. As well demonstrated in studies of two-person conversations, interlocutors
can coordinate their speech and non-verbal behaviour in ways that manage the unfolding conversation.
However, when scaling up from two people to three people interacting, the coordination
challenges that the interlocutors face increase. In particular speakers must manage multiple listeners.
This thesis examines the use of interlocutors’ bodies in shared space to coordinate their
multiparty dialogue.
The approach exploits corpora of motion captured triadic interactions. The thesis first explores
how interlocutors coordinate their speech and non-verbal behaviour. Inter-person relationships
are examined and compared with artificially created triples who did not interact. Results demonstrate
that interlocutors avoid speaking and gesturing over each other, but tend to nod together.
Evidence is presented that the two recipients of an utterance have different patterns of head and
hand movement, and that some of the regularities of movement are correlated with the task structure.
The empirical section concludes by uncovering a class of coordination events, termed simultaneous
engagement events, that are unique to multiparty dialogue. They are constructed using
combinations of speaker head orientation and gesture orientation. The events coordinate multiple
recipients of the dialogue and potentially arise as a result of the greater coordination challenges
that interlocutors face. They are marked in requiring a mutually accessible shared space in order
to be considered an effective interactional cue.
The thesis provides quantitative evidence that interlocutors’ head and hand movements are
organised by their dialogue state and the task responsibilities that the bear. It is argued that a
shared interaction space becomes a more important interactional resource when conversations
scale up to three people
Twente Debate Corpus - A Multimodal Corpus for Head Movement Analysis
This paper introduces a multimodal discussion corpus for the study into head movement and turn-taking patterns in debates. Given that participants either acted alone or in a pair, cooperation and competition and their nonverbal correlates can be analyzed. In addition to the video and audio of the recordings, the corpus contains automatically estimated head movements, and manual annotations of who is speaking and who is looking where. The corpus consists of over 2 hours of debates, in 6 groups with 18 participants in total. We describe the recording setup and present initial analyses of the recorded data. We found that the person who acted as single debater speaks more and also receives more attention compared to the other debaters, also when corrected for the time speaking.We also found that a single debater was more likely to speak after a team debater. Future work will be aimed at further analysis of the relation between speaking and looking patterns, the outcome of the debate and perceived dominance of the debaters
Multimodal Human Group Behavior Analysis
Human behaviors in a group setting involve a complex mixture of multiple modalities: audio, visual, linguistic, and human interactions. With the rapid progress of AI, automatic prediction and understanding of these behaviors is no longer a dream. In a negotiation, discovering human relationships and identifying the dominant person can be useful for decision making. In security settings, detecting nervous behaviors can help law enforcement agents spot suspicious people. In adversarial settings such as national elections and court defense, identifying persuasive speakers is a critical task. It is beneficial to build accurate machine learning (ML) models to predict such human group behaviors. There are two elements for successful prediction of group behaviors. The first is to design domain-specific features for each modality. Social and Psychological studies have uncovered various factors including both individual cues and group interactions, which inspire us to extract relevant features computationally. In particular, the group interaction modality plays an important role, since human behaviors influence each other through interactions in a group. Second, effective multimodal ML models are needed to align and integrate the different modalities for accurate predictions. However, most previous work ignored the group interaction modality. Moreover, they only adopt early fusion or late fusion to combine different modalities, which is not optimal. This thesis presents methods to train models taking multimodal inputs in group interaction videos, and to predict human group behaviors. First, we develop an ML algorithm to automatically predict human interactions from videos, which is the basis to extract interaction features and model group behaviors. Second, we propose a multimodal method to identify dominant people in videos from multiple modalities. Third, we study the nervousness in human behavior by a developing hybrid method: group interaction feature engineering combined with individual facial embedding learning. Last, we introduce a multimodal fusion framework that enables us to predict how persuasive speakers are.
Overall, we develop one algorithm to extract group interactions and build three multimodal models to identify three kinds of human behavior in videos: dominance, nervousness and persuasion. The experiments demonstrate the efficacy of the methods and analyze the modality-wise contributions
- …