16 research outputs found
No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration
Recognizing who is speaking in a crowded scene is a key challenge towards the
understanding of the social interactions going on within. Detecting speaking
status from body movement alone opens the door for the analysis of social
scenes in which personal audio is not obtainable. Video and wearable sensors
make it possible recognize speaking in an unobtrusive, privacy-preserving way.
When considering the video modality, in action recognition problems, a bounding
box is traditionally used to localize and segment out the target subject, to
then recognize the action taking place within it. However, cross-contamination,
occlusion, and the articulated nature of the human body, make this approach
challenging in a crowded scene. Here, we leverage articulated body poses for
subject localization and in the subsequent speech detection stage. We show that
the selection of local features around pose keypoints has a positive effect on
generalization performance while also significantly reducing the number of
local features considered, making for a more efficient method. Using two
in-the-wild datasets with different viewpoints of subjects, we investigate the
role of cross-contamination in this effect. We additionally make use of
acceleration measured through wearable sensors for the same task, and present a
multimodal approach combining both methods
Who is where? Matching people in video to wearable acceleration during crowded mingling events
ConferenciaWe address the challenging problem of associating acceler-
ation data from a wearable sensor with the corresponding
spatio-temporal region of a person in video during crowded
mingling scenarios. This is an important rst step for multi-
sensor behavior analysis using these two modalities. Clearly,
as the numbers of people in a scene increases, there is also
a need to robustly and automatically associate a region of
the video with each person's device. We propose a hierarchi-
cal association approach which exploits the spatial context
of the scene, outperforming the state-of-the-art approaches
signi cantly. Moreover, we present experiments on match-
ing from 3 to more than 130 acceleration and video streams
which, to our knowledge, is signi cantly larger than prior
works where only up to 5 device streams are associated
Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild
Laughter is considered one of the most overt signals of joy. Laughter is
well-recognized as a multimodal phenomenon but is most commonly detected by
sensing the sound of laughter. It is unclear how perception and annotation of
laughter differ when annotated from other modalities like video, via the body
movements of laughter. In this paper we take a first step in this direction by
asking if and how well laughter can be annotated when only audio, only video
(containing full body movement information) or audiovisual modalities are
available to annotators. We ask whether annotations of laughter are congruent
across modalities, and compare the effect that labeling modality has on machine
learning model performance. We compare annotations and models for laughter
detection, intensity estimation, and segmentation, three tasks common in
previous studies of laughter. Our analysis of more than 4000 annotations
acquired from 48 annotators revealed evidence for incongruity in the perception
of laughter, and its intensity between modalities. Further analysis of
annotations against consolidated audiovisual reference annotations revealed
that recall was lower on average for video when compared to the audio
condition, but tended to increase with the intensity of the laughter samples.
Our machine learning experiments compared the performance of state-of-the-art
unimodal (audio-based, video-based and acceleration-based) and multi-modal
models for different combinations of input modalities, training label modality,
and testing label modality. Models with video and acceleration inputs had
similar performance regardless of training label modality, suggesting that it
may be entirely appropriate to train models for laughter detection from body
movements using video-acquired labels, despite their lower inter-rater
agreement
Estimating self-assessed personality from body movements and proximity in crowded mingling scenarios
ArtĂculoThis paper focuses on the automatic classi cation of self-
assessed personality traits from the HEXACO inventory du-
ring crowded mingle scenarios. We exploit acceleration and
proximity data from a wearable device hung around the
neck. Unlike most state-of-the-art studies, addressing per-
sonality estimation during mingle scenarios provides a cha-
llenging social context as people interact dynamically and
freely in a face-to-face setting. While many former studies
use audio to extract speech-related features, we present a
novel method of extracting an individual's speaking status
from a single body worn triaxial accelerometer which scales
easily to large populations. Moreover, by fusing both speech
and movement energy related cues from just acceleration,
our experimental results show improvements on the estima-
tion of Humility over features extracted from a single behav-
ioral modality. We validated our method on 71 participants
where we obtained an accuracy of 69% for Honesty, Consci-
entiousness and Openness to Experience. To our knowledge,
this is the largest validation of personality estimation carried
out in such a social context with simple wearable sensors
Towards Analyzing and Predicting the Experience of Live Performances with Wearable Sensing
We present an approach to interpret the response of audiences to live performances by processing mobile sensor data. We apply our method on three different datasets obtained from three live performances, where each audience member wore a single tri-axial accelerometer and proximity sensor embedded inside a smart sensor pack. Using these sensor data, we developed a novel approach to predict audience members’ self-reported experience of the performances in terms of enjoyment, immersion, willingness to recommend the event to others, and change in mood. The proposed method uses an unsupervised method to identify informative intervals of the event, using the linkage of the audience members’ bodily movements, and uses data from these intervals only to estimate the audience members’ experience. We also analyze how the relative location of members of the audience can affect their experience and present an automatic way of recovering neighborhood information based on proximity sensors. We further show that the linkage of the audience members’ bodily movements is informative of memorable moments which were later reported by the audience
Towards Analyzing and Predicting the Experience of Live Performances with Wearable Sensing
We present an approach to interpret the response of audiences to live performances by processing mobile sensor data. We apply our method on three different datasets obtained from three live performances, where each audience member wore a single tri-axial accelerometer and proximity sensor embedded inside a smart sensor pack. Using these sensor data, we developed a novel approach to predict audience members' self-reported experience of the performances in terms of enjoyment, immersion, willingness to recommend the event to others and change in mood. The proposed method uses an unsupervised method to identify informative intervals of the event, using the linkage of the audience members' bodily movements, and uses data from these intervals only to estimate the audience members' experience. We also analyze how the relative location of members of the audience can affect their experience and present an automatic way of recovering neighborhood information based on proximity sensors. We further show that the linkage of the audience members' bodily movements is informative of memorable moments which were later reported by the audience
The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates
We present MatchNMingle, a novel multimodal/multisensor dataset for the analysis of free-standing conversational groups and speed-dates in-the-wild. MatchNMingle leverages the use of wearable devices and overhead cameras to record social interactions of 92 people during real-life speed-dates, followed by a cocktail party. To our knowledge, MatchNMingle has the largest number of participants, longest recording time and largest set of manual annotations for social actions available in this context in a real-life scenario. It consists of 2 hours of data from wearable acceleration, binary proximity, video, audio, personality surveys, frontal pictures and speed-date responses. Participants' positions and group formations were manually annotated; as were social actions (eg. speaking, hand gesture) for 30 minutes at 20fps making it the first dataset to incorporate the annotation of such cues in this context. We present an empirical analysis of the performance of crowdsourcing workers against trained annotators in simple and complex annotation tasks, founding that although efficient for simple tasks, using crowdsourcing workers for more complex tasks like social action annotation led to additional overhead and poor inter-annotator agreement compared to trained annotators (differences up to 0.4 in Fleiss' Kappa coefficients). We also provide example experiments of how MatchNMingle can be used
Listen to the real experts: Detecting need of caregiver response in a NICU using multimodal monitoring signals
Vital signs are used in Neonatal Intensive Care Units (NICUs) to monitor the state of multiple patients at once. Alarms are triggered if a vital sign is below/above a predefined threshold. Numerous alarms sound each hour which could translate into an overload for the medical team, known as alarm fatigue. Yet many of these alarms do not require immediate clinical action of the caregivers. In this paper we automatically detect moments that need an immediate response (i.e. interaction with the patient) of the medical team in NICUs by using caregiver response to the patient, which is based on the interpretation of vital signs and of nonverbal cues (e.g. movements) delivered by patients. The ultimate goal of such approach is to reduce the overload of alarms while maintaining the patient safety. We use features extracted from the electrocardiogram (ECG) and pulse oxymetry (SpO2) sensors of the patient, as most unplanned interactions between patient and caregivers are due to deteriorations. Since in our unit an alarm can only be paused or silenced manually at the bedside, we used this information as a prior for caregiver response. We also propose different labeling schemes for classification, each representative of a possible interaction scenario within the nature of our problem. We accomplished a general detection of caregiver response with a mean AUC of 0.82. We also show that when trained only with stable and truly deteriorating (critical state) samples, the classifiers can better learn the difference between alarms that need no immediate response and those that do. In addition, we present an analysis of the posterior probabilities over time for different labeling schemes, and use it to speculate about the reasons behind some failure cases
Estimation of Heart Rate Directly from ECG Spectrogram in Neonate Intensive Care Units
This paper presents a simple yet novel method to estimate the heart frequency (HF) of neonates directly from the ECG signal, instead of using the RR-interval signals as generally done in clinical practices. From this, the heart rate (HR) can be derived. Thus, we avoid the use of peak detectors and the inherent errors that come with them.Our method leverages the highest Power Spectral Densities (PSD) of the ECG, for the bins around the frequencies related to heart rates for neonates, as they change in time (spectrograms).We tested our approach with the monitoring data of 6 days for 52 patients in a Neonate Intensive Care Unit (NICU) and compared against the HR from a commercial monitor, which produced a sample every second. The comparison showed that 92.4% of the samples have a difference lower than 5bpm. Moreover, we obtained a median MAE (Mean Absolute Error) between subjects equal to 2.28 bpm and a median RMSE (Root Mean Square Error) equal to 5.82 bpm. Although tested for neonates, we hypothesize that this method can also be customized for other populations.Finally, we analyze the failure cases of our method and found a direct co-allocation of errors due to moments with higher PSD in the lower frequencies with the presence of critical alarms related to other physiological systems (e.g. desaturation)