10,716 research outputs found
SALSA: A Novel Dataset for Multimodal Group Behavior Analysis
Studying free-standing conversational groups (FCGs) in unstructured social
settings (e.g., cocktail party ) is gratifying due to the wealth of information
available at the group (mining social networks) and individual (recognizing
native behavioral and personality traits) levels. However, analyzing social
scenes involving FCGs is also highly challenging due to the difficulty in
extracting behavioral cues such as target locations, their speaking activity
and head/body pose due to crowdedness and presence of extreme occlusions. To
this end, we propose SALSA, a novel dataset facilitating multimodal and
Synergetic sociAL Scene Analysis, and make two main contributions to research
on automated social interaction analysis: (1) SALSA records social interactions
among 18 participants in a natural, indoor environment for over 60 minutes,
under the poster presentation and cocktail party contexts presenting
difficulties in the form of low-resolution images, lighting variations,
numerous occlusions, reverberations and interfering sound sources; (2) To
alleviate these problems we facilitate multimodal analysis by recording the
social interplay using four static surveillance cameras and sociometric badges
worn by each participant, comprising the microphone, accelerometer, bluetooth
and infrared sensors. In addition to raw data, we also provide annotations
concerning individuals' personality as well as their position, head, body
orientation and F-formation information over the entire event duration. Through
extensive experiments with state-of-the-art approaches, we show (a) the
limitations of current methods and (b) how the recorded multiple cues
synergetically aid automatic analysis of social interactions. SALSA is
available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure
Synesthesia: Detecting Screen Content via Remote Acoustic Side Channels
We show that subtle acoustic noises emanating from within computer screens
can be used to detect the content displayed on the screens. This sound can be
picked up by ordinary microphones built into webcams or screens, and is
inadvertently transmitted to other parties, e.g., during a videoconference call
or archived recordings. It can also be recorded by a smartphone or "smart
speaker" placed on a desk next to the screen, or from as far as 10 meters away
using a parabolic microphone.
Empirically demonstrating various attack scenarios, we show how this channel
can be used for real-time detection of on-screen text, or users' input into
on-screen virtual keyboards. We also demonstrate how an attacker can analyze
the audio received during video call (e.g., on Google Hangout) to infer whether
the other side is browsing the web in lieu of watching the video call, and
which web site is displayed on their screen
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
A Voice is Worth a Thousand Words: The Implications of the Micro-Coding of Social Signals in Speech for Trust Research
While self-report measures are often highly reliable for field research on trust (Mayer and Davis, 1999), subjects often cannot complete surveys during real time interactions. In contrast, the social signals that are embedded in the non-linguistic elements of conversations can be captured in real time and extracted with the assistance of computer coding. This chapter seeks to understand how computer-coded social signals are related to interpersonal trust
Automatic Environmental Sound Recognition: Performance versus Computational Cost
In the context of the Internet of Things (IoT), sound sensing applications
are required to run on embedded platforms where notions of product pricing and
form factor impose hard constraints on the available computing power. Whereas
Automatic Environmental Sound Recognition (AESR) algorithms are most often
developed with limited consideration for computational cost, this article seeks
which AESR algorithm can make the most of a limited amount of computing power
by comparing the sound classification performance em as a function of its
computational cost. Results suggest that Deep Neural Networks yield the best
ratio of sound classification accuracy across a range of computational costs,
while Gaussian Mixture Models offer a reasonable accuracy at a consistently
small cost, and Support Vector Machines stand between both in terms of
compromise between accuracy and computational cost
Slocum gliders provide accurate near real-time estimates of baleen whale presence from human-reviewed passive acoustic detection information
© The Author(s), 2020. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Baumgartner, M. F., Bonnell, J., Corkeron, P. J., Van Parijs, S. M., Hotchkin, C., Hodges, B. A., Thornton, J. B., Mensi, B. L., & Bruner, S. M. Slocum gliders provide accurate near real-time estimates of baleen whale presence from human-reviewed passive acoustic detection information. Frontiers in Marine Science, 7, (2020):100, doi:10.3389/fmars.2020.00100.Mitigating the effects of human activities on marine mammals often depends on monitoring animal occurrence over long time scales, large spatial scales, and in real time. Passive acoustics, particularly from autonomous vehicles, is a promising approach to meeting this need. We have previously developed the capability to record, detect, classify, and transmit to shore information about the tonal sounds of baleen whales in near real time from long-endurance ocean gliders. We have recently developed a protocol by which a human analyst reviews this information to determine the presence of marine mammals, and the results of this review are automatically posted to a publicly accessible website, sent directly to interested parties via email or text, and made available to stakeholders via a number of public and private digital applications. We evaluated the performance of this system during two 3.75-month Slocum glider deployments in the southwestern Gulf of Maine during the spring seasons of 2015 and 2016. Near real-time detections of humpback, fin, sei, and North Atlantic right whales were compared to detections of these species from simultaneously recorded audio. Data from another 2016 glider deployment in the same area were also used to compare results between three different analysts to determine repeatability of results both among and within analysts. False detection (occurrence) rates on daily time scales were 0% for all species. Daily missed detection rates ranged from 17 to 24%. Agreement between two trained novice analysts and an experienced analyst was greater than 95% for fin, sei, and right whales, while agreement was 83–89% for humpback whales owing to the more subjective process for detecting this species. Our results indicate that the presence of baleen whales can be accurately determined using information about tonal sounds transmitted in near real-time from Slocum gliders. The system is being used operationally to monitor baleen whales in United States, Canadian, and Chilean waters, and has been particularly useful for monitoring the critically endangered North Atlantic right whale throughout the northwestern Atlantic Ocean.Funding for this project was provided by the Environmental Security Technology Certification Program of the U.S. Department of Defense and the U.S. Navy’s Living Marine Resources Program
ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions in the Wild
Recording the dynamics of unscripted human interactions in the wild is
challenging due to the delicate trade-offs between several factors: participant
privacy, ecological validity, data fidelity, and logistical overheads. To
address these, following a 'datasets for the community by the community' ethos,
we propose the Conference Living Lab (ConfLab): a new concept for multimodal
multisensor data collection of in-the-wild free-standing social conversations.
For the first instantiation of ConfLab described here, we organized a real-life
professional networking event at a major international conference. Involving 48
conference attendees, the dataset captures a diverse mix of status,
acquaintance, and networking motivations. Our capture setup improves upon the
data fidelity of prior in-the-wild datasets while retaining privacy
sensitivity: 8 videos (1920x1080, 60 fps) from a non-invasive overhead view,
and custom wearable sensors with onboard recording of body motion (full 9-axis
IMU), privacy-preserving low-frequency audio (1250 Hz), and Bluetooth-based
proximity. Additionally, we developed custom solutions for distributed hardware
synchronization at acquisition, and time-efficient continuous annotation of
body keypoints and actions at high sampling rates. Our benchmarks showcase some
of the open research tasks related to in-the-wild privacy-preserving social
data analysis: keypoints detection from overhead camera views, skeleton-based
no-audio speaker detection, and F-formation detection.Comment: v2 is the version submitted to Neurips 2022 Datasets and Benchmarks
Trac
- …