10,716 research outputs found

    SALSA: A Novel Dataset for Multimodal Group Behavior Analysis

    Get PDF
    Studying free-standing conversational groups (FCGs) in unstructured social settings (e.g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels. However, analyzing social scenes involving FCGs is also highly challenging due to the difficulty in extracting behavioral cues such as target locations, their speaking activity and head/body pose due to crowdedness and presence of extreme occlusions. To this end, we propose SALSA, a novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis, and make two main contributions to research on automated social interaction analysis: (1) SALSA records social interactions among 18 participants in a natural, indoor environment for over 60 minutes, under the poster presentation and cocktail party contexts presenting difficulties in the form of low-resolution images, lighting variations, numerous occlusions, reverberations and interfering sound sources; (2) To alleviate these problems we facilitate multimodal analysis by recording the social interplay using four static surveillance cameras and sociometric badges worn by each participant, comprising the microphone, accelerometer, bluetooth and infrared sensors. In addition to raw data, we also provide annotations concerning individuals' personality as well as their position, head, body orientation and F-formation information over the entire event duration. Through extensive experiments with state-of-the-art approaches, we show (a) the limitations of current methods and (b) how the recorded multiple cues synergetically aid automatic analysis of social interactions. SALSA is available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure

    Synesthesia: Detecting Screen Content via Remote Acoustic Side Channels

    Full text link
    We show that subtle acoustic noises emanating from within computer screens can be used to detect the content displayed on the screens. This sound can be picked up by ordinary microphones built into webcams or screens, and is inadvertently transmitted to other parties, e.g., during a videoconference call or archived recordings. It can also be recorded by a smartphone or "smart speaker" placed on a desk next to the screen, or from as far as 10 meters away using a parabolic microphone. Empirically demonstrating various attack scenarios, we show how this channel can be used for real-time detection of on-screen text, or users' input into on-screen virtual keyboards. We also demonstrate how an attacker can analyze the audio received during video call (e.g., on Google Hangout) to infer whether the other side is browsing the web in lieu of watching the video call, and which web site is displayed on their screen

    A Speaker Diarization System for Studying Peer-Led Team Learning Groups

    Full text link
    Peer-led team learning (PLTL) is a model for teaching STEM courses where small student groups meet periodically to collaboratively discuss coursework. Automatic analysis of PLTL sessions would help education researchers to get insight into how learning outcomes are impacted by individual participation, group behavior, team dynamics, etc.. Towards this, speech and language technology can help, and speaker diarization technology will lay the foundation for analysis. In this study, a new corpus is established called CRSS-PLTL, that contains speech data from 5 PLTL teams over a semester (10 sessions per team with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a LENA device (portable audio recorder) that provides multiple audio recordings of the event. Our proposed solution is unsupervised and contains a new online speaker change detection algorithm, termed G 3 algorithm in conjunction with Hausdorff-distance based clustering to provide improved detection accuracy. Additionally, we also exploit cross channel information to refine our diarization hypothesis. The proposed system provides good improvements in diarization error rate (DER) over the baseline LIUM system. We also present higher level analysis such as the number of conversational turns taken in a session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San Francisco, US

    A Voice is Worth a Thousand Words: The Implications of the Micro-Coding of Social Signals in Speech for Trust Research

    Get PDF
    While self-report measures are often highly reliable for field research on trust (Mayer and Davis, 1999), subjects often cannot complete surveys during real time interactions. In contrast, the social signals that are embedded in the non-linguistic elements of conversations can be captured in real time and extracted with the assistance of computer coding. This chapter seeks to understand how computer-coded social signals are related to interpersonal trust

    Automatic Environmental Sound Recognition: Performance versus Computational Cost

    Get PDF
    In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance em as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost

    Slocum gliders provide accurate near real-time estimates of baleen whale presence from human-reviewed passive acoustic detection information

    Get PDF
    © The Author(s), 2020. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Baumgartner, M. F., Bonnell, J., Corkeron, P. J., Van Parijs, S. M., Hotchkin, C., Hodges, B. A., Thornton, J. B., Mensi, B. L., & Bruner, S. M. Slocum gliders provide accurate near real-time estimates of baleen whale presence from human-reviewed passive acoustic detection information. Frontiers in Marine Science, 7, (2020):100, doi:10.3389/fmars.2020.00100.Mitigating the effects of human activities on marine mammals often depends on monitoring animal occurrence over long time scales, large spatial scales, and in real time. Passive acoustics, particularly from autonomous vehicles, is a promising approach to meeting this need. We have previously developed the capability to record, detect, classify, and transmit to shore information about the tonal sounds of baleen whales in near real time from long-endurance ocean gliders. We have recently developed a protocol by which a human analyst reviews this information to determine the presence of marine mammals, and the results of this review are automatically posted to a publicly accessible website, sent directly to interested parties via email or text, and made available to stakeholders via a number of public and private digital applications. We evaluated the performance of this system during two 3.75-month Slocum glider deployments in the southwestern Gulf of Maine during the spring seasons of 2015 and 2016. Near real-time detections of humpback, fin, sei, and North Atlantic right whales were compared to detections of these species from simultaneously recorded audio. Data from another 2016 glider deployment in the same area were also used to compare results between three different analysts to determine repeatability of results both among and within analysts. False detection (occurrence) rates on daily time scales were 0% for all species. Daily missed detection rates ranged from 17 to 24%. Agreement between two trained novice analysts and an experienced analyst was greater than 95% for fin, sei, and right whales, while agreement was 83–89% for humpback whales owing to the more subjective process for detecting this species. Our results indicate that the presence of baleen whales can be accurately determined using information about tonal sounds transmitted in near real-time from Slocum gliders. The system is being used operationally to monitor baleen whales in United States, Canadian, and Chilean waters, and has been particularly useful for monitoring the critically endangered North Atlantic right whale throughout the northwestern Atlantic Ocean.Funding for this project was provided by the Environmental Security Technology Certification Program of the U.S. Department of Defense and the U.S. Navy’s Living Marine Resources Program

    ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions in the Wild

    Full text link
    Recording the dynamics of unscripted human interactions in the wild is challenging due to the delicate trade-offs between several factors: participant privacy, ecological validity, data fidelity, and logistical overheads. To address these, following a 'datasets for the community by the community' ethos, we propose the Conference Living Lab (ConfLab): a new concept for multimodal multisensor data collection of in-the-wild free-standing social conversations. For the first instantiation of ConfLab described here, we organized a real-life professional networking event at a major international conference. Involving 48 conference attendees, the dataset captures a diverse mix of status, acquaintance, and networking motivations. Our capture setup improves upon the data fidelity of prior in-the-wild datasets while retaining privacy sensitivity: 8 videos (1920x1080, 60 fps) from a non-invasive overhead view, and custom wearable sensors with onboard recording of body motion (full 9-axis IMU), privacy-preserving low-frequency audio (1250 Hz), and Bluetooth-based proximity. Additionally, we developed custom solutions for distributed hardware synchronization at acquisition, and time-efficient continuous annotation of body keypoints and actions at high sampling rates. Our benchmarks showcase some of the open research tasks related to in-the-wild privacy-preserving social data analysis: keypoints detection from overhead camera views, skeleton-based no-audio speaker detection, and F-formation detection.Comment: v2 is the version submitted to Neurips 2022 Datasets and Benchmarks Trac
    • …
    corecore