330 research outputs found
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Evaluating the Performance of Using Speaker Diarization for Speech Separation of In-Person Role-Play Dialogues
Development of professional communication skills, such as motivational interviewing, often requires experiential learning through expert instructor-guided role-plays between the trainee and a standard patient/actor. Due to the growing demand for such skills in practices, e.g., for health care providers in the management of mental health challenges, chronic conditions, substance misuse disorders, etc., there is an urgent need to improve the efficacy and scalability of such role-play based experiential learning, which are often bottlenecked by the time-consuming performance assessment process. WSU is developing ReadMI (Real-time Assessment of Dialogue in Motivational Interviewing) to address this challenge, a mobile AI solution aiming to provide automated performance assessment based on ASR and NLP. The main goal of this thesis research is to investigate current commercially available speaker diarization capabilities and evaluate their performance in separating the speeches between the trainee and the standard patient/actor in an in-person role-play training environment where the crosstalk could interfere with the operation and performance of ReadMI. Specifically, this thesis research has: 1.) identified the major commercially-available speaker diarization systems, such as those from Google, Amazon, IBM, and Rev.ai; 2.) designed and implemented corresponding evaluation systems that integrate these commercially available cloud services for operating in the in-person role-play training environments; and, 3.) completed an experimental study that evaluated and compared the performance of the speaker diarization services from Google and Amazon. The main finding of this thesis is that the current speaker diarization capabilities alone are not able to provide sufficient performance for our particular use case when integrating them into ReadMI for operating in in-person role-play training environments. But this thesis research potentially provides a clear baseline reference to future developers for integrating future speaker diarization capabilities into similar applications
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios
The CHiME challenges have played a significant role in the development and
evaluation of robust automatic speech recognition (ASR) systems. We introduce
the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task
comprises joint ASR and diarization in far-field settings with multiple, and
possibly heterogeneous, recording devices. Different from previous challenges,
we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The
goal is for participants to devise a single system that can generalize across
different array geometries and use cases with no a-priori information. Another
departure from earlier CHiME iterations is that participants are allowed to use
open-source pre-trained models and datasets. In this paper, we describe the
challenge design, motivation, and fundamental research questions in detail. We
also present the baseline system, which is fully array-topology agnostic and
features multi-channel diarization, channel selection, guided source separation
and a robust ASR model that leverages self-supervised speech representations
(SSLR)
- …