91 research outputs found
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation
Speaker diarization has gained considerable attention within speech
processing research community. Mainstream speaker diarization rely primarily on
speakers' voice characteristics extracted from acoustic signals and often
overlook the potential of semantic information. Considering the fact that
speech signals can efficiently convey the content of a speech, it is of our
interest to fully exploit these semantic cues utilizing language models. In
this work we propose a novel approach to effectively leverage semantic
information in clustering-based speaker diarization systems. Firstly, we
introduce spoken language understanding modules to extract speaker-related
semantic information and utilize these information to construct pairwise
constraints. Secondly, we present a novel framework to integrate these
constraints into the speaker diarization pipeline, enhancing the performance of
the entire system. Extensive experiments conducted on the public dataset
demonstrate the consistent superiority of our proposed approach over
acoustic-only speaker diarization systems.Comment: Submitted to ICASSP 202
Language modelling for speaker diarization in telephonic interviews
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
Disentangled dimensionality reduction for noise-robust speaker diarisation
The objective of this work is to train noise-robust speaker embeddings
adapted for speaker diarisation. Speaker embeddings play a crucial role in the
performance of diarisation systems, but they often capture spurious information
such as noise and reverberation, adversely affecting performance. Our previous
work has proposed an auto-encoder-based dimensionality reduction module to help
remove the redundant information. However, they do not explicitly separate such
information and have also been found to be sensitive to hyper-parameter values.
To this end, we propose two contributions to overcome these issues: (i) a novel
dimensionality reduction framework that can disentangle spurious information
from the speaker embeddings; (ii) the use of a speech/non-speech indicator to
prevent the speaker code from representing the background noise. Through a
range of experiments conducted on four different datasets, our approach
consistently demonstrates the state-of-the-art performance among models without
system fusion.Comment: This paper was submitted to Interspeech202
Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors
This work proposes a frame-wise online/streaming end-to-end neural
diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely
detect a flexible number of speakers and extract/update their corresponding
attractors, we propose to leverage a causal speaker embedding encoder and an
online non-autoregressive self-attention-based attractor decoder. A look-ahead
mechanism is adopted to allow leveraging some future frames for effectively
detecting new speakers in real time and adaptively updating speaker attractors.
The proposed method processes the audio stream frame by frame, and has a low
inference latency caused by the look-ahead frames. Experiments show that,
compared with the recently proposed block-wise online methods, our method
FS-EEND achieves state-of-the-art diarization results, with a low inference
latency and computational cost
The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition
This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting - i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained
Speaker detection in the wild: Lessons learned from JSALT 2019
Submitted to ICASSP 2020This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection
- …