53 research outputs found
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
We introduce a sophisticated multi-speaker speech data simulator,
specifically engineered to generate multi-speaker speech recordings. A notable
feature of this simulator is its capacity to modulate the distribution of
silence and overlap via the adjustment of statistical parameters. This
capability offers a tailored training environment for developing neural models
suited for speaker diarization and voice activity detection. The acquisition of
substantial datasets for speaker diarization often presents a significant
challenge, particularly in multi-speaker scenarios. Furthermore, the precise
time stamp annotation of speech data is a critical factor for training both
speaker diarization and voice activity detection. Our proposed multi-speaker
simulator tackles these problems by generating large-scale audio mixtures that
maintain statistical properties closely aligned with the input parameters. We
demonstrate that the proposed multi-speaker simulator generates audio mixtures
with statistical properties that closely align with the input parameters
derived from real-world statistics. Additionally, we present the effectiveness
of speaker diarization and voice activity detection models, which have been
trained exclusively on the generated simulated datasets
Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
This paper proposes a novel Attention-based Encoder-Decoder network for
End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we
incorporate the target speaker enrollment information used in target speaker
voice activity detection (TS-VAD) to calculate the attractor, which can
mitigate the speaker permutation problem and facilitate easier model
convergence. In the training process, we propose a teacher-forcing strategy to
obtain the enrollment information using the ground-truth label. Furthermore, we
propose three heuristic decoding methods to identify the enrollment area for
each speaker during the evaluation process. Additionally, we enhance the
attractor calculation network LSTM used in the end-to-end encoder-decoder based
attractor calculation (EEND-EDA) system by incorporating an attention-based
model. By utilizing such an attention-based attractor decoder, our proposed
AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s
of enrollment data.Comment: Accepted by InterSpeech 202
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
End-to-end neural diarization (EEND) is nowadays one of the most prominent
research topics in speaker diarization. EEND presents an attractive alternative
to standard cascaded diarization systems since a single system is trained at
once to deal with the whole diarization problem. Several EEND variants and
approaches are being proposed, however, all these models require large amounts
of annotated data for training but available annotated data are scarce. Thus,
EEND works have used mostly simulated mixtures for training. However, simulated
mixtures do not resemble real conversations in many aspects. In this work we
present an alternative method for creating synthetic conversations that
resemble real ones by using statistics about distributions of pauses and
overlaps estimated on genuine conversations. Furthermore, we analyze the effect
of the source of the statistics, different augmentations and amounts of data.
We demonstrate that our approach performs substantially better than the
original one, while reducing the dependence on the fine-tuning stage.
Experiments are carried out on 2-speaker telephone conversations of Callhome
and DIHARD 3. Together with this publication, we release our implementations of
EEND and the method for creating simulated conversations.Comment: Submitted to Interspeech 202
Speaker diarization assisted ASR for multi-speaker conversations
In this paper, we propose a novel approach for the transcription of speech
conversations with natural speaker overlap, from single channel recordings. We
propose a combination of a speaker diarization system and a hybrid automatic
speech recognition (ASR) system with speaker activity assisted acoustic model
(AM). An end-to-end neural network system is used for speaker diarization. Two
architectures, (i) input conditioned AM, and (ii) gated features AM, are
explored to incorporate the speaker activity information. The models output
speaker specific senones. The experiments on Switchboard telephone
conversations show the advantage of incorporating speaker activity information
in the ASR system for recordings with overlapped speech. In particular, an
absolute improvement of in word error rate (WER) is seen for the
proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202
Self-supervised Speaker Diarization
Over the last few years, deep learning has grown in popularity for speaker
verification, identification, and diarization. Inarguably, a significant part
of this success is due to the demonstrated effectiveness of their speaker
representations. These, however, are heavily dependent on large amounts of
annotated data and can be sensitive to new domains. This study proposes an
entirely unsupervised deep-learning model for speaker diarization.
Specifically, the study focuses on generating high-quality neural speaker
representations without any annotated data, as well as on estimating secondary
hyperparameters of the model without annotations.
The speaker embeddings are represented by an encoder trained in a
self-supervised fashion using pairs of adjacent segments assumed to be of the
same speaker. The trained encoder model is then used to self-generate
pseudo-labels to subsequently train a similarity score between different
segments of the same call using probabilistic linear discriminant analysis
(PLDA) and further to learn a clustering stopping threshold. We compared our
model to state-of-the-art unsupervised as well as supervised baselines on the
CallHome benchmarks. According to empirical results, our approach outperforms
unsupervised methods when only two speakers are present in the call, and is
only slightly worse than recent supervised models.Comment: Submitted to Interspeech 202
- …