46 research outputs found
Toroidal Probabilistic Spherical Discriminant Analysis
In speaker recognition, where speech segments are mapped to embeddings on the
unit hypersphere, two scoring back-ends are commonly used, namely cosine
scoring and PLDA. We have recently proposed PSDA, an analog to PLDA that uses
Von Mises-Fisher distributions instead of Gaussians. In this paper, we present
toroidal PSDA (T-PSDA). It extends PSDA with the ability to model within and
between-speaker variabilities in toroidal submanifolds of the hypersphere. Like
PLDA and PSDA, the model allows closed-form scoring and closed-form EM updates
for training. On VoxCeleb, we find T-PSDA accuracy on par with cosine scoring,
while PLDA accuracy is inferior. On NIST SRE'21 we find that T-PSDA gives large
accuracy gains compared to both cosine scoring and PLDA.Comment: Submitted to ICASSP 202
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
End-to-end neural diarization (EEND) is nowadays one of the most prominent
research topics in speaker diarization. EEND presents an attractive alternative
to standard cascaded diarization systems since a single system is trained at
once to deal with the whole diarization problem. Several EEND variants and
approaches are being proposed, however, all these models require large amounts
of annotated data for training but available annotated data are scarce. Thus,
EEND works have used mostly simulated mixtures for training. However, simulated
mixtures do not resemble real conversations in many aspects. In this work we
present an alternative method for creating synthetic conversations that
resemble real ones by using statistics about distributions of pauses and
overlaps estimated on genuine conversations. Furthermore, we analyze the effect
of the source of the statistics, different augmentations and amounts of data.
We demonstrate that our approach performs substantially better than the
original one, while reducing the dependence on the fine-tuning stage.
Experiments are carried out on 2-speaker telephone conversations of Callhome
and DIHARD 3. Together with this publication, we release our implementations of
EEND and the method for creating simulated conversations.Comment: Submitted to Interspeech 202