Search CORE

3 research outputs found

Automatic Closed Captioning for Estonian Live Broadcasts

Author: Alumäe Tanel
Bode Külliki
Kaitsa Martin
Kalda Joonas
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Author: Alumäe Tanel
Bredin Hervé
Kalda Joonas
Marxer Ricard
Pagés Clément
Publication venue: ISCA
Publication date: 18/06/2024
Field of study

International audienceA major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with over-separation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels during training, it solves the problem of over-separation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning

Scientific Publications of the University of Toulouse II Le Mirail

HAL AMU

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Author: Alumae Tanel
Baroudi Séverin
Bredin Hervé
Kalda Joonas
Lebourdais Martin
Marxer Ricard
Publication venue: ISCA
Publication date: 01/09/2024
Field of study

International audienceThis paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024 challenge. Our team participated in the speaker diarization and language diarization tracks of the challenge. In the speaker diarization track, our best submission was an ensemble of systems based on the pyannote.audio speaker diarization pipeline utilizing powerset training and our recently proposed PixIT method that performs joint diarization and speech separation. We improve upon PixIT by using the separation outputs for speaker embedding extraction. Our ensemble achieved a diarization error rate of 27.1% on the evaluation dataset. In the language diarization track, we fine-tuned a pre-trained Wav2Vec2-BERT language embedding model on in-domain data, and clustered short segments using AHC and VBx, based on similarity scores from LDA/PLDA. This led to a language diarization error rate of 27.6% on the evaluation data. Both results were ranked first in their respective challenge tracks

Scientific Publications of the University of Toulouse II Le Mirail

HAL AMU