Search CORE

16 research outputs found

Overlapped Speech Detection in Multi-Party Meetings

Author: Thaw Mie Mie
Zaw Thein Htay
Publication venue: 'International Journal of Computer Engineering and Applications'
Publication date: 29/07/2020
Field of study

Detection of simultaneous speech in meeting recordings is a difficult problem due both to the complexity of the meeting itself and the environment surrounding it. The system proposes the use of gammatone-like spectrogram-based linear predictor coefficients on distant microphone channel data for overlap detection functions. The framework utilized the Augmented Multiparty Interaction (AMI) conference corpus to assess model performance. The proposed system offers enhancements over base line feature set models for classification

International Journal of Computer (IJC - Global Society of Scientific Research and Researchers, GSSRR)

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Author: Andrusenko Andrei
Khokhlov Yuri
Korenevskaya Mariya
Korenevsky Maxim
Laptev Aleksandr
Medennikov Ivan
Mitrofanov Anton
Podluzhny Ivan
Prisyach Tatiana
Romanenko Aleksei
Sorokin Ivan
Timofeeva Tatiana
Publication venue: 'International Speech Communication Association'
Publication date: 27/07/2020
Field of study

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Crossref