16 research outputs found
Overlapped Speech Detection in Multi-Party Meetings
Detection of simultaneous speech in meeting recordings is a difficult problem due both to the complexity of the meeting itself and the environment surrounding it. The system proposes the use of gammatone-like spectrogram-based linear predictor coefficients on distant microphone channel data for overlap detection functions. The framework utilized the Augmented Multiparty Interaction (AMI) conference corpus to assess model performance. The proposed system offers enhancements over base line feature set models for classification
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Speaker diarization for real-life scenarios is an extremely challenging
problem. Widely used clustering-based diarization approaches perform rather
poorly in such conditions, mainly due to the limited ability to handle
overlapping speech. We propose a novel Target-Speaker Voice Activity Detection
(TS-VAD) approach, which directly predicts an activity of each speaker on each
time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along
with i-vectors for each speaker as inputs. A set of binary classification
output layers produces activities of each speaker. I-vectors can be estimated
iteratively, starting with a strong clustering-based diarization. We also
extend the TS-VAD approach to the multi-microphone case using a simple
attention mechanism on top of hidden representations extracted from the
single-channel TS-VAD model. Moreover, post-processing strategies for the
predicted speaker activity probabilities are investigated. Experiments on the
CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results
outperforming the baseline x-vector-based system by more than 30% Diarization
Error Rate (DER) abs.Comment: Accepted to Interspeech 202