Search CORE

6 research outputs found

Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords

Author: Sreenivas Thippur V.
Sundar Harshavardhan
Publication venue
Publication date: 29/04/2015
Field of study

In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82% for detecting both the speakers and their respective keywords, using Student's-t mixture models as speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter

arXiv.org e-Print Archive

Crossref

Single-channel mixed speech recognition using deep neural networks

Author: Chao Weng
Dong Yu
Jasha Droppo
Michael L Seltzer
Publication venue
Publication date: 05/03/2020
Field of study

ABSTRACT In this work, we study the problem of single-channel mixed speech recognition using deep neural networks (DNNs). Using a multi-style training strategy on artificially mixed speech data, we investigate several different training setups that enable the DNN to generalize to corresponding similar patterns in the test data. We also introduce a WFST-based two-talker decoder to work with the trained DNNs. Experiments on the 2006 speech separation and recognition challenge task demonstrate that the proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an overall WER of 19.7% which improves upon the results obtained by the state-of-the-art IBM superhuman system by 1.9% absolute, with fewer assumptions and lower computational complexity

CiteSeerX

Single-channel mixed speech recognition using deep neural networks

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref