13,351 research outputs found
Modeling Overlapping Speech using Vector Taylor Series
Abstract Current speaker diarization systems typically fail to succesfully assign multiple speakers speaking simultaneously. According to previous studies, overlapping errors account for a large proportion of the total errors in multi-party speech diarization. In this work, we propose a new approach using Vector Taylor Series (VTS) to obtain overlapping speech models assuming individual speaker models are available, e.g. from the diarization output. We extend the VTS framework to use multiple acoustic classes to account for the non-stationarity of corrupting speaker speech. We propose a system using multi-class VTS to detect single-speaker and two-speaker overlapping speech as well as the speakers involved. We show the effectivity of the approach on distant microphone meeting data, especially with the multiclass approach performing at the state-of-the-art
A Mouth Full of Words: Visually Consistent Acoustic Redubbing
This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling
A Subband-Based SVM Front-End for Robust ASR
This work proposes a novel support vector machine (SVM) based robust
automatic speech recognition (ASR) front-end that operates on an ensemble of
the subband components of high-dimensional acoustic waveforms. The key issues
of selecting the appropriate SVM kernels for classification in frequency
subbands and the combination of individual subband classifiers using ensemble
methods are addressed. The proposed front-end is compared with state-of-the-art
ASR front-ends in terms of robustness to additive noise and linear filtering.
Experiments performed on the TIMIT phoneme classification task demonstrate the
benefits of the proposed subband based SVM front-end: it outperforms the
standard cepstral front-end in the presence of noise and linear filtering for
signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed
front-end with a conventional front-end such as MFCC yields further
improvements over the individual front ends across the full range of noise
levels
- …