5,913 research outputs found

    Significance of Vowel Onset Point Information for Speaker Verification

    Get PDF
    This work demonstrates the significance of information about vowel onset points (VOPs) for speaker verification. VOP is defined as the instant at which the onset of vowel takes place. Vowel-like regions can be identified using VOPs. By production, vowel-like regions have impulse-like excitation and therefore impulse-response of vocal tract system is better manifested in them, and are relatively high signal to noise ratio (SNR) regions. Speaker information extracted from such regions may therefore be more discriminative. Due to this better speaker modeling and reliable testing may be possible using the features extracted from vowel-like regions. It is demonstrated in this work that for clean and matched conditions, relatively less number of frames from vowel-like regions are sufficient for speaker modeling and testing. Alternatively, for degraded and mismatched conditions, vowel-like regions provide better performanc

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords

    Full text link
    In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82% for detecting both the speakers and their respective keywords, using Student's-t mixture models as speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter

    Robust text independent closed set speaker identification systems and their evaluation

    Get PDF
    PhD ThesisThis thesis focuses upon text independent closed set speaker identi cation. The contributions relate to evaluation studies in the presence of various types of noise and handset e ects. Extensive evaluations are performed on four databases. The rst contribution is in the context of the use of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) with original speech recordings from only the TIMIT database. Four main simulations for Speaker Identi cation Accuracy (SIA) are presented including di erent fusion strategies: Late fusion (score based), early fusion (feature based) and early-late fusion (combination of feature and score based), late fusion using concatenated static and dynamic features (features with temporal derivatives such as rst order derivative delta and second order derivative delta-delta features, namely acceleration features), and nally fusion of statistically independent normalized scores. The second contribution is again based on the GMM-UBM approach. Comprehensive evaluations of the e ect of Additive White Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and without a G.712 type handset) upon identi cation performance are undertaken. In particular, three NSN types with varying Signal to Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus interior and a crowded talking environment. The performance evaluation also considered the e ect of late fusion techniques based on score fusion, namely mean, maximum, and linear weighted sum fusion. The databases employed were: TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3,600 speech utterances. The third contribution is based on the use of the I-vector, four combinations of I-vectors with 100 and 200 dimensions were employed. Then, various fusion techniques using maximum, mean, weighted sum and cumulative fusion with the same I-vector dimension were used to improve the SIA. Similarly, both interleaving and concatenated I-vector fusion were exploited to produce 200 and 400 I-vector dimensions. The system was evaluated with four di erent databases using 120 speakers from each database. TIMIT, SITW and NIST 2008 databases were evaluated for various types of NSN namely, street-tra c NSN, bus-interior NSN and crowd talking NSN; and the G.712 type handset at 16 kHz was also applied. As recommendations from the study in terms of the GMM-UBM approach, mean fusion is found to yield overall best performance in terms of the SIA with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings. However, in the I-vector approach the best SIA was obtained from the weighted sum and the concatenated fusion.Ministry of Higher Education and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e, Al-Mustansiriya University, Al-Mustansiriya University College of Engineering in Iraq for supporting my PhD scholarship
    • …
    corecore