36 research outputs found
Implicit Self-supervised Language Representation for Spoken Language Diarization
In a code-switched (CS) scenario, the use of spoken language diarization (LD)
as a pre-possessing system is essential. Further, the use of implicit
frameworks is preferable over the explicit framework, as it can be easily
adapted to deal with low/zero resource languages. Inspired by speaker
diarization (SD) literature, three frameworks based on (1) fixed segmentation,
(2) change point-based segmentation and (3) E2E are proposed to perform LD. The
initial exploration with synthetic TTSF-LD dataset shows, using x-vector as
implicit language representation with appropriate analysis window length ()
can able to achieve at per performance with explicit LD. The best implicit LD
performance of in terms of Jaccard error rate (JER) is achieved by using
the E2E framework. However, considering the E2E framework the performance of
implicit LD degrades to while using with practical Microsoft CS (MSCS)
dataset. The difference in performance is mostly due to the distributional
difference between the monolingual segment duration of secondary language in
the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the
smaller duration of the monolingual segment suggests the use of a small value
of . At the same time with small , the x-vector representation is unable
to capture the required language discrimination due to the acoustic similarity,
as the same speaker is speaking both languages. Therefore, to resolve the issue
a self-supervised implicit language representation is proposed in this study.
In comparison with the x-vector representation, the proposed representation
provides a relative improvement of and achieved a JER of using
the E2E framework.Comment: Planning to Submit in IEEE-JSTS
Speaker Recognition using Supra-segmental Level Excitation Information
Speaker specific information present in the excitation signal is mostly viewed from sub-segmental, segmental and supra-segmental levels. In this work, the supra-segmental level information is explored for recognizing speakers. Earlier study has shown that, combined use of pitch and epoch strength vectors provides useful supra-segmental information. However, the speaker recognition accuracy achieved by supra-segmental level feature is relatively poor than other levels source information. May be the modulation information present at the supra-segmental level of the excitation signal is not manifested properly in pith and epoch strength vectors. We propose a method to model the supra-segmental level modulation information from residual mel frequency cepstral coefficient (R-MFCC) trajectories. The evidences from R-MFCC trajectories combined with pitch and epoch strength vectors are proposed to represent supra-segmental information. Experimental results show that compared to pitch and epoch strength vectors, the proposed approach provides relatively improved performance. Further, the proposed supra-segmental level information is relatively more complimentary to other levels information
Significance of Vowel Onset Point Information for Speaker Verification
This work demonstrates the significance of information about vowel onset points (VOPs) for speaker verification. VOP is defined as the instant at which the onset of vowel takes place. Vowel-like regions can be identified using VOPs. By production, vowel-like regions have impulse-like excitation and therefore impulse-response of vocal tract system is better manifested in them, and are relatively high signal to noise ratio (SNR) regions. Speaker information extracted from such regions may therefore be more discriminative. Due to this better speaker modeling and reliable testing may be possible using the features extracted from vowel-like regions. It is demonstrated in this work that for clean and matched conditions, relatively less number of frames from vowel-like regions are sufficient for speaker modeling and testing. Alternatively, for degraded and mismatched conditions, vowel-like regions provide better performanc
Implicit spoken language diarization
Spoken language diarization (LD) and related tasks are mostly explored using
the phonotactic approach. Phonotactic approaches mostly use explicit way of
language modeling, hence requiring intermediate phoneme modeling and
transcribed data. Alternatively, the ability of deep learning approaches to
model temporal dynamics may help for the implicit modeling of language
information through deep embedding vectors. Hence this work initially explores
the available speaker diarization frameworks that capture speaker information
implicitly to perform LD tasks. The performance of the LD system on synthetic
code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and
for practical data is 22.50% and 60.38%, in terms of diarization error rate and
Jaccard error rate (JER), respectively. The performance degradation is due to
the data imbalance and resolved to some extent by using pre-trained wave2vec
embeddings that provide a relative improvement of 30.74% in terms of JER
Fast Approximate Spoken Term Detection from Sequence of Phonemes
We investigate the detection of spoken terms in conversational speech using phoneme recognition with the objective of achieving smaller index size as well as faster search speed. Speech is processed and indexed as a sequence of one best phoneme sequence. We propose the use of a probabilistic pronunciation model for the search term to compensate for the errors in the recognition of phonemes. This model is derived using the pronunciation of the word and the phoneme confusion matrix. Experiments are performed on the conversational telephone speech database distributed by NIST for the 2006 spoken term detection. We achieve about 1500 times smaller index size and 14 times faster search speed compared to the state-of-the-art system using phoneme lattice at the cost of relatively lower detection performance
Analysis of Confusion Matrix to Combine Evidence for Phoneme Recognition
In this work we analyze and combine evidences from different classifiers for phoneme recognition using information from the confusion matrices. Speech signals are processed to extract the Perceptual Linear Prediction (PLP) and Multi-RASTA (MRASTA) features. Neural network classifiers with different architectures are built using these features. The classifiers are analyzed using their confusion matrices. The motivation behind this analysis is to come up with some objective measures which indicate the complementary nature of information in each of the classifiers. These measures are useful for combining a subset of classifiers. The classifiers can be combined using different combination schemes like product, sum, minimum and maximum rules. The significance of the objective measures is demonstrated in terms the results of combination. Classifiers selected through the proposed objective measures seem to provide the best performance