29 research outputs found
Unsupervised methods for speaker diarization
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-95).Given a stream of unlabeled audio data, speaker diarization is the process of determining "who spoke when." We propose a novel approach to solving this problem by taking advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features and exploiting the inherent variabilities in the data through the use of unsupervised methods. Upon initial evaluation, our system achieves state-of-the art results of 0.9% Diarization Error Rate in the diarization of two-speaker telephone conversations. The approach is then generalized to the problem of K-speaker diarization, for which we take measures to address issues of data sparsity and experiment with the use of the von Mises-Fisher distribution for clustering on a unit hypersphere. Our extended system performs competitively on the diarization of conversations involving two or more speakers. Finally, we present promising initial results obtained from applying variational inference on our front-end speaker representation to estimate the unknown number of speakers in a given utterance.by Stephen Shum.S.M
An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings
We performed an experimental review of current diarization systems for the
conversational telephone speech (CTS) domain. In detail, we considered a total
of eight different algorithms belonging to clustering-based, end-to-end neural
diarization (EEND), and speech separation guided diarization (SSGD) paradigms.
We studied the inference-time computational requirements and diarization
accuracy on four CTS datasets with different characteristics and languages. We
found that, among all methods considered, EEND-vector clustering (EEND-VC)
offers the best trade-off in terms of computing requirements and performance.
More in general, EEND models have been found to be lighter and faster in
inference compared to clustering-based methods. However, they also require a
large amount of diarization-oriented annotated data. In particular EEND-VC
performance in our experiments degraded when the dataset size was reduced,
whereas self-attentive EEND (SA-EEND) was less affected. We also found that
SA-EEND gives less consistent results among all the datasets compared to
EEND-VC, with its performance degrading on long conversations with high speech
sparsity. Clustering-based diarization systems, and in particular VBx, instead
have more consistent performance compared to SA-EEND but are outperformed by
EEND-VC. The gap with respect to this latter is reduced when overlap-aware
clustering methods are considered. SSGD is the most computationally demanding
method, but it could be convenient if speech recognition has to be performed.
Its performance is close to SA-EEND but degrades significantly when the
training and inference data characteristics are less matched.Comment: 52 pages, 10 figure
Diarization of telephone conversations using probabilistic linear discriminant analysis
Speaker diarization can be summarized as the process of partitioning an audio data into homogeneous segments according to speaker identity. This thesis investigates the application of the probabilistic linear discriminant analysis (PLDA) to speaker diarization of telephone conversations. We introduce a variational Bayes (VB) approach for inference under a PLDA model for modeling segmental i-vectors in speaker diarization. Deterministic annealing (DA) algorithm is employed in order to avoid locally optimal solutions in VB iterations. We compare our proposed system with a well-known system that applies k-means clustering on principal component analysis coe cients of segmental i-vectors. We used summed channel telephone data from the National Institute of Standards and Technology 2008 Speaker Recognition Evaluation as the test set in order to evaluate the performance of the proposed system. We achieve about 20% relative improvement in diarization error rate as compared to the baseline system
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
End-to-end neural diarization (EEND) is nowadays one of the most prominent
research topics in speaker diarization. EEND presents an attractive alternative
to standard cascaded diarization systems since a single system is trained at
once to deal with the whole diarization problem. Several EEND variants and
approaches are being proposed, however, all these models require large amounts
of annotated data for training but available annotated data are scarce. Thus,
EEND works have used mostly simulated mixtures for training. However, simulated
mixtures do not resemble real conversations in many aspects. In this work we
present an alternative method for creating synthetic conversations that
resemble real ones by using statistics about distributions of pauses and
overlaps estimated on genuine conversations. Furthermore, we analyze the effect
of the source of the statistics, different augmentations and amounts of data.
We demonstrate that our approach performs substantially better than the
original one, while reducing the dependence on the fine-tuning stage.
Experiments are carried out on 2-speaker telephone conversations of Callhome
and DIHARD 3. Together with this publication, we release our implementations of
EEND and the method for creating simulated conversations.Comment: Submitted to Interspeech 202
PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION
Práce pojednává o fonotaktickĂ©m a akustickĂ©m pĹ™Ăstupu pro automatickĂ© rozpoznávánĂ jazyka. Prvnà část práce pojednává o fonotaktickĂ©m pĹ™Ăstupu zaloĹľenĂ©m na vĂ˝skytu fonĂ©movĂ˝ch sekvenci v Ĺ™eÄŤi. NejdĹ™Ăve je prezentován popis vĂ˝voje fonĂ©movĂ©ho rozpoznávaÄŤe jako techniky pro pĹ™epis Ĺ™eÄŤi do sekvence smysluplnĂ˝ch symbolĹŻ. HlavnĂ dĹŻraz je kladen na dobrĂ© natrĂ©novánĂ fonĂ©movĂ©ho rozpoznávaÄŤe a kombinaci vĂ˝sledkĹŻ z nÄ›kolika fonĂ©movĂ˝ch rozpoznávaÄŤĹŻ trĂ©novanĂ˝ch na rĹŻznĂ˝ch jazycĂch (ParalelnĂ fonĂ©movĂ© rozpoznávánĂ následovanĂ© jazykovĂ˝mi modely (PPRLM)). Práce takĂ© pojednává o novĂ© technice anti-modely v PPRLM a studuje pouĹľitĂ fonĂ©movĂ˝ch grafĹŻ mĂsto nejlepšĂho pĹ™episu. Na závÄ›r práce jsou porovnány dva pĹ™Ăstupy modelovánĂ vĂ˝stupu fonĂ©movĂ©ho rozpoznávaÄŤe -- standardnĂ n-gramovĂ© jazykovĂ© modely a binárnĂ rozhodovacĂ stromy. HlavnĂ pĹ™Ănos v akustickĂ©m pĹ™Ăstupu je diskriminativnĂ modelovánĂ cĂlovĂ˝ch modelĹŻ jazykĹŻ a prvnĂ experimenty s kombinacĂ diskriminativnĂho trĂ©novánĂ a na pĹ™ĂznacĂch, kde byl odstranÄ›n vliv kanálu. Práce dále zkoumá rĹŻznĂ© druhy technik fĂşzi akustickĂ©ho a fonotaktickĂ©ho pĹ™Ăstupu. Všechny experimenty jsou provedeny na standardnĂch datech z NIST evaluaci konanĂ© v letech 2003, 2005 a 2007, takĹľe jsou pĹ™Ămo porovnatelnĂ© s vĂ˝sledky ostatnĂch skupin zabĂ˝vajĂcĂch se automatickĂ˝m rozpoznávánĂm jazyka. S fĂşzĂ uvedenĂ˝ch technik jsme posunuli state-of-the-art vĂ˝sledky a dosáhli vynikajĂcĂch vĂ˝sledkĹŻ ve dvou NIST evaluacĂch.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.
Open-set Speaker Identification
This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition.
The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material
X-VECTORS: ROBUST NEURAL EMBEDDINGS FOR SPEAKER RECOGNITION
Speaker recognition is the task of identifying speakers based on their speech signal. Typically, this involves comparing speech from a known speaker, with recordings from unknown speakers, and making same-or-different speaker decisions. If the lexical contents of the recordings are fixed to some phrase, the task is considered text-dependent, otherwise it is text-independent. This dissertation is primarily concerned with this second, less constrained problem. Since speech data lives in a complex, high-dimensional space, it is difficult to directly compare speakers. Comparisons are facilitated by embeddings: mappings from complex input patterns to low-dimensional Euclidean spaces where notions of distance or similarity are defined in natural ways. For almost ten years, systems based on i-vectors--a type of embedding extracted from a traditional generative model--have been the dominant paradigm in this field. However, in other areas of applied machine learning, such as text or vision, embeddings extracted from discriminatively trained neural networks are the state-of-the-art. Recently, this line of research has become very active in speaker recognition as well. Neural networks are a natural choice for this purpose, as they are capable of learning extremely complex mappings, and when training data resources are abundant, tend to outperform traditional methods. In this dissertation, we develop a next-generation neural embedding--denoted by x-vector--for speaker recognition. These neural embeddings are demonstrated to substantially improve upon the state-of-the-art on a number of benchmark datasets