712 research outputs found
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective
Neural PLDA Modeling for End-to-End Speaker Verification
While deep learning models have made significant advances in supervised
classification problems, the application of these models for out-of-set
verification tasks like speaker recognition has been limited to deriving
feature embeddings. The state-of-the-art x-vector PLDA based speaker
verification systems use a generative model based on probabilistic linear
discriminant analysis (PLDA) for computing the verification score. Recently, we
had proposed a neural network approach for backend modeling in speaker
verification called the neural PLDA (NPLDA) where the likelihood ratio score of
the generative PLDA model is posed as a discriminative similarity function and
the learnable parameters of the score function are optimized using a
verification cost. In this paper, we extend this work to achieve joint
optimization of the embedding neural network (x-vector network) with the NPLDA
network in an end-to-end (E2E) fashion. This proposed end-to-end model is
optimized directly from the acoustic features with a verification cost function
and during testing, the model directly outputs the likelihood ratio score. With
various experiments using the NIST speaker recognition evaluation (SRE) 2018
and 2019 datasets, we show that the proposed E2E model improves significantly
over the x-vector PLDA baseline speaker verification system.Comment: Accepted in Interspeech 2020. GitHub Implementation Repos:
https://github.com/iiscleap/E2E-NPLDA and
https://github.com/iiscleap/NeuralPld
Viseme-based Lip-Reading using Deep Learning
Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge.
The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema.
This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning-
based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset
are attained which improve upon the current state-of-the-art.
In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with.
To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2
(LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance.
The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2
Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
We study a novel neural architecture and its training strategies of speaker
encoder for speaker recognition without using any identity labels. The speaker
encoder is trained to extract a fixed-size speaker embedding from a spoken
utterance of various length. Contrastive learning is a typical self-supervised
learning technique. However, the quality of the speaker encoder depends very
much on the sampling strategy of positive and negative pairs. It is common that
we sample a positive pair of segments from the same utterance. Unfortunately,
such poor-man's positive pairs (PPP) lack necessary diversity for the training
of a robust encoder. In this work, we propose a multi-modal contrastive
learning technique with novel sampling strategies. By cross-referencing between
speech and face data, we study a method that finds diverse positive pairs (DPP)
for contrastive learning, thus improving the robustness of the speaker encoder.
We train the speaker encoder on the VoxCeleb2 dataset without any speaker
labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\%
under the proposed progressive clustering strategy, and an EER of 1.44\%,
1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on
the three test sets of VoxCeleb1. This novel solution outperforms the
state-of-the-art self-supervised learning methods by a large margin, at the
same time, achieves comparable results with the supervised learning
counterpart. We also evaluate our self-supervised learning technique on LRS2
and LRW datasets, where the speaker information is unknown. All experiments
suggest that the proposed neural architecture and sampling strategies are
robust across datasets.Comment: 13 page
Active Collaboration of Classifiers for Visual Tracking
Recently, discriminative visual trackers obtain state-of-the-art performance, yet they suffer in the presence of different real-world challenges such as target motion and appearance changes. In a discriminative tracker, one or more classifiers are employed to obtain the target/nontarget label for the samples, which in turn determine the target’s location. To cope with variations of the target shape and appearance, the classifier(s) are updated online with different samples of the target and the background. Sample selection, labeling, and updating the classifier are prone to various sources of errors that drift the tracker. In this study, we motivate, conceptualize, realize, and formalize a novel active co-tracking framework, step by step to demonstrate the challenges and generic solutions for them. In this framework, not only classifiers cooperate in labeling the samples but also exchange their information to robustify the labeling, improve the sampling, and realize efficient yet effective updating. The proposed framework is evaluated against state-of-the-art trackers on public dataset and showed promising results
- …