10,801 research outputs found
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Current speaker verification techniques rely on a neural network to extract
speaker representations. The successful x-vector architecture is a Time Delay
Neural Network (TDNN) that applies statistics pooling to project
variable-length utterances into fixed-length speaker characterizing embeddings.
In this paper, we propose multiple enhancements to this architecture based on
recent trends in the related fields of face verification and computer vision.
Firstly, the initial frame layers can be restructured into 1-dimensional
Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we
introduce Squeeze-and-Excitation blocks in these modules to explicitly model
channel interdependencies. The SE block expands the temporal context of the
frame layer by rescaling the channels according to global properties of the
recording. Secondly, neural networks are known to learn hierarchical features,
with each layer operating on a different level of complexity. To leverage this
complementary information, we aggregate and propagate features of different
hierarchical levels. Finally, we improve the statistics pooling module with
channel-dependent frame attention. This enables the network to focus on
different subsets of frames during each of the channel's statistics estimation.
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art
TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker
Recognition Challenge.Comment: proceedings of INTERSPEECH 202
Language Identification Using Visual Features
Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech
- …