14,666 research outputs found
Analysis of I-Vector framework for Speaker Identification in TV-shows
International audienceInspired from the Joint Factor Analysis, the I-vector-based analysis has become the most popular and state-of-the-art framework for the speaker verification task. Mainly applied within the NIST/SRE evaluation campaigns, many studies have been proposed to improve more and more performance of speaker verification systems. Nevertheless, while the i-vector framework has been used in other speech processing fields like language recognition, a very few studies have been reported for the speaker identification task on TV shows. This work was done in the REPERE challenge context, focused on the people recognition task in multimodal conditions (audio, video, text) from TV show corpora. Moreover, the challenge participants are invited for providing systems for monomodal tasks, like speaker identification. The application of the i-vector framework is investi-gatedthrough different points of views: (1) some of the i-vector based approaches are compared, (2) a specific i-vector extraction protocol is proposed in order to deal with widely varying amounts of training data among speaker population, (3) the joint use of both speaker diarization and identification is finally analyzed. Based on a 533 speaker dictionary, this joint system wins the monomodal speaker identification task of the 2014 REPERE challenge
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
Speaker identification refers to the task of localizing the face of a person
who has the same identity as the ongoing voice in a video. This task not only
requires collective perception over both visual and auditory signals, the
robustness to handle severe quality degradations and unconstrained content
variations are also indispensable. In this paper, we describe a novel
multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies
both visual and auditory modalities from the beginning of each sequence input.
The key idea is to extend the conventional LSTM by not only sharing weights
across time steps, but also sharing weights across modalities. We show that
modeling the temporal dependency across face and voice can significantly
improve the robustness to content quality degradations and variations. We also
found that our multimodal LSTM is robustness to distractors, namely the
non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory
dataset and showed that our system outperforms the state-of-the-art systems in
speaker identification with lower false alarm rate and higher recognition
accuracy.Comment: The 30th AAAI Conference on Artificial Intelligence (AAAI-16
Improving speaker turn embedding by crossmodal transfer learning from face embedding
Learning speaker turn embeddings has shown considerable improvement in
situations where conventional speaker modeling approaches fail. However, this
improvement is relatively limited when compared to the gain observed in face
embedding learning, which has been proven very successful for face verification
and clustering tasks. Assuming that face and voices from the same identities
share some latent properties (like age, gender, ethnicity), we propose three
transfer learning approaches to leverage the knowledge from the face domain
(learned from thousands of images and identities) for tasks in the speaker
domain. These approaches, namely target embedding transfer, relative distance
transfer, and clustering structure transfer, utilize the structure of the
source face embedding space at different granularities to regularize the target
speaker turn embedding space as optimizing terms. Our methods are evaluated on
two public broadcast corpora and yield promising advances over competitive
baselines in verification and audio clustering tasks, especially when dealing
with short speaker utterances. The analysis of the results also gives insight
into characteristics of the embedding spaces and shows their potential
applications
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
Multi-biometric templates using fingerprint and voice
As biometrics gains popularity, there is an increasing concern about privacy and misuse of biometric data held in central repositories. Furthermore, biometric verification systems face challenges arising from noise and intra-class variations. To tackle both problems, a multimodal biometric verification system combining fingerprint and voice modalities is proposed. The system combines the two modalities at the template level, using multibiometric templates. The fusion of fingerprint and voice data successfully diminishes privacy concerns by hiding the minutiae points from the fingerprint, among the artificial points generated by the features obtained from the spoken utterance of the speaker. Equal error rates are observed to be under 2% for the system where 600 utterances from 30 people have been processed and fused with a database of 400 fingerprints from 200 individuals. Accuracy is increased compared to the previous results for voice verification over the same speaker database
Audio-Visual Speaker Identification using the CUAVE Database
The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identifications systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decisionfusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks
In this paper, a novel method using 3D Convolutional Neural Network (3D-CNN)
architecture has been proposed for speaker verification in the text-independent
setting. One of the main challenges is the creation of the speaker models. Most
of the previously-reported approaches create speaker models based on averaging
the extracted features from utterances of the speaker, which is known as the
d-vector system. In our paper, we propose an adaptive feature learning by
utilizing the 3D-CNNs for direct speaker model creation in which, for both
development and enrollment phases, an identical number of spoken utterances per
speaker is fed to the network for representing the speakers' utterances and
creation of the speaker model. This leads to simultaneously capturing the
speaker-related information and building a more robust system to cope with
within-speaker variation. We demonstrate that the proposed method significantly
outperforms the traditional d-vector verification system. Moreover, the
proposed system can also be an alternative to the traditional d-vector system
which is a one-shot speaker modeling system by utilizing 3D-CNNs.Comment: Accepted to be published in IEEE International Conference on
Multimedia and Expo (ICME) 201
- …