Search CORE

71,923 research outputs found

Identity Verification Using Speech and Face Information

Author: Abdeljaoued
Adjoudani
Alexandre
Altiçay
Altiçay
Atkins
Barniv
Ben-Yacoub
Bengio
Bengio
Bolle
Brunelli
Brunelli
Burges
Cardinaux
Caulcott
Chen
Chibelushi
Conrad Sanderson
Dempster
Dieckmann
Doddington
Duda
Dugelay
Furui
Furui
Gauvain
Genoud
Gonzales
Haigh
Hall
Ho
Hong
Iyengar
Jain
Jankowski
Joachims
Jourlin
Jourlin
Kittler
Kittler
Kuldip K. Paliwal
Luo
Nefian
Nelder
Ortega-Garcia
Pau
Picone
Poh
Potamianos
Press
Rabiner
Radová
Reynolds
Reynolds
Ross
Sanderson
Sanderson
Sanderson
Silsbee
Soong
Swokowski
Tenney
Tenney
Thong
Turk
Vapnik
Varshney
Wark
Wark
Wark
Wayman
Wildermoth
Woodward
Publication venue: 'Elsevier BV'
Publication date: 01/09/2005
Field of study

This article first provides an review of important concepts in the field of information fusion, followed by a review of important milestones in audio–visual person identification and verification. Several recent adaptive and nonadaptive techniques for reaching the verification decision (i.e., to accept or reject the claimant), based on speech and face information, are then evaluated in clean and noisy audio conditions on a common database; it is shown that in clean conditions most of the nonadaptive approaches provide similar performance and in noisy conditions most exhibit a severe deterioration in performance; it is also shown that current adaptive approaches are either inadequate or utilize restrictive assumptions. A new category of classifiers is then introduced, where the decision boundary is fixed but constructed to take into account how the distributions of opinions are likely to change due to noisy conditions; compared to a previously proposed adaptive approach, the proposed classifiers do not make a direct assumption about the type of noise that causes the mismatch between training and testing conditions

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

University of Queensland eSpace

Learnable PINs: Cross-Modal Embeddings for Person Identity

Author: Albanie Samuel
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 01/01/2018
Field of study

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.Comment: To appear in ECCV 201

arXiv.org e-Print Archive

Oxford University Research Archive

Disentangled Speech Embeddings using Cross-modal Self-supervision

Author: Albanie Samuel
Chung Joon Son
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 01/01/2020
Field of study

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart--without annotation--the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Detecting replay attacks in audiovisual identity verification

Author: Bredin Herve
Miguel Antonio
Witten Ian H.
Chollet Gerard
Publication venue: IEEE Computer Society
Publication date: 01/01/2006
Field of study

We describe an algorithm that detects a lack of correspondence between speech and lip motion by detecting and monitoring the degree of synchrony between live audio and visual signals. It is simple, effective, and computationally inexpensive; providing a useful degree of robustness against basic replay attacks and against speech or image forgeries. The method is based on a cross-correlation analysis between two streams of features, one from the audio signal and the other from the image sequence. We argue that such an algorithm forms an effective first barrier against several kinds of replay attack that would defeat existing verification systems based on standard multimodal fusion techniques. In order to provide an evaluation mechanism for the new technique we have augmented the protocols that accompany the BANCA multimedia corpus by defining new scenarios. We obtain 0% equal-error rate (EER) on the simplest scenario and 35% on a more challenging one

Crossref

Research Commons@Waikato

Oxford University Research Archive

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Author: Chung Joon Son
Chung Soo-Whan
Kang Hong Goo
Publication venue: 'International Speech Communication Association'
Publication date: 06/05/2020
Field of study

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.Comment: Under submission as a conference pape

arXiv.org e-Print Archive

Crossref

Detecting replay attacks in audiovisual identity verification

Author: Bredin Herve
Chollet Gerard
Miguel Antonio
Witten Ian H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

CiteSeerX

Research Commons@Waikato

Multi-biometric templates using fingerprint and voice

Author: Camlikaya Eren
Kholmatov Alisher Anatolyevich
Yanikoglu Berrin
Yanıkoğlu Berrin
Çamlıkaya Eren
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 01/03/2008
Field of study

As biometrics gains popularity, there is an increasing concern about privacy and misuse of biometric data held in central repositories. Furthermore, biometric verification systems face challenges arising from noise and intra-class variations. To tackle both problems, a multimodal biometric verification system combining fingerprint and voice modalities is proposed. The system combines the two modalities at the template level, using multibiometric templates. The fusion of fingerprint and voice data successfully diminishes privacy concerns by hiding the minutiae points from the fingerprint, among the artificial points generated by the features obtained from the spoken utterance of the speaker. Equal error rates are observed to be under 2% for the system where 600 utterances from 30 people have been processed and fused with a database of 400 fingerprints from 200 individuals. Accuracy is increased compared to the previous results for voice verification over the same speaker database

Sabanci University Research Database

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

Author: Dabouei Ali
Dawson Jeremy
Iranmanesh Seyed Mehdi
Kazemi Hadi
Nasrabadi Nasser M.
Soleymani Sobhan
Publication venue
Publication date: 31/07/2018
Field of study

In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018

arXiv.org e-Print Archive

Crossref