234 research outputs found

    One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

    Full text link
    Lip-based biometric authentication (LBBA) is an authentication method based on a person's lip movements during speech in the form of video data captured by a camera sensor. LBBA can utilize both physical and behavioral characteristics of lip movements without requiring any additional sensory equipment apart from an RGB camera. State-of-the-art (SOTA) approaches use one-shot learning to train deep siamese neural networks which produce an embedding vector out of these features. Embeddings are further used to compute the similarity between an enrolled user and a user being authenticated. A flaw of these approaches is that they model behavioral features as style-of-speech without relation to what is being said. This makes the system vulnerable to video replay attacks of the client speaking any phrase. To solve this problem we propose a one-shot approach which models behavioral features to discriminate against what is being said in addition to style-of-speech. We achieve this by customizing the GRID dataset to obtain required triplets and training a siamese neural network based on 3D convolutions and recurrent neural network layers. A custom triplet loss for batch-wise hard-negative mining is proposed. Obtained results using an open-set protocol are 3.2% FAR and 3.8% FRR on the test set of the customized GRID dataset. Additional analysis of the results was done to quantify the influence and discriminatory power of behavioral and physical features for LBBA.Comment: 28 pages, 10 figures, 7 table

    SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams

    Full text link
    We present SpeakingFaces as a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction (HCI), biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (~3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.Comment: 6 pages, 4 figures, 3 table

    Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

    Full text link
    Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research

    Multi-modal Authentication Model for Occluded Faces in a Challenging Environment

    Get PDF
    Authentication systems are crucial in the digital era, providing reliable protection of personal information. Most authentication systems rely on a single modality, such as the face, fingerprints, or password sensors. In the case of an authentication system based on a single modality, there is a problem in that the performance of the authentication is degraded when the information of the corresponding modality is covered. Especially, face identification does not work well due to the mask in a COVID-19 situation. In this paper, we focus on the multi-modality approach to improve the performance of occluded face identification. Multi-modal authentication systems are crucial in building a robust authentication system because they can compensate for the lack of modality in the uni-modal authentication system. In this light, we propose DemoID, a multi-modal authentication system based on face and voice for human identification in a challenging environment. Moreover, we build a demographic module to efficiently handle the demographic information of individual faces. The experimental results showed an accuracy of 99% when using all modalities and an overall improvement of 5.41%–10.77% relative to uni-modal face models. Furthermore, our model demonstrated the highest performance compared to existing multi-modal models and also showed promising results on the real-world dataset constructed for this study.This work was supported in part by Basic Science Research Program through the National Research Foundation of Korea, funded by the Ministry of Education under Grant NRF-2022R1A6A3A13063417, in part by the Government of the Republic of Korea (MSIT), and in part by the National Research Foundation of Korea under Grant NRF-2023K2A9A1A01098773

    EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis

    Get PDF
    Data clustering has received a lot of attention and numerous methods, algorithms and software packages are available. Among these techniques, parametric finite-mixture models play a central role due to their interesting mathematical properties and to the existence of maximum-likelihood estimators based on expectation-maximization (EM). In this paper we propose a new mixture model that associates a weight with each observed point. We introduce the weighted-data Gaussian mixture and we derive two EM algorithms. The first one considers a fixed weight for each observation. The second one treats each weight as a random variable following a gamma distribution. We propose a model selection method based on a minimum message length criterion, provide a weight initialization strategy, and validate the proposed algorithms by comparing them with several state of the art parametric and non-parametric clustering techniques. We also demonstrate the effectiveness and robustness of the proposed clustering technique in the presence of heterogeneous data, namely audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table

    An audio-visual corpus for multimodal automatic speech recognition

    Get PDF
    corecore