234 research outputs found
One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information
Lip-based biometric authentication (LBBA) is an authentication method based
on a person's lip movements during speech in the form of video data captured by
a camera sensor. LBBA can utilize both physical and behavioral characteristics
of lip movements without requiring any additional sensory equipment apart from
an RGB camera. State-of-the-art (SOTA) approaches use one-shot learning to
train deep siamese neural networks which produce an embedding vector out of
these features. Embeddings are further used to compute the similarity between
an enrolled user and a user being authenticated. A flaw of these approaches is
that they model behavioral features as style-of-speech without relation to what
is being said. This makes the system vulnerable to video replay attacks of the
client speaking any phrase. To solve this problem we propose a one-shot
approach which models behavioral features to discriminate against what is being
said in addition to style-of-speech. We achieve this by customizing the GRID
dataset to obtain required triplets and training a siamese neural network based
on 3D convolutions and recurrent neural network layers. A custom triplet loss
for batch-wise hard-negative mining is proposed. Obtained results using an
open-set protocol are 3.2% FAR and 3.8% FRR on the test set of the customized
GRID dataset. Additional analysis of the results was done to quantify the
influence and discriminatory power of behavioral and physical features for
LBBA.Comment: 28 pages, 10 figures, 7 table
SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
We present SpeakingFaces as a publicly-available large-scale dataset
developed to support multimodal machine learning research in contexts that
utilize a combination of thermal, visual, and audio data streams; examples
include human-computer interaction (HCI), biometric authentication, recognition
systems, domain transfer, and speech recognition. SpeakingFaces is comprised of
well-aligned high-resolution thermal and visual spectra image streams of
fully-framed faces synchronized with audio recordings of each subject speaking
approximately 100 imperative phrases. Data were collected from 142 subjects,
yielding over 13,000 instances of synchronized data (~3.8 TB). For technical
validation, we demonstrate two baseline examples. The first baseline shows
classification by gender, utilizing different combinations of the three data
streams in both clean and noisy environments. The second example consists of
thermal-to-visual facial image translation, as an instance of domain transfer.Comment: 6 pages, 4 figures, 3 table
Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Speaker-independent VSR is a complex task that involves identifying spoken
words or phrases from video recordings of a speaker's facial movements. Over
the years, there has been a considerable amount of research in the field of VSR
involving different algorithms and datasets to evaluate system performance.
These efforts have resulted in significant progress in developing effective VSR
models, creating new opportunities for further research in this area. This
survey provides a detailed examination of the progression of VSR over the past
three decades, with a particular emphasis on the transition from
speaker-dependent to speaker-independent systems. We also provide a
comprehensive overview of the various datasets used in VSR research and the
preprocessing techniques employed to achieve speaker independence. The survey
covers the works published from 1990 to 2023, thoroughly analyzing each work
and comparing them on various parameters. This survey provides an in-depth
analysis of speaker-independent VSR systems evolution from 1990 to 2023. It
outlines the development of VSR systems over time and highlights the need to
develop end-to-end pipelines for speaker-independent VSR. The pictorial
representation offers a clear and concise overview of the techniques used in
speaker-independent VSR, thereby aiding in the comprehension and analysis of
the various methodologies. The survey also highlights the strengths and
limitations of each technique and provides insights into developing novel
approaches for analyzing visual speech cues. Overall, This comprehensive review
provides insights into the current state-of-the-art speaker-independent VSR and
highlights potential areas for future research
Multi-modal Authentication Model for Occluded Faces in a Challenging Environment
Authentication systems are crucial in the digital era, providing reliable protection of personal information. Most authentication systems rely on a single modality, such as the face, fingerprints, or password sensors. In the case of an authentication system based on a single modality, there is a problem in that the performance of the authentication is degraded when the information of the corresponding modality is covered. Especially, face identification does not work well due to the mask in a COVID-19 situation. In this paper, we focus on the multi-modality approach to improve the performance of occluded face identification. Multi-modal authentication systems are crucial in building a robust authentication system because they can compensate for the lack of modality in the uni-modal authentication system. In this light, we propose DemoID, a multi-modal authentication system based on face and voice for human identification in a challenging environment. Moreover, we build a demographic module to efficiently handle the demographic information of individual faces. The experimental results showed an accuracy of 99% when using all modalities and an overall improvement of 5.41%–10.77% relative to uni-modal face models. Furthermore, our model demonstrated the highest performance compared to existing multi-modal models and also showed promising results on the real-world dataset constructed for this study.This work was supported in part by Basic Science Research Program through the National Research Foundation of Korea, funded by the Ministry of Education under Grant NRF-2022R1A6A3A13063417, in part by the Government of the Republic of Korea (MSIT), and in part by the National Research Foundation of Korea under Grant NRF-2023K2A9A1A01098773
EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis
Data clustering has received a lot of attention and numerous methods,
algorithms and software packages are available. Among these techniques,
parametric finite-mixture models play a central role due to their interesting
mathematical properties and to the existence of maximum-likelihood estimators
based on expectation-maximization (EM). In this paper we propose a new mixture
model that associates a weight with each observed point. We introduce the
weighted-data Gaussian mixture and we derive two EM algorithms. The first one
considers a fixed weight for each observation. The second one treats each
weight as a random variable following a gamma distribution. We propose a model
selection method based on a minimum message length criterion, provide a weight
initialization strategy, and validate the proposed algorithms by comparing them
with several state of the art parametric and non-parametric clustering
techniques. We also demonstrate the effectiveness and robustness of the
proposed clustering technique in the presence of heterogeneous data, namely
audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table
Recommended from our members
A note on the robust stability of uncertain stochastic fuzzy systems with time-delays
Copyright [2004] IEEE. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Brunel University's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.Takagi-Sugeno (T-S) fuzzy models are now often used to describe complex nonlinear systems in terms of fuzzy sets and fuzzy reasoning applied to a set of linear submodels. In this note, the T-S fuzzy model approach is exploited to establish stability criteria for a class of nonlinear stochastic systems with time delay. Sufficient conditions are derived in the format of linear matrix inequalities (LMIs), such that for all admissible parameter uncertainties, the overall fuzzy system is stochastically exponentially stable in the mean square, independent of the time delay. Therefore, with the numerically attractive Matlab LMI toolbox, the robust stability of the uncertain stochastic fuzzy systems with time delays can be easily checked
- …