80 research outputs found

    Extrapolating single view face models for multi-view recognition

    Get PDF
    Copyright © 2004 IEEEPerformance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of maximum likelihood linear regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.Conrad Sanderson and Samy Bengi

    Improving Source Separation via Multi-Speaker Representations

    Get PDF
    Lately there have been novel developments in deep learning towards solving the cocktail party problem. Initial results are very promising and allow for more research in the domain. One technique that has not yet been explored in the neural network approach to this task is speaker adaptation. Intuitively, information on the speakers that we are trying to separate seems fundamentally important for the speaker separation task. However, retrieving this speaker information is challenging since the speaker identities are not known a priori and multiple speakers are simultaneously active. There is thus some sort of chicken and egg problem. To tackle this, source signals and i-vectors are estimated alternately. We show that blind multi-speaker adaptation improves the results of the network and that (in our case) the network is not capable of adequately retrieving this useful speaker information itself

    Subspace and graph methods to leverage auxiliary data for limited target data multi-class classification, applied to speaker verification

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 127-130).Multi-class classification can be adversely affected by the absence of sufficient target (in-class) instances for training. Such cases arise in face recognition, speaker verification, and document classification, among others. Auxiliary data-sets, which contain a diverse sampling of non-target instances, are leveraged in this thesis using subspace and graph methods to improve classification where target data is limited. The auxiliary data is used to define a compact representation that maps instances into a vector space where inner products quantify class similarity. Within this space, an estimate of the subspace that constitutes within-class variability (e.g. the recording channel in speaker verification or the illumination conditions in face recognition) can be obtained using class-labeled auxiliary data. This thesis proposes a way to incorporate this estimate into the SVM framework to perform nuisance compensation, thus improving classification performance. Another contribution is a framework that combines mapping and compensation into a single linear comparison, which motivates computationally inexpensive and accurate comparison functions. A key aspect of the work takes advantage of efficient pairwise comparisons between the training, test, and auxiliary instances to characterize their interaction within the vector space, and exploits it for improved classification in three ways. The first uses the local variability around the train and test instances to reduce false-alarms. The second assumes the instances lie on a low-dimensional manifold and uses the distances along the manifold. The third extracts relational features from a similarity graph where nodes correspond to the training, test and auxiliary instances. To quantify the merit of the proposed techniques, results of experiments in speaker verification are presented where only a single target recording is provided to train the classifier. Experiments are preformed on standard NIST corpora and methods are compared using standard evalutation metrics: detection error trade-off curves, minimum decision costs, and equal error rates.by Zahi Nadim Karam.Ph.D

    Statistical Transformation Techniques for Face Verification Using Faces Rotated in Depth

    Get PDF
    In the framework of a {B}ayesian classifier based on mixtures of gaussians, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each frontal face model with artificially synthesized models for non-frontal views. {T}he synthesis methods are based on several implementations of {M}aximum {L}ikelihood {L}inear {R}egression ({MLLR}), as well as standard multi-variate linear regression ({LinReg}). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. {T}he synthesis and extension approach is evaluated by applying it to two face verification systems: {PCA} based (holistic features) and {DCTmod2} based (local features). Experiments on the {FERET} database suggest that for the {PCA} based system, the {LinReg} based technique is more suited than the {MLLR} based techniques; for the {DCTmod2} based system, the results show that synthesis via a new {MLLR} implementation obtains better performance than synthesis based on traditional {MLLR}. {T}he results further suggest that extending frontal models considerably reduces errors. It is also shown that the {DCTmod2} based system is less affected by out-of-plane rotations than the {PCA} based system; this can be attributed to the local feature representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between face parts, allowing for movement of facial areas

    Verificación de firmas en línea usando modelos de mezcla Gaussianas y estrategias de aprendizaje para conjuntos pequeños de muestras

    Get PDF
    RESUMEN: El artículo aborda el problema de entrenamiento de sistemas de verificación de firmas en línea cuando el número de muestras disponibles para el entrenamiento es bajo, debido a que en la mayoría de situaciones reales el número de firmas disponibles por usuario es muy limitado. El artículo evalúa nueve diferentes estrategias de clasificación basadas en modelos de mezclas de Gaussianas (GMM por sus siglas en inglés) y la estrategia conocida como modelo histórico universal (UBM por sus siglas en inglés), la cual está diseñada con el objetivo de trabajar bajo condiciones de menor número de muestras. Las estrategias de aprendizaje de los GMM incluyen el algoritmo convencional de Esperanza y Maximización, y una aproximación Bayesiana basada en aprendizaje variacional. Las firmas son caracterizadas principalmente en términos de velocidades y aceleraciones de los patrones de escritura a mano de los usuarios. Los resultados muestran que cuando se evalúa el sistema en una configuración genuino vs. impostor, el método GMM-UBM es capaz de mantener una precisión por encima del 93%, incluso en casos en los que únicamente se usa para entrenamiento el 20% de las muestras disponibles (equivalente a 5 firmas), mientras que la combinación de un modelo Bayesiano UBM con una Máquina de Soporte Vectorial (SVM por sus siglas en inglés), modelo conocido como GMM-Supervector, logra un 99% de acierto cuando las muestras de entrenamiento exceden las 20. Por otro lado, cuando se simula un ambiente real en el que no están disponibles muestras impostoras y se usa únicamente el 20% de las muestras para el entrenamiento, una vez más la combinación del modelo UBM Bayesiano y una SVM alcanza más del 77% de acierto, manteniendo una tasa de falsa aceptación inferior al 3%.ABSTRACT: This paper addresses the problem of training on-line signature verification systems when the number of training samples is small, facing the real-world scenario when the number of available signatures per user is limited. The paper evaluates nine different classification strategies based on Gaussian Mixture Models (GMM), and the Universal Background Model (UBM) strategy, which are designed to work under small-sample size conditions. The GMM’s learning strategies include the conventional Expectation-Maximisation algorithm and also a Bayesian approach based on variational learning. The signatures are characterised mainly in terms of velocities and accelerations of the users’ handwriting patterns. The results show that for a genuine vs. impostor test, the GMM-UBM method is able to keep the accuracy above 93%, even when only 20% of samples are used for training (5 signatures). Moreover, the combination of a full Bayesian UBM and a Support Vector Machine (SVM) (known as GMM-Supervector) is able to achieve 99% of accuracy when the training samples exceed 20. On the other hand, when simulating a real environment where there are not available impostor signatures, once again the combination of a full Bayesian UBM and a SVM, achieve more than 77% of accuracy and a false acceptance rate lower than 3%, using only 20% of the samples for training

    VOICE BIOMETRICS UNDER MISMATCHED NOISE CONDITIONS

    Get PDF
    This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy. To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions. In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated. The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes

    Sound environment analysis in smart home

    No full text
    International audienceThis study aims at providing audio-based interaction technology that lets the users have full control over their home environment, at detecting distress situations and at easing the social inclusion of the elderly and frail population. The paper presents the sound and speech analysis system evaluated thanks to a corpus of data acquired in a real smart home environment. The 4 steps of analysis are signal detection, speech/sound discrimination, sound classification and speech recognition. The results are presented for each step and globally. The very first experiments show promising results be it for the modules evaluated independently or for the whole system

    Automatic Person Verification Using Speech and Face Information

    Get PDF
    Interest in biometric based identification and verification systems has increased considerably over the last decade. As an example, the shortcomings of security systems based on passwords can be addressed through the supplemental use of biometric systems based on speech signals, face images or fingerprints. Biometric recognition can also be applied to other areas, such as passport control (immigration checkpoints), forensic work (to determine whether a biometric sample belongs to a suspect) and law enforcement applications (e.g. surveillance). While biometric systems based on face images and/or speech signals can be useful, their performance can degrade in the presence of challenging conditions. In face based systems this can be in the form of a change in the illumination direction and/or face pose variations. Multi-modal systems use more than one biometric at the same time. This is done for two main reasons -- to achieve better robustness and to increase discrimination power. This thesis reviews relevant backgrounds in speech and face processing, as well as information fusion. It reports research aimed at increasing the robustness of single- and multi-modal biometric identity verification systems. In particular, it addresses the illumination and pose variation problems in face recognition, as well as the challenge of effectively fusing information from multiple modalities under non-ideal conditions

    Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

    Get PDF
    Some practical uses of ASR have been implemented, including the transcription of meetings and the usage of smart speakers. It is the process by which speech waves are transformed into text that allows computers to interpret and act upon human speech. Scalable strategies for developing ASR systems in languages where no voice transcriptions or pronunciation dictionaries exist are the primary focus of this work. We first show that the necessity for voice transcription into the target language can be greatly reduced through cross-lingual acoustic model transfer when phonemic pronunciation lexicons exist in the new language. Afterwards, we investigate three approaches to dealing with languages that lack a pronunciation lexicon. Secondly, we have a look at the efficiency of graphemic acoustic model transfer, which makes it easy to build pronunciation dictionaries. Thesis problems can be solved, in part, by investigating optimization strategies for training on huge corpora (such as GA+HMM and DE+HMM). In the training phase of acoustic modelling, the suggested method is applied to traditional methods. Read speech and HMI voice experiments indicated that while each data augmentation strategy alone did not always increase recognition performance, using all three techniques together did. Power normalised cepstral coefficient (PNCC) features are tweaked somewhat in this work to enhance verification accuracy. To increase speaker verification accuracy, we suggest employing multiple “Gaussian Mixture Model-Universal Background Model (GMM-UBM) and SVM classifiers”. Importantly, pitch shift data augmentation and multi-task training reduced bias by more than 18% absolute compared to the baseline system for read speech, and applying all three data augmentation techniques during fine tuning reduced bias by more than 7% for HMI speech, while increasing recognition accuracy of both native and non-native Dutch speech
    • …
    corecore