3,454 research outputs found

    An improved normalized gain-based score normalization technique for spoof detection algorithm

    Get PDF
    A spoof detection algorithm supports the speaker verification system to examine the false claims by an imposter through careful analysis of input test speech. The scores are employed to categorize the genuine and spoofed samples effectively. Under the mismatch conditions, the false acceptance ratio increases and can be reduced by appropriate score normalization techniques. In this article, we are using the normalized Discounted Cumulative Gain (nDCG) norm derived from ranking the speaker’s log-likelihood scores. The proposed scoring technique smoothens the decaying process due to logarithm with an added advantage from the ranking. The baseline spoof detection system employs Constant Q-Cepstral Co-efficient (CQCC) as the base features with a Gaussian Mixture Model (GMM) based classifier. The scores are computed using the ASVspoof 2019 dataset for normalized and without normalization conditions. The baseline techniques including the Zero normalization (Z-norm) and Test normalization (T-norm) are also considered. The proposed technique is found to perform better in terms of improved Equal Error Rate (EER) of 0.35 as against 0.43 for baseline system (no normalization) wrt to synthetic attacks using development data. Similarly, improvements are seen in the case of replay attack with EER of 7.83 for nDCG-norm and 9.87 with no normalization (no-norm). Furthermore, the tandem-Detection Cost Function (t-DCF) scores for synthetic attack are 0.015 for no-norm and 0.010 for proposed normalization. Additionally, for the replay attack the t-DCF scores are 0.195 for no-norm and 0.17 proposed normalization. The system performance is satisfactory when evaluated using evaluation data with EER of 8.96 for nDCG-norm as against 9.57 with no-norm for synthetic attacks while the EER of 9.79 for nDCG-norm as against 11.04 with no-norm for replay attacks. Supporting the EER, the t-DCF for nDCG-norm is 0.1989 and for no-norm is 0.2636 for synthetic attacks; while in case of replay attacks, the t-DCF is 0.2284 for the nDCG-norm and 0.2454 for no-norm. The proposed scoring technique is found to increase spoof detection accuracy and overall accuracy of speaker verification system

    Advances in Subspace-based Solutions for Diarization in the Broadcast Domain

    Get PDF
    La motivación de esta tesis es la necesidad de soluciones robustas al problema de diarización. Estas técnicas de diarización deben proporcionar valor añadido a la creciente cantidad disponible de datos multimedia mediante la precisa discriminación de los locutores presentes en la señal de audio. Desafortunadamente, hasta tiempos recientes este tipo de tecnologías solamente era viable en condiciones restringidas, quedando por tanto lejos de una solución general. Las razones detrás de las limitadas prestaciones de los sistemas de diarización son múltiples. La primera causa a tener en cuenta es la alta complejidad de la producción de la voz humana, en particular acerca de los procesos fisiológicos necesarios para incluir las características discriminativas de locutor en la señal de voz. Esta complejidad hace del proceso inverso, la estimación de dichas características a partir del audio, una tarea ineficiente por medio de las técnicas actuales del estado del arte. Consecuentemente, en su lugar deberán tenerse en cuenta aproximaciones. Los esfuerzos en la tarea de modelado han proporcionado modelos cada vez más elaborados, aunque no buscando la explicación última de naturaleza fisiológica de la señal de voz. En su lugar estos modelos aprenden relaciones entre la señales acústicas a partir de un gran conjunto de datos de entrenamiento. El desarrollo de modelos aproximados genera a su vez una segunda razón, la variabilidad de dominio. Debido al uso de relaciones aprendidas a partir de un conjunto de entrenamiento concreto, cualquier cambio de dominio que modifique las condiciones acústicas con respecto a los datos de entrenamiento condiciona las relaciones asumidas, pudiendo causar fallos consistentes en los sistemas.Nuestra contribución a las tecnologías de diarización se ha centrado en el entorno de radiodifusión. Este dominio es actualmente un entorno todavía complejo para los sistemas de diarización donde ninguna simplificación de la tarea puede ser tenida en cuenta. Por tanto, se deberá desarrollar un modelado eficiente del audio para extraer la información de locutor y como inferir el etiquetado correspondiente. Además, la presencia de múltiples condiciones acústicas debido a la existencia de diferentes programas y/o géneros en el domino requiere el desarrollo de técnicas capaces de adaptar el conocimiento adquirido en un determinado escenario donde la información está disponible a aquellos entornos donde dicha información es limitada o sencillamente no disponible.Para este propósito el trabajo desarrollado a lo largo de la tesis se ha centrado en tres subtareas: caracterización de locutor, agrupamiento y adaptación de modelos. La primera subtarea busca el modelado de un fragmento de audio para obtener representaciones precisas de los locutores involucrados, poniendo de manifiesto sus propiedades discriminativas. En este área se ha llevado a cabo un estudio acerca de las actuales estrategias de modelado, especialmente atendiendo a las limitaciones de las representaciones extraídas y poniendo de manifiesto el tipo de errores que pueden generar. Además, se han propuesto alternativas basadas en redes neuronales haciendo uso del conocimiento adquirido. La segunda tarea es el agrupamiento, encargado de desarrollar estrategias que busquen el etiquetado óptimo de los locutores. La investigación desarrollada durante esta tesis ha propuesto nuevas estrategias para estimar el mejor reparto de locutores basadas en técnicas de subespacios, especialmente PLDA. Finalmente, la tarea de adaptación de modelos busca transferir el conocimiento obtenido de un conjunto de entrenamiento a dominios alternativos donde no hay datos para extraerlo. Para este propósito los esfuerzos se han centrado en la extracción no supervisada de información de locutor del propio audio a diarizar, sinedo posteriormente usada en la adaptación de los modelos involucrados.<br /

    Speaker tracking system using speaker boundary detection

    Get PDF
    This thesis is about a research conducted in the area of Speaker Recognition. The application is concerned to the automatic detection and tracking of target speakers in meetings, conferences, telephone conversations and in radio and television broadcasts. A Speaker Tracking system is developed here, in collaboration with the Center for Language and Speech Technologies and Applications (TALP) in UPC. The main objective of this Speaker Tracking system is to answer the question: When the target speaker speaks? The system uses training speech data for the target speaker in the pre-enrollment stage. Three main modules have been designed for this Speaker Tracking system. In the first module an energy based Speech Activity Detection is applied to select the speech parts of the audio. In the second module the audio is segmented according to the speaker turning points. In the last module a Speaker Verification is implemented in which the target speakers are verified and tracked. Two different approaches are applied in this last module. In the first approach for Speaker Verification, the target speakers and the segments are modeled using the state-of-the-art, Gaussian Mixture Models (GMM). In the second approach for Speaker Verification, the identity vectors (i-vectors) representation is applied for the target speakers and the segments. Finally, the performance of both these approaches is compared for the results evaluation

    Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction

    Get PDF
    Most of the state-of-the-art speaker recognition systems use a compact representation of spoken utterances referred to as i-vector. Since the "standard" i-vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the memory needs for i-vector extraction, but is also fast and accurate compared to recently proposed approaches. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than other method

    Genetic Programming for Multibiometrics

    Full text link
    Biometric systems suffer from some drawbacks: a biometric system can provide in general good performances except with some individuals as its performance depends highly on the quality of the capture. One solution to solve some of these problems is to use multibiometrics where different biometric systems are combined together (multiple captures of the same biometric modality, multiple feature extraction algorithms, multiple biometric modalities...). In this paper, we are interested in score level fusion functions application (i.e., we use a multibiometric authentication scheme which accept or deny the claimant for using an application). In the state of the art, the weighted sum of scores (which is a linear classifier) and the use of an SVM (which is a non linear classifier) provided by different biometric systems provide one of the best performances. We present a new method based on the use of genetic programming giving similar or better performances (depending on the complexity of the database). We derive a score fusion function by assembling some classical primitives functions (+, *, -, ...). We have validated the proposed method on three significant biometric benchmark datasets from the state of the art

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation
    corecore