188 research outputs found

    Joint factor analysis for forensic automatic speaker recognition

    Get PDF
    Projecte final de carrera fet en col·laboració amb Faculté Sciences et Techniques de l'Ingéenieur. Institut de Traitement des SignauxEnglish: Nowadays, under controlled recording conditions, the state-of-the-art automatic speaker recognition systems show very good performance in discriminating between voices of speakers. However, in investigative activities (e.g., anonymous calls and wire-tapping) the conditions in which recordings are made cannot be controlled and pose a challenge to automatic speaker recognition. Some factors that introduce variability in the recordings can be the differences in the phone handset, in the transmission channel and in the recording devices. The strength of evidence, estimated using statistical models of within-source variability and between-sources variability, is expressed as a likelihood ratio, i.e., the probability of observing the features of the questioned recording in the statistical model of the suspected speaker's voice given the two competing hypotheses: the suspected speaker is the source of the questioned recording and the speaker at the origin of the questioned recording is not the suspected speaker. The main unresolved problem in forensic automatic speaker recognition today is that of handling mismatch in recording conditions. This mismatch has to be considered in the estimation of the likelihood ratio because it can introduce important errors. In this work, we handle and analyze this state-of-the-art system. The forensic automatic speaker recognition system consists of many parts, such as feature extraction and modeling. We have focused on the modeling part, training models which can be decomposed in two spaces, the speaker and session subspace. This technique, called Joint Factor Analysis, is the state-of-the-art in the speaker verification systems. Using the property of decomposition in two subspaces, we try to solve the problem of mismatched conditions adapting the session subspace of the train recordings to a new session subspace (which is under different conditions). To estimate the speaker and session subspaces, we need some databases, e.g. one database containing the traces, and another containing recordings from the suspect. These databases must be recorded in several conditions to simulate a real forensic case where mismatched is present. Examples to such recording conditions are cellular phones or fixed telephone network. Finally, an evaluation of the system is presented at the end of the work. Thanks to this evaluation, we see which recording conditions degrade more the results, what effect the mismatch have on the results and, how much the adaptation can fix these effects.Castellano: Hoy en día, bajo condiciones controladas, los sistemas de reconocimiento de locutor obtienen unos resultados muy buenos al discernir entre las voces de los hablantes. Sin embargo, en las actividades de investigación (por ejemplo, las llamadas anónimas y escuchas telefónicas) las condiciones en que las grabaciones se realizan no pueden ser controladas y representan un desafío para el reconocimiento automático de locutor. Algunos de los factores que introducen variabilidad en las grabaciones pueden ser las diferencias en el terminal telefónico, en el canal de transmisión y los dispositivos de grabación. La fuerza de la prueba, estimada utilizando modelos estadísticos de variabilidad entre locutores y variabilidad entre el mismo locutor, se expresa como un ratio de verosimilitud, es decir, la probabilidad de observar las características de la grabación cuestionada en el modelo estadístico de la voz del sospechoso dada dos hipótesis: el sospechoso es la fuente de la grabación cuestionada y el locutor en el origen de la grabación cuestionada no es el sospechoso. El principal problema sin resolver en el reconocimiento automático de locutor para las ciencias forenses es tratar con el desajuste en las condiciones de grabación. Este desajuste se debe considerar en la estimación del ratio de verosimilitud, ya que puede introducir errores importantes. En este trabajo, usamos y analizamos estos sistemas. El sistema de reconocimiento automático de locutor para las ciencias forenses se compone de muchas partes, tales como la extracción de características y el modelado. Nosotros nos hemos centrado en la parte de modelado, entrenando modelos que se puede descomponer en dos espacios, el subespacio del locutor y el de sesión. Esta técnica, llamada Análisis Factorial Conjunto (Joint Factor Analysis), es el estado del arte en los sistemas de verificación de locutor. Usando la propiedad de descomposición en dos subespacios, tratamos de resolver el problema de desajuste de condiciones adaptando el subespacio de sesión de las grabaciones de entrenamiento a un nuevo subespacio de sesión (que se encuentra bajo otras condiciones). Para la estimación de los subespacios de locutor y de sesión, necesitamos algunas bases de datos, por ejemplo, una base de datos que contenga las pruebas, y otra que contenga las grabaciones del sospechoso. Estas bases de datos deben ser grabadas bajo diferentes condiciones para simular un caso forense real donde el desajuste de condiciones está presente. Ejemplos de condiciones de grabación son los teléfonos móviles o la red fija de telefonía. Finalmente, una evaluación del sistema se presenta al final del proyecto. Gracias a esta evaluación, vemos qué condiciones de grabación degradan más los resultados, qué efecto tiene el desajuste de condiciones en los resultados y, cómo la adaptación puede arreglar estos efectos.Català: Avui en dia, sota condicions controlades, els sistemes de reconeixement de locutor obtenen uns resultats molt bons al discernir entre les veus dels parlants. No obstant això, en les activitats d'investigació (per exemple, les trucades anònimes i escoltes telefòniques) les condicions en què les gravacions es realitzen no poden ser controlades i representen un desafiament per al reconeixement automàtic de locutor. Alguns dels factors que introdueixen variabilitat en els enregistraments poden ser les diferències en el terminal telefònic, al canal de transmissió i els dispositius de gravació. La força de la prova, estimada utilitzant models estadístics de variabilitat entre locutors i variabilitat entre el mateix locutor, s'expressa com una ràtio de versemblança, és a dir, la probabilitat d'observar les característiques de la gravació qüestionada en el model estadístic de la veu del sospitós donada dues hipòtesis: el sospitós és la font de la gravació qüestionada i el locutor de la gravació qüestionada no és el sospitós. El principal problema sense resoldre en el reconeixement automàtic de locutor per a les ciències forenses és tractar amb el desajust en les condicions de gravació. Aquest desajust s'ha de considerar en l'estimació de la ràtio de versemblança, ja que pot introduir errors importants. En aquest treball, utilitzem i analitzem aquests sistemes. El sistema de reconeixement automàtic de locutor per a les ciències forenses es compon de moltes parts, com ara l'extracció de característiques i el modelatge. Nosaltres ens hem centrat en la part de modelatge, entrenant models que es poden descompondre en dos espais, el subespai del locutor i el de sessió. Aquesta tècnica, anomenada Anàlisi Factorial Conjunt (Joint Factor Analysis), és l'estat de l'art en els sistemes de verificació de locutor. Fent servir la propietat de descomposició en dos subespais, tractem de resoldre el problema de desajustament de condicions adaptant el subespai de sessió de les gravacions d'entrenament a un nou subespai de sessió (que es troba sota altres condicions). Per a l'estimació dels subespais de locutor i de sessió, necessitem algunes bases de dades, per exemple, una base de dades que contingui les proves, i una altra que contingui les gravacions del sospitós. Aquestes bases de dades han de ser gravades sota diferents condicions per simular un cas forense real on el desajust de condicions hi és present. Exemples de condicions de gravació són els telèfons mòbils o la xarxa fixa de telefonia. Finalment, una avaluació del sistema es presenta al final del projecte. Gràcies a aquesta avaluació, veiem quines condicions de gravació degraden més els resultats, quin efecte té el desajust de condicions en els resultats i, com l'adaptació pot arreglar aquests efectes

    Speaker Recognition: Advancements and Challenges

    Get PDF

    The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems

    Get PDF
    This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates

    Session varaibility compensation in automatic speaker and language recognition

    Full text link
    Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, octubre de 201

    Face comparison in forensics:A deep dive into deep learning and likelihood rations

    Get PDF
    This thesis explores the transformative potential of deep learning techniques in the field of forensic face recognition. It aims to address the pivotal question of how deep learning can advance this traditionally manual field, focusing on three key areas: forensic face comparison, face image quality assessment, and likelihood ratio estimation. Using a comparative analysis of open-source automated systems and forensic experts, the study finds that automated systems excel in identifying non-matches in low-quality images, but lag behind experts in high-quality settings. The thesis also investigates the role of calibration methods in estimating likelihood ratios, revealing that quality score-based and feature-based calibrations are more effective than naive methods. To enhance face image quality assessment, a multi-task explainable quality network is proposed that not only gauges image quality, but also identifies contributing factors. Additionally, a novel images-to-video recognition method is introduced to improve the estimation of likelihood ratios in surveillance settings. The study employs multiple datasets and software systems for its evaluations, aiming for a comprehensive analysis that can serve as a cornerstone for future research in forensic face recognition

    A non-linear polynomial approximation filter for robust speaker verification

    Get PDF
    Bibliography: leaves 101-109

    Individual and environment-related acoustic-phonetic strategies for communicating in adverse conditions

    Get PDF
    In many situations it is necessary to produce speech in ‘adverse conditions’: that is, conditions that make speech communication difficult. Research has demonstrated that speaker strategies, as described by a range of acoustic-phonetic measures, can vary both at the individual level and according to the environment, and are argued to facilitate communication. There has been debate as to the environmental specificity of these adaptations, and their effectiveness in overcoming communication difficulty. Furthermore, the manner and extent to which adaptation strategies differ between individuals is not yet well understood. This thesis presents three studies that explore the acoustic-phonetic adaptations of speakers in noisy and degraded communication conditions and their relationship with intelligibility. Study 1 investigated the effects of temporally fluctuating maskers on global acoustic-phonetic measures associated with speech in noise (Lombard speech). The results replicated findings of increased power in the modulation spectrum in Lombard speech, but showed little evidence of adaptation to masker fluctuations via the temporal envelope. Study 2 collected a larger corpus of semi-spontaneous communicative speech in noise and other degradations perturbing specific acoustic dimensions. Speakers showed different adaptations across the environments that were likely suited to overcome noise (steady and temporally fluctuating), restricted spectral and pitch information by a noise-excited vocoder, and a sensorineural hearing loss simulation. Analyses of inter-speaker variation in both studies 1 and 2 showed behaviour was highly variable and some strategy combinations were identified. Study 3 investigated the intelligibility of strategies ‘tailored’ to specific environments and the relationship between intelligibility and speaker acoustics, finding a benefit of tailored speech adaptations and discussing the potential roles of speaker flexibility, adaptation level, and intrinsic intelligibility. The overall results are discussed in relation to models of communication in adverse conditions and a model accounting for individual variability in these conditions is proposed

    Individual Differences in Speech Production and Perception

    Get PDF
    Inter-individual variation in speech is a topic of increasing interest both in human sciences and speech technology. It can yield important insights into biological, cognitive, communicative, and social aspects of language. Written by specialists in psycholinguistics, phonetics, speech development, speech perception and speech technology, this volume presents experimental and modeling studies that provide the reader with a deep understanding of interspeaker variability and its role in speech processing, speech development, and interspeaker interactions. It discusses how theoretical models take into account individual behavior, explains why interspeaker variability enriches speech communication, and summarizes the limitations of the use of speaker information in forensics
    corecore