9 research outputs found

    Automatic Conversion of Emotions in Speech within a Speaker Independent Framework

    Get PDF
    Emotions in speech are a fundamental part of a natural dialog. In everyday life, vocal interaction with people often implies emotions as an intrinsic part of the conversation to a greater or lesser extent. Thus, the inclusion of emotions in human-machine dialog systems is crucial to achieve an acceptable degree of naturalness in the communication. This thesis focuses on automatic emotion conversion of speech, a technique whose aim is to transform an utterance produced in neutral style to a certain emotion state in a speaker independent context. Conversion of emotions represents a challenge in the sense that emotions a affect significantly all the parts of the human vocal production system, and in the conversion process all these factors must be taken into account carefully. The techniques used in the literature are based on voice conversion approaches, with minor modifications to create the sensation of emotion. In this thesis, the idea of voice conversion systems is used as well, but the usual regression process is divided in a two-step procedure that provides additional speaker normalization to remove the intrinsic speaker dependency of this kind of systems, using vocal tract length normalization as a pre-processing technique. In addition, a new method to convert the duration trend of the utterance and the intonation contour is proposed, taking into account the contextual information

    Efficient Approaches for Voice Change and Voice Conversion Systems

    Get PDF
    In this thesis, the study and design of Voice Change and Voice Conversion systems are presented. Particularly, a voice change system manipulates a speaker’s voice to be perceived as it is not spoken by this speaker; and voice conversion system modifies a speaker’s voice, such that it is perceived as being spoken by a target speaker. This thesis mainly includes two sub-parts. The first part is to develop a low latency and low complexity voice change system (i.e. includes frequency/pitch scale modification and formant scale modification algorithms), which can be executed on the smartphones in 2012 with very limited computational capability. Although some low-complexity voice change algorithms have been proposed and studied, the real-time implementations are very rare. According to the experimental results, the proposed voice change system achieves the same quality as the baseline approach but requires much less computational complexity and satisfies the requirement of real-time. Moreover, the proposed system has been implemented in C language and was released as a commercial software application. The second part of this thesis is to investigate a novel low-complexity voice conversion system (i.e. from a source speaker A to a target speaker B) that improves the perceptual quality and identity without introducing large processing latencies. The proposed scheme directly manipulates the spectrum using an effective and physically motivated method – Continuous Frequency Warping and Magnitude Scaling (CFWMS) to guarantee high perceptual naturalness and quality. In addition, a trajectory limitation strategy is proposed to prevent the frame-by-frame discontinuity to further enhance the speech quality. The experimental results show that the proposed method outperforms the conventional baseline solutions in terms of either objective tests or subjective tests

    Vulnerability assessment in the use of biometrics in unsupervised environments

    Get PDF
    Mención Internacional en el título de doctorIn the last few decades, we have witnessed a large-scale deployment of biometric systems in different life applications replacing the traditional recognition methods such as passwords and tokens. We approached a time where we use biometric systems in our daily life. On a personal scale, the authentication to our electronic devices (smartphones, tablets, laptops, etc.) utilizes biometric characteristics to provide access permission. Moreover, we access our bank accounts, perform various types of payments and transactions using the biometric sensors integrated into our devices. On the other hand, different organizations, companies, and institutions use biometric-based solutions for access control. On the national scale, police authorities and border control measures use biometric recognition devices for individual identification and verification purposes. Therefore, biometric systems are relied upon to provide a secured recognition where only the genuine user can be recognized as being himself. Moreover, the biometric system should ensure that an individual cannot be identified as someone else. In the literature, there are a surprising number of experiments that show the possibility of stealing someone’s biometric characteristics and use it to create an artificial biometric trait that can be used by an attacker to claim the identity of the genuine user. There were also real cases of people who successfully fooled the biometric recognition system in airports and smartphones [1]–[3]. That urges the necessity to investigate the potential threats and propose countermeasures that ensure high levels of security and user convenience. Consequently, performing security evaluations is vital to identify: (1) the security flaws in biometric systems, (2) the possible threats that may target the defined flaws, and (3) measurements that describe the technical competence of the biometric system security. Identifying the system vulnerabilities leads to proposing adequate security solutions that assist in achieving higher integrity. This thesis aims to investigate the vulnerability of fingerprint modality to presentation attacks in unsupervised environments, then implement mechanisms to detect those attacks and avoid the misuse of the system. To achieve these objectives, the thesis is carried out in the following three phases. In the first phase, the generic biometric system scheme is studied by analyzing the vulnerable points with special attention to the vulnerability to presentation attacks. The study reviews the literature in presentation attack and the corresponding solutions, i.e. presentation attack detection mechanisms, for six biometric modalities: fingerprint, face, iris, vascular, handwritten signature, and voice. Moreover, it provides a new taxonomy for presentation attack detection mechanisms. The proposed taxonomy helps to comprehend the issue of presentation attacks and how the literature tried to address it. The taxonomy represents a starting point to initialize new investigations that propose novel presentation attack detection mechanisms. In the second phase, an evaluation methodology is developed from two sources: (1) the ISO/IEC 30107 standard, and (2) the Common Evaluation Methodology by the Common Criteria. The developed methodology characterizes two main aspects of the presentation attack detection mechanism: (1) the resistance of the mechanism to presentation attacks, and (2) the corresponding threat of the studied attack. The first part is conducted by showing the mechanism's technical capabilities and how it influences the security and ease-of-use of the biometric system. The second part is done by performing a vulnerability assessment considering all the factors that affect the attack potential. Finally, a data collection is carried out, including 7128 fingerprint videos of bona fide and attack presentation. The data is collected using two sensing technologies, two presentation scenarios, and considering seven attack species. The database is used to develop dynamic presentation attack detection mechanisms that exploit the fingerprint spatio-temporal features. In the final phase, a set of novel presentation attack detection mechanisms is developed exploiting the dynamic features caused by the natural fingerprint phenomena such as perspiration and elasticity. The evaluation results show an efficient capability to detect attacks where, in some configurations, the mechanisms are capable of eliminating some attack species and mitigating the rest of the species while keeping the user convenience at a high level.En las últimas décadas, hemos asistido a un despliegue a gran escala de los sistemas biométricos en diferentes aplicaciones de la vida cotidiana, sustituyendo a los métodos de reconocimiento tradicionales, como las contraseñas y los tokens. Actualmente los sistemas biométricos ya forman parte de nuestra vida cotidiana: es habitual emplear estos sistemas para que nos proporcionen acceso a nuestros dispositivos electrónicos (teléfonos inteligentes, tabletas, ordenadores portátiles, etc.) usando nuestras características biométricas. Además, accedemos a nuestras cuentas bancarias, realizamos diversos tipos de pagos y transacciones utilizando los sensores biométricos integrados en nuestros dispositivos. Por otra parte, diferentes organizaciones, empresas e instituciones utilizan soluciones basadas en la biometría para el control de acceso. A escala nacional, las autoridades policiales y de control fronterizo utilizan dispositivos de reconocimiento biométrico con fines de identificación y verificación individual. Por lo tanto, en todas estas aplicaciones se confía en que los sistemas biométricos proporcionen un reconocimiento seguro en el que solo el usuario genuino pueda ser reconocido como tal. Además, el sistema biométrico debe garantizar que un individuo no pueda ser identificado como otra persona. En el estado del arte, hay un número sorprendente de experimentos que muestran la posibilidad de robar las características biométricas de alguien, y utilizarlas para crear un rasgo biométrico artificial que puede ser utilizado por un atacante con el fin de reclamar la identidad del usuario genuino. También se han dado casos reales de personas que lograron engañar al sistema de reconocimiento biométrico en aeropuertos y teléfonos inteligentes [1]–[3]. Esto hace que sea necesario investigar estas posibles amenazas y proponer contramedidas que garanticen altos niveles de seguridad y comodidad para el usuario. En consecuencia, es vital la realización de evaluaciones de seguridad para identificar (1) los fallos de seguridad de los sistemas biométricos, (2) las posibles amenazas que pueden explotar estos fallos, y (3) las medidas que aumentan la seguridad del sistema biométrico reduciendo estas amenazas. La identificación de las vulnerabilidades del sistema lleva a proponer soluciones de seguridad adecuadas que ayuden a conseguir una mayor integridad. Esta tesis tiene como objetivo investigar la vulnerabilidad en los sistemas de modalidad de huella dactilar a los ataques de presentación en entornos no supervisados, para luego implementar mecanismos que permitan detectar dichos ataques y evitar el mal uso del sistema. Para lograr estos objetivos, la tesis se desarrolla en las siguientes tres fases. En la primera fase, se estudia el esquema del sistema biométrico genérico analizando sus puntos vulnerables con especial atención a los ataques de presentación. El estudio revisa la literatura sobre ataques de presentación y las soluciones correspondientes, es decir, los mecanismos de detección de ataques de presentación, para seis modalidades biométricas: huella dactilar, rostro, iris, vascular, firma manuscrita y voz. Además, se proporciona una nueva taxonomía para los mecanismos de detección de ataques de presentación. La taxonomía propuesta ayuda a comprender el problema de los ataques de presentación y la forma en que la literatura ha tratado de abordarlo. Esta taxonomía presenta un punto de partida para iniciar nuevas investigaciones que propongan novedosos mecanismos de detección de ataques de presentación. En la segunda fase, se desarrolla una metodología de evaluación a partir de dos fuentes: (1) la norma ISO/IEC 30107, y (2) Common Evaluation Methodology por el Common Criteria. La metodología desarrollada considera dos aspectos importantes del mecanismo de detección de ataques de presentación (1) la resistencia del mecanismo a los ataques de presentación, y (2) la correspondiente amenaza del ataque estudiado. Para el primer punto, se han de señalar las capacidades técnicas del mecanismo y cómo influyen en la seguridad y la facilidad de uso del sistema biométrico. Para el segundo aspecto se debe llevar a cabo una evaluación de la vulnerabilidad, teniendo en cuenta todos los factores que afectan al potencial de ataque. Por último, siguiendo esta metodología, se lleva a cabo una recogida de datos que incluye 7128 vídeos de huellas dactilares genuinas y de presentación de ataques. Los datos se recogen utilizando dos tecnologías de sensor, dos escenarios de presentación y considerando siete tipos de instrumentos de ataque. La base de datos se utiliza para desarrollar y evaluar mecanismos dinámicos de detección de ataques de presentación que explotan las características espacio-temporales de las huellas dactilares. En la fase final, se desarrolla un conjunto de mecanismos novedosos de detección de ataques de presentación que explotan las características dinámicas causadas por los fenómenos naturales de las huellas dactilares, como la transpiración y la elasticidad. Los resultados de la evaluación muestran una capacidad eficiente de detección de ataques en la que, en algunas configuraciones, los mecanismos son capaces de eliminar completamente algunos tipos de instrumentos de ataque y mitigar el resto de los tipos manteniendo la comodidad del usuario en un nivel alto.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidente: Cristina Conde Vila.- Secretario: Mariano López García.- Vocal: Farzin Derav

    Utilización de la fase armónica en la detección de voz sintética.

    Get PDF
    156 p.Los sistemas de verificación de locutor (SV) tienen que enfrentarse a la posibilidad de ser atacados mediante técnicas de spoofing. Hoy en día, las tecnologías de conversión de voces y de síntesis de voz adaptada a locutor han avanzado lo suficiente para poder crear voces que sean capaces de engañar a un sistema SV. En esta tesis se propone un módulo de detección de habla sintética (SSD) que puede utilizarse como complemento a un sistema SV, pero que es capaz de funcionar de manera independiente. Lo conforma un clasificador basado en GMM, dotado de modelos de habla humana y sintética. Cada entrada se compara con ambos, y, si la diferencia de verosimilitudes supera un determinado umbral, se acepta como humana, rechazándose en caso contrario. El sistema desarrollado es independiente de locutor. Para la generación de modelos se utilizarán parámetros RPS. Se propone una técnica para reducir la complejidad del proceso de entrenamiento, evitando generar TTSs adaptados o un conversor de voz para cada locutor. Para ello, como la mayoría de los sistemas de adaptación o síntesis modernos hacen uso de vocoders, se propone transcodificar las señales humanas mediante vocoders para obtener de esta forma sus versiones sintéticas, con las que se generarán los modelos sintéticos del clasificador. Se demostrará que se pueden detectar señales sintéticas detectando que se crearon mediante un vocoder. El rendimiento del sistema prueba en diferentes condiciones: con las propias señales transcodificadas o con ataques TTS. Por último, se plantean estrategias para el entrenamiento de modelos para sistemas SSD

    Articulatory-based Speech Processing Methods for Foreign Accent Conversion

    Get PDF
    The objective of this dissertation is to develop speech processing methods that enable without altering their identity. We envision accent conversion primarily as a tool for pronunciation training, allowing non-native speakers to hear their native-accented selves. With this application in mind, we present two methods of accent conversion. The first assumes that the voice quality/identity of speech resides in the glottal excitation, while the linguistic content is contained in the vocal tract transfer function. Accent conversion is achieved by convolving the glottal excitation of a non-native speaker with the vocal tract transfer function of a native speaker. The result is perceived as 60 percent less accented, but it is no longer identified as the same individual. The second method of accent conversion selects segments of speech from a corpus of non-native speech based on their acoustic or articulatory similarity to segments from a native speaker. We predict that articulatory features provide a more speaker-independent representation of speech and are therefore better gauges of linguistic similarity across speakers. To test this hypothesis, we collected a custom database containing simultaneous recordings of speech and the positions of important articulators (e.g. lips, jaw, tongue) for a native and non-native speaker. Resequencing speech from a non-native speaker based on articulatory similarity with a native speaker achieved a 20 percent reduction in accent. The approach is particularly appealing for applications in pronunciation training because it modifies speech in a way that produces realistically achievable changes in accent (i.e., since the technique uses sounds already produced by the non-native speaker). A second contribution of this dissertation is the development of subjective and objective measures to assess the performance of accent conversion systems. This is a difficult problem because, in most cases, no ground truth exists. Subjective evaluation is further complicated by the interconnected relationship between accent and identity, but modifications of the stimuli (i.e. reverse speech and voice disguises) allow the two components to be separated. Algorithms to measure objectively accent, quality, and identity are shown to correlate well with their subjective counterparts

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems
    corecore