19 research outputs found

    Analysis and detection of human emotion and stress from speech signals

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Subband modeling for spoofing detection in automatic speaker verification

    Get PDF
    Spectrograms - time-frequency representations of audio signals - have found widespread use in neural network-based spoofing detection. While deep models are trained on the fullband spectrum of the signal, we argue that not all frequency bands are useful for these tasks. In this paper, we systematically investigate the impact of different subbands and their importance on replay spoofing detection on two benchmark datasets: ASVspoof 2017 v2.0 and ASVspoof 2019 PA. We propose a joint subband modelling framework that employs n different sub-networks to learn subband specific features. These are later combined and passed to a classifier and the whole network weights are updated during training. Our findings on the ASVspoof 2017 dataset suggest that the most discriminative information appears to be in the first and the last 1 kHz frequency bands, and the joint model trained on these two subbands shows the best performance outperforming the baselines by a large margin. However, these findings do not generalise on the ASVspoof 2019 PA dataset. This suggests that the datasets available for training these models do not reflect real world replay conditions suggesting a need for careful design of datasets for training replay spoofing countermeasures

    Acoustic anomaly detection using robust statistical energy processing

    Get PDF
    An anomaly is the specific event that causes the violation of a process observer's expectations about the process under observation. In this work, the problem of spatially locating an acoustic anomaly is addressed. Once reduced to a problem in robust statistics, an automated observer is designed to detect when high energy sources are introduced into an acoustic scene. Accounting for potential energy from signal amplitude, and kinetic energy from signal frequency in wavelet-filtered sub-bands, an outlier a robust statistical characterization scheme was developed using the Teager energy operator. With a statistical expectation of energy content in sub-bands, a methodology is designed to detect signal energies that violate the statistical expectation. These minor anomalies provide some sense that a fundamental change in energy has occurred in the sub-band. By examining how the signal is changing across all sub-bands, a detector is designed that is able to determine when a fundamental change occurs in the sub-band signal trends. Minor anomalies occurring during such changes are labeled as major anomalies. Using established localization methods, position estimates are obtained for the major anomalies in each sub-band. Accounting for the possibility of a source with spatiotemporal properties, the median of sub-band position estimates provides the final spatial information about the source

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems

    Voice biometric system security: Design and analysis of countermeasures for replay attacks.

    Get PDF
    PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoo ng attacks (or presentation attacks) is of growing concern and is now well acknowledged by the research community. A spoo ng attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount | yet di cult to detect reliably and is the focus of this thesis. This research focuses on the analysis and design of existing and novel countermeasures for replay attack detection in ASV, organised in two major parts. The rst part of the thesis investigates existing methods for spoo ng detection from several perspectives. I rst study the generalisability of hand-crafted features for replay detection that show promising results on synthetic speech detection. I nd, however, that it is di cult to achieve similar levels of performance due to the acoustically di erent problem under investigation. In addition, I show how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to the manipulation of class predictions. I then analyse the performance of several countermeasure models under varied replay attack conditions. I nd that it is di cult to account for the e ects of various factors in a replay attack: acoustic environment, playback device and recording device, and their interactions. Subsequently, I developed and studied a convolutional neural network (CNN) model that demonstrates comparable performance to the one that ranked rst in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN has learned for replay detection using a method from interpretable machine learning. The ndings suggest that the model highly attends at the rst few milliseconds of test recordings in order to make predictions. Then, I perform an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection and demonstrate that any machine learning countermeasure model can still exploit the artefacts I identi ed in this dataset. The second part of the thesis studies the design of countermeasures for ASV, focusing on model robustness and avoiding dataset biases. First, I proposed an ensemble model combining shallow and deep machine learning methods for spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection for reliable and robust model predictions on the ASVspoof 2017 dataset. For this, I created a publicly available collection of hand-annotations of speech endpoints for the same dataset, and new benchmark results for both frame-based and utterance-based countermeasures are also developed. I then proposed spectral subband modelling using CNNs for replay detection. My results indicate that models that learn subband-speci c information substantially outperform models trained on complete spectrograms. Finally, I proposed to use variational autoencoders | deep unsupervised generative models | as an alternative backend for spoo ng detection and demonstrate encouraging results when compared with the traditional Gaussian mixture mode

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 4th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2005, held 29-31 October 2005, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Caracterització de l'impacte dels esdeveniments acústics en els nivells equivalents sonors i en la percepció dels ciutadans per a la confecció de mapes dinàmics de soroll

    Get PDF
    La contaminació acústica ha esdevingut un greu problema de salut pública, provocant diversos tipus de malalties i trastorns en les persones. Segons l'Organització Mundial de la Salut, cada any es perden a l'Europa occidental, un milió d'anys de vida saludables per culpa de l'exposició al soroll ambiental. Per tal d'avaluar i gestionar el soroll ambiental a la Unió Europea, la directiva END 2002/49/CE requereix als estats membres la preparació i publicació de mapes de soroll actualitzats i els plans d'acció relatius, cada cinc anys. Això inclou aglomeracions de més de 100.000 habitants i les principals carreteres, vies de tren i aeroports. Gràcies als avanços tecnològics recents, el paradigma de creació de mapes de soroll ha canviat substancialment, permetent l'automatització de les mesures dels nivells sonors utilitzant xarxes de sensors acústics sense fils per a la generació de mapes de soroll en temps real. Així i tot, aquestes xarxes no poden prevenir una sèrie de situacions que esbiaixarien la mesura real dels nivells equivalents sonors, ocasionant que el mapa no sigui fidel a la realitat que percep el ciutadà, p. ex., el so de les aus, de la indústria, els clàxons, les sirenes, les converses que ocorren prop dels sensors o fenòmens meteorològics com la pluja i el vent. Aquesta tesi estudia la caracterització dels esdeveniments acústics per a la confecció de mapes dinàmics de soroll de trànsit. L'estudi comença presentant el context de la tesi, el projecte LIFE DYNAMAP, que pretén mesurar els nivells de soroll de trànsit en dues àrees pilot i integrar-los dinàmicament en un mapa de soroll que s'actualitza a temps real. A continuació, es presenta una anàlisi exhaustiva dels esdeveniments que es troben en les dues àrees, la urbana i la suburbana, i s'hi apliquen diverses caracteritzacions. Una de les mesures que es presenta és la de l'impacte en el nivell equivalent sonor (Leq), que permet mesurar el biaix que provoca la presència de certs esdeveniments acústics en la confecció dels mapes de soroll de trànsit. També es planteja l'ús de tests perceptius mitjançant mètriques psicoacústiques per tal d'adaptar la caracterització d'aquests esdeveniments a la percepció ciutadana. L'objectiu principal de la tesi és caracteritzar els esdeveniments d'entorns urbans i suburbans per oferir mapes de soroll més fidels a la realitat percebuda pel ciutadà en relació amb el paisatge sonor on es troba. I durant la tesi es mostra la importància de la detecció de sons en una xarxa de sensors acústics per tal de prevenir errors de mesura en els nivells equivalents i la necessitat d'entrenar el sistema de detecció amb dades obtingudes en els mateixos sensors de la xarxa.La contaminación acústica se ha convertido en un grave problema de salud pública, provocando varios tipos de enfermedades y trastornos en las personas. Según la Organización Mundial de la Salud, cada año se pierden en la Europa occidental, un millón de años de vida saludables por culpa de la exposición al ruido ambiental. Para evaluar y gestionar el ruido ambiental en la Unión Europea, la directiva END 2002/49/CE requiere a los estados miembros la preparación y publicación de mapas de ruido actualizados y los planes de acción relativos, cada cinco años. Esto incluye aglomeraciones de más de 100.000 habitantes y las principales carreteras, vías de tren y aeropuertos. Gracias a los avances tecnológicos recientes, el paradigma de creación de mapas de ruido ha cambiado sustancialmente, permitiendo la automatización de las medidas de los niveles sonoros utilizando redes de sensores acústicos inalámbricos para la generación de mapas de ruido en tiempo real. Aun así, estas redes no pueden prevenir una serie de situaciones que sesgarían la medida real de los niveles equivalentes sonoros, ocasionando que el mapa no sea fiel a la realidad que percibe el ciudadano, p. ej., el sonido de las aves, de la industria, los cláxones, las sirenas, las conversaciones que ocurren cerca de los sensores o fenómenos meteorológicos como la lluvia y el viento. Esta tesis estudia la caracterización de los eventos acústicos para la confección de mapas dinámicos de ruido de tráfico. El estudio empieza presentando el contexto de la tesis, el proyecto LIFE DYNAMAP, que pretende mesurar los niveles de ruido de tráfico en dos áreas piloto e integrarlos dinámicamente en un mapa de ruido que se actualiza a tiempo real. A continuación, se presenta un análisis exhaustivo de los acontecimientos que se encuentran en las dos áreas, la urbana y la suburbana, y se aplican varias caracterizaciones. Una de las medidas que se presenta es la del impacto en el nivel equivalente sonoro (Leq), que permite mesurar el sesgo que provoca la presencia de ciertos acontecimientos acústicos en la confección de los mapas de ruido de tráfico. También se plantea el uso de macetas perceptivas mediante métricas psicoacústicas para adaptar la caracterización de estos eventos a la percepción ciudadana. El objetivo principal de la tesis es caracterizar los acontecimientos de entornos urbanos y suburbanos para ofrecer mapas de ruido más fieles a la realidad percibida por el ciudadano en relación con el paisaje sonoro donde se encuentra. Y durante la tesis se muestra la importancia de la detección de sonidos en una red de sensores acústicos para prevenir errores de medida en los niveles equivalentes y la necesidad de entrenar el sistema de detección con datos obtenidos en los mismos sensores de la red.Acoustic pollution has become a serious public health problem, causing various types of disease and disorders in people. According to the World Health Organisation, one million years of healthy life are lost in Western Europe every year due to exposure to environmental noise. In order to evaluate and manage environmental noise in the European Union, Directive END 2002/49/EC requires Member States to prepare and publish updated noise maps and relative action plans every five years. This includes agglomerations of more than 100,000 inhabitants and major roads, train tracks and airports. Thanks to recent technological advances, the noise map creation paradigm has changed substantially, allowing noise level measurements to be automated using wireless acoustic sensor networks for real-time noise map generation. However, these networks cannot prevent a series of situations that would bias the actual measurement of sound equivalent levels, causing the map not to be true to the reality perceived by the citizen, e.g., the sound of birds, the industry, the claxons, the mermaids, conversations that occur near sensors or weather phenomena such as rain and wind. This thesis studies the characterization of acoustic events for the tailoring of dynamic traffic noise maps. The study begins by presenting the context of the thesis, the LIFE DYNAMAP project, which aims to measure traffic noise levels in two pilot areas and dynamically integrate them into a noise map that is updated in real time. After that, a detailed analysis is presented for the events in the two areas, urban and suburban, and various characterizations are applied. One of the presented measures is the impact on the equivalent sound level (Leq), which allows the measurement of bias resulting from the presence of certain acoustic events in the making of traffic noise maps. The use of perceptual tests using psychoacoustic metrics is also considered in order to adapt the characterization of these events to citizen perception. The main purpose of the thesis is to characterize the events of urban and suburban environments to offer noise maps more faithful to the reality perceived by the citizen in relation to the sound environment where it is found. And during the thesis, the importance of sound detection on a network of acoustic sensors is shown in order to prevent measurement errors at equivalent levels and the need to train the detection system with data obtained from the same sensors on the network
    corecore