9 research outputs found

    Non-Negative Matrix Factorization Based Algorithms to Cluster Frequency Basis Functions for Monaural Sound Source Separation.

    Get PDF
    Monophonic sound source separation (SSS) refers to a process that separates out audio signals produced from the individual sound sources in a given acoustic mixture, when the mixture signal is recorded using one microphone or is directly recorded onto one reproduction channel. Many audio applications such as pitch modification and automatic music transcription would benefit from the availability of segregated sound sources from the mixture of audio signals for further processing. Recently, Non-negative matrix factorization (NMF) has found application in monaural audio source separation due to its ability to factorize audio spectrograms into additive part-based basis functions, where the parts typically correspond to individual notes or chords in music. An advantage of NMF is that there can be a single basis function for each note played by a given instrument, thereby capturing changes in timbre with pitch for each instrument or source. However, these basis functions need to be clustered to their respective sources for the reconstruction of the individual source signals. Many clustering methods have been proposed to map the separated signals into sources with considerable success. Recently, to avoid the need of clustering, Shifted NMF (SNMF) was proposed, which assumes that the timbre of a note is constant for all the pitches produced by an instrument. SNMF has two drawbacks. Firstly, the assumption that the timbre of the notes played by an instrument remains constant, is not true in general. Secondly, the SNMF method uses the Constant Q transform (CQT) and the lack of a true inverse of the CQT results in compromising on separation quality of the reconstructed signal. The principal aim of this thesis is to attempt to solve the problem of clustering NMF basis functions. Our first major contribution is the use of SNMF as a method of clustering the basis functions obtained via standard NMF. The proposed SNMF clustering method aims to cluster the frequency basis functions obtained via standard NMF to their respective sources by making use of shift invariance in a log-frequency domain. Further, a minor contribution is made by improving the separation performance of the standard SNMF algorithm (here used directly to separate sources) obtained through the use of an improved inverse CQT. Here, the standard SNMF algorithm finds shift-invariance in a CQ spectrogram, that contain the frequency basis functions, obtained directly from the spectrogram of the audio mixture. Our next contribution is an improvement in the SNMF clustering algorithm through the incorporation of the CQT matrix inside the SNMF model in order to avoid the need of an inverse CQT to reconstruct the clustered NMF basis unctions. Another major contribution deals with the incorporation of a constraint called group sparsity (GS) into the SNMF clustering algorithm at two stages to improve clustering. The effect of the GS is evaluated on various SNMF clustering algorithms proposed in this thesis. Finally, we have introduced a new family of masks to reconstruct the original signal from the clustered basis functions and compared their performance to the generalized Wiener filter masks using three different factorisation-based separation algorithms. We show that better separation performance can be achieved by using the proposed family of masks

    Combinación de DNN y audio fingerprinting para detección de ataques de reproducción acústica multidispositivo de passwords habladas

    Full text link
    En este Trabajo de Fin de Máster, se ha usado una combinación de sistemas DNN y audio fingerprinting para la detección de ataques de reproducción acústica multidispositivo de passwords habladas, con el fin de investigar la capacidad de detección de estos ataques, uno de los problemas de seguridad más graves de los sistemas de verificación de locutor. Para ello se ha implementado un sistema de audio fingerprinting a partir de un prototipo para la detección de anuncios en programas broadcast; y se han desarrollado varios sistemas basados en redes neuronales profundas, para reconocer patrones acústicos en los audios de ataques de reproducción provocados por los diferentes dispositivos de grabación y reproducción que pueden ser utilizados en estos ataques. El carácter multidispositivo de este problema, conlleva que los audios procedentes de estos ataques de reproducción puedan tener diferentes calidades, dependiendo de la fidelidad del dispositivo con el que se realice el ataque y las condiciones acústicas de grabación y reproducción. Los sistemas de audio fingerprinting y de reconocimiento de patrones, funcionan con diferente rendimiento según la calidad de estos audios, siendo los sistemas acústicos capaces de detectar mejor los ataques cuanto peor es la calidad, al reconocer los artefactos acústicos de diferente tipo provocados por la reproducción y grabación, que son más acuciados cuanto menor es la fidelidad del dispositivo. El sistema de audio fingerprinting, sin embargo, tendrá un mejor comportamiento cuanto más parecidos sean el audio del ataque y el audio original. La complementariedad de estas estrategias ha sido probada en este trabajo, realizando una combinación de los sistemas, mejorando con la fusión el rendimiento de cualquiera de los sistemas individuales.In this final Master in Science Thesis, a combination of DNN and audio fingerprinting has been used, to be able to detect acoustic password replay attacks, on a multidevice environment, to assess the detection capabilities of this kind of attack, a major security problem on automatic speaker verification systems. In order to achieve this objective, an audio fingerprinting system has been implemented, based on a prototipe used for commercial detection on broadcast shows; and several neural network based system have been developped, in pursuance to recognize acoustic patterns in the replay attack audios, caused by the multiple replay devices that can be used on these attacks. In this multi-device environment, the sound recordings coming out from this attacks can have different qualities, depending on the device’s fidelity and the acoustic conditions at the attack’s time. The audio fingerprinting and the pattern recognition systems operate at different perfomances, depending on the quality of the audio. The acoustic systems have a better perfomence when the sound recording quality drops, being able to recognice acoustic patterns caused by the replay, which are more precise as the device’s fidelity lowers down. On the other hand, the audio fingerprinting system, has a better efficiency when the genuine and spoof audio are more similar. The complementarity of these strategies has been proved in this work, achieving with the system’s fusion a better perfomance than any of the individual systems by their own

    Speech, time-frequency representations

    Get PDF
    This paper presents a review on the use of time frequency representations in the fields of speech analysis and automatic speech processing . Three main groups of methods are considered : speech production based methods, general signal analysis methods, auditory-based methods . After this review, some short conclusions on their carrent use, and on some possible future evolutions are proposed .Le propos de cet article est de présenter une bibliographie récente sur l'utilisation des méthodes de représentation temps-fréquence en analyse et en traitement automatique de la parole. Les méthodes sont classées en trois grandes familles: méthodes dérivées de la production, méthodes d'analyse du signal, méthodes modélisant la perception. Après ce panorama, quelques rapides conclusions sur l'état actuel de l'utilisation de ces méthodes, et quelques perspectives sont tentée

    Procesado Tiempo-Frecuencia: Aplicación a la percepción humana del sonido

    Full text link
    Esta tesis está dedicada al estudio de las distintas herramientas que se disponen en el campo del Procesado Digital de Señales para analizar la percepción humana del sonido. Para ello, en la primera parte de la tesis se exponen los fundamentos fisiológicos de la percepción humana del sonido, así como los métodos y parámetros fundamentales que permiten evaluar la percepción subjetiva producida, lo que también se conoce como estudio psicoacústico de los sonidos. El análisis psicoacústico permite calcular el valor de una serie de parámetros del sonido para combinarlos de una forma adecuada con el fin de obtener una calificación de su calidad percibida (Sound Quality), lo más parecida posible a la que percibiría una persona. Según los parámetros psicoacústicos combinados, podremos obtener la calificación del grado de molestia, bienestar, stress, disonancia, etc. que produce el sonido en una persona. Respecto a las herramientas disponibles en el campo del Procesado Digital de Señales que puedan ayudar a mejorar el análisis psicoacústico, en esta tesis se realiza un estudio exhaustivo de las posibilidades del procesado Tiempo-Frecuencia. Además de realizar este estudio, el trabajo de investigación se ha centrado en la aplicación del análisis psicoacústico para la calificación de la molestia que producen ciertos tipos de ruido como son los ruidos de motor, ya sea el ruido de combustión producido por el motor, ya sea el ruido de escape. Se sabe que la molestia producida por estos ruidos se puede modelar utilizando una combinación de valores de Loudness (Sonoridad), Sharpness (Agudeza o Tonalidad) y Roughness (Aspereza). El cálculo de los dos primeros parámetros está estandarizado, mientras que existen varias aproximaciones para el cálculo del Roughness, presentando todas ellas serias limitaciones al analizar sonidos de motor, puesto que fueron desarrolladas y probadas para aplicarlas a señales simples. En esta tesis se ha desarrollado un modelo de síntesis de señaCamacho García, A. (2011). Procesado Tiempo-Frecuencia: Aplicación a la percepción humana del sonido [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/36417Palanci

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems
    corecore