167 research outputs found

    V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

    Full text link
    Voice data generated on instant messaging or social media applications contains unique user voiceprints that may be abused by malicious adversaries for identity inference or identity theft. Existing voice anonymization techniques, e.g., signal processing and voice conversion/synthesis, suffer from degradation of perceptual quality. In this paper, we develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization while preserving the intelligibility, naturalness and timbre of the audio. Our designed anonymizer features a one-shot generative model that modulates the features of the original audio at different frequency levels. We train the anonymizer with a carefully-designed loss function. Apart from the anonymity loss, we further incorporate the intelligibility loss and the psychoacoustics-based naturalness loss. The anonymizer can realize untargeted and targeted anonymization to achieve the anonymity goals of unidentifiability and unlinkability. We have conducted extensive experiments on four datasets, i.e., LibriSpeech (English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian), five Automatic Speaker Verification (ASV) systems (including two DNN-based, two statistical and one commercial ASV), and eleven Automatic Speech Recognition (ASR) systems (for different languages). Experiment results confirm that V-Cloak outperforms five baselines in terms of anonymity performance. We also demonstrate that V-Cloak trained only on the VoxCeleb1 dataset against ECAPA-TDNN ASV and DeepSpeech2 ASR has transferable anonymity against other ASVs and cross-language intelligibility for other ASRs. Furthermore, we verify the robustness of V-Cloak against various de-noising techniques and adaptive attacks. Hopefully, V-Cloak may provide a cloak for us in a prism world.Comment: Accepted by USENIX Security Symposium 202

    Advances in Binary and Multiclass Audio Segmentation with Deep Learning Techniques

    Get PDF
    Los avances tecnológicos acaecidos en la última década han cambiado completamente la forma en la que la población interactúa con el contenido multimedia. Esto ha propiciado un aumento significativo tanto en la generación como el consumo de dicho contenido. El análisis y la anotación manual de toda esta información no son factibles dado el gran volumen actual, lo que releva la necesidad de herramientas automáticas que ayuden en la transición hacia flujos de trabajo asistidos o parcialmente automáticos. En los últimos años, la mayoría de estas herramientas están basadas en el uso de redes neuronales y deep learning. En este contexto, el trabajo que se describe en esta tesis se centra en el ámbito de la extracción de información a partir de señales de audio. Particularmente, se estudia la tarea de segmentación de audio, cuyo principal objetivo es obtener una secuencia de etiquetas que aíslen diferentes regiones en una señal de entrada de acuerdo con una serie de características descritas en un conjunto predefinido de clases, como por ejemplo voz, música o ruido.La primera parte de esta memoria esta centrada en la tarea de detección de actividad de voz. Recientemente, diferentes campañas de evaluación internacionales han propuesto esta tarea como uno de sus retos. Entre ellas se encuentra el reto Fearless steps, que trabaja con audios de las grabaciones de las misiones Apollo de la NASA. Para este reto, se propone una solución basada en aprendizaje supervisado usando una red convolucional recurrente como clasificador. La principal contribución es un método que combina información de filtros de 1D y 2D en la etapa convolucional para que sea procesada posteriormente por la etapa recurrente. Motivado por la introducción de los datos del reto Fearless steps, se plantea una evaluación de diferentes técnicas de adaptación de dominio, con el objetivo de comprobar las prestaciones de un sistema entrenado con datos de dominios habituales y evaluado en este nuevo dominio presentado en el reto. Los métodos descritos no requieren de etiquetas en el dominio objetivo, lo que facilita su uso en aplicaciones prácticas. En términos generales, se observa que los métodos que buscan minimizar el cambio en las distribuciones estadísticas entre los dominios fuente y objetivo obtienen los resultados mas prometedores. Los avances recientes en técnicas de representación obtenidas mediante aprendizaje auto-supervisado han demostrado grandes mejoras en prestaciones en varias tareas relacionadas con el procesado de voz. Siguiendo esta línea, se plantea la incorporación de dichas representaciones en la tarea de detección de actividad de voz. Las ediciones más recientes del reto Fearless steps modificaron su propósito, buscando ahora evaluar las capacidades de generalización de los sistemas. El objetivo entonces con las técnicas introducidas es poder beneficiarse de grandes cantidades de datos no etiquetados para mejorar la robustez del sistema. Los resultados experimentales sugieren que el aprendizaje auto-supervisado de representaciones permite obtener sistemas que son mucho menos sensibles al cambio de dominio.En la segunda parte de este documento se analiza una tarea de segmentación de audio más genérica que busca clasificar de manera simultanea una señal de audio como voz, música, ruido o una combinación de estas. En el contexto de los datos propuesto para el reto de segmentación de audio Albayzín 2010, se presenta un enfoque basado en el uso de redes neuronales recurrentes como clasificador principal, y un modelo de postprocesado integrado por modelos ocultos de Markov. Se introduce un nuevo bloque en la arquitectura neuronal con el objetivo de eliminar la información temporal redundante, mejorando las prestaciones y reduciendo el numero de operaciones por segundo al mismo tiempo. Esta propuesta obtuvo mejores prestaciones que soluciones presentadas anteriormenteen la literatura, y que aproximaciones similares basadas en redes neuronales profundas. Mientras que los resultados con aprendizaje auto-supervisado de representaciones eran prometedores en tareas de segmentación binaria, si se aplican en tareas de segmentación multiclase surgen una serie de cuestiones. Las técnicas habituales de aumento de datos que se aplican en el entrenamiento fuerzan al modelo a compensar el ruido de fondo o la música. En estas condiciones las características obtenidas podrían no representar de manera precisa aquellas clases generadas de manera similar a las versiones aumentadas vistas en el entrenamiento. Este hecho limita la mejora global de prestaciones observada al aplicar estas técnicas en tareas como la propuesta en la evaluación Albayzín 2010.La última parte de este trabajo ha investigado la aplicación de nuevas funciones de coste en la tarea de segmentación de audio, con el principal objetivo de mitigar los problemas que se derivan de utilizar un conjunto de datos de entrenamiento limitado. Se ha demostrado que nuevas técnicas de optimización basadas en las métricas AUC y AUC parcial pueden mejorar objetivos de entrenamiento tradicionales como la entropía cruzada en varias tareas de detección. Con esta idea en mente, en esta tesis se introducen dichas técnicas en la tarea de detección de música. Considerando que la cantidad de datos etiquetados para esta tarea es limitada comparado con otras tareas, las funciones de coste basadas en la métrica AUC se aplican con el objetivo de mejorar las prestaciones cuando el conjunto de datos de entrenamiento es relativamente pequeño. La mayoría de los sistemas que utilizan las técnicas de optimización basadas en métricas AUC se limitan a tareas binarias ya que ese el ámbito de aplicación habitual de la métrica AUC. Además, el etiquetado de audios con taxonomías más detalladas en las que hay múltiples opciones posibles es más complejo, por lo que la cantidad de audio etiquetada en algunas tareas de segmentación multiclase es limitada. Como una extensión natural, se propone una generalización de las técnicas de optimización basadas en la métrica AUC binaria, de tal manera que se puedan aplicar con un número arbitrario de clases. Dos funciones de coste distintas se introducen, usando como base para su formulación las variaciones multiclase de la métrica AUC propuestas en la literatura: una basada en un enfoque uno contra uno, y otra basada en un enfoque uno contra el resto.<br /

    Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition

    No full text
    Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whispered speech emotion recognition. We select two types of phase-based features (i.e., modified group delay features and all-pole group delay features), both which have shown wide applicability to all sorts of different speech analysis and are now studied in whispered speech emotion recognition. When exploiting these features, we propose a new speech emotion recognition framework, employing outer product in combination with power and L2 normalization. The according technique encodes any variable length sequence of the phase-based features into a fixed dimension vector regardless of the length of the input sequence. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the Geneva Whispered Emotion Corpus database, including normal and whispered phonation, demonstrate the effectiveness of the proposed method when compared with other modern systems. It is also shown that, combining phase information with magnitude information could significantly improve performance over the common systems solely adopting magnitude information

    Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

    Full text link
    In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems

    Tree-Based Deep Mixture of Experts with Applications to Visual Saliency Prediction and Quality Robust Visual Recognition

    Get PDF
    abstract: Mixture of experts is a machine learning ensemble approach that consists of individual models that are trained to be ``experts'' on subsets of the data, and a gating network that provides weights to output a combination of the expert predictions. Mixture of experts models do not currently see wide use due to difficulty in training diverse experts and high computational requirements. This work presents modifications of the mixture of experts formulation that use domain knowledge to improve training, and incorporate parameter sharing among experts to reduce computational requirements. First, this work presents an application of mixture of experts models for quality robust visual recognition. First it is shown that human subjects outperform deep neural networks on classification of distorted images, and then propose a model, MixQualNet, that is more robust to distortions. The proposed model consists of ``experts'' that are trained on a particular type of image distortion. The final output of the model is a weighted sum of the expert models, where the weights are determined by a separate gating network. The proposed model also incorporates weight sharing to reduce the number of parameters, as well as increase performance. Second, an application of mixture of experts to predict visual saliency is presented. A computational saliency model attempts to predict where humans will look in an image. In the proposed model, each expert network is trained to predict saliency for a set of closely related images. The final saliency map is computed as a weighted mixture of the expert networks' outputs, with weights determined by a separate gating network. The proposed model achieves better performance than several other visual saliency models and a baseline non-mixture model. Finally, this work introduces a saliency model that is a weighted mixture of models trained for different levels of saliency. Levels of saliency include high saliency, which corresponds to regions where almost all subjects look, and low saliency, which corresponds to regions where some, but not all subjects look. The weighted mixture shows improved performance compared with baseline models because of the diversity of the individual model predictions.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    TimeScaleNet : a Multiresolution Approach for Raw Audio Recognition using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

    Get PDF
    International audienceIn the present paper, we show the benefit of a multi-resolution approach that allows to encode the relevant information contained in unprocessed time domain acoustic signals. TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. The proposed approach allows to improve the interpretability of the learning scheme, by unifying advanced deep learning and signal processing techniques. In particular, TimeScaleNet's architecture introduces a new form of recurrent neural layer, which is directly inspired from digital IIR signal processing. This layer acts as a learnable passband biquadratic digital IIR filterbank. The learnable filterbank allows to build a time-frequency-like feature map that self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis aims at efficiently encoding relationships between the time fluctuations at the frame timescale, in different learnt pooled frequency bands, in the range of [20 ms ; 200 ms]. TimeScaleNet is tested both using the Speech Commands Dataset and the ESC-10 Dataset. We report a very high mean accuracy of 94.87 ± 0.24% (macro averaged F1-score : 94.9 ± 0.24%) for speech recognition, and a rather moderate accuracy of 69.71 ± 1.91% (macro averaged F1-score : 70.14 ± 1.57%) for the environmental sound classification task
    • …
    corecore