84 research outputs found

    Adversarial Network Bottleneck Features for Noise Robust Speaker Verification

    Full text link
    In this paper, we propose a noise robust bottleneck feature representation which is generated by an adversarial network (AN). The AN includes two cascade connected networks, an encoding network (EN) and a discriminative network (DN). Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used as input to the EN and the output of the EN is used as the noise robust feature. The EN and DN are trained in turn, namely, when training the DN, noise types are selected as the training labels and when training the EN, all labels are set as the same, i.e., the clean speech label, which aims to make the AN features invariant to noise and thus achieve noise robustness. We evaluate the performance of the proposed feature on a Gaussian Mixture Model-Universal Background Model based speaker verification system, and make comparison to MFCC features of speech enhanced by short-time spectral amplitude minimum mean square error (STSA-MMSE) and deep neural network-based speech enhancement (DNN-SE) methods. Experimental results on the RSR2015 database show that the proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE and DNN-SE based MFCCs for different noise types and signal-to-noise ratios. Furthermore, the AN-BN feature is able to improve the speaker verification performance under the clean condition

    Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features

    Get PDF

    DNN Filter Bank Cepstral Coefficients for Spoofing Detection

    Get PDF
    With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech. The deep neural network filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. By adding restrictions on the training rules, the learned weight matrix of FBNN is band-limited and sorted by frequency, similar to the normal filter bank. Unlike the manually designed filter bank, the learned filter bank has different filter shapes in different channels, which can capture the differences between natural and synthetic speech more effectively. The experimental results on the ASVspoof {2015} database show that the Gaussian mixture model maximum-likelihood (GMM-ML) classifier trained by the new feature performs better than the state-of-the-art linear frequency cepstral coefficients (LFCC) based classifier, especially on detecting unknown attacks

    An evaluation of intrusive instrumental intelligibility metrics

    Full text link
    Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and sEPSMcorr\text{sEPSM}^\text{corr}. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of ρ=0.92\rho=0.92 and ρ=0.89\rho=0.89, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called SIIBGauss\text{SIIB}^\text{Gauss}, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 201

    Advances in Deep Learning Towards Fire Emergency Application : Novel Architectures, Techniques and Applications of Neural Networks

    Get PDF
    Paper IV is not published yet.With respect to copyright paper IV and paper VI was excluded from the dissertation.Deep Learning has been successfully used in various applications, and recently, there has been an increasing interest in applying deep learning in emergency management. However, there are still many significant challenges that limit the use of deep learning in the latter application domain. In this thesis, we address some of these challenges and propose novel deep learning methods and architectures. The challenges we address fall in these three areas of emergency management: Detection of the emergency (fire), Analysis of the situation without human intervention and finally Evacuation Planning. In this thesis, we have used computer vision tasks of image classification and semantic segmentation, as well as sound recognition, for detection and analysis. For evacuation planning, we have used deep reinforcement learning.publishedVersio

    Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

    Get PDF
    Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture

    Environmentally robust ASR front-end for deep neural network acoustic models

    Get PDF
    This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00

    Acoustic Features for Environmental Sound Analysis

    Get PDF
    International audienceMost of the time it is nearly impossible to differentiate between particular type of sound events from a waveform only. Therefore, frequency domain and time-frequency domain representations have been used for years providing representations of the sound signals that are more inline with the human perception. However, these representations are usually too generic and often fail to describe specific content that is present in a sound recording. A lot of work have been devoted to design features that could allow extracting such specific information leading to a wide variety of hand-crafted features. During the past years, owing to the increasing availability of medium scale and large scale sound datasets, an alternative approach to feature extraction has become popular, the so-called feature learning. Finally, processing the amount of data that is at hand nowadays can quickly become overwhelming. It is therefore of paramount importance to be able to reduce the size of the dataset in the feature space. The general processing chain to convert an sound signal to a feature vector that can be efficiently exploited by a classifier and the relation to features used for speech and music processing are described is this chapter
    corecore