13 research outputs found

    Deep Fishing: Gradient Features from Deep Nets

    Full text link
    Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels. Deep learning is a marked departure from the previous state of the art, the Fisher Vector (FV), which relied on gradient-based encoding of local hand-crafted features. In this paper, we discuss a novel connection between these two approaches. First, we show that one can derive gradient representations from ConvNets in a similar fashion to the FV. Second, we show that this gradient representation actually corresponds to a structured matrix that allows for efficient similarity computation. We experimentally study the benefits of transferring this representation over the outputs of ConvNet layers, and find consistent improvements on the Pascal VOC 2007 and 2012 datasets.Comment: To appear at BMVC 201

    A Novel automatic voice recognition system based on text-independent in a noisy environment

    Get PDF
    Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increase the accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user to speak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order to grant or deny access to the system. The voice recognition was done in two steps: training and testing. During the training a Universal Background Model as well as a Gaussian Mixtures Model: GMM-UBM models are calculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisy environment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noise methods were applied, we investigated different MFCC features of the voice to determine the best feature possible as well as noise filter algorithm that subsequently improved the performance of the automatic voice recognition system

    Research of Speaker Verification System Based On Sparse Representation

    Get PDF
    说话人识别作为现代生物信息识别中的一项重要技术,依据语音信号确认说话人身份。从1999年起,历年NIST测评结果显示,GMM-UBM识别框架使用统一背景模型自适应出目标说话人模型,能更好地表征说话人个性特征。由于GMM建模只是对目标说话人一类数据进行的,直接采用GMM似然度得分进行分类具有计算量大、区分能力不佳等不足之处;将GMM均值超向量作为SVM分类器的输入,采用非线性核函数进行二分类,一定程度上提高了说话人识别性能,但是数据的不平衡和两类数据的混叠问题对分类效果影响较大。稀疏表示理论指出可压缩信号能够在某个空间中由最能反映信号特征且数量最少的原子线性表示,表征同类信号的基原子分布密集,对...Speaker recognition as an important technology in modern biological information recognition area, it can confirm the identity of speaker based on the speech signal. Since 1999, NIST speaker recognition evaluation results show that, GMM-UBM recognition framework which gets target speaker GMM model from universial background model (UBM) adaptively can better characterize the speaker’s personality. H...学位:工学硕士院系专业:信息科学与技术学院_电路与系统学号:2312010115295

    Comparison GMM and SVM Classifier for Automatic Speaker Verification

    Get PDF
    The objective of this thesis is to develop automatic text-independent speaker verification systems using unconstrained telephone conversational speech. We began by performing a Gaussian Mixture Model Likelihood ratio verification task in speaker independent system as described by MIT Lincoln Lab. We next introduced a speaker dependent verification system based on speaker dependent thresholds. We then implemented the same system applying Support Vector Machine. In SVM, we used polynomial kernels and radial basis function kernels and compared the performance. For training and testing the system, we used low-level spectral features. Finally, we provided a performance assessment of these systems using the National Institute of Standards and technology (NIST) speaker recognition evaluation 2008 telephone corpora

    A Framework For Enhancing Speaker Age And Gender Classification By Using A New Feature Set And Deep Neural Network Architectures

    Get PDF
    Speaker age and gender classification is one of the most challenging problems in speech processing. Recently with developing technologies, identifying a speaker age and gender has become a necessity for speaker verification and identification systems such as identifying suspects in criminal cases, improving human-machine interaction, and adapting music for awaiting people queue. Although many studies have been carried out focusing on feature extraction and classifier design for improvement, classification accuracies are still not satisfactory. The key issue in identifying speaker’s age and gender is to generate robust features and to design an in-depth classifier. Age and gender information is concealed in speaker’s speech, which is liable for many factors such as, background noise, speech contents, and phonetic divergences. In this work, different methods are proposed to enhance the speaker age and gender classification based on the deep neural networks (DNNs) as a feature extractor and classifier. First, a model for generating new features from a DNN is proposed. The proposed method uses the Hidden Markov Model toolkit (HTK) tool to find tied-state triphones for all utterances, which are used as labels for the output layer in the DNN. The DNN with a bottleneck layer is trained in an unsupervised manner for calculating the initial weights between layers, then it is trained and tuned in a supervised manner to generate transformed mel-frequency cepstral coefficients (T-MFCCs). Second, the shared class labels method is introduced among misclassified classes to regularize the weights in DNN. Third, DNN-based speakers models using the SDC feature set is proposed. The speakers-aware model can capture the characteristics of the speaker age and gender more effectively than a model that represents a group of speakers. In addition, AGender-Tune system is proposed to classify the speaker age and gender by jointly fine-tuning two DNN models; the first model is pre-trained to classify the speaker age, and second model is pre-trained to classify the speaker gender. Moreover, the new T-MFCCs feature set is used as the input of a fusion model of two systems. The first system is the DNN-based class model and the second system is the DNN-based speaker model. Utilizing the T-MFCCs as input and fusing the final score with the score of a DNN-based class model enhanced the classification accuracies. Finally, the DNN-based speaker models are embedded into an AGender-Tune system to exploit the advantages of each method for a better speaker age and gender classification. The experimental results on a public challenging database showed the effectiveness of the proposed methods for enhancing the speaker age and gender classification and achieved the state of the art on this database

    VOICE BIOMETRICS UNDER MISMATCHED NOISE CONDITIONS

    Get PDF
    This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy. To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions. In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated. The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes

    Support Vector Machines for Speech Recognition

    Get PDF
    Hidden Markov models (HMM) with Gaussian mixture observation densities are the dominant approach in speech recognition. These systems typically use a representational model for acoustic modeling which can often be prone to overfitting and does not translate to improved discrimination. We propose a new paradigm centered on principles of structural risk minimization using a discriminative framework for speech recognition based on support vector machines (SVMs). SVMs have the ability to simultaneously optimize the representational and discriminative ability of the acoustic classifiers. We have developed the first SVM-based large vocabulary speech recognition system that improves performance over traditional HMM-based systems. This hybrid system achieves a state-of-the-art word error rate of 10.6% on a continuous alphadigit task ? a 10% improvement relative to an HMM system. On SWITCHBOARD, a large vocabulary task, the system improves performance over a traditional HMM system from 41.6% word error rate to 40.6%. This dissertation discusses several practical issues that arise when SVMs are incorporated into the hybrid system

    Autoregressive models for text independent speaker identification in noisy environments

    Get PDF
    The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high classification accuracies (in the range 95% - 100%) when both the training and testing utterances are recorded in sound proof studio, i.e., there is neither additive noise nor spectral distortion to the speech signals. However, in real life applications, speech is usually corrupted by noise and band-limitation. Moreover, there is a mismatch between the recording conditions of the training and testing environments. As a result, the classification accuracy of GMM-based systems deteriorates significantly. In this thesis, we propose a two-step procedure for improving the speaker identification performance under noisy environment. In the first step, we introduce a new classifier: vector autoregressive Gaussian mixture (VARGM) model. Unlike the GMM, the new classifier models correlations between successive feature vectors. We also integrate the proposed method into the framework of the universal background model (UBM). In addition, we develop the learning procedure according to the maximum likelihood (ML) criterion. Based on a thorough experimental evaluation, the proposed method achieves an improvement of 3 to 5% in the identification accuracy. In the second step, we propose a new compensation technique based on the generalized maximum likelihood (GML) decision rule. In particular, we assume a general form for the distribution of the noise-corrupted utterances, which contains two types of parameters: clean speech-related parameters and noise-related parameters. While the clean speech related parameters are estimated during the training phase, the noise related parameters are estimated from the corrupted speech in the testing phase. We applied the proposed method to utterances of 50 speakers selected from the TIMIT database, artificially corrupted by convolutive and additive noise. The signal to noise ratio (SNR) varies from 0 to 20 dB. Simulation results reveal that the proposed method achieves good robustness against variation in the SNR. For utterances corrupted by covolutive noise, the improvement in the classification accuracy ranges from 70% for SNR = 0 dB to around 4% for SNR = 10dB, compared to the standard ML decision rule. For utterances corrupted by additive noise, the improvement in the classification accuracy ranges from 1% to 10% for SNRs ranging from 0 to 20 dB. The proposed VARGM classifier is also applied to the speech emotion classification problem. In particular, we use the Berlin emotional speech database to validate the classification performance of the proposed VARGM classifier. The proposed technique provides a classification accuracy of 76% versus 71% for the hidden Markov model, 67% for the k-nearest neighbors, 55% for feed-forward neural networks. The model gives also better discrimination between high-arousal emotions (joy, anger, fear), low arousal emotions (sadness, boredom), and neutral emotions than the HMM. Another interesting application of the VARGM model is the blind equalization of multi input multiple output (MIMO) communication channels. Based on VARGM modeling of MIMO channels, we propose a four-step equalization procedure. First, the received data vectors are fitted into a VARGM model using the expectation maximization (EM) algorithm. The constructed VARGM model is then used to filter the received data. A Baysian decision rule is then applied to identify the transmitted symbols up to a permutation and phase ambiguities, which are finally resolved using a small training sequence. Moreover, we propose a fast and easily implementable model order selection technique. The new equalization algorithm is compared to the whitening method and found to provide less symbol error probability. The proposed technique is also applied to frequency-flat slow fading channels and found to provide a more accurate estimate of the channel response than that provided by the blind de-convolution exploiting channel encoding (BDCC) method and at a higher information rate
    corecore