44 research outputs found
Robust excitation-based features for Automatic Speech Recognition
In this paper we investigate the use of robust to noise features characterizing the speech excitation signal as complementary features to the usually considered vocal tract based features for automatic speech recognition (ASR). The features are tested in a state-of-the-art Deep Neural Network (DNN) based hybrid acoustic model for speech recognition. The suggested excitation features expands the set of excitation features previously considered for ASR, expecting that these features help in a better discrimination of the broad phonetic classes (e.g., fricatives, nasal, vowels, etc.). Relative improvements in the word error rate are observed in the AMI meeting transcription system with greater gains (about 5%) if PLP features are combined with the suggested excitation features. For Aurora 4, significant improvements are observed as well. Combining the suggested excitation features with filter banks, a word error rate of 9.96% is achieved.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ICASSP.2015.717885
Relevant Feature Selection for Audio-Visual Speech Recognition
We present a feature selection method based on information theoretic measures, targeted at multimodal signal processing, showing how we can quantitatively assess the relevance of features from different modalities. We are able to find the features with the highest amount of information relevant for the recognition task, and at the same having minimal redundancy. Our application is audio- visual speech recognition, and in particular selecting relevant visual features. Experimental results show that our method outperforms other feature selection algorithms from the literature by improving recognition accuracy even with a significantly reduced number of features
Efficient GCI detection for efficient sparse linear prediction
International audienceWe propose a unified non-linear approach that offers an ef- ficient closed-form solution for the problem of sparse linear prediction analysis. The approach is based on our previous work for minimization of the weighted l2 -norm of the prediction error. The weighting of the l2 -norm is done in a way that less emphasis is given to the prediction error around the Glottal Closure Instants (GCI) as they are expected to attain the largest values of error and hence, the resulting cost function approaches the ideal l0 -norm cost function for sparse residual recovery. As such, the method requires knowledge of the GCIs. In this paper we use our recently developed GCI detection algorithm which is particularly suitable for this problem as it does not rely on residuals themselves for detection of GCIs. We show that our GCI detection algorithm provides slightly better sparsity properties in comparison to a recent powerful GCI detection algorithm. Moreover, as the computational cost of our GCI detection algorithm is quite low, the computational cost of the overall solution is considerably lower
Fundamental frequency estimation of low-quality electroglottographic signals
Fundamental frequency (fo) is often estimated based on electroglottographic (EGG) signals. Due to the nature of the method, the quality of EGG signals may be impaired by certain features like amplitude or baseline drifts, mains hum or
noise. The potential adverse effects of these factors on fo estimation has to date not been investigated. Here, the performance of thirteen algorithms for estimating fo was tested, based on 147 synthesized EGG signals with varying degrees of signal quality deterioration. Algorithm performance was assessed through the standard deviation σfo of the difference between known and estimated fo data, expressed in octaves. With very few exceptions, simulated mains hum, and amplitude and baseline drifts did not influence fo results, even though some algorithms consistently outperformed others. When increasing either cycle-to-cycle fo variation or the degree of subharmonics, the SIGMA algorithm had the best performance (max. σfo = 0.04). That algorithm was however more easily disturbed by typical EGG equipment noise, whereas the NDF and Praat's auto-correlation algorithms performed best in this category (σfo = 0.01). These results suggest that the algorithm for fo estimation of EGG signals needs to be selected specifically for each particular data set. Overall, estimated fo data should be interpreted with care
Emergence of linguistic laws in human voice
Submitted for publicationSubmitted for publicatio
ON THE MUTUAL INFORMATION OF GLOTTAL SOURCE ESTIMATION TECHNIQUES FOR THE AUTOMATIC DETECTION OF SPEECH PATHOLOGIES
Abstract: This paper focuses on the automatic detection of speech pathologies by exploiting the estimation of the glottal source. Three methods of estimation are compared and time and spectral features are extracted. The relevancy of these features is assessed by means of information theory-based measures. This allows an intuitive interpretation in terms of discrimination power and redundancy between the features. It is discussed which features are informative or complementary for detecting voice pathologies and the glottal source estimation methods are compared