1,787 research outputs found

    New Features Using Robust MVDR Spectrum of Filtered Autocorrelation Sequence for Robust Speech Recognition

    Get PDF
    This paper presents a novel noise-robust feature extraction method for speech recognition using the robust perceptual minimum variance distortionless response (MVDR) spectrum of temporally filtered autocorrelation sequence. The perceptual MVDR spectrum of the filtered short-time autocorrelation sequence can reduce the effects of residue of the nonstationary additive noise which remains after filtering the autocorrelation. To achieve a more robust front-end, we also modify the robust distortionless constraint of the MVDR spectral estimation method via revised weighting of the subband power spectrum values based on the sub-band signal to noise ratios (SNRs), which adjusts it to the new proposed approach. This new function allows the components of the input signal at the frequencies least affected by noise to pass with larger weights and attenuates more effectively the noisy and undesired components. This modification results in reduction of the noise residuals of the estimated spectrum from the filtered autocorrelation sequence, thereby leading to a more robust algorithm. Our proposed method, when evaluated on Aurora 2 task for recognition purposes, outperformed all Mel frequency cepstral coefficients (MFCC) as the baseline, relative autocorrelation sequence MFCC (RAS-MFCC), and the MVDR-based features in several different noisy conditions

    Cross validation of bi-modal health-related stress assessment

    Get PDF
    This study explores the feasibility of objective and ubiquitous stress assessment. 25 post-traumatic stress disorder patients participated in a controlled storytelling (ST) study and an ecologically valid reliving (RL) study. The two studies were meant to represent an early and a late therapy session, and each consisted of a "happy" and a "stress triggering" part. Two instruments were chosen to assess the stress level of the patients at various point in time during therapy: (i) speech, used as an objective and ubiquitous stress indicator and (ii) the subjective unit of distress (SUD), a clinically validated Likert scale. In total, 13 statistical parameters were derived from each of five speech features: amplitude, zero-crossings, power, high-frequency power, and pitch. To model the emotional state of the patients, 28 parameters were selected from this set by means of a linear regression model and, subsequently, compressed into 11 principal components. The SUD and speech model were cross-validated, using 3 machine learning algorithms. Between 90% (2 SUD levels) and 39% (10 SUD levels) correct classification was achieved. The two sessions could be discriminated in 89% (for ST) and 77% (for RL) of the cases. This report fills a gap between laboratory and clinical studies, and its results emphasize the usefulness of Computer Aided Diagnostics (CAD) for mental health care

    Speech and crosstalk detection in multichannel audio

    Get PDF
    The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, speech from both the microphone's wearer (local speech) and from other participants (crosstalk) is received. The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimize a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, "fundamentalness," and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation

    Improving the performance of MFCC for Persian robust speech recognition

    Get PDF
    The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to the noisy original speech signal. The pre-emphasized original  speech segmented into overlapping time frames, then it is windowed by a modified hamming window .Higher order autocorrelation coefficients are extracted. The next step is to eliminate the lower order of the autocorrelation coefficients. The consequence pass from FFT block and then power spectrum of output is calculated. A Gaussian shape filter bank is applied to the results. Logarithm and two compensator blocks form which one is mean subtraction and the other one are root block applied to the results and DCT transformation is the last step. We use MLP neural network to evaluate the performance of proposed MFCC method and to classify the results. Some speech recognition experiments for various tasks indicate that the proposed algorithm is more robust than traditional ones in noisy condition

    Investigation of Voice Pathology Detection and Classification on Different Frequency Regions Using Correlation Functions.

    Get PDF
    Automatic voice pathology detection and classification systems effectively contribute to the assessment of voice disorders, which helps clinicians to detect the existence of any voice pathologies and the type of pathology from which patients suffer in the early stages. This work concentrates on developing an accurate and robust feature extraction for detecting and classifying voice pathologies by investigating different frequency bands using correlation functions. In this paper, we extracted maximum peak values and their corresponding lag values from each frame of a voiced signal by using correlation functions as features to detect and classify pathological samples. These features are investigated in different frequency bands to see the contribution of each band on the detection and classification processes.Various samples of sustained vowel /a/ of normal and pathological voices were extracted from three different databases: English, German, and Arabic. A support vector machine was used as a classifier. We also performed a t test to investigate the significant differences in mean of normal and pathological samples.The best achieved accuracies in both detection and classification were varied depending on the band, the correlation function, and the database. The most contributive bands in both detection and classification were between 1000 and 8000 Hz. In detection, the highest acquired accuracies when using cross-correlation were 99.809%, 90.979%, and 91.168% in the Massachusetts Eye and Ear Infirmary, Saarbruecken Voice Database, and Arabic Voice Pathology Database databases, respectively. However, in classification, the highest acquired accuracies when using cross-correlation were 99.255%, 98.941%, and 95.188% in the three databases, respectively
    corecore