1,312 research outputs found

    Modeling Sub-Band Information Through Discrete Wavelet Transform to Improve Intelligibility Assessment of Dysarthric Speech

    Get PDF
    The speech signal within a sub-band varies at a fine level depending on the type, and level of dysarthria. The Mel-frequency filterbank used in the computation process of cepstral coefficients smoothed out this fine level information in the higher frequency regions due to the larger bandwidth of filters. To capture the sub-band information, in this paper, four-level discrete wavelet transform (DWT) decomposition is firstly performed to decompose the input speech signal into approximation and detail coefficients, respectively, at each level. For a particular input speech signal, five speech signals representing different sub-bands are then reconstructed using inverse DWT (IDWT). The log filterbank energies are computed by analyzing the short-term discrete Fourier transform magnitude spectra of each reconstructed speech using a 30-channel Mel-filterbank. For each analysis frame, the log filterbank energies obtained across all reconstructed speech signals are pooled together, and discrete cosine transform is performed to represent the cepstral feature, here termed as discrete wavelet transform reconstructed (DWTR)- Mel frequency cepstral coefficient (MFCC). The i-vector based dysarthric level assessment system developed on the universal access speech corpus shows that the proposed DTWRMFCC feature outperforms the conventional MFCC and several other cepstral features reported for a similar task. The usages of DWTR- MFCC improve the detection accuracy rate (DAR) of the dysarthric level assessment system in the text and the speaker-independent test case to 60.094 % from 56.646 % MFCC baseline. Further analysis of the confusion matrices shows that confusion among different dysarthric classes is quite different for MFCC and DWTR-MFCC features. Motivated by this observation, a two-stage classification approach employing discriminating power of both kinds of features is proposed to improve the overall performance of the developed dysarthric level assessment system. The two-stage classification scheme further improves the DAR to 65.813 % in the text and speaker- independent test case

    Multi-modal association learning using spike-timing dependent plasticity (STDP)

    Get PDF
    We propose an associative learning model that can integrate facial images with speech signals to target a subject in a reinforcement learning (RL) paradigm. Through this approach, the rules of learning will involve associating paired stimuli (stimulus–stimulus, i.e., face–speech), which is also known as predictor-choice pairs. Prior to a learning simulation, we extract the features of the biometrics used in the study. For facial features, we experiment by using two approaches: principal component analysis (PCA)-based Eigenfaces and singular value decomposition (SVD). For speech features, we use wavelet packet decomposition (WPD). The experiments show that the PCA-based Eigenfaces feature extraction approach produces better results than SVD. We implement the proposed learning model by using the Spike- Timing-Dependent Plasticity (STDP) algorithm, which depends on the time and rate of pre-post synaptic spikes. The key contribution of our study is the implementation of learning rules via STDP and firing rate in spatiotemporal neural networks based on the Izhikevich spiking model. In our learning, we implement learning for response group association by following the reward-modulated STDP in terms of RL, wherein the firing rate of the response groups determines the reward that will be given. We perform a number of experiments that use existing face samples from the Olivetti Research Laboratory (ORL) dataset, and speech samples from TIDigits. After several experiments and simulations are performed to recognize a subject, the results show that the proposed learning model can associate the predictor (face) with the choice (speech) at optimum performance rates of 77.26% and 82.66% for training and testing, respectively. We also perform learning by using real data, that is, an experiment is conducted on a sample of face–speech data, which have been collected in a manner similar to that of the initial data. The performance results are 79.11% and 77.33% for training and testing, respectively. Based on these results, the proposed learning model can produce high learning performance in terms of combining heterogeneous data (face–speech). This finding opens possibilities to expand RL in the field of biometric authenticatio

    The Use of EEG Signals For Biometric Person Recognition

    Get PDF
    This work is devoted to investigating EEG-based biometric recognition systems. One potential advantage of using EEG signals for person recognition is the difficulty in generating artificial signals with biometric characteristics, thus making the spoofing of EEG-based biometric systems a challenging task. However, more works needs to be done to overcome certain drawbacks that currently prevent the adoption of EEG biometrics in real-life scenarios: 1) usually large number of employed sensors, 2) still relatively low recognition rates (compared with some other biometric modalities), 3) the template ageing effect. The existing shortcomings of EEG biometrics and their possible solutions are addressed from three main perspectives in the thesis: pre-processing, feature extraction and pattern classification. In pre-processing, task (stimuli) sensitivity and noise removal are investigated and discussed in separated chapters. For feature extraction, four novel features are proposed; for pattern classification, a new quality filtering method, and a novel instance-based learning algorithm are described in respective chapters. A self-collected database (Mobile Sensor Database) is employed to investigate some important biometric specified effects (e.g. the template ageing effect; using low-cost sensor for recognition). In the research for pre-processing, a training data accumulation scheme is developed, which improves the recognition performance by combining the data of different mental tasks for training; a new wavelet-based de-noising method is developed, its effectiveness in person identification is found to be considerable. Two novel features based on Empirical Mode Decomposition and Hilbert Transform are developed, which provided the best biometric performance amongst all the newly proposed features and other state-of-the-art features reported in the thesis; the other two newly developed wavelet-based features, while having slightly lower recognition accuracies, were computationally more efficient. The quality filtering algorithm is designed to employ the most informative EEG signal segments: experimental results indicate using a small subset of the available data for feature training could receive reasonable improvement in identification rate. The proposed instance-based template reconstruction learning algorithm has shown significant effectiveness when tested using both the publicly available and self-collected databases

    An ongoing review of speech emotion recognition

    Get PDF
    User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.This work has been carried out with the support of project PID2020-116346GB-I00 funded by the Spanish MICIN
    • …
    corecore