1,155 research outputs found
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
Deep fusion of multi-channel neurophysiological signal for emotion recognition and monitoring
How to fuse multi-channel neurophysiological signals for emotion recognition is emerging as a hot research topic in community of Computational Psychophysiology. Nevertheless, prior feature engineering based approaches require extracting various domain knowledge related features at a high time cost. Moreover, traditional fusion method cannot fully utilise correlation information between different channels and frequency components. In this paper, we design a hybrid deep learning model, in which the 'Convolutional Neural Network (CNN)' is utilised for extracting task-related features, as well as mining inter-channel and inter-frequency correlation, besides, the 'Recurrent Neural Network (RNN)' is concatenated for integrating contextual information from the frame cube sequence. Experiments are carried out in a trial-level emotion recognition task, on the DEAP benchmarking dataset. Experimental results demonstrate that the proposed framework outperforms the classical methods, with regard to both of the emotional dimensions of Valence and Arousal
Noise removal methods on ambulatory EEG: A Survey
Over many decades, research is being attempted for the removal of noise in
the ambulatory EEG. In this respect, an enormous number of research papers is
published for identification of noise removal, It is difficult to present a
detailed review of all these literature. Therefore, in this paper, an attempt
has been made to review the detection and removal of an noise. More than 100
research papers have been discussed to discern the techniques for detecting and
removal the ambulatory EEG. Further, the literature survey shows that the
pattern recognition required to detect ambulatory method, eye open and close,
varies with different conditions of EEG datasets. This is mainly due to the
fact that EEG detected under different conditions has different
characteristics. This is, in turn, necessitates the identification of pattern
recognition technique to effectively distinguish EEG noise data from a various
condition of EEG data
STILN: A Novel Spatial-Temporal Information Learning Network for EEG-based Emotion Recognition
The spatial correlations and the temporal contexts are indispensable in
Electroencephalogram (EEG)-based emotion recognition. However, the learning of
complex spatial correlations among several channels is a challenging problem.
Besides, the temporal contexts learning is beneficial to emphasize the critical
EEG frames because the subjects only reach the prospective emotion during part
of stimuli. Hence, we propose a novel Spatial-Temporal Information Learning
Network (STILN) to extract the discriminative features by capturing the spatial
correlations and temporal contexts. Specifically, the generated 2D power
topographic maps capture the dependencies among electrodes, and they are fed to
the CNN-based spatial feature extraction network. Furthermore, Convolutional
Block Attention Module (CBAM) recalibrates the weights of power topographic
maps to emphasize the crucial brain regions and frequency bands. Meanwhile,
Batch Normalizations (BNs) and Instance Normalizations (INs) are appropriately
combined to relieve the individual differences. In the temporal contexts
learning, we adopt the Bidirectional Long Short-Term Memory Network (Bi-LSTM)
network to capture the dependencies among the EEG frames. To validate the
effectiveness of the proposed method, subject-independent experiments are
conducted on the public DEAP dataset. The proposed method has achieved the
outstanding performance, and the accuracies of arousal and valence
classification have reached 0.6831 and 0.6752 respectively
Comparative Analysis of MLP, CNN, and RNN Models in Automatic Speech Recognition: Dissecting Performance Metric
This study conducts a comparative analysis of three prominent machine learning models: Multi-Layer Perceptrons (MLP), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) in the field of automatic speech recognition (ASR). This research is distinct in its use of the LibriSpeech 'test-clean' dataset, selected for its diversity in speaker accents and varied recording conditions, establishing it as a robust benchmark for ASR performance evaluation. Our approach involved preprocessing the audio data to ensure consistency and extracting Mel-Frequency Cepstral Coefficients (MFCCs) as the primary features, crucial for capturing the nuances of human speech. The models were meticulously configured with specific architectural details and hyperparameters. The MLP and CNN models were designed to maximize their pattern recognition capabilities, while the RNN (LSTM) was optimized for processing temporal data. To assess their performance, we employed metrics such as precision, recall, and F1-score. The MLP and CNN models demonstrated exceptional accuracy, with scores of 0.98 across these metrics, indicating their effectiveness in feature extraction and pattern recognition. In contrast, the LSTM variant of RNN showed lower efficacy, with scores below 0.60, highlighting the challenges in handling sequential speech data. The results of this study shed light on the differing capabilities of these models in ASR. While the high accuracy of MLP and CNN suggests potential overfitting, the underperformance of LSTM underscores the necessity for further refinement in sequential data processing. This research contributes to the understanding of various machine learning approaches in ASR and paves the way for future investigations. We propose exploring hybrid model architectures and enhancing feature extraction methods to develop more sophisticated, real-world ASR systems. Additionally, our findings underscore the importance of considering model-specific strengths and limitations in ASR applications, guiding the direction of future research in this rapidly evolving field
- …