16 research outputs found

    Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition

    Full text link
    Modern deep learning architectures are ordinarily performed on high-performance computing facilities due to the large size of the input features and complexity of its model. This paper proposes traditional multilayer perceptrons (MLP) with deep layers and small input size to tackle that computation requirement limitation. The result shows that our proposed deep MLP outperformed modern deep learning architectures, i.e., LSTM and CNN, on the same number of layers and value of parameters. The deep MLP exhibited the highest performance on both speaker-dependent and speaker-independent scenarios on IEMOCAP and MSP-IMPROV corpus.Comment: 2 figures, 4 tables, submitted to EUSIPCO 202

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    Semi-Supervised Speech Emotion Recognition with Ladder Networks

    Full text link
    Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. The approach is implemented with several acoustic features, showing that ladder networks generalize significantly better in cross-corpus settings. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture
    corecore