73 research outputs found

    Learning temporal clusters using capsule routing for speech emotion recognition

    Get PDF
    Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a state-of-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAO-Aibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks

    Insights on neural representations for end-to-end speech recognition

    Get PDF
    End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural representations and performance. Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models. This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches using Canonical correlation analysis (CCA) and centered kernel alignment (CKA) for the experiments. It was found that neural representations within CNN layers exhibit hierarchical correlation dependencies as layer depth increases but this is mostly limited to cases where neural representation correlates more closely. This behaviour is not observed in LSTM architecture, however there is a bottom-up pattern observed across the training process, while Transformer encoder layers exhibit irregular coefficiency correlation as neural depth increases. Altogether, these results provide new insights into the role that neural architectures have upon speech recognition performance. More specifically, these techniques can be used as indicators to build better performing speech recognition models

    Empirical interpretation of speech emotion perception with attention based model for speech emotion Recognition

    Get PDF
    Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. Harmonically structured vowel and consonant sounds add indexical and linguistic cues in spoken information. Previous research argued whether vowel sound cues were more important in carrying the emotional context from a psychological and linguistic point of view. Other research also claimed that emotion information could exist in small overlapping acoustic cues. However, these claims are not corroborated in computational speech emotion recognition systems. In this research, a convolution-based model and a long-short-term memory-based model, both using attention, are applied to investigate these theories of speech emotion on computational models. The role of acoustic context and word importance is demonstrated for the task of speech emotion recognition. The IEMOCAP corpus is evaluated by the proposed models, and 80.1% unweighted accuracy is achieved on pure acoustic data which is higher than current state-of-the-art models on this task. The phones and words are mapped to the attention vectors and it is seen that the vowel sounds are more important for defining emotion acoustic cues than the consonants, and the model can assign word importance based on acoustic context

    An end-to-end deep neural network for facial emotion classification

    Get PDF
    Facial emotional expression is a nonverbal communication medium in human-human communication. Facial expression recognition (FER) is a significantly challenging task in computer vision. With the advent of deep neural networks, facial expression recognition has transitioned from lab-controlled settings to more neutral environments. However, deep neural networks (DNNs) suffer from overfitting the data and biases towards specific categorical distribution. The number of samples in each category is heavily imbalanced, and overall the number of samples is much less than the full number of samples representing all emotions. In this paper, we propose an end-to-end convolutional-self attention framework for classifying facial emotions. The convolutional neural network (CNN) layers can capture the spatial features in a given frame. Here we apply a convolutional-self-attention mechanism to obtain the spatiotemporal features and perform context modelling. The AffectNet database is used to validate the framework. The AffectNet database has a large number of image samples in the wild settings, which makes this database very challenging. The result shows a 30% improvement in accuracy from the CNN baseline

    Investigating deep neural structures and their interpretability in the domain of voice conversion

    Get PDF
    Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data. In the study reported here, we investigated the interpretability of state-of-the-art implementations of non-parallel GANs in the domain of VC. We show that the learned representations in the repeating layers of a particular GAN architecture remain close to their original random initialised parameters, demonstrating that it is the number of repeating layers that is more responsible for the quality of the output. We also analysed the learned representations of a model trained on one particular dataset when used during transfer learning on another dataset. This also showed high levels of similarity in the repeating layers. Together, these results provide new insight into how the learned representations of deep generative networks change during learning and the importance of the number of layers, which would help build better GAN-based speech conversion models

    Population parameters of Rastrelliger kanagurta (Cuvier, 1816) in the Marudu Bay, Sabah, Malaysia

    Get PDF
    An investigation of the population parameters of Indian mackerel, Rastrelliger kanagurta (Cuvier, 1816) in the Marudu Bay, Sabah, Malaysia was carried out from January to September 2013. The relationship between total length and body weight was estimated as W=0.006TL^3.215 or Log W=3.215LogTL – 2.22 (R^2=0.946). Monthly length frequency data of R. kanagurta were analyzed by FiSAT software to evaluate the mortality rates and its exploitation level. Asymptotic length (L∝) and growth co-efficient (K) were estimated at 27.83 cm and 1.50 yr^-1, respectively. The growth performance index (φ') was calculated as 3.07. Total mortality (Z), natural mortality (M) and fishing mortality (F) was calculated at 4.44 yr^-1, 2.46 yr^-1 and 1.98 yr^-1, respectively. Exploitation level (E) of R. kanagurta was found to be 0.45. The exploitation level was below the optimum level of exploitation (E=0.50). It is revealed that the stock of R. kanagurta was found to be still under exploited in Marudu Bay

    Dual stream spatio-temporal motion fusion with self-attention for action recognition

    Get PDF
    Human action recognition in diverse and realistic environments is a challenging task. Automatic classification of action and gestures has a significant impact on human-robot interaction and human-machine interaction technologies. Due to the prevalence of complex real-world problems, it is non-trivial to produce a rich representation of actions and to produce an effective categorical distribution of large action classes. Deep convolutional neural networks have obtained great success in this area. Many researchers have proposed deep neural architectures for action recognition while considering the spatial and temporal aspects of the action. This research proposes a dual stream spatiotemporal fusion architecture for human action classification. The spatial and temporal data is fused using an attention mechanism. We investigate two fusion techniques and show that the proposed architecture achieves accurate results with much fewer parameters as compared to the traditional deep neural networks. We achieved 99.1 % absolute accuracy on the UCF-101 test set

    Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition

    Get PDF
    Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. The fundamental challenges of speech emotion recognition from a machine learning standpoint is to extract patterns which carry maximum correlation with the emotion information encoded in this signal, and to be as insensitive as possible to other types of information carried by speech. In this paper, a novel recurrent residual temporal context modelling framework is proposed. The framework includes mixture of multi-view attention smoothing and high dimensional feature projection for context expansion and learning feature representations. The framework is designed to be robust to changes in speaker and other distortions, and it provides state-of-the-art results for speech emotion recognition. Performance of the proposed approach is compared with a wide range of current architectures in a standard 4-class classification task on the widely used IEMOCAP corpus. A significant improvement of 4% unweighted accuracy over state-of-the-art systems is observed. Additionally, the attention vectors have been aligned with the input segments and plotted at two different attention levels to demonstrate the effectiveness

    American sign language posture understanding with deep neural networks

    Get PDF
    Sign language is a visually oriented, natural, nonverbal communication medium. Having shared similar linguistic properties with its respective spoken language, it consists of a set of gestures, postures and facial expressions. Though, sign language is a mode of communication between deaf people, most other people do not know sign language interpretations. Therefore, it would be constructive if we can translate the sign postures artificially. In this paper, a capsule-based deep neural network sign posture translator for an American Sign Language (ASL) fingerspelling (posture), has been presented. The performance validation shows that the approach can successfully identify sign language, with accuracy like 99%. Unlike previous neural network approaches, which mainly used fine-tuning and transfer learning from pre-trained models, the developed capsule network architecture does not require a pre-trained model. The framework uses a capsule network with adaptive pooling which is the key to its high accuracy. The framework is not limited to sign language understanding, but it has scope for non-verbal communication in Human-Robot Interaction (HRI) also

    Measurement of light output of NE213 and NE102A detectors for2.7-14.5 MeV neutrons

    Get PDF
    The light output of 125-mm-diameter NE213 and NE102A detectors has been measured for neutron energies ranging from 2.7 to 14.5 MeV. For neutron energies below 6.14 MeV, measurements were carried out using the neutron time-of-flight spectrum from an Am-Be neutron source, while for proton energies above 6.14 MeV, measurements were carried out using neutrons produced from the T(d,n) reaction. For the NE102A detector the measured light output is in good agreement with the data of R.A. Cecil et al., (1979) but for the NE213 detector the light output is 2-15% lower than that for a similar detector. The NE213 detector light output agrees with the data of V. Verbinski et al. (1968
    corecore