12 research outputs found

    Learning temporal clusters using capsule routing for speech emotion recognition

    Get PDF
    Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a state-of-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAO-Aibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks

    Exploring the use of group delay for generalised VTS based noise compensation

    Get PDF
    In earlier work we studied the effect of statistical normalisation for phase-based features and observed it leads to a significant robustness improvement. This paper explores the extension of the generalised Vector Taylor Series (gVTS) noise compensation approach to the group delay (GD) domain. We discuss the problems it presents, propose some solutions and derive the corresponding formulae. Furthermore, the effects of additive and channel noise in the GD domain were studied. It was observed that the GD of the noisy observation is a convex combination of the GDs of the clean signal and the additive noise and also in the expected sense, channel GD tends to zero. Experiments on Aurora-4 showed that, despite training only on the clean speech, the proposed features provide average WER reductions of 0.8% absolute and 4.1% relative compared to an MFCC-based system trained on the multi-style data. Combining the gVTS with a bottleneck DNN-based system led to average absolute (relative) WER improvements of 6.0% (23.5%) when training on clean data and 2.5% (13.8%) when using multi-style training with additive noise

    On the usefulness of the speech phase spectrum for pitch extraction

    Get PDF
    © 2018 International Speech Communication Association. All rights reserved. Most frequency domain techniques for pitch extraction such as cepstrum, harmonic product spectrum (HPS) and summation residual harmonics (SRH) operate on the magnitude spectrum and turn it into a function in which the fundamental frequency emerges as argmax. In this paper, we investigate the extension of these three techniques to the phase and group delay (GD) domains. Our extensions exploit the observation that the bin at which F(magnitude) becomes maximum, for some monotonically increasing function F, is equivalent to bin at which F(phase) has maximum negative slope and F(groupdelay) has the maximum value. To extract the pitch track from speech phase spectrum, these techniques were coupled with the source-filter model in the phase domain that we proposed in earlier publications and a novel voicing detection algorithm proposed here. The accuracy and robustness of the phase-based pitch extraction techniques are illustrated and compared with their magnitude-based counterparts using six pitch evaluation metrics. On average, it is observed that the phase spectrum can be successfully employed in pitch tracking with comparable accuracy and robustness to the speech magnitude spectrum

    iCub visual memory inspector: Visualising the iCub’s thoughts

    Get PDF
    This paper describes the integration of multiple sensory recognition models created by a Synthetic Autobiographical Memory into a structured system. This structured system provides high level control of the overall architecture and interfaces with an iCub simulator based in Unity which provides a virtual space for the display of recollected events

    Statistical Normalisation of Phase-based Feature Representation For Robust Speech Recognition

    Get PDF
    In earlier work we have proposed a source-filter decomposition of speech through phase-based processing. The decomposition leads to novel speech features that are extracted from the filter component of the phase spectrum. This paper analyses this spectrum and the proposed representation by evaluating statistical properties at various points along the parametrisation pipeline. We show that speech phase spectrum has a bell-shaped distribution which is in contrast to the uniform assumption that is usually made. It is demonstrated that the uniform density (which implies that the corresponding sequence is least-informative) is an artefact of the phase wrapping and not an original characteristic of this spectrum. In addition, we extend the idea of statistical normalisation usually applied for the magnitudebased features into the phase domain. Based on the statistical structure of the phase-based features, which is shown to be super-gaussian in the clean condition, three normalisation schemes, namely, Gaussianisation, Laplacianisation and table-based histogram equalisation have been applied for improving the robustness. Speech recognition experiments using Aurora-2 show that applying an optimal normalisation scheme at the right stage of the feature extraction process can produce average relative WER reductions of up to 18.6% across the 0-20 dB SNR conditions

    Raw source and filter modelling for dysarthric speech recognition

    No full text
    Acoustic modelling for automatic dysarthric speech recognition (ADSR) is a challenging task. Data deficiency is a major problem and substantial differences between the typical and dysarthric speech complicates transfer learning. In this paper, we build acoustic models using the raw magnitude spectra of the source and filter components. The proposed multi-stream model consists of convolutional and recurrent layers. It allows for fusing the vocal tract and excitation components at different levels of abstraction and after per-stream pre-processing. We show that such a multi-stream processing leverages these two information streams and helps s model towards normalising the speaker attributes and speaking style. This potentially leads to better handling of the dysarthric speech with a large inter-speaker and intra-speaker variability. We compare the proposed system with various features, study the training dynamics, explore usefulness of the data augmentation and provide interpretation for the learned convolutional filters. On the widely used TORGO dysarthric speech corpus, the proposed approach results in up to 1.7% absolute WER reduction for dysarthric speech compared with the MFCC base-line. Our best model reaches up to 40.6% and 11.8% WER for dysarthric and typical speech, respectively

    Compression of Model-based Group Delay Function for Robust Speech Recognition

    No full text
    In this paper, we improve the performance of the ARGDMF feature by adding a nonlinear filtering block. ARGDMF is a group delay-based feature consists of four main parts, namely autoregressive (AR) model extraction, group delay function (GDF) calculation, compression, and scale information augmentation. The main problem with the GDF is its spiky nature which is solved by coupling the GDF with an all-pole model. The compression step includes two stages similar to MFCC without taking a logarithm of the output energies. The fourth part augments the phase-based feature vector with scale information. The novelty of this paper is in adding a filtering block to compression process to make it more efficient. This filter aims at elevating the performance of the ARGDMF via a more optimum dynamic range and formants sharpness adjustment. The feature was evaluated on Aurora 2 database. In the presence of both additive and convolutional noises, the proposed method noticeably outperforms the MFCCs and other phase-based features, without remarkable increase in computational load

    Salve Regina. Anónimo

    Get PDF
    Este trabajo se inscribe dentro de las actividades del Grupo de Investigación Consolidado 2009 SGR 973, "Aula Música Poética", financiado por la Generalitat de Catalunya (Comissionat per a Universitats i Recerca del DIUE)El documento contiene la antífona a cuatro voces de compositor anónimo, titulada “Salve, Regina”. Se ofrece el texto en latín y su traducción al castellano, la partitura con la transcripción musical a notación moderna, el facsímil parcial de la obra y varios datos de interés musical y musicológico. Esta composición figura en la antología "Música de varios autores escogida por el maestro Gerónimo Vermell" (1690).Consejo Superior de Investigaciones Científicas (CSIC). Universitat de Barcelona (UB)Peer reviewe
    corecore