2,149 research outputs found

    Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

    Full text link
    Audio Word2Vec offers vector representations of fixed dimensionality for variable-length audio segments using Sequence-to-sequence Autoencoder (SA). These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with real world applications such as query-by-example Spoken Term Detection (STD). This paper examines the capability of language transfer of Audio Word2Vec. We train SA from one language (source language) and use it to extract the vector representation of the audio segments of another language (target language). We found that SA can still catch phonetic structure from the audio segments of the target language if the source and target languages are similar. In query-by-example STD, we obtain the vector representations from the SA learned from a large amount of source language data, and found them surpass the representations from naive encoder and SA directly learned from a small amount of target language data. The result shows that it is possible to learn Audio Word2Vec model from high-resource languages and use it on low-resource languages. This further expands the usability of Audio Word2Vec.Comment: arXiv admin note: text overlap with arXiv:1603.0098

    On automatic emotion classification using acoustic features

    No full text
    In this thesis, we describe extensive experiments on the classification of emotions from speech using acoustic features. This area of research has important applications in human computer interaction. We have thoroughly reviewed the current literature and present our results on some of the contemporary emotional speech databases. The principal focus is on creating a large set of acoustic features, descriptive of different emotional states and finding methods for selecting a subset of best performing features by using feature selection methods. In this thesis we have looked at several traditional feature selection methods and propose a novel scheme which employs a preferential Borda voting strategy for ranking features. The comparative results show that our proposed scheme can strike a balance between accurate but computationally intensive wrapper methods and less accurate but computationally less intensive filter methods for feature selection. By using the selected features, several schemes for extending the binary classifiers to multiclass classification are tested. Some of these classifiers form serial combinations of binary classifiers while others use a hierarchical structure to perform this task. We describe a new hierarchical classification scheme, which we call Data-Driven Dimensional Emotion Classification (3DEC), whose decision hierarchy is based on non-metric multidimensional scaling (NMDS) of the data. This method of creating a hierarchical structure for the classification of emotion classes gives significant improvements over other methods tested. The NMDS representation of emotional speech data can be interpreted in terms of the well-known valence-arousal model of emotion. We find that this model does not givea particularly good fit to the data: although the arousal dimension can be identified easily, valence is not well represented in the transformed data. From the recognitionresults on these two dimensions, we conclude that valence and arousal dimensions are not orthogonal to each other. In the last part of this thesis, we deal with the very difficult but important topic of improving the generalisation capabilities of speech emotion recognition (SER) systems over different speakers and recording environments. This topic has been generally overlooked in the current research in this area. First we try the traditional methods used in automatic speech recognition (ASR) systems for improving the generalisation of SER in intra– and inter–database emotion classification. These traditional methods do improve the average accuracy of the emotion classifier. In this thesis, we identify these differences in the training and test data, due to speakers and acoustic environments, as a covariate shift. This shift is minimised by using importance weighting algorithms from the emerging field of transfer learning to guide the learning algorithm towards that training data which gives better representation of testing data. Our results show that importance weighting algorithms can be used to minimise the differences between the training and testing data. We also test the effectiveness of importance weighting algorithms on inter–database and cross-lingual emotion recognition. From these results, we draw conclusions about the universal nature of emotions across different languages
    corecore