149 research outputs found
Recommended from our members
Music Emotion Recognition based on Feature Combination, Deep Learning and Chord Detection
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London.As one of the most classic human inventions, music appeared in many artworks, such as songs, movies and theatres. It can be seen as another language, used to express the authors thoughts and emotion. In many cases, music can express the meaning and emotion emerged which is the authors hope and the audience feeling. However, the emotions which appear during human enjoying the music is complex and difficult to precisely explain. Therefore, Music Emotion Recognition (MER) is an interesting research topic in artificial intelligence field for recognising the emotions from the music. The recognition methods and tools for the music signals are growing fast recently. With recent development of the signal processing, machine learning and algorithm optimization, the recognition accuracy is approaching perfection. In this thesis, the research is focused on three differentsignificantpartsofMER,thatarefeatures, learningmethodsandmusicemotion theory, to explain and illustrate how to effectively build MER systems. Firstly, an automatic MER system for classing 4 emotions was proposed where OpenSMILE is used for feature extraction and IS09 feature was selected. After the combination with STAT statistic features, Random Forest classifier produced the best performance than previous systems. It shows that this approach of feature selection and machine learning can indeed improve the accuracy of MER by at least 3.5% from other combinations under suitable parameter setting and the performance of system was improved by new features combination by IS09 and STAT reaching 83.8% accuracy. Secondly, another MER system for 4 emotions was proposed basedon the dynamic property of music signals where the features are extracted from segments of music signals instead of the whole recording in APM database. Then Long Shot-Term Memory (LSTM) deep learning model was used for classification. The model can use the dynamic continuous information between the different time frame segments for more effective emotion recognition. However, the final performance just achieved 65.7% which was not as good as expected. The reason might be that the database is not suitable to the LSTM as the initial thoughts. The information between the segments might be not good enough to improve the performance of recognition in comparison with the traditional methods. The complex deep learning method do not suitable for every database was proved by the conclusion,which shown that the LSTM dynamic deep learning method did not work well in this continuous database. Finally, it was targeted to recognise the emotion by the identification of chord inside as these chords have particular emotion information inside stated in previous theoretical work. The research starts by building a new chord database that uses the Adobe audition to extract the chord clip from the piano chord teaching audio. Then the FFT features based on the 1000 points sampling pre-process data and STAT features were extracted for the selected samples from the database. After the calculation and comparison using Euclidean distance and correlation, the results shown the STAT features work well in most of chords except the Augmented chord. The new approach of recognise 6 emotions from the music was first time used in this research and approached 75% accuracy of chord identification. In summary, the research proposed new MER methods through the three different approaches. Some of them achieved good recognition performance and some of them will have more broad application prospect
Spoken content retrieval beyond pipeline integration of automatic speech recognition and information retrieval
The dramatic increase in the creation of multimedia content is leading to the development of large archives in which a substantial amount of the information is in spoken form. Efficient access to this information requires effective spoken content retrieval (SCR) methods. Traditionally, SCR systems have focused on a pipeline integration of two fundamental technologies: transcription using automatic speech recognition (ASR) and search supported using text-based information retrieval (IR).
Existing SCR approaches estimate the relevance of a spoken retrieval item based on the lexical overlap between a user’s query and the textual transcriptions of the items. However, the speech signal contains other potentially valuable non-lexical information that remains largely unexploited by SCR approaches. Particularly, acoustic correlates of speech prosody, that have been shown useful to identify salient words and determine topic changes, have not been exploited by existing SCR approaches.
In addition, the temporal nature of multimedia content means that accessing content is a user intensive, time consuming process. In order to minimise user effort in locating relevant content, SCR systems could suggest playback points in retrieved content indicating the locations where the system believes relevant information may be found. This typically requires adopting a segmentation mechanism for splitting documents into smaller “elements” to be ranked and from which suitable playback points could be selected. Existing segmentation approaches do not generalise well to every possible information need or provide robustness to ASR errors.
This thesis extends SCR beyond the standard ASR and IR pipeline approach by: (i) exploring the utilisation of prosodic information as complementary evidence of topical relevance to enhance current SCR approaches; (ii) determining elements of content that, when retrieved, minimise user search effort and provide increased robustness to ASR errors; and (iii) developing enhanced evaluation measures that could better capture the factors that affect user satisfaction in SCR
Recommended from our members
A Novel Approach for Continuous Speech Tracking and Dynamic Time Warping. Adaptive Framing Based Continuous Speech Similarity Measure and Dynamic Time Warping using Kalman Filter and Dynamic State Model
Dynamic speech properties such as time warping, silence removal and background noise interference are the most challenging issues in continuous speech signal matching. Among all of them, the time warped speech signal matching is of great interest and has been a tough challenge for the researchers. An adaptive framing based continuous speech tracking and similarity measurement approach is introduced in this work following a comprehensive research conducted in the diverse areas of speech processing. A dynamic state model is introduced based on system of linear motion equations which models the input (test) speech signal frame as a unidirectional moving object along the template speech signal. The most similar corresponding frame position in the template speech is estimated which is fused with a feature based similarity observation and the noise variances using a Kalman filter. The Kalman filter provides the final estimated frame position in the template speech at current time which is further used for prediction of a new frame size for the next step. In addition, a keyword spotting approach is proposed by introducing wavelet decomposition based dynamic noise filter and combination of beliefs. The Dempster’s theory of belief combination is deployed for the first time in relation to keyword spotting task. Performances for both; speech tracking and keyword spotting approaches are evaluated using the statistical metrics and gold standards for the binary classification. Experimental results proved the superiority of the proposed approaches over the existing methods.The appendices files are not available online
Recommended from our members
Deep learning based facial expression recognition and its applications
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonFacial expression recognition (FER) is a research area that consists of classifying the human emotions through the expressions on their face. It can be used in applications such as biometric security, intelligent human-computer interaction, robotics, and clinical medicine for autism, depression, pain and mental health problems. This dissertation investigates the advanced technologies for facial expression analysis and develops the artificial intelligent systems for practical applications. The first part of this work applies geometric and texture domain feature extractors along with various machine learning techniques to improve FER. Advanced 2D and 3D facial processing techniques such as Edge Oriented Histograms (EOH) and Facial Mesh Distances (FMD) are then fused together using a framework designed to investigate their individual and combined domain performances. Following these tests, the face is then broken down into facial parts using advanced facial alignment and localising techniques. Deep learning in the form of Convolutional Neural Networks (CNNs) is also explored also FER. A novel approach is used for the deep network architecture design, to learn the facial parts jointly, showing an improvement over using the whole face. Joint Bayesian is also adapted in the form of metric learning, to work with deep feature representations of the facial parts. This provides a further improvement over using the deep network alone. Dynamic emotion content is explored as a solution to provide richer information than still images. The motion occurring across the content is initially captured using the Motion History Histogram descriptor (MHH) and is critically evaluated. Based on this observation, several improvements are proposed through extensions such as Average Spatial Pooling Multi-scale Motion History Histogram (ASMMHH). This extension adds two modifications, first is to view the content in different spatial dimensions through spatial pooling; influenced by the structure of CNNs. The other modification is to capture motion at different speeds. Combined, they have provided better performance over MHH, and other popular techniques like Local Binary Patterns – Three Orthogonal Planes (LBP-TOP).
Finally, the dynamic emotion content is observed in the feature space, with sequences of images represented as sequences of extracted features. A novel technique called Facial Dynamic History Histogram (FDHH) is developed to capture patterns of variations within the sequence of features; an approach not seen before. FDHH is applied in an end to end framework for applications in Depression analysis and evaluating the induced emotions through a large set of video clips from various movies. With the combination of deep learning techniques and FDHH, state-of-the-art results are achieved for Depression analysis
Recommended from our members
Artificial intelligence system for continuous affect estimation from naturalistic human expressions
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonThe analysis and automatic affect estimation system from human expression has been acknowledged as an active research topic in computer vision community. Most reported affect recognition systems, however, only consider subjects performing well-defined acted expression, in a very controlled condition, so they are not robust enough for real-life recognition tasks with subject variation, acoustic surrounding and illumination change. In this thesis, an artificial intelligence system is proposed to continuously (represented along a continuum e.g., from -1 to +1) estimate affect behaviour in terms of latent dimensions (e.g., arousal and valence) from naturalistic human expressions. To tackle the issues, feature representation and machine
learning strategies are addressed. In feature representation, human expression is represented by modalities such as audio, video, physiological signal and text modality. Hand- crafted features is extracted from each modality per frame, in order to match with consecutive affect label. However, the features extracted maybe missing information due to several factors such as background noise or lighting condition. Haar Wavelet Transform is employed to determine if noise cancellation mechanism in feature space should be considered in the design of affect estimation system. Other than hand-crafted features, deep learning features are also analysed in terms of the layer-wise; convolutional and fully connected layer. Convolutional Neural Network
such as AlexNet, VGGFace and ResNet has been selected as deep learning architecture to do feature extraction on top of facial expression images. Then, multimodal fusion scheme is applied by fusing deep learning feature and hand-crafted feature together to improve the performance. In machine learning strategies, two-stage regression approach is introduced. In the first stage, baseline regression methods such as Support Vector Regression are applied to estimate each affect per time. Then in the second stage, subsequent model such as Time Delay Neural Network, Long Short-Term Memory and Kalman Filter is proposed to model the
temporal relationships between consecutive estimation of each affect. In doing so, the temporal information employed by a subsequent model is not biased by high variability present in consecutive frame and at the same time, it allows the network to exploit the slow changing dynamic between emotional dynamic more efficiently. Following of two-stage regression approach for unimodal affect analysis, fusion information from different modalities is elaborated. Continuous emotion recognition in-the-wild is leveraged by investigating mathematical modelling for each emotion dimension. Linear Regression, Exponent Weighted Decision Fusion and Multi-Gene Genetic Programming are implemented to quantify the relationship between each modality. In summary, the research work presented in this thesis reveals a fundamental approach to automatically estimate affect value continuously from naturalistic human expression. The proposed system, which consists of feature smoothing, deep learning feature, two-stage regression framework and fusion using mathematical equation between modalities is demonstrated. It offers strong basis towards the development artificial intelligent system on estimation continuous affect estimation, and more broadly towards building a real-time
emotion recognition system for human-computer interaction.Majlis Amanah Rakyat (MARA), Malaysi
Computer audition for emotional wellbeing
This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has.
To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout
- …