687 research outputs found
Learning discriminative features for human motion understanding
Human motion understanding has attracted considerable interest in recent research for its applications to video surveillance, content-based search and healthcare. With different capturing methods, human motion can be recorded in various forms (e.g. skeletal data, video, image, etc.). Compared to the 2D video and image, skeletal data recorded by motion capture device contains full 3D movement information. To begin with, we first look into a gait motion analysis problem based on 3D skeletal data. We propose an automatic framework for identifying musculoskeletal and neurological disorders among older people based on 3D skeletal motion data. In this framework, a feature selection strategy and two new gait features are proposed to choose an optimal feature set from the input features to optimise classification accuracy.
Due to self-occlusion caused by single shooting angle, 2D video and image are not able to record full 3D geometric information. Therefore, viewpoint variation dramatically affects the performance on lots of 2D based applications (e.g. arbitrary view action recognition and image-based 3D human shape reconstruction). Leveraging view-invariance from the 3D model is a popular idea to improve the performance on 2D computer vision problems. Therefore, in the second contribution, we adopt 3D models built with computer graphics technology to assist in solving the problem of arbitrary view action recognition. As a solution, a new transfer dictionary learning framework that utilises computer graphics technologies to synthesise realistic 2D and 3D training videos is proposed, which can project a real-world 2D video into a view-invariant sparse representation.
In the third contribution, 3D models are utilised to build an end-to-end 3D human shape reconstruction system, which can recover the 3D human shape from a single image without any prior parametric model. In contrast to most existing methods that calculate 3D joint locations, the method proposed in this thesis can produce a richer and more useful point cloud based representation. Synthesised high-quality 2D images and dense 3D point clouds are used to train a CNN-based encoder and 3D regression module.
It can be concluded that the methods introduced in this thesis try to explore human motion understanding from 3D to 2D. We investigate how to compensate for the lack of full geometric information in 2D based applications with view-invariance learnt from 3D models
Iterative Separation of Note Events from Single-Channel Polyphonic Recordings
This thesis is concerned with the separation of audio sources from single-channel polyphonic musical recordings using the iterative estimation and separation of note events. Each event is defined as a section of audio containing largely harmonic energy identified as coming from a single sound source. Multiple events can be clustered to form separated sources. This solution is a model-based algorithm that can be applied to a large variety of audio recordings without requiring previous training stages.
The proposed system embraces two principal stages. The first one considers the iterative detection and separation of note events from within the input mixture. In every iteration, the pitch trajectory of the predominant note event is automatically selected from an array of fundamental frequency estimates and used to guide the separation of the event's spectral content using two different methods: time-frequency masking and time-domain subtraction. A residual signal is then generated and used as the input mixture for the next iteration. After convergence, the second stage considers the clustering of all detected note events into individual audio sources.
Performance evaluation is carried out at three different levels. Firstly, the accuracy of the note-event-based multipitch estimator is compared with that of the baseline algorithm used in every iteration to generate the initial set of pitch estimates. Secondly, the performance of the semi-supervised source separation process is compared with that of another semi-automatic algorithm. Finally, a listening test is conducted to assess the audio quality and naturalness of the separated sources when they are used to create stereo mixes from monaural recordings.
Future directions for this research focus on the application of the proposed system to other music-related tasks. Also, a preliminary optimisation-based approach is presented as an alternative method for the separation of overlapping partials, and as a high resolution time-frequency representation for digital signals
Semantic Validation in Structure from Motion
The Structure from Motion (SfM) challenge in computer vision is the process
of recovering the 3D structure of a scene from a series of projective
measurements that are calculated from a collection of 2D images, taken from
different perspectives. SfM consists of three main steps; feature detection and
matching, camera motion estimation, and recovery of 3D structure from estimated
intrinsic and extrinsic parameters and features.
A problem encountered in SfM is that scenes lacking texture or with
repetitive features can cause erroneous feature matching between frames.
Semantic segmentation offers a route to validate and correct SfM models by
labelling pixels in the input images with the use of a deep convolutional
neural network. The semantic and geometric properties associated with classes
in the scene can be taken advantage of to apply prior constraints to each class
of object. The SfM pipeline COLMAP and semantic segmentation pipeline DeepLab
were used. This, along with planar reconstruction of the dense model, were used
to determine erroneous points that may be occluded from the calculated camera
position, given the semantic label, and thus prior constraint of the
reconstructed plane. Herein, semantic segmentation is integrated into SfM to
apply priors on the 3D point cloud, given the object detection in the 2D input
images. Additionally, the semantic labels of matched keypoints are compared and
inconsistent semantically labelled points discarded. Furthermore, semantic
labels on input images are used for the removal of objects associated with
motion in the output SfM models. The proposed approach is evaluated on a
data-set of 1102 images of a repetitive architecture scene. This project offers
a novel method for improved validation of 3D SfM models
Recent Trends in Computational Intelligence
Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
Signal Processing and Machine Learning Techniques Towards Various Real-World Applications
abstract: Machine learning (ML) has played an important role in several modern technological innovations and has become an important tool for researchers in various fields of interest. Besides engineering, ML techniques have started to spread across various departments of study, like health-care, medicine, diagnostics, social science, finance, economics etc. These techniques require data to train the algorithms and model a complex system and make predictions based on that model. Due to development of sophisticated sensors it has become easier to collect large volumes of data which is used to make necessary hypotheses using ML. The promising results obtained using ML have opened up new opportunities of research across various departments and this dissertation is a manifestation of it. Here, some unique studies have been presented, from which valuable inference have been drawn for a real-world complex system. Each study has its own unique sets of motivation and relevance to the real world. An ensemble of signal processing (SP) and ML techniques have been explored in each study. This dissertation provides the detailed systematic approach and discusses the results achieved in each study. Valuable inferences drawn from each study play a vital role in areas of science and technology, and it is worth further investigation. This dissertation also provides a set of useful SP and ML tools for researchers in various fields of interest.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
Recognizing emotions in spoken dialogue with acoustic and lexical cues
Automatic emotion recognition has long been a focus of Affective Computing. It has
become increasingly apparent that awareness of human emotions in Human-Computer
Interaction (HCI) is crucial for advancing related technologies, such as dialogue
systems. However, performance of current automatic emotion recognition is
disappointing compared to human performance. Current research on emotion
recognition in spoken dialogue focuses on identifying better feature representations
and recognition models from a data-driven point of view. The goal of this thesis
is to explore how incorporating prior knowledge of human emotion recognition
in the automatic model can improve state-of-the-art performance of automatic
emotion recognition in spoken dialogue. Specifically, we study this by proposing
knowledge-inspired features representing occurrences of disfluency and non-verbal
vocalisation in speech, and by building a multimodal recognition model that combines
acoustic and lexical features in a knowledge-inspired hierarchical structure. In our
study, emotions are represented with the Arousal, Expectancy, Power, and Valence
emotion dimensions. We build unimodal and multimodal emotion recognition
models to study the proposed features and modelling approach, and perform emotion
recognition on both spontaneous and acted dialogue.
Psycholinguistic studies have suggested that DISfluency and Non-verbal
Vocalisation (DIS-NV) in dialogue is related to emotions. However, these affective
cues in spoken dialogue are overlooked by current automatic emotion recognition
research. Thus, we propose features for recognizing emotions in spoken dialogue
which describe five types of DIS-NV in utterances, namely filled pause, filler, stutter,
laughter, and audible breath. Our experiments show that this small set of features
is predictive of emotions. Our DIS-NV features achieve better performance than
benchmark acoustic and lexical features for recognizing all emotion dimensions in
spontaneous dialogue. Consistent with Psycholinguistic studies, the DIS-NV features
are especially predictive of the Expectancy dimension of emotion, which relates to
speaker uncertainty. Our study illustrates the relationship between DIS-NVs and
emotions in dialogue, which contributes to Psycholinguistic understanding of them
as well. Note that our DIS-NV features are based on manual annotations, yet our
long-term goal is to apply our emotion recognition model to HCI systems. Thus, we
conduct preliminary experiments on automatic detection of DIS-NVs, and on using
automatically detected DIS-NV features for emotion recognition. Our results show
that DIS-NVs can be automatically detected from speech with stable accuracy, and
auto-detected DIS-NV features remain predictive of emotions in spontaneous dialogue.
This suggests that our emotion recognition model can be applied to a fully automatic
system in the future, and holds the potential to improve the quality of emotional
interaction in current HCI systems.
To study the robustness of the DIS-NV features, we conduct cross-corpora
experiments on both spontaneous and acted dialogue. We identify how dialogue
type influences the performance of DIS-NV features and emotion recognition models.
DIS-NVs contain additional information beyond acoustic characteristics or lexical
contents. Thus, we study the gain of modality fusion for emotion recognition with the
DIS-NV features. Previous work combines different feature sets by fusing modalities
at the same level using two types of fusion strategies: Feature-Level (FL) fusion,
which concatenates feature sets before recognition; and Decision-Level (DL) fusion,
which makes the final decision based on outputs of all unimodal models. However,
features from different modalities may describe data at different time scales or levels
of abstraction. Moreover, Cognitive Science research indicates that when perceiving
emotions, humans make use of information from different modalities at different
cognitive levels and time steps. Therefore, we propose a HierarchicaL (HL) fusion
strategy for multimodal emotion recognition, which incorporates features that describe
data at a longer time interval or which are more abstract at higher levels of its
knowledge-inspired hierarchy. Compared to FL and DL fusion, HL fusion incorporates
both inter- and intra-modality differences. Our experiments show that HL fusion
consistently outperforms FL and DL fusion on multimodal emotion recognition in both
spontaneous and acted dialogue. The HL model combining our DIS-NV features with
benchmark acoustic and lexical features improves current performance of multimodal
emotion recognition in spoken dialogue.
To study how other emotion-related tasks of spoken dialogue can benefit from the
proposed approaches, we apply the DIS-NV features and the HL fusion strategy to
recognize movie-induced emotions. Our experiments show that although designed
for recognizing emotions in spoken dialogue, DIS-NV features and HL fusion
remain effective for recognizing movie-induced emotions. This suggests that other
emotion-related tasks can also benefit from the proposed features and model structure
- …