5 research outputs found
Learning embeddings for speaker clustering based on voice equality
Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices - namely, identifying few specific speakers.
We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering
Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications
PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure
of social functioning due to the association between LLD and poor social functioning. The
utilisation of wearable computing technologies is a relatively new approach within healthcare
and well-being application sectors. Recently, the design and development of wearable
technologies and systems for health and well-being monitoring have attracted attention both
of the clinical and scientific communities. Mainly because the current clinical practice of –
typically rather sporadic – clinical behaviour assessments are often administered in artificial
settings. As a result, it does not provide a realistic impression of a patient’s condition
and thus does not lead to sufficient diagnosis and care. However, wearable behaviour
monitors have the potential for continuous, objective assessment of behaviour and wider
social interactions and thereby allowing for capturing naturalistic data without any constraints
on the place of recording or any typical limitations of the lab-setting research. Such data from
naturalistic ambient environments would facilitate automated transmission and analysis by
having no constraints on the recordings, allowing for a more timely and accurate assessment
of depressive symptoms. In response to this artificial setting issue, this thesis focuses on
the analysis and assessment of the different aspects of social interactions in naturalistic
environments using deep learning algorithms. That could lead to improvements in both
diagnosis and treatment.
The advantages of using deep learning are that there is no need for hand-crafted features
engineering and this leads to using the raw data with minimal pre-processing compared to
classical machine learning approaches and also its scalability and ability to generalise. The
main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle
University. This device has multiple sensors including microphone, tri-axial accelerometer,
light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer
are used for the social interaction analysis. The other sensors are not used since they need
more calibration from the user which in this will be the elderly people with depression.
Hence, it was not feasible in this scenario. Novel deep learning models are proposed to
automatically analyse two aspects of social interactions (the verbal interactions/acoustic
communications and physical activities/movement patterns). Verbal Interactions include
the total quantity of speech, who is talking to whom and when and how much engagement
the wearer contributed in the conversations. The physical activity analysis includes activity
recognition and the quantity of each activity and sleep patterns.
This thesis is composed of three main stages, two of them discuss the acoustic analysis
and the third stage describes the movement pattern analysis. The acoustic analysis starts
with speech detection in which each segment of the recording is categorised as speech or
non-speech. This segment classification is achieved by a novel deep learning model that
leverages bi-directional Long Short-Term Memory with gated activation units combined
with Maxout Networks as well as a combination of two optimisers. After detecting speech
segments from audio data, the next stage is detecting how much engagement the wearer has
in any conversation throughout these speech events based on detecting the wearer of the
device using a variant model of the previous one that combines the convolutional autoencoder
with bi-directional Long Short-Term Memory. Following this, the system then detects the
spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking
but only includes the turn taking between the wearer and other speakers and not every speaker
in the conversation. This stage did not take into account the semantics of the speakers due
to the ethical constraints of the main dataset (Depression dataset) and therefore it was not
possible to listen to the data by any means or even have any information about the contents.
So, it is a good idea to be considered for future work.
Stage 3 involves the physical activity analysis that is inferring the elementary physical
activities and movement patterns. These elementary patterns include sedentary actions,
walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive
model used is based on Random Forests and Hidden Markov Models. In all stages the
methods presented in this thesis have been compared to the state-of-the-art in processing
audio, accelerometer data, respectively, to thoroughly assess their contribution. Following
these stages is a thorough analysis of the interplay between acoustic interaction and physical
movement patterns and the depression key clinical variables resulting to the outcomes of
the previous stages. The main reason for not using deep learning in this stage unlike the
previous stages is that the main dataset (Depression dataset) did not have any annotations
for the speech or even the activity due to the ethical constraints as mentioned. Furthermore,
the training dataset (Discussion dataset) did not have any annotations for the accelerometer
data where the data is recorded freely and there is no camera attached to device to make it
possible to be annotated afterwards.Newton-Mosharafa Fund and
the mission sector and cultural affairs, ministry of Higher Education in Egypt