5 research outputs found

    Learning embeddings for speaker clustering based on voice equality

    Get PDF
    Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices - namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering

    Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications

    Get PDF
    PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure of social functioning due to the association between LLD and poor social functioning. The utilisation of wearable computing technologies is a relatively new approach within healthcare and well-being application sectors. Recently, the design and development of wearable technologies and systems for health and well-being monitoring have attracted attention both of the clinical and scientific communities. Mainly because the current clinical practice of – typically rather sporadic – clinical behaviour assessments are often administered in artificial settings. As a result, it does not provide a realistic impression of a patient’s condition and thus does not lead to sufficient diagnosis and care. However, wearable behaviour monitors have the potential for continuous, objective assessment of behaviour and wider social interactions and thereby allowing for capturing naturalistic data without any constraints on the place of recording or any typical limitations of the lab-setting research. Such data from naturalistic ambient environments would facilitate automated transmission and analysis by having no constraints on the recordings, allowing for a more timely and accurate assessment of depressive symptoms. In response to this artificial setting issue, this thesis focuses on the analysis and assessment of the different aspects of social interactions in naturalistic environments using deep learning algorithms. That could lead to improvements in both diagnosis and treatment. The advantages of using deep learning are that there is no need for hand-crafted features engineering and this leads to using the raw data with minimal pre-processing compared to classical machine learning approaches and also its scalability and ability to generalise. The main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle University. This device has multiple sensors including microphone, tri-axial accelerometer, light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer are used for the social interaction analysis. The other sensors are not used since they need more calibration from the user which in this will be the elderly people with depression. Hence, it was not feasible in this scenario. Novel deep learning models are proposed to automatically analyse two aspects of social interactions (the verbal interactions/acoustic communications and physical activities/movement patterns). Verbal Interactions include the total quantity of speech, who is talking to whom and when and how much engagement the wearer contributed in the conversations. The physical activity analysis includes activity recognition and the quantity of each activity and sleep patterns. This thesis is composed of three main stages, two of them discuss the acoustic analysis and the third stage describes the movement pattern analysis. The acoustic analysis starts with speech detection in which each segment of the recording is categorised as speech or non-speech. This segment classification is achieved by a novel deep learning model that leverages bi-directional Long Short-Term Memory with gated activation units combined with Maxout Networks as well as a combination of two optimisers. After detecting speech segments from audio data, the next stage is detecting how much engagement the wearer has in any conversation throughout these speech events based on detecting the wearer of the device using a variant model of the previous one that combines the convolutional autoencoder with bi-directional Long Short-Term Memory. Following this, the system then detects the spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking but only includes the turn taking between the wearer and other speakers and not every speaker in the conversation. This stage did not take into account the semantics of the speakers due to the ethical constraints of the main dataset (Depression dataset) and therefore it was not possible to listen to the data by any means or even have any information about the contents. So, it is a good idea to be considered for future work. Stage 3 involves the physical activity analysis that is inferring the elementary physical activities and movement patterns. These elementary patterns include sedentary actions, walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive model used is based on Random Forests and Hidden Markov Models. In all stages the methods presented in this thesis have been compared to the state-of-the-art in processing audio, accelerometer data, respectively, to thoroughly assess their contribution. Following these stages is a thorough analysis of the interplay between acoustic interaction and physical movement patterns and the depression key clinical variables resulting to the outcomes of the previous stages. The main reason for not using deep learning in this stage unlike the previous stages is that the main dataset (Depression dataset) did not have any annotations for the speech or even the activity due to the ethical constraints as mentioned. Furthermore, the training dataset (Discussion dataset) did not have any annotations for the accelerometer data where the data is recorded freely and there is no camera attached to device to make it possible to be annotated afterwards.Newton-Mosharafa Fund and the mission sector and cultural affairs, ministry of Higher Education in Egypt

    Speaker diarization through speaker embeddings

    No full text
    International audienceno abstrac
    corecore