7 research outputs found

    Unsupervised adaptation of deep speech activity detection models to unseen domains

    Get PDF
    Speech Activity Detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless Steps Challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system

    Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

    Get PDF
    This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayzín 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases

    Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications

    Get PDF
    PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure of social functioning due to the association between LLD and poor social functioning. The utilisation of wearable computing technologies is a relatively new approach within healthcare and well-being application sectors. Recently, the design and development of wearable technologies and systems for health and well-being monitoring have attracted attention both of the clinical and scientific communities. Mainly because the current clinical practice of – typically rather sporadic – clinical behaviour assessments are often administered in artificial settings. As a result, it does not provide a realistic impression of a patient’s condition and thus does not lead to sufficient diagnosis and care. However, wearable behaviour monitors have the potential for continuous, objective assessment of behaviour and wider social interactions and thereby allowing for capturing naturalistic data without any constraints on the place of recording or any typical limitations of the lab-setting research. Such data from naturalistic ambient environments would facilitate automated transmission and analysis by having no constraints on the recordings, allowing for a more timely and accurate assessment of depressive symptoms. In response to this artificial setting issue, this thesis focuses on the analysis and assessment of the different aspects of social interactions in naturalistic environments using deep learning algorithms. That could lead to improvements in both diagnosis and treatment. The advantages of using deep learning are that there is no need for hand-crafted features engineering and this leads to using the raw data with minimal pre-processing compared to classical machine learning approaches and also its scalability and ability to generalise. The main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle University. This device has multiple sensors including microphone, tri-axial accelerometer, light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer are used for the social interaction analysis. The other sensors are not used since they need more calibration from the user which in this will be the elderly people with depression. Hence, it was not feasible in this scenario. Novel deep learning models are proposed to automatically analyse two aspects of social interactions (the verbal interactions/acoustic communications and physical activities/movement patterns). Verbal Interactions include the total quantity of speech, who is talking to whom and when and how much engagement the wearer contributed in the conversations. The physical activity analysis includes activity recognition and the quantity of each activity and sleep patterns. This thesis is composed of three main stages, two of them discuss the acoustic analysis and the third stage describes the movement pattern analysis. The acoustic analysis starts with speech detection in which each segment of the recording is categorised as speech or non-speech. This segment classification is achieved by a novel deep learning model that leverages bi-directional Long Short-Term Memory with gated activation units combined with Maxout Networks as well as a combination of two optimisers. After detecting speech segments from audio data, the next stage is detecting how much engagement the wearer has in any conversation throughout these speech events based on detecting the wearer of the device using a variant model of the previous one that combines the convolutional autoencoder with bi-directional Long Short-Term Memory. Following this, the system then detects the spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking but only includes the turn taking between the wearer and other speakers and not every speaker in the conversation. This stage did not take into account the semantics of the speakers due to the ethical constraints of the main dataset (Depression dataset) and therefore it was not possible to listen to the data by any means or even have any information about the contents. So, it is a good idea to be considered for future work. Stage 3 involves the physical activity analysis that is inferring the elementary physical activities and movement patterns. These elementary patterns include sedentary actions, walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive model used is based on Random Forests and Hidden Markov Models. In all stages the methods presented in this thesis have been compared to the state-of-the-art in processing audio, accelerometer data, respectively, to thoroughly assess their contribution. Following these stages is a thorough analysis of the interplay between acoustic interaction and physical movement patterns and the depression key clinical variables resulting to the outcomes of the previous stages. The main reason for not using deep learning in this stage unlike the previous stages is that the main dataset (Depression dataset) did not have any annotations for the speech or even the activity due to the ethical constraints as mentioned. Furthermore, the training dataset (Discussion dataset) did not have any annotations for the accelerometer data where the data is recorded freely and there is no camera attached to device to make it possible to be annotated afterwards.Newton-Mosharafa Fund and the mission sector and cultural affairs, ministry of Higher Education in Egypt
    corecore