5 research outputs found
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications
PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure
of social functioning due to the association between LLD and poor social functioning. The
utilisation of wearable computing technologies is a relatively new approach within healthcare
and well-being application sectors. Recently, the design and development of wearable
technologies and systems for health and well-being monitoring have attracted attention both
of the clinical and scientific communities. Mainly because the current clinical practice of –
typically rather sporadic – clinical behaviour assessments are often administered in artificial
settings. As a result, it does not provide a realistic impression of a patient’s condition
and thus does not lead to sufficient diagnosis and care. However, wearable behaviour
monitors have the potential for continuous, objective assessment of behaviour and wider
social interactions and thereby allowing for capturing naturalistic data without any constraints
on the place of recording or any typical limitations of the lab-setting research. Such data from
naturalistic ambient environments would facilitate automated transmission and analysis by
having no constraints on the recordings, allowing for a more timely and accurate assessment
of depressive symptoms. In response to this artificial setting issue, this thesis focuses on
the analysis and assessment of the different aspects of social interactions in naturalistic
environments using deep learning algorithms. That could lead to improvements in both
diagnosis and treatment.
The advantages of using deep learning are that there is no need for hand-crafted features
engineering and this leads to using the raw data with minimal pre-processing compared to
classical machine learning approaches and also its scalability and ability to generalise. The
main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle
University. This device has multiple sensors including microphone, tri-axial accelerometer,
light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer
are used for the social interaction analysis. The other sensors are not used since they need
more calibration from the user which in this will be the elderly people with depression.
Hence, it was not feasible in this scenario. Novel deep learning models are proposed to
automatically analyse two aspects of social interactions (the verbal interactions/acoustic
communications and physical activities/movement patterns). Verbal Interactions include
the total quantity of speech, who is talking to whom and when and how much engagement
the wearer contributed in the conversations. The physical activity analysis includes activity
recognition and the quantity of each activity and sleep patterns.
This thesis is composed of three main stages, two of them discuss the acoustic analysis
and the third stage describes the movement pattern analysis. The acoustic analysis starts
with speech detection in which each segment of the recording is categorised as speech or
non-speech. This segment classification is achieved by a novel deep learning model that
leverages bi-directional Long Short-Term Memory with gated activation units combined
with Maxout Networks as well as a combination of two optimisers. After detecting speech
segments from audio data, the next stage is detecting how much engagement the wearer has
in any conversation throughout these speech events based on detecting the wearer of the
device using a variant model of the previous one that combines the convolutional autoencoder
with bi-directional Long Short-Term Memory. Following this, the system then detects the
spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking
but only includes the turn taking between the wearer and other speakers and not every speaker
in the conversation. This stage did not take into account the semantics of the speakers due
to the ethical constraints of the main dataset (Depression dataset) and therefore it was not
possible to listen to the data by any means or even have any information about the contents.
So, it is a good idea to be considered for future work.
Stage 3 involves the physical activity analysis that is inferring the elementary physical
activities and movement patterns. These elementary patterns include sedentary actions,
walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive
model used is based on Random Forests and Hidden Markov Models. In all stages the
methods presented in this thesis have been compared to the state-of-the-art in processing
audio, accelerometer data, respectively, to thoroughly assess their contribution. Following
these stages is a thorough analysis of the interplay between acoustic interaction and physical
movement patterns and the depression key clinical variables resulting to the outcomes of
the previous stages. The main reason for not using deep learning in this stage unlike the
previous stages is that the main dataset (Depression dataset) did not have any annotations
for the speech or even the activity due to the ethical constraints as mentioned. Furthermore,
the training dataset (Discussion dataset) did not have any annotations for the accelerometer
data where the data is recorded freely and there is no camera attached to device to make it
possible to be annotated afterwards.Newton-Mosharafa Fund and
the mission sector and cultural affairs, ministry of Higher Education in Egypt