64 research outputs found

    Using Deep Neural Networks for Speaker Diarisation

    Get PDF
    Speaker diarisation answers the question “who spoke when?” in an audio recording. The input may vary, but a system is required to output speaker labelled segments in time. Typical stages are Speech Activity Detection (SAD), speaker segmentation and speaker clustering. Early research focussed on Conversational Telephone Speech (CTS) and Broadcast News (BN) domains before the direction shifted to meetings and, more recently, broadcast media. The British Broadcasting Corporation (BBC) supplied data through the Multi-Genre Broadcast (MGB) Challenge in 2015 which showed the difficulties speaker diarisation systems have on broadcast media data. Diarisation is typically an unsupervised task which does not use auxiliary data or information to enhance a system. However, methods which do involve supplementary data have shown promise. Five semi-supervised methods are investigated which use a combination of inputs: different channel types and transcripts. The methods involve Deep Neural Networks (DNNs) for SAD, DNNs trained for channel detection, transcript alignment, and combinations of these approaches. However, the methods are only applicable when datasets contain the required inputs. Therefore, a method involving a pretrained Speaker Separation Deep Neural Network (ssDNN) is investigated which is applicable to every dataset. This technique performs speaker clustering and speaker segmentation using DNNs successfully for meeting data and with mixed results for broadcast media. The task of diarisation focuses on two aspects: accurate segments and speaker labels. The Diarisation Error Rate (DER) does not evaluate the segmentation quality as it does not measure the number of correctly detected segments. Other metrics exist, such as boundary and purity measures, but these also mask the segmentation quality. An alternative metric is presented based on the F-measure which considers the number of hypothesis segments correctly matched to reference segments. A deeper insight into the segment quality is shown through this metric

    Productivity Measurement of Call Centre Agents using a Multimodal Classification Approach

    Get PDF
    Call centre channels play a cornerstone role in business communications and transactions, especially in challenging business situations. Operations’ efficiency, service quality, and resource productivity are core aspects of call centres’ competitive advantage in rapid market competition. Performance evaluation in call centres is challenging due to human subjective evaluation, manual assortment to massive calls, and inequality in evaluations because of different raters. These challenges impact these operations' efficiency and lead to frustrated customers. This study aims to automate performance evaluation in call centres using various deep learning approaches. Calls recorded in a call centre are modelled and classified into high- or low-performance evaluations categorised as productive or nonproductive calls. The proposed conceptual model considers a deep learning network approach to model the recorded calls as text and speech. It is based on the following: 1) focus on the technical part of agent performance, 2) objective evaluation of the corpus, 3) extension of features for both text and speech, and 4) combination of the best accuracy from text and speech data using a multimodal structure. Accordingly, the diarisation algorithm extracts that part of the call where the agent is talking from which the customer is doing so. Manual annotation is also necessary to divide the modelling corpus into productive and nonproductive (supervised training). Krippendorff’s alpha was applied to avoid subjectivity in the manual annotation. Arabic speech recognition is then developed to transcribe the speech into text. The text features are the words embedded using the embedding layer. The speech features make several attempts to use the Mel Frequency Cepstral Coefficient (MFCC) upgraded with Low-Level Descriptors (LLD) to improve classification accuracy. The data modelling architectures for speech and text are based on CNNs, BiLSTMs, and the attention layer. The multimodal approach follows the generated models to improve performance accuracy by concatenating the text and speech models using the joint representation methodology. The main contributions of this thesis are: • Developing an Arabic Speech recognition method for automatic transcription of speech into text. • Drawing several DNN architectures to improve performance evaluation using speech features based on MFCC and LLD. • Developing a Max Weight Similarity (MWS) function to outperform the SoftMax function used in the attention layer. • Proposing a multimodal approach for combining the text and speech models for best performance evaluation

    Detecting early signs of dementia in conversation

    Get PDF
    Dementia can affect a person's speech, language and conversational interaction capabilities. The early diagnosis of dementia is of great clinical importance. Recent studies using the qualitative methodology of Conversation Analysis (CA) demonstrated that communication problems may be picked up during conversations between patients and neurologists and that this can be used to differentiate between patients with Neuro-degenerative Disorders (ND) and those with non-progressive Functional Memory Disorder (FMD). However, conducting manual CA is expensive and difficult to scale up for routine clinical use.\ud This study introduces an automatic approach for processing such conversations which can help in identifying the early signs of dementia and distinguishing them from the other clinical categories (FMD, Mild Cognitive Impairment (MCI), and Healthy Control (HC)). The dementia detection system starts with a speaker diarisation module to segment an input audio file (determining who talks when). Then the segmented files are passed to an automatic speech recogniser (ASR) to transcribe the utterances of each speaker. Next, the feature extraction unit extracts a number of features (CA-inspired, acoustic, lexical and word vector) from the transcripts and audio files. Finally, a classifier is trained by the features to determine the clinical category of the input conversation. Moreover, we investigate replacing the role of a neurologist in the conversation with an Intelligent Virtual Agent (IVA) (asking similar questions). We show that despite differences between the IVA-led and the neurologist-led conversations, the results achieved by the IVA are as good as those gained by the neurologists. Furthermore, the IVA can be used for administering more standard cognitive tests, like the verbal fluency tests and produce automatic scores, which then can boost the performance of the classifier. The final blind evaluation of the system shows that the classifier can identify early signs of dementia with an acceptable level of accuracy and robustness (considering both sensitivity and specificity)

    Comparison of diarization tools for building speaker database

    Get PDF
    This paper compares open source diarization toolkits (LIUM, DiarTK, ALIZE-Lia_Ral), which were designed for extraction of speaker identity from audio records without any prior information about the analysed data. The comparative study of used diarization tools was performed for three different types of analysed data (broadcast news - BN and TV shows). Corresponding values of achieved DER measure are presented here. The automatic speaker diarization system developed by LIUM was able to identified speech segments belonging to speakers at very good level. Its segmentation outputs can be used to build a speaker database

    Dementia detection using automatic analysis of conversations

    Get PDF
    Neurogenerative disorders, like dementia, can affect a person's speech, language and as a consequence, conversational interaction capabilities. A recent study, aimed at improving dementia detection accuracy, investigated the use of conversation analysis (CA) of interviews between patients and neurologists as a means to differentiate between patients with progressive neurodegenerative memory disorder (ND) and those with (non-progressive) functional memory disorders (FMD). However, doing manual CA is expensive and difficult to scale up for routine clinical use. In this paper, we present an automatic classification system using an intelligent virtual agent (IVA). In particular, using two parallel corpora of respectively neurologist- and IVA-led interactions, we show that using acoustic, lexical and CA-inspired features enable ND/FMD classification rates of 90.0% for the neurologist-patient conversations, and 90.9% for the IVA-patient conversations. Analysis of the differentiating potential of individual features show that some differences exist between the IVA and human-led conversations, for example in average turn length of patients

    Interaction analytics for automatic assessment of communication quality in primary care

    Get PDF
    Effective doctor-patient communication is a crucial element of health care, influencing patients’ personal and medical outcomes following the interview. The set of skills used in interpersonal interaction is complex, involving verbal and non-verbal behaviour. Precise attributes of good non-verbal behaviour are difficult to characterise, but models and studies offer insight on relevant factors. In this PhD, I studied how the attributes of non-verbal behaviour can be automatically extracted and assessed, focusing on turn-taking patterns of and the prosody of patient-clinician dialogues. I described clinician-patient communication and the tools and methods used to train and assess communication during the consultation. I then proceeded to a review of the literature on the existing efforts to automate assessment, depicting an emerging domain focused on the semantic content of the exchange and a lack of investigation on interaction dynamics, notably on the structure of turns and prosody. To undertake the study of these aspects, I initially planned the collection of data. I underlined the need for a system that follows the requirements of sensitive data collection regarding data quality and security. I went on to design a secure system which records participants’ speech as well as the body posture of the clinician. I provided an open-source implementation and I supported its use by the scientific community. I investigated the automatic extraction and analysis of some non-verbal components of the clinician-patient communication on an existing corpus of GP consultations. I outlined different patterns in the clinician-patient interaction and I further developed explanations of known consulting behaviours, such as the general imbalance of the doctor-patient interaction and differences in the control of the conversation. I compared behaviours present in face to face, telephone, and video consultations, finding overall similarities alongside noticeable differences in patterns of overlapping speech and switching behaviour. I further studied non-verbal signals by analysing speech prosodic features, investigating differences in participants’ behaviour and relations between the assessment of the clinician-patient communication and prosodic features. While limited in their interpretative power on the explored dataset, these signals nonetheless provide additional metrics to identify and characterise variations in the non-verbal behaviour of the participants. Analysing clinician-patient communication is difficult even for human experts. Automating that process in this work has been particularly challenging. I demonstrated the capacity of automated processing of non-verbal behaviours to analyse clinician-patient communication. I outlined the ability to explore new aspects, interaction dynamics, and objectively describe how patients and clinicians interact. I further explained known aspects such as clinician dominance in more detail. I also provided a methodology to characterise participants’ turns taking behaviour and speech prosody for the objective appraisal of the quality of non-verbal communication. This methodology is aimed at further use in research and education
    corecore