28 research outputs found

    Automatic hierarchical attention neural network for detecting AD

    Get PDF
    Picture description tasks are used for the detection of cognitive decline associated with Alzheimer’s disease (AD). Recent years have seen work on automatic AD detection in picture descriptions based on acoustic and word-based analysis of the speech. These methods have shown some success but lack an ability to capture any higher level effects of cognitive decline on the patient’s language. In this paper, we propose a novel model that encompasses both the hierarchical and sequential structure of the description and detect its informative units by attention mechanism. Automatic speech recognition (ASR) and punctuation restoration are used to transcribe and segment the data. Using the DementiaBank database of people with AD as well as healthy controls (HC), we obtain an F-score of 84.43% and 74.37% when using manual and automatic transcripts respectively. We further explore the effect of adding additional data (a total of 33 descriptions collected using a ‘ digital doctor’ ) during model training, and increase the F-score when using ASR transcripts to 76.09%. This outperforms baseline models, including bidirectional LSTM and bidirectional hierarchical neural network without an attention mechanism, and demonstrate that the use of hierarchical models with attention mechanism improves the AD/HC discrimination performance

    Linguistic- and Acoustic-based Automatic Dementia Detection using Deep Learning Methods

    Get PDF
    Dementia can affect a person's speech and language abilities, even in the early stages. Dementia is incurable, but early detection can enable treatment that can slow down and maintain mental function. Therefore, early diagnosis of dementia is of great importance. However, current dementia detection procedures in clinical practice are expensive, invasive, and sometimes inaccurate. In comparison, computational tools based on the automatic analysis of spoken language have the potential to be applied as a cheap, easy-to-use, and objective clinical assistance tool for dementia detection. In recent years, several studies have shown promise in this area. However, most studies focus heavily on the machine learning aspects and, as a consequence, often lack sufficient incorporation of clinical knowledge. Many studies also concentrate on clinically less relevant tasks such as the distinction between HC and people with AD which is relatively easy and therefore less interesting both in terms of the machine learning and the clinical application. The studies in this thesis concentrate on automatically identifying signs of neurodegenerative dementia in the early stages and distinguishing them from other clinical, diagnostic categories related to memory problems: (FMD, MCI, and HC). A key focus, when designing the proposed systems has been to better consider (and incorporate) currently used clinical knowledge and also to bear in mind how these machine-learning based systems could be translated for use in real clinical settings. Firstly, a state-of-the-art end-to-end system is constructed for extracting linguistic information from automatically transcribed spontaneous speech. The system's architecture is based on hierarchical principles thereby mimicking those used in clinical practice where information at both word-, sentence- and paragraph-level is used when extracting information to be used for diagnosis. Secondly, hand-crafted features are designed that are based on clinical knowledge of the importance of pausing and rhythm. These are successfully joined with features extracted from the end-to-end system. Thirdly, different classification tasks are explored, each set up so as to represent the types of diagnostic decision-making that is relevant in clinical practice. Finally, experiments are conducted to explore how to better deal with the known problem of confounding and overlapping symptoms on speech and language from age and cognitive decline. A multi-task system is constructed that takes age into account while predicting cognitive decline. The studies use the publicly available DementiaBank dataset as well as the IVA dataset, which has been collected by our collaborators at the Royal Hallamshire Hospital, UK. In conclusion, this thesis proposes multiple methods of using speech and language information for dementia detection with state-of-the-art deep learning technologies, confirming the automatic system's potential for dementia detection

    Analysis and automatic identification of spontaneous emotions in speech from human-human and human-machine communication

    Get PDF
    383 p.This research mainly focuses on improving our understanding of human-human and human-machineinteractions by analysing paricipants¿ emotional status. For this purpose, we have developed andenhanced Speech Emotion Recognition (SER) systems for both interactions in real-life scenarios,explicitly emphasising the Spanish language. In this framework, we have conducted an in-depth analysisof how humans express emotions using speech when communicating with other persons or machines inactual situations. Thus, we have analysed and studied the way in which emotional information isexpressed in a variety of true-to-life environments, which is a crucial aspect for the development of SERsystems. This study aimed to comprehensively understand the challenge we wanted to address:identifying emotional information on speech using machine learning technologies. Neural networks havebeen demonstrated to be adequate tools for identifying events in speech and language. Most of themaimed to make local comparisons between some specific aspects; thus, the experimental conditions weretailored to each particular analysis. The experiments across different articles (from P1 to P19) are hardlycomparable due to our continuous learning of dealing with the difficult task of identifying emotions inspeech. In order to make a fair comparison, additional unpublished results are presented in the Appendix.These experiments were carried out under identical and rigorous conditions. This general comparisonoffers an overview of the advantages and disadvantages of the different methodologies for the automaticrecognition of emotions in speech

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    Multimodal machine learning in medical screenings

    Get PDF
    The healthcare industry, with its high demand and standards, has long been considered a crucial area for technology-based innovation. However, the medical field often relies on experience-based evaluation. Limited resources, overloading capacity, and a lack of accessibility can hinder timely medical care and diagnosis delivery. In light of these challenges, automated medical screening as a decision-making aid is highly recommended. With the increasing availability of data and the need to explore the complementary effect among modalities, multimodal machine learning has emerged as a potential area of technology. Its impact has been witnessed across a wide range of domains, prompting the question of how far machine learning can be leveraged to automate processes in even more complex and high-risk sectors. This paper delves into the realm of multimodal machine learning in the field of automated medical screening and evaluates the potential of this area of study in mental disorder detection, a highly important area of healthcare. First, we conduct a scoping review targeted at high-impact papers to highlight the trends and directions of multimodal machine learning in screening prevalent mental disorders such as depression, stress, and bipolar disorder. The review provides a comprehensive list of popular datasets and extensively studied modalities. The review also proposes an end-to-end pipeline for multimodal machine learning applications, covering essential steps from preprocessing, representation, and fusion, to modelling and evaluation. While cross-modality interaction has been considered a promising factor to leverage fusion among multimodalities, the number of existing multimodal fusion methods employing this mechanism is rather limited. This study investigates multimodal fusion in more detail through the proposal of Autofusion, an autoencoder-infused fusion technique that harnesses the cross-modality interaction among different modalities. The technique is evaluated on DementiaBank’s Pitt corpus to detect Alzheimer’s disease, leveraging the power of cross-modality interaction. Autofusion achieves a promising performance of 79.89% in accuracy, 83.85% in recall, 81.72% in precision, and 82.47% in F1. The technique consistently outperforms all unimodal methods by an average of 5.24% across all metrics. Our method consistently outperforms early fusion and late fusion. Especially against the late fusion hard-voting technique, our method outperforms by an average of 20% across all metrics. Further, empirical results show that the cross-modality interaction term enhances the model performance by 2-3% across metrics. This research highlights the promising impact of cross-modality interaction in multimodal machine learning and calls for further research to unlock its full potential

    Interaction analytics for automatic assessment of communication quality in primary care

    Get PDF
    Effective doctor-patient communication is a crucial element of health care, influencing patients’ personal and medical outcomes following the interview. The set of skills used in interpersonal interaction is complex, involving verbal and non-verbal behaviour. Precise attributes of good non-verbal behaviour are difficult to characterise, but models and studies offer insight on relevant factors. In this PhD, I studied how the attributes of non-verbal behaviour can be automatically extracted and assessed, focusing on turn-taking patterns of and the prosody of patient-clinician dialogues. I described clinician-patient communication and the tools and methods used to train and assess communication during the consultation. I then proceeded to a review of the literature on the existing efforts to automate assessment, depicting an emerging domain focused on the semantic content of the exchange and a lack of investigation on interaction dynamics, notably on the structure of turns and prosody. To undertake the study of these aspects, I initially planned the collection of data. I underlined the need for a system that follows the requirements of sensitive data collection regarding data quality and security. I went on to design a secure system which records participants’ speech as well as the body posture of the clinician. I provided an open-source implementation and I supported its use by the scientific community. I investigated the automatic extraction and analysis of some non-verbal components of the clinician-patient communication on an existing corpus of GP consultations. I outlined different patterns in the clinician-patient interaction and I further developed explanations of known consulting behaviours, such as the general imbalance of the doctor-patient interaction and differences in the control of the conversation. I compared behaviours present in face to face, telephone, and video consultations, finding overall similarities alongside noticeable differences in patterns of overlapping speech and switching behaviour. I further studied non-verbal signals by analysing speech prosodic features, investigating differences in participants’ behaviour and relations between the assessment of the clinician-patient communication and prosodic features. While limited in their interpretative power on the explored dataset, these signals nonetheless provide additional metrics to identify and characterise variations in the non-verbal behaviour of the participants. Analysing clinician-patient communication is difficult even for human experts. Automating that process in this work has been particularly challenging. I demonstrated the capacity of automated processing of non-verbal behaviours to analyse clinician-patient communication. I outlined the ability to explore new aspects, interaction dynamics, and objectively describe how patients and clinicians interact. I further explained known aspects such as clinician dominance in more detail. I also provided a methodology to characterise participants’ turns taking behaviour and speech prosody for the objective appraisal of the quality of non-verbal communication. This methodology is aimed at further use in research and education

    Multimodal Assessment of Cognitive Decline: Applications in Alzheimer’s Disease and Depression

    Get PDF
    The initial diagnosis and assessment of cognitive decline are generally based around the judgement of clinicians, and commonly used semi-structured interviews, guided by pre-determined sets of topics, in a clinical set-up. Publicly available multimodal datasets have provided an opportunity to explore a range of experiments in the automatic detecting of cognitive decline. Drawing on the latest developments in representation learning, machine learning, and natural language processing, we seek to develop models capable of identifying cognitive decline with an eye to discovering the differences and commonalities that should be considered in computational treatment of mental health disorders. We present models that learn the indicators of cognitive decline from audio and visual modalities as well as lexical, syntactic, disfluency and pause information. Our study is carried out in two parts: moderation analysis and predictive modelling. We do some experiments with different fusion techniques. Our approaches are motivated by some of the recent efforts in multimodal fusion for classifying cognitive states to capture the interaction between modalities and maximise the use and combination of each modality. We create tools for detecting cognitive decline and use them to analyze three major datasets containing speech produced by people with and without cognitive decline. These findings are being used to develop multimodal models for the detection of depression and Alzheimer’s dementia
    corecore