8 research outputs found

    Prediction of mental effort derived from an automated vocal biomarker using machine learning in a large-scale remote sample

    Get PDF
    IntroductionBiomarkers of mental effort may help to identify subtle cognitive impairments in the absence of task performance deficits. Here, we aim to detect mental effort on a verbal task, using automated voice analysis and machine learning.MethodsAudio data from the digit span backwards task were recorded and scored with automated speech recognition using the online platform NeuroVocalixTM, yielding usable data from 2,764 healthy adults (1,022 male, 1,742 female; mean age 31.4 years). Acoustic features were aggregated across each trial and normalized within each subject. Cognitive load was dichotomized for each trial by categorizing trials at >0.6 of each participants' maximum span as “high load.” Data were divided into training (60%), test (20%), and validate (20%) datasets, each containing different participants. Training and test data were used in model building and hyper-parameter tuning. Five classification models (Logistic Regression, Naive Bayes, Support Vector Machine, Random Forest, and Gradient Boosting) were trained to predict cognitive load (“high” vs. “low”) based on acoustic features. Analyses were limited to correct responses. The model was evaluated using the validation dataset, across all span lengths and within the subset of trials with a four-digit span. Classifier discriminant power was examined with Receiver Operating Curve (ROC) analysis.ResultsParticipants reached a mean span of 6.34 out of 8 items (SD = 1.38). The Gradient Boosting classifier provided the best performing model on test data (AUC = 0.98) and showed excellent discriminant power for cognitive load on the validation dataset, across all span lengths (AUC = 0.99), and for four-digit only utterances (AUC = 0.95).DiscussionA sensitive biomarker of mental effort can be derived from vocal acoustic features in remotely administered verbal cognitive tests. The use-case of this biomarker for improving sensitivity of cognitive tests to subtle pathology now needs to be examined

    Culture Clubs: Processing Speech by Deriving and Exploiting Linguistic Subcultures

    Full text link
    Spoken language understanding systems are error-prone for several reasons, including individual speech variability. This is manifested in many ways, among which are differences in pronunciation, lexical inventory, grammar and disfluencies. There is, however, a lot of evidence pointing to stable language usage within subgroups of a language population. We call these subgroups linguistic subcultures. The two broad problems are defined and a survey of the work in this space is performed. The two broad problems are: linguistic subculture detection, commonly performed via Language Identification, Accent Identification or Dialect Identification approaches; and speech and language processing tasks taken which may see increases in performance by modeling for each linguistic subculture. The data used in the experiments are drawn from four corpora: Accents of the British Isles (ABI), Intonational Variation in English (IViE), the NIST Language Recognition Evaluation Plan (LRE15) and Switchboard. The speakers in the corpora come from different parts of the United Kingdom and the United States and were provided different stimuli. From the speech samples, two features sets are used in the experiments. A number of experiments to determine linguistic subcultures are conducted. The set of experiments cover a number of approaches including the use traditional machine learning approaches shown to be effective for similar tasks in the past, each with multiple feature sets. State-of-the-art deep learning approaches are also applied to this problem. Two large automatic speech recognition (ASR) experiments are performed against all three corpora: one, monolithic experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures. For the discourse markers labeled in the Switchboard corpus, there are some interesting trends when examined through the lens of the speakers in their linguistic subcultures. Two large dialogue acts experiments are performed against the labeled portion of the Switchboard corpus: one, monocultural (or monolithic ) experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures. We conclude by discussing applications of this work, the changing landscape of natural language processing and suggestions for future research

    Affective Speech Recognition

    Get PDF
    Speech, as a medium of interaction, carries two different streams of information. Whereas one stream carries explicit messages, the other one contains implicit information about speakers themselves. Affective speech recognition is a set of theories and tools that intend to automate unfolding the part of the implicit stream that has to do with humans emotion. Application of affective speech recognition is to human computer interaction; a machine that is able to recognize humans emotion could engage the user in a more effective interaction. This thesis proposes a set of analyses and methodologies that advance automatic recognition of affect from speech. The proposed solution spans two dimensions of the problem: speech signal processing, and statistical learning. At the speech signal processing dimension, extraction of speech low-level descriptors is dis- cussed, and a set of descriptors that exploit the spectrum of the signal are proposed, which have shown to be particularly practical for capturing affective qualities of speech. Moreover, consider- ing the non-stationary property of the speech signal, further proposed is a measure of dynamicity that captures that property of speech by quantifying changes of the signal over time. Furthermore, based on the proposed set of low-level descriptors, it is shown that individual human beings are different in conveying emotions, and that parts of the spectrum that hold the affective information are different from one person to another. Therefore, the concept of emotion profile is proposed that formalizes those differences by taking into account different factors such as cultural and gender-specific differences, as well as those distinctions that have to do with individual human beings. At the statistical learning dimension, variable selection is performed to identify speech features that are most imperative to extracting affective information. In doing so, low-level descriptors are distinguished from statistical functionals, therefore, effectiveness of each of the two are studied dependently and independently. The major importance of variable selection as a standalone component of a solution is to real-time application of affective speech recognition. Although thousands of speech features are commonly used to tackle this problem in theory, extracting that many features in a real-time manner is unrealistic, especially for mobile applications. Results of the conducted investigations show that the required number of speech features is far less than the number that is commonly used in the literature of the problem. At the core of an affective speech recognition solution is a statistical model that uses speech features to recognize emotions. Such a model comes with a set of parameters that are estimated through a learning process. Proposed in this thesis is a learning algorithm, developed based on the notion of Hilbert-Schmidt independence criterion and named max-dependence regression, that maximizes the dependence between predicted and actual values of affective qualities. Pearson’s correlation coefficient is commonly used as the measure of goodness of a fit in the literature of affective computing, therefore max-dependence regression is proposed to make the learning and hypothesis testing criteria consistent with one another. Results of this research show that doing so results in higher prediction accuracy. Lastly, sparse representation for affective speech datasets is considered in this thesis. For this purpose, the application of a dictionary learning algorithm based on Hilbert-Schmidt independence criterion is proposed. Dictionary learning is used to identify the most important bases of the data in order to improve the generalization capability of the proposed solution to affective speech recognition. Based on the dictionary learning approach of choice, fusion of feature vectors is proposed. It is shown that sparse representation leads to higher generalization capability for affective speech recognition

    Procesos cognitivos y afectivos en adultos mayores medidos por medio del habla

    Get PDF
    Tesis por compendio de publicaciones[ES] El análisis automático del habla es una técnica que permite extraer información lingüística objetivamente de la señal de sonido emitida al hablar. En la producción de habla se interrelacionan numerosos sistemas para seleccionar y planificar el mensaje, dotarlo de una estructura adecuada, y enviar las señales neuromusculares a los órganos implicados en la producción del sonido. Estos procesos determinan las características de la onda sonora que se emite, de modo que el análisis del habla ha sido utilizado para detectar diversas alteraciones que afectan a estos sistemas lingüísticos como son la demencia tipo Alzheimer y el deterioro cognitivo leve. Esta tesis indaga sobre cómo se alteran los parámetros del habla en adultos mayores afectados por diferentes deterioros cognitivos y/o afectivos. Podemos distinguir en este trabajo dos fases: En la primera se llevan a cabo dos estudios con el objetivo de hallar factores cognitivos previos a los cambios en el habla durante el proceso de envejecimiento. En la segunda, se realizan cuatro estudios que tratan de obtener combinaciones de parámetros del habla susceptibles de ser utilizados como algoritmos en la detección de varios trastornos mediante la manipulación del proceso utilizado para elicitar habla. En el primer estudio se concluye que diversos parámetros del habla relacionados con la duración, el ritmo, las frecuencias y el análisis espectral, sufren cambios que se relacionan con el estado cognitivo general y que de hecho podrían ser sensibles a varias etapas de un deterioro. En el segundo, se examina si esos parámetros se explican mediante procesos cognitivos específicos, encontrando una relación con el acceso lingüístico a la memoria semántica, al léxico y la función ejecutiva. A continuación, los resultados demuestran que dentro de las personas con deterioro cognitivo leve podría haber perfiles de habla correspondientes a aquellos cuya causa subyacente es la Enfermedad de Alzheimer, y que podrían ser identificados a través de medidas del ritmo del habla. Se han utilizado dos tareas, una de lectura y otra de fluidez verbal, que permitan por medio del análisis del habla detectar con un nivel de éxito aceptable a personas con deterioro cognitivo leve y/o Alzheimer. Finalmente, se trató de extender el método del análisis del habla a la detección de depresión en mayores como un primer paso hacia el diagnóstico diferencial de depresión y demencia. [EN] Automatic speech analysis is a technique that allows linguistic information to be extracted objectively from the sound signal emitted during the act of speaking. Numerous systems are interrelated in speech production to select and plan the message, provide it with an appropriate structure, and send the neuromuscular signals to the organs involved in sound production. These processes determine the characteristics of the sound wave that is emitted, thus, speech analysis has been used to detect various disorders involving these systems, such as Alzheimer's dementia and mild cognitive impairment. This thesis investigates the alteration of speech parameters in older adults affected by cognitive and/or affective impairments. We can distinguish two phases in this work. In the first one, two studies are carried out with the aim of finding cognitive factors of speech changes during the aging process. In the second, four studies are carried out in an attempt to obtain combinations of speech parameters that can be used as algorithms in the detection of various disorders by manipulating the process used to elicit speech. The first study concludes that various speech parameters related to duration, rhythm, frequencies and spectral analysis undergo changes that are related to general cognitive state and may in fact be sensitive to various stages of impairment. In the second, we examine whether these parameters are explained by specific cognitive processes, finding a relationship with linguistic access to semantic memory, lexicon and executive function. Next, the results show that within people with mild cognitive impairment there could be speech profiles corresponding to those whose underlying cause is Alzheimer's disease and that could be identified through measures of rhythm. Two tasks, reading and verbal fluency, are proposed. Speech analysis on these tasks can be used with an acceptable level of success to detect people with mild cognitive impairment and/or Alzheimer's disease. Finally, an attempt was made to extend the method to the detection of depression in older adults as a first step towards the differential diagnosis of depression and dementia

    Diagnosticité des mesures physiologiques périphériques de la charge mentale

    Get PDF
    La charge mentale est un concept très utile dans les domaines dont l’objet d’étude et d’analyse est le travail et la performance humaine. Typiquement, la charge mentale est mesurée à l’aide des mesures subjectives (c.-à-d. des questionnaires) ou par des mesures comportementales (c.-à-d. les actions des individus). La charge mentale peut également être mesurée à l’aide de l’activité physiologique périphérique (p.ex. l’activité cardiaque). Il est cependant difficile de déterminer la source de la charge mentale à l’aide des mesures physiologiques périphérique. En effet, les sous-dimensions de la charge mentale, comme l’exigence, l’effort, le stress et la fatigue, provoquent souvent des manifestations physiologiques similaires. En plus de cette problématique, les mesures physiologiques de la charge mentale sont trop souvent étudiées dans des contextes de laboratoire et trop rarement étudiées dans des milieux de travail réels. Il est donc crucial d’investiguer leur potentiel dans des contextes réels. Cette thèse vise donc à investiguer le potentiel diagnostique (le potentiel à déterminer la source) des mesures physiologiques périphériques de la charge mentale. Pour y arriver, une méthode combinant l’approches cognitive traditionnelle et l’apprentissage automatique est utilisée. La thèse rapporte les résultats de deux expériences : une première menée en contexte de laboratoire et une seconde menée dans une simulation de commandement et contrôle reproduisant un milieu de travail réel. Les résultats montrent que les mesures physiologiques périphériques peuvent prédire, avec une bonne précision, la sousdimension qui est à l’origine de la charge mentale en contexte de tâche simple. Bien que moins précise, il reste possible de faire cette prédiction dans des contextes de tâche réelle. Dans l’ensemble, cette thèse apporte plusieurs contributions essentielles afin de rendre possible les mesures physiologiques périphériques de la charge mentale dans les milieux de travail réels.Mental workload stands out as a key concept as soon as human work and human performance is discussed. Mental workload is often measured using subjective questionnaires or behavioral cues. Peripheral physiological measures (e.g. heart rate) can also be used to measure workload. However, it is particularly difficult to determine the source of workload using peripheral physiological measures. Sub-divisions of mental workload, such as task load, mental effort, stress and fatigue, often trigger similar physiological reactions, blurring the diagnostic potential of physiological measures. Furthermore, physiological measures are too often investigated in laboratory settings, making it hazardous to determine their performance in real world settings. This thesis aims at investigating the diagnostic potential of peripheral physiological measures. A mixed methodology, combining traditional cognitive approach as well as machine learning techniques, is used. This thesis presents results of both a laboratory setting experimental as well as an ecological command and control simulation. Results show that peripheral measures can be used to predict, with high accuracy, the source of workload in laboratory settings. While not as accurate, results also show that it is possible to perform a diagnostic measure of workload in an ecological work simulation. This thesis contribute to improve the potential of peripheral physiological measures in real work settings

    Learning disentangled speech representations

    Get PDF
    A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody. The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions. In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks. This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically
    corecore