12 research outputs found

    Gender Bias in Depression Detection Using Audio Features

    Full text link
    Depression is a large-scale mental health problem and a challenging area for machine learning researchers in detection of depression. Datasets such as Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) have been created to aid research in this area. However, on top of the challenges inherent in accurately detecting depression, biases in datasets may result in skewed classification performance. In this paper we examine gender bias in the DAIC-WOZ dataset. We show that gender biases in DAIC-WOZ can lead to an overreporting of performance. By different concepts from Fair Machine Learning, such as data re-distribution, and using raw audio features, we can mitigate against the harmful effects of bias.Comment: 5 pages, 2 figures, to be published at EUSIPCO 202

    Rule-based lip-syncing algorithm for virtual character in voice chatbot

    Get PDF
    Virtual characters changed the way we interact with computers. The underlying key for a believable virtual character is accurate synchronization between the visual (lip movements) and the audio (speech) in real-time. This work develops a 3D model for the virtual character and implements the rule-based lip-syncing algorithm for the virtual character's lip movements. We use the Jacob voice chatbot as the platform for the design and implementation of the virtual character. Thus, audio-driven articulation and manual mapping methods are considered suitable for real-time applications such as Jacob. We evaluate the proposed virtual character using hedonic motivation system adoption model (HMSAM) with 70 users. The HMSAM results for the behavioral intention to use is 91.74%, and the immersion is 72.95%. The average score for all aspects of the HMSAM is 85.50%. The rule-based lip-syncing algorithm accurately synchronizes the lip movements with the Jacob voice chatbot's speech in real-time

    Multilingual markers of depression in remotely collected speech samples: A preliminary analysis

    Get PDF
    Background: Speech contains neuromuscular, physiological and cognitive components, and so is a potential biomarker of mental disorders. Previous studies indicate that speaking rate and pausing are associated with major depressive disorder (MDD). However, results are inconclusive as many studies are small and underpowered and do not include clinical samples. These studies have also been unilingual and use speech collected in controlled settings. If speech markers are to help understand the onset and progress of MDD, we need to uncover markers that are robust to language and establish the strength of associations in real-world data. // Methods: We collected speech data in 585 participants with a history of MDD in the United Kingdom, Spain, and Netherlands as part of the RADAR-MDD study. Participants recorded their speech via smartphones every two weeks for 18 months. Linear mixed models were used to estimate the strength of specific markers of depression from a set of 28 speech features. // Results: Increased depressive symptoms were associated with speech rate, articulation rate and intensity of speech elicited from a scripted task. These features had consistently stronger effect sizes than pauses. // Limitations: Our findings are derived at the cohort level so may have limited impact on identifying intra-individual speech changes associated with changes in symptom severity. The analysis of features averaged over the entire recording may have underestimated the importance of some features. // Conclusions: Participants with more severe depressive symptoms spoke more slowly and quietly. Our findings are from a real-world, multilingual, clinical dataset so represent a step-change in the usefulness of speech as a digital phenotype of MDD

    Natural Language Processing Methods for Acoustic and Landmark Event-Based Features in Speech-Based Depression Detection

    Full text link
    The processing of speech as an explicit sequence of events is common in automatic speech recognition (linguistic events), but has received relatively little attention in paralinguistic speech classification despite its potential for characterizing broad acoustic event sequences. This paper proposes a framework for analyzing speech as a sequence of acoustic events, and investigates its application to depression detection. In this framework, acoustic space regions are tokenized to 'words' representing speech events at fixed or irregular intervals. This tokenization allows the exploitation of acoustic word features using proven natural language processing methods. A key advantage of this framework is its ability to accommodate heterogeneous event types: herein we combine acoustic words and speech landmarks, which are articulation-related speech events. Another advantage is the option to fuse such heterogeneous events at various levels, including the embedding level. Evaluation of the proposed framework on both controlled laboratory-grade supervised audio recordings as well as unsupervised self-administered smartphone recordings highlight the merits of the proposed framework across both datasets, with the proposed landmark-dependent acoustic words achieving improvements in F1(depressed) of up to 15% and 13% for SH2-FS and DAIC-WOZ respectively, relative to acoustic speech baseline approaches

    Quantifying Dimensions of the Vowel Space in Patients with Schizophrenia and Controls

    Get PDF
    The speech of patients with schizophrenia has been characterized as being aprosodic, or lacking pitch variation. Recent research on linguistic aspects of schizophrenia has looked at the vowel space to determine if there is some correlation between acoustic aspects of speech and patient status (Compton et al. 2018). Additional research by Hogoboom et al. (submitted) noted that measurements of Euclidean distance (ED), which is the average distance from the center of the vowel space to all vowels produced, and vowel density, which is the proportion of vowels clustered together in the center of the vowel space, were significantly correlated for patients with schizophrenia, but not for controls; this correlation was primarily due to a subset of 13 patients. In addition, they found that ED independently was a weak predictor of patient status, but that both density and ED, when used together, were predictors of patient status. This previous study utilized Prosogram (Mertens 2014), a tool that relies on acoustics to sift through the sound files and identify the vowels, which showed unstable reliability in detecting vowels. Therefore, this research aims to reassess the relationship between the vowel space and patient status by gathering more reliable measurements of the vowels from Hogoboom’s dataset by using the forced aligner FAVE (Rosenfelder et al. 2014). We seek to determine if there is a stronger correlation between vowel space usage and patient status than previously found—one that was previously masked by incomplete vowel measurements. Our current research finds that ED is a strong predictor of patient status (p\u3c0.05). While Hogoboom’s previous work found that ED and density were independently significant, current work finds that those two variables are correlated. These results show that there is a relationship between ED and an individual’s patient status, where patients have lower average ED and controls have higher average ED. Overall, this research clarifies differences in utilization of the vowel space between patients with schizophrenia and controls, which could ultimately be used to create more quantitatively-defined linguistic measurements for diagnosis that are less subject to individual clinical listeners

    Self-Reported Symptoms of Depression and PTSD Are Associated with Reduced Vowel Space in Screening Interviews

    No full text

    The Phylogeny and Function of Vocal Complexity in Geladas

    Full text link
    The complexity of vocal communication varies widely across taxa – from humans who can create an infinite repertoire of sound combinations to some non-human species that produce only a few discrete sounds. A growing body of research is aimed at understanding the origins of ‘vocal complexity’. And yet, we still understand little about the evolutionary processes that led to, and the selective advantages of engaging in, complex vocal behaviors. I contribute to this body of research by examining the phylogeny and function of vocal complexity in wild geladas (Theropithecus gelada), a primate known for its capacity to combine a suite of discrete sound types into varied sequences. First, I investigate the phylogeny of vocal complexity by comparing gelada vocal communication with that of their close baboon relatives and with humans. Comparisons of vocal repertoires reveal that geladas – specifically the males – produce a suite of unique or ‘derived’ call types that results in a more diversified vocal repertoire than baboons. Also, comparisons of acoustic properties reveal that geladas produce vocalizations with greater spectro-temporal modulation, a feature shared with human speech, than baboons. Additionally, I show that the same organizational principle – Menzerath’s law – underpins the structure of gelada vocal sequences (i.e., combinations of derived and homologous call types) and human sentences. Second, I investigate the function of vocal complexity by examining the perception of male complex vocal sequences (i.e., those with more derived call types), the contexts in which they are produced, and how their production differs across individuals. A playback experiment shows that female geladas perceive ‘complex’ and ‘simple’ vocal sequences as being different. Then, two observational studies show that male production of complex vocal sequences mediates their affiliative interactions with females, both during neutral periods and periods of uncertainty (e.g., following conflicts). Finally, I find evidence that vocal complexity can act as a signal of male ‘quality’, in that more dominant males exhibit higher levels of vocal complexity than their subordinate counterparts. Collectively, the work presented in this dissertation presents an integrative investigation of the ultimate origins of complex communication systems, and in the process, it highlights the critical importance of approaching the study of complexity from several scientific perspectives.PHDPsychologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138479/1/gustison_1.pd

    The Domestication of Voice Activated -Technology & EavesMining: Surveillance, Privacy and Gender Relations at Home

    Get PDF
    This thesis develops a case study analysis of the Amazon Echo, the first-ever voice-activated smart speaker. The domestication of the devices feminine conversational agent, Alexa, and the integration of its microphone and digital sensor technology in home environments represents a moment of radical change in the domestic sphere. This development is interpreted according to two primary force relations: historical gender patterns of domestic servitude and eavesmining (eavesdropping + datamining) processes of knowledge extraction and analysis. The thesis is framed around three pillars of study that together demonstrate: how routinization with voice-activated technology affects acoustic space and ones experiences of home; how online warm experts initiate a dialogue about the domestication of technology that disregards and ignores Amazons corporate privacy framework; and finally, how the technologys conditions of use silently result in the deployment of ever-intensifying surveillance mechanisms in home environments. Eavesmining processes are beginning to construct a new world of media and surveillance where every spoken word can potentially be heard and recorded, and speaking is inseparable from identification
    corecore