658 research outputs found
Automatic Detection of Dementia and related Affective Disorders through Processing of Speech and Language
In 2019, dementia is has become a trillion dollar disorder. Alzheimer’s disease (AD) is a type of dementia in which the main observable symptom is a decline in cognitive functions, notably memory, as well as language and problem-solving. Experts agree that early detection is crucial to effectively develop and apply interventions and treatments, underlining the need for effective and pervasive assessment and screening tools. The goal of this thesis is to explores how computational techniques can be used to process speech and language samples produced by patients suffering from dementia or related affective disorders, to the end of automatically detecting them in large populations us- ing machine learning models. A strong focus is laid on the detection of early stage dementia (MCI), as most clinical trials today focus on intervention at this level. To this end, novel automatic and semi-automatic analysis schemes for a speech-based cogni- tive task, i.e., verbal fluency, are explored and evaluated to be an appropriate screening task. Due to a lack of available patient data in most languages, world-first multilingual approaches to detecting dementia are introduced in this thesis. Results are encouraging and clear benefits on a small French dataset become visible. Lastly, the task of detecting these people with dementia who also suffer from an affective disorder called apathy is explored. Since they are more likely to convert into later stage of dementia faster, it is crucial to identify them. These are the fist experiments that consider this task us- ing solely speech and language as inputs. Results are again encouraging, both using only speech or language data elicited using emotional questions. Overall, strong results encourage further research in establishing speech-based biomarkers for early detection and monitoring of these disorders to better patients’ lives.Im Jahr 2019 ist Demenz zu einer Billionen-Dollar-Krankheit geworden. Die Alzheimer- Krankheit (AD) ist eine Form der Demenz, bei der das Hauptsymptom eine Abnahme der kognitiven Funktionen ist, insbesondere des Gedächtnisses sowie der Sprache und des Problemlösungsvermögens. Experten sind sich einig, dass eine frühzeitige Erkennung entscheidend für die effektive Entwicklung und Anwendung von Interventionen und Behandlungen ist, was den Bedarf an effektiven und durchgängigen Bewertungsund Screening-Tools unterstreicht. Das Ziel dieser Arbeit ist es zu erforschen, wie computergest ützte Techniken eingesetzt werden können, um Sprach- und Sprechproben von Patienten, die an Demenz oder verwandten affektiven Störungen leiden, zu verarbeiten, mit dem Ziel, diese in großen Populationen mit Hilfe von maschinellen Lernmodellen automatisch zu erkennen. Ein starker Fokus liegt auf der Erkennung von Demenz im Frühstadium (MCI), da sich die meisten klinischen Studien heute auf eine Intervention auf dieser Ebene konzentrieren. Zu diesem Zweck werden neuartige automatische und halbautomatische Analyseschemata für eine sprachbasierte kognitive Aufgabe, d.h. die verbale Geläufigkeit, erforscht und als geeignete Screening-Aufgabe bewertet. Aufgrund des Mangels an verfügbaren Patientendaten in den meisten Sprachen werden in dieser Arbeit weltweit erstmalig mehrsprachige Ansätze zur Erkennung von Demenz vorgestellt. Die Ergebnisse sind ermutigend und es werden deutliche Vorteile an einem kleinen französischen Datensatz sichtbar. Schließlich wird die Aufgabe untersucht, jene Menschen mit Demenz zu erkennen, die auch an einer affektiven Störung namens Apathie leiden. Da sie mit größerer Wahrscheinlichkeit schneller in ein späteres Stadium der Demenz übergehen, ist es entscheidend, sie zu identifizieren. Dies sind die ersten Experimente, die diese Aufgabe unter ausschließlicher Verwendung von Sprache und Sprache als Input betrachten. Die Ergebnisse sind wieder ermutigend, sowohl bei der Verwendung von reiner Sprache als auch bei der Verwendung von Sprachdaten, die durch emotionale Fragen ausgelöst werden. Insgesamt sind die Ergebnisse sehr ermutigend und ermutigen zu weiterer Forschung, um sprachbasierte Biomarker für die Früherkennung und Überwachung dieser Erkrankungen zu etablieren und so das Leben der Patienten zu verbessern
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Identification of voice pathologies in an elderly population
Ageing is associated with an increased risk of developing diseases, including a greater pre-
disposition to develop diseases such as Sepsis. Also, with ageing, human voices undergo a
natural degradation gauged by alterations in hoarseness, breathiness, articulatory ability,
and speaking rate. Nowadays, perceptual evaluation is widely used to assess speech and
voice impairments despite its high subjectivity.
This dissertation proposes a new method for detecting and identifying voice patholo-
gies by exploring acoustic parameters of continuous speech signals in the elderly popula-
tion. Additionally, a study of the influence of gender and age on voice pathology detection
systems’ performance is conducted.
The study included 44 subjects older than 60 years old, with the pathologies Dyspho-
nia, Functional Dysphonia, and Spasmodic Dysphonia. In the dataset originated with
these settings, two gender-dependent subsets were created, one with only female samples
and the other with only male samples. The system developed used three feature selection
methods and five Machine Learning algorithms to classify the voice signal according to
the presence of pathology.
The binary classification, which consisted of voice pathology detection, reached an
accuracy of 85,1%±5,1% for the dataset without gender division, 83,7%±7,0% for the
male dataset, and 87,4%±4,2% for the female dataset. As for the multiclass classifica-
tion, which consisted of the classification of different pathologies, reached an accuracy of
69,0%±5,1% for the dataset without gender division, 63,7%± 5,4% for the male dataset,
and 80,6%±8,1% for the female dataset.
The obtained results revealed that features that describe fluency are important and
discriminating in these types of systems. Also, Random Forest has shown to be the most
effective Machine Learning algorithm for both binary and multiclass classification.
The proposed model proves to be promising in detecting pathological voices and
identifying the underlying pathology in an elderly population, with an increase in its
performance when a gender division is performed.O envelhecimento está associado a um maior risco de desenvolvimento de doenças, nome-
adamente a uma maior predisposição para a evolução de doenças como a Sepsis. Inclusiva-
mente, com o envelhecimento, a voz sofre uma degradação natural aferindo-se alterações
na rouquidão, respiração, capacidade articulatória e no ritmo do discurso. Atualmente, a
avaliação percetual é amplamente utilizada para avaliar as perturbações da fala e da voz,
possuindo elevada subjetividade.
Esta dissertação propõe um novo método de deteção e identificação de patologias da
voz através da exploração de parâmetros acústicos de sinais de fala contínua na população
idosa. Adicionalmente, é realizado um estudo da influência do género e da idade no
desempenho dos sistemas de detecção de patologias da voz.
A amostra deste estudo é composta por 44 indivíduos com idades superiores a 60
anos referentes às patologias Disfonia, Disfonia Funcional e Disfonia Espasmódica. No
conjunto de dados originados com esta configuração, foram criados dois subconjuntos de-
pendentes do género: um com apenas amostras femininas e o outro com apenas amostras
masculinas. O sistema desenvolvido utilizou três métodos de seleção de atributos e cinco
algoritmos de Aprendizagem Automática de modo a classificar o sinal de voz de acordo
com a presença de patologias da voz.
A deteção de patologia de voz alcançou uma exatidão de 85,1%±5,1% para os da-
dos sem divisão de género, 83,7%±7,0% para os dados masculinos, e 87,4%±4,2% para
os dados femininos. A classificação de diferentes patologias alcançou uma exatidão de
69,0%±5,1% para os dados sem divisão de género, 63,7%±5,4% para os dados masculinos,
e 80,6%±8,1% para os dados femininos.
Os resultados obtidos revelaram que os atributos que caracterizam a fluência são
importantes e discriminatórios nestes tipos de sistemas. Ademais, o classificador Random
Forest demonstrou ser o algoritmo mais eficaz na deteção e identificação de patologias da
voz.
O modelo proposto revelou-se promissor na deteção de vozes patológicas e identifi-
cação da patologia subjacente numa população idosa, aumentando o seu desempenho
quando ocorre uma divisão de género
Speech- and Language-Based Classification of Alzheimer’s Disease: A Systematic Review
Background: Alzheimer’s disease (AD) has paramount importance due to its rising prevalence,
the impact on the patient and society, and the related healthcare costs. However, current
diagnostic techniques are not designed for frequent mass screening, delaying therapeutic intervention
and worsening prognoses. To be able to detect AD at an early stage, ideally at a pre-clinical stage,
speech analysis emerges as a simple low-cost non-invasive procedure. Objectives: In this work it is
our objective to do a systematic review about speech-based detection and classification of Alzheimer’s
Disease with the purpose of identifying the most effective algorithms and best practices. Methods:
A systematic literature search was performed from Jan 2015 up to May 2020 using ScienceDirect,
PubMed and DBLP. Articles were screened by title, abstract and full text as needed. A manual
complementary search among the references of the included papers was also performed. Inclusion
criteria and search strategies were defined a priori. Results: We were able: to identify the main resources
that can support the development of decision support systems for AD, to list speech features
that are correlated with the linguistic and acoustic footprint of the disease, to recognize the data
models that can provide robust results and to observe the performance indicators that were reported.
Discussion: A computational system with the adequate elements combination, based on the identified
best-practices, can point to a whole new diagnostic approach, leading to better insights about AD
symptoms and its disease patterns, creating conditions to promote a longer life span as well as an
improvement in patient quality of life. The clinically relevant results that were identified can be used
to establish a reference system and help to define research guidelines for future developments.This work was partially supported by FCT- UIDB/04730/2020 project.info:eu-repo/semantics/publishedVersio
CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES
Tesis por compendio[ES] Durante los últimos años, los repositorios multimedia en línea se han convertido
en fuentes clave de conocimiento gracias al auge de Internet, especialmente en
el área de la educación. Instituciones educativas de todo el mundo han dedicado
muchos recursos en la búsqueda de nuevos métodos de enseñanza, tanto para
mejorar la asimilación de nuevos conocimientos, como para poder llegar a una
audiencia más amplia. Como resultado, hoy en día disponemos de diferentes
repositorios con clases grabadas que siven como herramientas complementarias en
la enseñanza, o incluso pueden asentar una nueva base en la enseñanza a
distancia. Sin embargo, deben cumplir con una serie de requisitos para que la
experiencia sea totalmente satisfactoria y es aquí donde la transcripción de los
materiales juega un papel fundamental. La transcripción posibilita una búsqueda
precisa de los materiales en los que el alumno está interesado, se abre la
puerta a la traducción automática, a funciones de recomendación, a la
generación de resumenes de las charlas y además, el poder hacer
llegar el contenido a personas con discapacidades auditivas. No obstante, la
generación de estas transcripciones puede resultar muy costosa.
Con todo esto en mente, la presente tesis tiene como objetivo proporcionar
nuevas herramientas y técnicas que faciliten la transcripción de estos
repositorios. En particular, abordamos el desarrollo de un conjunto de herramientas
de reconocimiento de automático del habla, con énfasis en las técnicas de aprendizaje
profundo que contribuyen a proporcionar transcripciones precisas en casos de
estudio reales. Además, se presentan diferentes participaciones en competiciones
internacionales donde se demuestra la competitividad del software comparada con
otras soluciones. Por otra parte, en aras de mejorar los sistemas de
reconocimiento, se propone una nueva técnica de adaptación de estos sistemas al
interlocutor basada en el uso Medidas de Confianza. Esto además motivó el
desarrollo de técnicas para la mejora en la estimación de este tipo de medidas
por medio de Redes Neuronales Recurrentes.
Todas las contribuciones presentadas se han probado en diferentes repositorios
educativos. De hecho, el toolkit transLectures-UPV es parte de un conjunto de
herramientas que sirve para generar transcripciones de clases en diferentes
universidades e instituciones españolas y europeas.[CA] Durant els últims anys, els repositoris multimèdia en línia s'han convertit
en fonts clau de coneixement gràcies a l'expansió d'Internet, especialment en
l'àrea de l'educació. Institucions educatives de tot el món han dedicat
molts recursos en la recerca de nous mètodes d'ensenyament, tant per
millorar l'assimilació de nous coneixements, com per poder arribar a una
audiència més àmplia. Com a resultat, avui dia disposem de diferents
repositoris amb classes gravades que serveixen com a eines complementàries en
l'ensenyament, o fins i tot poden assentar una nova base a l'ensenyament a
distància. No obstant això, han de complir amb una sèrie de requisits perquè la
experiència siga totalment satisfactòria i és ací on la transcripció dels
materials juga un paper fonamental. La transcripció possibilita una recerca
precisa dels materials en els quals l'alumne està interessat, s'obri la
porta a la traducció automàtica, a funcions de recomanació, a la
generació de resums de les xerrades i el poder fer
arribar el contingut a persones amb discapacitats auditives. No obstant, la
generació d'aquestes transcripcions pot resultar molt costosa.
Amb això en ment, la present tesi té com a objectiu proporcionar noves
eines i tècniques que faciliten la transcripció d'aquests repositoris. En
particular, abordem el desenvolupament d'un conjunt d'eines de reconeixement
automàtic de la parla, amb èmfasi en les tècniques d'aprenentatge profund que
contribueixen a proporcionar transcripcions precises en casos d'estudi reals. A
més, es presenten diferents participacions en competicions internacionals on es
demostra la competitivitat del programari comparada amb altres solucions.
D'altra banda, per tal de millorar els sistemes de reconeixement, es proposa una
nova tècnica d'adaptació d'aquests sistemes a l'interlocutor basada en l'ús de
Mesures de Confiança. A més, això va motivar el desenvolupament de tècniques per
a la millora en l'estimació d'aquest tipus de mesures per mitjà de Xarxes
Neuronals Recurrents.
Totes les contribucions presentades s'han provat en diferents repositoris
educatius. De fet, el toolkit transLectures-UPV és part d'un conjunt d'eines
que serveix per generar transcripcions de classes en diferents universitats i
institucions espanyoles i europees.[EN] During the last years, on-line multimedia repositories have become key
knowledge assets thanks to the rise of Internet and especially in the area of
education. Educational institutions around the world have devoted big efforts
to explore different teaching methods, to improve the transmission of knowledge
and to reach a wider audience. As a result, online video lecture repositories
are now available and serve as complementary tools that can boost the learning
experience to better assimilate new concepts. In order to guarantee the success
of these repositories the transcription of each lecture plays a very important
role because it constitutes the first step towards the availability of many other
features. This transcription allows the searchability of learning materials,
enables the translation into another languages, provides recommendation
functions, gives the possibility to provide content summaries, guarantees
the access to people with hearing disabilities, etc. However, the
transcription of these videos is expensive in terms of time and human cost.
To this purpose, this thesis aims at providing new tools and techniques that
ease the transcription of these repositories. In particular, we address the
development of a complete Automatic Speech Recognition Toolkit with an special
focus on the Deep Learning techniques that contribute to provide accurate
transcriptions in real-world scenarios. This toolkit is tested against many
other in different international competitions showing comparable transcription
quality. Moreover, a new technique to improve the recognition accuracy has been
proposed which makes use of Confidence Measures, and constitutes the spark that
motivated the proposal of new Confidence Measures techniques that helped to
further improve the transcription quality. To this end, a new speaker-adapted
confidence measure approach was proposed for models based on Recurrent Neural
Networks.
The contributions proposed herein have been tested in real-life scenarios in
different educational repositories. In fact, the transLectures-UPV toolkit is
part of a set of tools for providing video lecture transcriptions in many
different Spanish and European universities and institutions.Agua Teba, MÁD. (2019). CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/130198TESISCompendi
Automated screening methods for mental and neuro-developmental disorders
Mental and neuro-developmental disorders such as depression, bipolar disorder, and autism spectrum disorder (ASD) are critical healthcare issues which affect a large number of people. Depression, according to the World Health Organisation, is the largest cause of disability worldwide and affects more than 300 million people. Bipolar disorder affects more than 60 million individuals worldwide. ASD, meanwhile, affects more than 1 in 100 people in the UK. Not only do these disorders adversely affect the quality of life of affected individuals, they also have a significant economic impact.
While brute-force approaches are potentially useful for learning new features which could be representative of these disorders, such approaches may not be best suited for developing robust screening methods. This is due to a myriad of confounding factors, such as the age, gender, cultural background, and socio-economic status, which can affect social signals of individuals in a similar way as the symptoms of these disorders. Brute-force approaches may learn to exploit effects of these confounding factors on social signals in place of effects due to mental and neuro-developmental disorders.
The main objective of this thesis is to develop, investigate, and propose computational methods to screen for mental and neuro-developmental disorders in accordance with descriptions given in the Diagnostic and Statistical Manual (DSM). The DSM manual is a guidebook published by the American Psychiatric Association which offers common language on mental disorders. Our motivation is to alleviate, to an extent, the possibility of machine learning algorithms picking up one of the confounding factors to optimise performance for the dataset – something which we do not find uncommon in research literature.
To this end, we introduce three new methods for automated screening for depression from audio/visual recordings, namely: turbulence features, craniofacial movement features, and Fisher Vector based representation of speech spectra. We surmise that psychomotor changes due to depression lead to uniqueness in an individual's speech pattern which manifest as sudden and erratic changes in speech feature contours. The efficacy of these features is demonstrated as part of our solution to Audio/Visual Emotion Challenge 2017 (AVEC 2017) on Depression severity prediction. We also detail a methodology to quantify specific craniofacial movements, which we hypothesised could be indicative of psychomotor retardation, and hence depression. The efficacy of craniofacial movement features is demonstrated using datasets from the 2014 and 2017 editions of AVEC Depression severity prediction challenges. Finally, using the dataset provided as part of AVEC 2016 Depression classification challenge, we demonstrate that differences between speech of individuals with and without depression can be quantified effectively using the Fisher Vector representation of speech spectra.
For our work on automated screening of bipolar disorder, we propose methods to classify individuals with bipolar disorder into states of remission, hypo-mania, and mania. Here, we surmise that like depression, individuals with different levels of mania have certain uniqueness to their social signals. Based on this understanding, we propose the use of turbulence features for audio/visual social signals (i.e. speech and facial expressions). We also propose the use of Fisher Vectors to create a unified representation of speech in terms of prosody, voice quality, and speech spectra. These methods have been proposed as part of our solution to the AVEC 2018 Bipolar disorder challenge.
In addition, we find that the task of automated screening for ASD is much more complicated. Here, confounding factors can easily overwhelm socials signals which are affected by ASD. We discuss, in the light of research literature and our experimental analysis, that significant collaborative work is required between computer scientists and clinicians to discern social signals which are robust to common confounding factors
- …