4,810 research outputs found
Automatic Assessment of Speech Capability Loss in Disordered Speech
International audienceIn this article, we report on the use of an automatic technique to assess pronunciation in the context of several types of speech disorders. Even if such tools already exist, they are more widely used in a different context, namely, Computer-Assisted Language Learning, in which the objective is to assess nonnative pronunciation by detecting learners' mispronunciations at segmental and/or suprasegmental levels. In our work, we sought to determine if the Goodness of Pronunciation (GOP) algorithm, which aims to detect phone-level mispronunciations by means of automatic speech recognition, could also detect segmental deviances in disordered speech. Our main experiment is an analysis of speech from people with unilateral facial palsy. This pathology may impact the realization of certain phonemes such as bilabial plosives and sibilants. Speech read by 32 speakers at four different clinical severity grades was automatically aligned and GOP scores were computed for each phone realization. The highest scores, which indicate large dissimilarities with standard phone realizations, were obtained for the most severely impaired speakers. The corresponding speech subset was manually transcribed at phone level; 8.3% of the phones differed from standard pronunciations extracted from our lexicon. The GOP technique allowed the detection of 70.2% of mispronunciations with an equal rate of about 30% of false rejections and false acceptances. Finally, to broaden the scope of the study, we explored the correlation between GOP values and speech comprehensibility scores on a second corpus, composed of sentences recorded by six people with speech impairments due to cancer surgery or neurological disorders. Strong correlations were achieved between GOP scores and subjective comprehensibility scores (about 0.7 absolute). Results from both experiments tend to validate the use of GOP to measure speech capability loss, a dimension that could be used as a complement to physiological measures in pathologies causing speech disorders
Automatic Detection of Laryngeal Pathology on Sustained Vowels Using Short-Term Cepstral Parameters: Analysis of Performance and Theoretical Justification
The majority of speech signal analysis procedures for automatic detection of laryngeal pathologies mainly rely on parameters extracted from time domain processing. Moreover, calculation of these parameters often requires prior pitch period estimation; therefore, their validity heavily depends on the robustness of pitch detection. Within this paper, an alternative approach based on cepstral- domain processing is presented which has the advantage of not requiring pitch estimation, thus providing a gain in both simplicity and robustness. While the proposed scheme is similar to solutions based on Mel-frequency cepstral parameters, already present in literature, it has an easier physical interpretation while achieving similar performance standards
An intelligent healthcare system for detection and classification to discriminate vocal fold disorders
The growing population of senior citizens around the world will appear as a big challenge in the future and they will engage a significant portion of the healthcare facilities. Therefore, it is necessary to develop intelligent healthcare systems so that they can be deployed in smart homes and cities for remote diagnosis. To overcome the problem, an intelligent healthcare system is proposed in this study. The proposed intelligent system is based on the human auditory mechanism and capable of detection and classification of various types of the vocal fold disorders. In the proposed system, critical bandwidth phenomena by using the bandpass filters spaced over Bark scale is implemented to simulate the human auditory mechanism. Therefore, the system acts like an expert clinician who can evaluate the voice of a patient by auditory perception. The experimental results show that the proposed system can detect the pathology with an accuracy of 99.72%. Moreover, the classification accuracy for vocal fold polyp, keratosis, vocal fold paralysis, vocal fold nodules, and adductor spasmodic dysphonia is 97.54%, 99.08%, 96.75%, 98.65%, 95.83%, and 95.83%, respectively. In addition, an experiment for paralysis versus all other disorders is also conducted, and an accuracy of 99.13% is achieved. The results show that the proposed system is accurate and reliable in vocal fold disorder assessment and can be deployed successfully for remote diagnosis. Moreover, the performance of the proposed system is better as compared to existing disorder assessment systems
Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition
Automatic recognition of disordered and elderly speech remains a highly
challenging task to date due to the difficulty in collecting such data in large
quantities. This paper explores a series of approaches to integrate domain
adapted SSL pre-trained models into TDNN and Conformer ASR systems for
dysarthric and elderly speech recognition: a) input feature fusion between
standard acoustic frontends and domain adapted wav2vec2.0 speech
representations; b) frame-level joint decoding of TDNN systems separately
trained using standard acoustic features alone and with additional wav2vec2.0
features; and c) multi-pass decoding involving the TDNN/Conformer system
outputs to be rescored using domain adapted wav2vec2.0 models. In addition,
domain adapted wav2vec2.0 representations are utilized in
acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric
and elderly speech recognition systems. Experiments conducted on the UASpeech
dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and
Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently
outperform the standalone wav2vec2.0 models by statistically significant WER
reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two
tasks respectively. The lowest published WERs of 22.56% (52.53% on very low
intelligibility, 39.09% on unseen words) and 18.17% are obtained on the
UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set
respectively.Comment: accepted by ICASSP 202
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
Nuevo dispositivo para análisis de voz de pacientes con enfermedad de Parkinson en tiempo real
RESUMEN: La enfermedad de Parkinson (EP) es un desorden neurodegenerativo que afecta la coordinación de músculos y extremidades, incluyendo aquellos responsables de la producción del habla, generando alteraciones en la inteligibilidad de la señal de voz. Está demostrado que el ejercicio terapéutico constante puede mejorar las habilidades de comunicación de los pacientes; sin embargo, el diagnóstico acerca del avance en el proceso de recuperación es realizado de forma subjetiva por los fonoaudiólogos o neurólogos. Debido a esto se requiere el desarrollo de herramientas flexibles que valoren y guíen la terapia fonoaudiológica de los pacientes. En este artículo se presenta el diseño e implementación de un sistema embebido para el análisis en tiempo real de la voz de pacientes con EP. Para esto se desarrollan tres plataformas; primero, se construye una interfaz gráfica en Matlab; luego, se crea un primer prototipo basado en un DSP TMS320C6713 de Texas Instruments. La aplicación final es desarrollada sobre un mini-ordenador que cuenta con un códec de audio, capacidad de almacenamiento, y una unidad de procesamiento. El sistema además se complementa con un monitor LCD para desplegar información en tiempo real, y un teclado para la interacción con el usuario. En todas las plataformas se evalúan diferentes medidas usadas comúnmente en la valoración de la voz de pacientes con EP, incluyendo características acústicas y de dinámica no lineal. En concordancia con otros trabajos del estado del arte donde se analiza la voz de personas con EP, la plataforma diseñada muestra un incremento en la variación del pitch en la voz de los pacientes, además de un decremento en el valor del área del espacio vocálico. Este resultado indica que la herramienta diseñada puede ser útil para hacer la evaluación y seguimiento de la terapia fonoaudiológica de pacientes con EP.ABSTRACT: Parkinson’s disease (PD) is a neurodegenerative disorder that affects the coordination of muscles and limbs, including those responsible of the speech production. The lack of control of the limbs and muscles involved in the speech production process can generate intelligibility problems and this situation has a negative impact in the social interaction of the patients. It is already demonstrated that constant speech therapy can improve the communication abilities of the patients; however, the measurement of the recovery progress is done subjectively by speech therapists and neurologists. Due to this, it is required the development of flexible tools able to asses and guide the speech therapy of the patients. In this paper the design and deployment of a new device for the real time assessment of speech signals of people with PD is presented. The processes of design and deployment include the development on three platforms: first, a graphic user interface is developed on Matlab, second the first prototype is implemented on a digital signal processor (DSP) and third, the final device is developed on a mini-computer. The device is equipped with an audio codec, storage capacity and the processing unit. Besides, the system is complemented with a monitor to display the processed information on real time and with a keyboard enabling the interaction of the end-user with the device. Different acoustics and nonlinear dynamics measures which have been used in the state of the art for the assessment of speech of people with PD are implemented on the three mentioned platforms. In accordance with the state of the art, the designed platforms show an increment in the variation of the
fundamental period of speech (commonly called pitch) of people with PD. Additionally, the decrease of the vocal space area is validated for the case of patients with PD. These results indicate that the designed device is useful to perform the assessment and monitoring of the speech therapy of people with PD
Factor Analysis of Speech Signal for Parkinson’s Disease Prediction using Support Vector Machine
Abstract—Speech signal can be used as marker for identification of Parkinson’s disease. It is neurological disorder which is progressive in nature mainly effect the people in old age. Identification of relevant discriminant features from speech signal has been a challenge in this area. In this paper, factor analysis method is used to select distinguishing features from a set of features. These selected features are more effective for detection of the PD. From an empirical study on existing dataset and a generated dataset, it was found that the jitter, shimmer variants and noise to harmonic ratio are dominant features in detecting PD. Further, these features are employed in support vector machine for classifying PD from healthy subjects. This method provides an average accuracy of 85 % with sensitivity and specificity of about 86% and 84%. Important outcome of this study is that sustained vowels phonation captures distinguishing information for analysis and detection of PD
Models and Analysis of Vocal Emissions for Biomedical Applications
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies
CDSD: Chinese Dysarthria Speech Database
We present the Chinese Dysarthria Speech Database (CDSD) as a valuable
resource for dysarthria research. This database comprises speech data from 24
participants with dysarthria. Among these participants, one recorded an
additional 10 hours of speech data, while each recorded one hour, resulting in
34 hours of speech material. To accommodate participants with varying cognitive
levels, our text pool primarily consists of content from the AISHELL-1 dataset
and speeches by primary and secondary school students. When participants read
these texts, they must use a mobile device or the ZOOM F8n multi-track field
recorder to record their speeches. In this paper, we elucidate the data
collection and annotation processes and present an approach for establishing a
baseline for dysarthric speech recognition. Furthermore, we conducted a
speaker-dependent dysarthric speech recognition experiment using an additional
10 hours of speech data from one of our participants. Our research findings
indicate that, through extensive data-driven model training, fine-tuning
limited quantities of specific individual data yields commendable results in
speaker-dependent dysarthric speech recognition. However, we observe
significant variations in recognition results among different dysarthric
speakers. These insights provide valuable reference points for
speaker-dependent dysarthric speech recognition.Comment: 9 pages, 3 figure
The use of speech recognition technology by people living with Amyotrophic Lateral Sclerosis: a scoping review
More than 80% of people living with Amyotrophic Lateral Sclerosis (plwALS) develop difficulties with their speech, affecting communication, self-identity and quality of life. Automatic speech recognition technology (ASR) is becoming a common way to interact with a broad range of devices, to find information and control the environment.
ASR can be problematic for people with acquired neurogenic motor speech difficulties (dysarthria). Given that the field is rapidly developing, a scoping review is warranted
- …