4,810 research outputs found

    Automatic Assessment of Speech Capability Loss in Disordered Speech

    Get PDF
    International audienceIn this article, we report on the use of an automatic technique to assess pronunciation in the context of several types of speech disorders. Even if such tools already exist, they are more widely used in a different context, namely, Computer-Assisted Language Learning, in which the objective is to assess nonnative pronunciation by detecting learners' mispronunciations at segmental and/or suprasegmental levels. In our work, we sought to determine if the Goodness of Pronunciation (GOP) algorithm, which aims to detect phone-level mispronunciations by means of automatic speech recognition, could also detect segmental deviances in disordered speech. Our main experiment is an analysis of speech from people with unilateral facial palsy. This pathology may impact the realization of certain phonemes such as bilabial plosives and sibilants. Speech read by 32 speakers at four different clinical severity grades was automatically aligned and GOP scores were computed for each phone realization. The highest scores, which indicate large dissimilarities with standard phone realizations, were obtained for the most severely impaired speakers. The corresponding speech subset was manually transcribed at phone level; 8.3% of the phones differed from standard pronunciations extracted from our lexicon. The GOP technique allowed the detection of 70.2% of mispronunciations with an equal rate of about 30% of false rejections and false acceptances. Finally, to broaden the scope of the study, we explored the correlation between GOP values and speech comprehensibility scores on a second corpus, composed of sentences recorded by six people with speech impairments due to cancer surgery or neurological disorders. Strong correlations were achieved between GOP scores and subjective comprehensibility scores (about 0.7 absolute). Results from both experiments tend to validate the use of GOP to measure speech capability loss, a dimension that could be used as a complement to physiological measures in pathologies causing speech disorders

    Automatic Detection of Laryngeal Pathology on Sustained Vowels Using Short-Term Cepstral Parameters: Analysis of Performance and Theoretical Justification

    Get PDF
    The majority of speech signal analysis procedures for automatic detection of laryngeal pathologies mainly rely on parameters extracted from time domain processing. Moreover, calculation of these parameters often requires prior pitch period estimation; therefore, their validity heavily depends on the robustness of pitch detection. Within this paper, an alternative approach based on cepstral- domain processing is presented which has the advantage of not requiring pitch estimation, thus providing a gain in both simplicity and robustness. While the proposed scheme is similar to solutions based on Mel-frequency cepstral parameters, already present in literature, it has an easier physical interpretation while achieving similar performance standards

    An intelligent healthcare system for detection and classification to discriminate vocal fold disorders

    Get PDF
    The growing population of senior citizens around the world will appear as a big challenge in the future and they will engage a significant portion of the healthcare facilities. Therefore, it is necessary to develop intelligent healthcare systems so that they can be deployed in smart homes and cities for remote diagnosis. To overcome the problem, an intelligent healthcare system is proposed in this study. The proposed intelligent system is based on the human auditory mechanism and capable of detection and classification of various types of the vocal fold disorders. In the proposed system, critical bandwidth phenomena by using the bandpass filters spaced over Bark scale is implemented to simulate the human auditory mechanism. Therefore, the system acts like an expert clinician who can evaluate the voice of a patient by auditory perception. The experimental results show that the proposed system can detect the pathology with an accuracy of 99.72%. Moreover, the classification accuracy for vocal fold polyp, keratosis, vocal fold paralysis, vocal fold nodules, and adductor spasmodic dysphonia is 97.54%, 99.08%, 96.75%, 98.65%, 95.83%, and 95.83%, respectively. In addition, an experiment for paralysis versus all other disorders is also conducted, and an accuracy of 99.13% is achieved. The results show that the proposed system is accurate and reliable in vocal fold disorder assessment and can be deployed successfully for remote diagnosis. Moreover, the performance of the proposed system is better as compared to existing disorder assessment systems

    Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

    Full text link
    Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively.Comment: accepted by ICASSP 202

    Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

    Full text link
    The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent

    Nuevo dispositivo para análisis de voz de pacientes con enfermedad de Parkinson en tiempo real

    Get PDF
    RESUMEN: La enfermedad de Parkinson (EP) es un desorden neurodegenerativo que afecta la coordinación de músculos y extremidades, incluyendo aquellos responsables de la producción del habla, generando alteraciones en la inteligibilidad de la señal de voz. Está demostrado que el ejercicio terapéutico constante puede mejorar las habilidades de comunicación de los pacientes; sin embargo, el diagnóstico acerca del avance en el proceso de recuperación es realizado de forma subjetiva por los fonoaudiólogos o neurólogos. Debido a esto se requiere el desarrollo de herramientas flexibles que valoren y guíen la terapia fonoaudiológica de los pacientes. En este artículo se presenta el diseño e implementación de un sistema embebido para el análisis en tiempo real de la voz de pacientes con EP. Para esto se desarrollan tres plataformas; primero, se construye una interfaz gráfica en Matlab; luego, se crea un primer prototipo basado en un DSP TMS320C6713 de Texas Instruments. La aplicación final es desarrollada sobre un mini-ordenador que cuenta con un códec de audio, capacidad de almacenamiento, y una unidad de procesamiento. El sistema además se complementa con un monitor LCD para desplegar información en tiempo real, y un teclado para la interacción con el usuario. En todas las plataformas se evalúan diferentes medidas usadas comúnmente en la valoración de la voz de pacientes con EP, incluyendo características acústicas y de dinámica no lineal. En concordancia con otros trabajos del estado del arte donde se analiza la voz de personas con EP, la plataforma diseñada muestra un incremento en la variación del pitch en la voz de los pacientes, además de un decremento en el valor del área del espacio vocálico. Este resultado indica que la herramienta diseñada puede ser útil para hacer la evaluación y seguimiento de la terapia fonoaudiológica de pacientes con EP.ABSTRACT: Parkinson’s disease (PD) is a neurodegenerative disorder that affects the coordination of muscles and limbs, including those responsible of the speech production. The lack of control of the limbs and muscles involved in the speech production process can generate intelligibility problems and this situation has a negative impact in the social interaction of the patients. It is already demonstrated that constant speech therapy can improve the communication abilities of the patients; however, the measurement of the recovery progress is done subjectively by speech therapists and neurologists. Due to this, it is required the development of flexible tools able to asses and guide the speech therapy of the patients. In this paper the design and deployment of a new device for the real time assessment of speech signals of people with PD is presented. The processes of design and deployment include the development on three platforms: first, a graphic user interface is developed on Matlab, second the first prototype is implemented on a digital signal processor (DSP) and third, the final device is developed on a mini-computer. The device is equipped with an audio codec, storage capacity and the processing unit. Besides, the system is complemented with a monitor to display the processed information on real time and with a keyboard enabling the interaction of the end-user with the device. Different acoustics and nonlinear dynamics measures which have been used in the state of the art for the assessment of speech of people with PD are implemented on the three mentioned platforms. In accordance with the state of the art, the designed platforms show an increment in the variation of the fundamental period of speech (commonly called pitch) of people with PD. Additionally, the decrease of the vocal space area is validated for the case of patients with PD. These results indicate that the designed device is useful to perform the assessment and monitoring of the speech therapy of people with PD

    Factor Analysis of Speech Signal for Parkinson’s Disease Prediction using Support Vector Machine

    Get PDF
    Abstract—Speech signal can be used as marker for identification of Parkinson’s disease. It is neurological disorder which is progressive in nature mainly effect the people in old age. Identification of relevant discriminant features from speech signal has been a challenge in this area. In this paper, factor analysis method is used to select distinguishing features from a set of features. These selected features are more effective for detection of the PD. From an empirical study on existing dataset and a generated dataset, it was found that the jitter, shimmer variants and noise to harmonic ratio are dominant features in detecting PD. Further, these features are employed in support vector machine for classifying PD from healthy subjects. This method provides an average accuracy of 85 % with sensitivity and specificity of about 86% and 84%. Important outcome of this study is that sustained vowels phonation captures distinguishing information for analysis and detection of PD

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    CDSD: Chinese Dysarthria Speech Database

    Full text link
    We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. This database comprises speech data from 24 participants with dysarthria. Among these participants, one recorded an additional 10 hours of speech data, while each recorded one hour, resulting in 34 hours of speech material. To accommodate participants with varying cognitive levels, our text pool primarily consists of content from the AISHELL-1 dataset and speeches by primary and secondary school students. When participants read these texts, they must use a mobile device or the ZOOM F8n multi-track field recorder to record their speeches. In this paper, we elucidate the data collection and annotation processes and present an approach for establishing a baseline for dysarthric speech recognition. Furthermore, we conducted a speaker-dependent dysarthric speech recognition experiment using an additional 10 hours of speech data from one of our participants. Our research findings indicate that, through extensive data-driven model training, fine-tuning limited quantities of specific individual data yields commendable results in speaker-dependent dysarthric speech recognition. However, we observe significant variations in recognition results among different dysarthric speakers. These insights provide valuable reference points for speaker-dependent dysarthric speech recognition.Comment: 9 pages, 3 figure

    The use of speech recognition technology by people living with Amyotrophic Lateral Sclerosis: a scoping review

    Get PDF
    More than 80% of people living with Amyotrophic Lateral Sclerosis (plwALS) develop difficulties with their speech, affecting communication, self-identity and quality of life. Automatic speech recognition technology (ASR) is becoming a common way to interact with a broad range of devices, to find information and control the environment. ASR can be problematic for people with acquired neurogenic motor speech difficulties (dysarthria). Given that the field is rapidly developing, a scoping review is warranted
    corecore