64 research outputs found

    EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals

    Get PDF
    The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech

    Towards using Cough for Respiratory Disease Diagnosis by leveraging Artificial Intelligence: A Survey

    Full text link
    Cough acoustics contain multitudes of vital information about pathomorphological alterations in the respiratory system. Reliable and accurate detection of cough events by investigating the underlying cough latent features and disease diagnosis can play an indispensable role in revitalizing the healthcare practices. The recent application of Artificial Intelligence (AI) and advances of ubiquitous computing for respiratory disease prediction has created an auspicious trend and myriad of future possibilities in the medical domain. In particular, there is an expeditiously emerging trend of Machine learning (ML) and Deep Learning (DL)-based diagnostic algorithms exploiting cough signatures. The enormous body of literature on cough-based AI algorithms demonstrate that these models can play a significant role for detecting the onset of a specific respiratory disease. However, it is pertinent to collect the information from all relevant studies in an exhaustive manner for the medical experts and AI scientists to analyze the decisive role of AI/ML. This survey offers a comprehensive overview of the cough data-driven ML/DL detection and preliminary diagnosis frameworks, along with a detailed list of significant features. We investigate the mechanism that causes cough and the latent cough features of the respiratory modalities. We also analyze the customized cough monitoring application, and their AI-powered recognition algorithms. Challenges and prospective future research directions to develop practical, robust, and ubiquitous solutions are also discussed in detail.Comment: 30 pages, 12 figures, 9 table

    Recognition of Activities of Daily Living and Environments Using Acoustic Sensors Embedded on Mobile Devices

    Get PDF
    The identification of Activities of Daily Living (ADL) is intrinsic with the user’s environment recognition. This detection can be executed through standard sensors present in every-day mobile devices. On the one hand, the main proposal is to recognize users’ environment and standing activities. On the other hand, these features are included in a framework for the ADL and environment identification. Therefore, this paper is divided into two parts—firstly, acoustic sensors are used for the collection of data towards the recognition of the environment and, secondly, the information of the environment recognized is fused with the information gathered by motion and magnetic sensors. The environment and ADL recognition are performed by pattern recognition techniques that aim for the development of a system, including data collection, processing, fusion and classification procedures. These classification techniques include distinctive types of Artificial Neural Networks (ANN), analyzing various implementations of ANN and choosing the most suitable for further inclusion in the following different stages of the developed system. The results present 85.89% accuracy using Deep Neural Networks (DNN) with normalized data for the ADL recognition and 86.50% accuracy using Feedforward Neural Networks (FNN) with non-normalized data for environment recognition. Furthermore, the tests conducted present 100% accuracy for standing activities recognition using DNN with normalized data, which is the most suited for the intended purpose.This work is funded by FCT/MEC through national funds and co-funded by FEDER-PT2020 partnership agreement under the project UID/EEA/50008/2019

    Speech Enhancement for Automatic Analysis of Child-Centered Audio Recordings

    Get PDF
    Analysis of child-centred daylong naturalist audio recordings has become a de-facto research protocol in the scientific study of child language development. The researchers are increasingly using these recordings to understand linguistic environment a child encounters in her routine interactions with the world. These audio recordings are captured by a microphone that a child wears throughout a day. The audio recordings, being naturalistic, contain a lot of unwanted sounds from everyday life which degrades the performance of speech analysis tasks. The purpose of this thesis is to investigate the utility of speech enhancement (SE) algorithms in the automatic analysis of such recordings. To this effect, several classical signal processing and modern machine learning-based SE methods were employed 1) as a denoiser for speech corrupted with additive noise sampled from real-life child-centred daylong recordings and 2) as front-end for downstream speech processing tasks of addressee classification (infant vs. adult-directed speech) and automatic syllable count estimation from the speech. The downstream tasks were conducted on data derived from a set of geographically, culturally, and linguistically diverse child-centred daylong audio recordings. The performance of denoising was evaluated through objective quality metrics (spectral distortion and instrumental intelligibility) and through the downstream task performance. Finally, the objective evaluation results were compared with downstream task performance results to find whether objective metrics can be used as a reasonable proxy to select SE front-end for a downstream task. The results obtained show that a recently proposed Long Short-Term Memory (LSTM)-based progressive learning architecture provides maximum performance gains in the downstream tasks in comparison with the other SE methods and baseline results. Classical signal processing-based SE methods also lead to competitive performance. From the comparison of objective assessment and downstream task performance results, no predictive relationship between task-independent objective metrics and performance of downstream tasks was found

    Parkinsonian Speech and Voice Quality: Assessment and Improvement

    Get PDF
    Parkinson’s disease (PD) is the second most common neurodegenerative disease. Statistics show that nearly 90% of people impaired with PD develop voice and speech disorders. Speech production impairments in PD subjects typically result in hypophonia and consequently, poor speech signal-to-noise ratio (SNR) in noisy environments and inferior speech intelligibility and quality. Assessment, monitoring, and improvement of the perceived quality and intelligibility of Parkinsonian voice and speech are, therefore, paramount. In the first study of this thesis, the perceived quality of sustained vowels produced by PD patients was assessed through objective predictors. Subjective quality ratings of sustained vowels were collected from 51 PD patients, on and off the Levodopa medication, and 7 control subjects. Features extracted from the sustained vowel recordings were combined using linear regression (LR) and support vector regression (SVR). An objective metric that combined linear prediction and harmonicity features resulted in a high correlation of 0.81 with subjective ratings, higher than the performance reported in the literature. The second study focused on the prediction of amplified Parkinsonian speech quality. Speech amplifiers are used by PD patients to counteract hyperphonia. To benchmark the amplifier performance, subjective ratings of the quality of speech samples from 11 PD patients and 10 control subjects using 7 different speech amplifiers in different background noise conditions were collected. Objective quality predictors were then developed in combination with machine learning algorithms such as deep learning (DL). It was shown that the speech amplifiers differentially affect Parkinsonian speech quality and that the composite objective metric resulted in a correlation of 0.85 with subjective speech quality ratings. In the third study, a new signal-to-noise feedback (SNF) device was designed and developed to help PD patients control their speech SNR, intelligibility, and quality. The proposed SNF device contained dual ear-level microphones for estimating the speech SNR, a throat accelerometer for reliable voice activity detection, and visual/auditory alarms when the produced speech was below a certain threshold. Performance evaluation of this device in noisy environments demonstrated significant improvements in speech SNR, perceived intelligibility, and predicted quality, especially in high background noise levels

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF

    Deep learning for i-vector speaker and language recognition

    Get PDF
    Over the last few years, i-vectors have been the state-of-the-art technique in speaker and language recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need speaker or/and phonetic labels for the background data, which are not easily accessible in practice. On the other hand, the lack of speaker-labeled background data makes a big performance gap, in speaker recognition, between two well-known cosine and Probabilistic Linear Discriminant Analysis (PLDA) i-vector scoring techniques. It has recently been a challenge how to fill this gap without speaker labels, which are expensive in practice. Although some unsupervised clustering techniques are proposed to estimate the speaker labels, they cannot accurately estimate the labels. This thesis tries to solve the problems above by using the DL technology in different ways, without any need of speaker or phonetic labels. In order to fill the performance gap between cosine and PLDA scoring given unlabeled background data, we have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBNs) and Deep Neural Networks (DNNs) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in both scenarios. Experiments on the National Institute of Standard and Technology (NIST) 2014 i-vector challenge show that 46% of this performance gap, in terms of minDCF, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap. In the second line of the research, we have developed an efficient alternative vector representation of speech by keeping the computational cost as low as possible and avoiding phonetic labels, which are not always accessible. The proposed vectors will be based on both Gaussian Mixture Models (GMMs) and Restricted Boltzmann Machines (RBMs) and will be referred to as GMM-RBM vectors. The role of RBM is to learn the total speaker and session variability among background GMM supervectors. This RBM, which will be referred to as Universal RBM (URBM), will then be used to transform unseen supervectors to the proposed low dimensional vectors. The use of different activation functions for training the URBM and different transformation functions for extracting the proposed vectors are investigated. At the end, a variant of Rectified Linear Unit (ReLU) which is referred to as Variable ReLU (VReLU) is proposed. Experiments on the core test condition 5 of the NIST Speaker Recognition Evaluation (SRE) 2010 show that comparable results with conventional i-vectors are achieved with a clearly lower computational load in the vector extraction process. Finally, for the Language Identification (LID) application, we have proposed a DNN architecture to model effectively the i-vector space of four languages, English, Spanish, German, and Finnish, in the car environment. Both raw i-vectors and session variability compensated i-vectors are evaluated as input vectors to DNN. The performance of the proposed DNN architecture is compared with both conventional GMM-UBM and i-vector/Linear Discriminant Analysis (LDA) systems considering the effect of duration of signals. It is shown that the signals with duration between 2 and 3 sec meet the accuracy and speed requirements of this application, in which the proposed DNN architecture outperforms GMM-UBM and i-vector/LDA systems by 37% and 28%, respectively.En los últimos años, los i-vectores han sido la técnica de referencia en el reconocimiento de hablantes y de idioma. Los últimos avances en la tecnología de Aprendizaje Profundo (Deep Learning. DL) han mejorado la calidad de los i-vectores, pero las técnicas DL en uso son computacionalmente costosas y necesitan datos etiquetados para cada hablante y/o unidad fon ética, los cuales no son fácilmente accesibles en la práctica. La falta de datos etiquetados provoca una gran diferencia de los resultados en el reconocimiento de hablante con i-vectors entre las dos técnicas de evaluación más utilizados: distancia coseno y Análisis Lineal Discriminante Probabilístico (PLDA). Por el momento, sigue siendo un reto cómo reducir esta brecha sin disponer de las etiquetas de los hablantes, que son costosas de obtener. Aunque se han propuesto algunas técnicas de agrupamiento sin supervisión para estimar las etiquetas de los hablantes, no pueden estimar las etiquetas con precisión. Esta tesis trata de resolver los problemas mencionados usando la tecnología DL de diferentes maneras, sin necesidad de etiquetas de hablante o fon éticas. Con el fin de reducir la diferencia de resultados entre distancia coseno y PLDA a partir de datos no etiquetados, hemos propuesto un algoritmo selección de impostores y la adaptación a un modelo universal en un sistema hibrido basado en Deep Belief Networks (DBN) y Deep Neural Networks (DNN) para modelar a cada hablante objetivo de forma discriminativa. Con el fin de tener más información sobre el comportamiento de las técnicas DL en las tareas de identificación de hablante en una única sesión y en varias sesiones, se han llevado a cabo algunos experimentos en ambos escenarios. Los experimentos utilizando los datos del National Institute of Standard and Technology (NIST) 2014 i-vector Challenge muestran que el 46% de esta diferencia de resultados, en términos de minDCF, se reduce con el sistema propuesto basado en DL. Además, la combinación de evaluaciones del sistema propuesto basado en DL y PLDA con etiquetas estimadas reduce el 79% de esta diferencia. En la segunda línea de la investigación, hemos desarrollado una representación vectorial alternativa eficiente de la voz manteniendo el coste computacional lo más bajo posible y evitando las etiquetas fon éticas, Los vectores propuestos se basan tanto en el Modelo de Mezcla de Gaussianas (GMM) y en las Maquinas Boltzmann Restringidas (RBM), a los que se hacer referencia como vectores GMM-RBM. El papel de la RBM es aprender la variabilidad total del hablante y de la sesión entre los supervectores del GMM gen érico. Este RBM, al que se hará referencia como RBM Universal (URBM), se utilizará para transformar supervectores ocultos en los vectores propuestos, de menor dimensión. Además, se estudia el uso de diferentes funciones de activación para el entrenamiento de la URBM y diferentes funciones de transformación para extraer los vectores propuestos. Finalmente, se propone una variante de la Unidad Lineal Rectificada (ReLU) a la que se hace referencia como Variable ReLU (VReLU). Los experimentos sobre los datos de la condición 5 del test de la NIST Speaker Recognition Evaluation (SRE) 2010 muestran que se han conseguidos resultados comparables con los i-vectores convencionales, con una carga computacional claramente inferior en el proceso de extracción de vectores. Por último, para la aplicación de Identificación de Idioma (LID), hemos propuesto una arquitectura DNN para modelar eficazmente en el entorno del coche el espacio i-vector de cuatro idiomas: inglés, español, alemán y finlandés. Tanto los i-vectores originales como los i-vectores propuestos son evaluados como vectores de entrada a DNN. El rendimiento de la arquitectura DNN propuesta se compara con los sistemas convencionales GMM-UBM y i-vector/Análisis Discriminante Lineal (LDA) considerando el efecto de la duración de las señales. Se muestra que en caso de señales con una duración entre 2 y 3 se obtienen resultados satisfactorios en cuanto a precisión y resultados, superando a los sistemas GMM-UBM y i-vector/LDA en un 37% y 28%, respectivament

    Secure and Usable Behavioural User Authentication for Resource-Constrained Devices

    Full text link
    Robust user authentication on small form-factor and resource-constrained smart devices, such as smartphones, wearables and IoT remains an important problem, especially as such devices are increasingly becoming stores of sensitive personal data, such as daily digital payment traces, health/wellness records and contact e-mails. Hence, a secure, usable and practical authentication mechanism to restrict access to unauthorized users is a basic requirement for such devices. Existing user authentication methods based on passwords pose a mental demand on the user's part and are not secure. Behavioural biometric based authentication provides an attractive means, which can replace passwords and provide high security and usability. To this end, we devise and study novel schemes and modalities and investigate how behaviour based user authentication can be practically realized on resource-constrained devices. In the first part of the thesis, we implemented and evaluated the performance of touch based behavioural biometric on wearables and smartphones. Our results show that touch based behavioural authentication can yield very high accuracy and a small inference time without imposing huge resource requirements on the wearable devices. The second part of the thesis focus on designing a novel hybrid scheme named BehavioCog. The hybrid scheme combined touch gestures (behavioural biometric) with challenge-response based cognitive authentication. Touch based behavioural authentication is highly usable but is prone to observation attacks. While cognitive authentication schemes are highly resistant to observation attacks but not highly usable. The hybrid scheme improves the usability of cognitive authentication and improves the security of touch based behavioural biometric at the same time. Next, we introduce and evaluate a novel behavioural biometric modality named BreathPrint based on an acoustics obtained from individual's breathing gestures. Breathing based authentication is highly usable and secure as it only requires a person to breathe and low observability makes it secure against spoofing and replay attacks. Our investigation with BreathPrint showed that it could be used for efficient real-time authentication on multiple standalone smart devices especially using deep learning models
    • …
    corecore