257 research outputs found

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Improving automatic detection of obstructive sleep apnea through nonlinear analysis of sustained speech

    Get PDF
    We present a novel approach for the detection of severe obstructive sleep apnea (OSA) based on patients' voices introducing nonlinear measures to describe sustained speech dynamics. Nonlinear features were combined with state-of-the-art speech recognition systems using statistical modeling techniques (Gaussian mixture models, GMMs) over cepstral parameterization (MFCC) for both continuous and sustained speech. Tests were performed on a database including speech records from both severe OSA and control speakers. A 10 % relative reduction in classification error was obtained for sustained speech when combining MFCC-GMM and nonlinear features, and 33 % when fusing nonlinear features with both sustained and continuous MFCC-GMM. Accuracy reached 88.5 % allowing the system to be used in OSA early detection. Tests showed that nonlinear features and MFCCs are lightly correlated on sustained speech, but uncorrelated on continuous speech. Results also suggest the existence of nonlinear effects in OSA patients' voices, which should be found in continuous speech

    Modern drowsiness detection techniques: a review

    Get PDF
    According to recent statistics, drowsiness, rather than alcohol, is now responsible for one-quarter of all automobile accidents. As a result, many monitoring systems have been created to reduce and prevent such accidents. However, despite the huge amount of state-of-the-art drowsiness detection systems, it is not clear which one is the most appropriate. The following points will be discussed in this paper: Initial consideration should be given to the many sorts of existing supervised detecting techniques that are now in use and grouped into four types of categories (behavioral, physiological, automobile and hybrid), Second, the supervised machine learning classifiers that are used for drowsiness detection will be described, followed by a discussion of the advantages and disadvantages of each technique that has been evaluated, and lastly the recommendation of a new strategy for detecting drowsiness

    수면 호흡음을 이용한 폐쇄성 수면 무호흡 중증도 분류

    Get PDF
    학위논문 (박사)-- 서울대학교 융합과학기술대학원 융합과학부, 2017. 8. 이교구.Obstructive sleep apnea (OSA) is a common sleep disorder. The symptom has a high prevalence and increases mortality as a risk factor for hypertension and stroke. Sleep disorders occur during sleep, making it difficult for patients to self-perceive themselves, and the actual diagnosis rate is low. Despite the existence of a standard sleep study called a polysomnography (PSG), it is difficult to diagnose the sleep disorders due to complicated test procedures and high medical cost burdens. Therefore, there is an increasing demand for an effective and rational screening test that can determine whether or not to undergo a PSG. In this thesis, we conducted three studies to classify the snoring sounds and OSA severity using only breathing sounds during sleep without additional biosensors. We first identified the classification possibility of snoring sounds related to sleep disorders using the features based on the cyclostationary analysis. Then, we classified the patients OSA severity with the features extracted using temporal and cyclostationary analysis from long-term sleep breathing sounds. Finally, the partial sleep sound extraction, and feature learning process using a convolutional neural network (CNN, or ConvNet) were applied to improve the efficiency and performance of previous snoring sound and OSA severity classification tasks. The sleep breathing sound analysis method using a CNN showed superior classification accuracy of more than 80% (average area under curve > 0.8) in multiclass snoring sounds and OSA severity classification tasks. The proposed analysis and classification method is expected to be used as a screening tool for improving the efficiency of PSG in the future customized healthcare service.Chapter 1. Introduction ................................ .......................1 1.1 Personal healthcare in sleep ................................ ..............1 1.2 Existing approaches and limitations ....................................... 9 1.3 Clinical information related to SRBD ................................ .. ..12 1.4 Study objectives ................................ .........................16 Chapter 2. Overview of Sleep Research using Sleep Breathing Sounds ........... 23 2.1 Previous goals of studies ................................ ................23 2.2 Recording environments and related configurations ........................ 24 2.3 Sleep breathing sound analysis ................................ ...........27 2.4 Sleep breathing sound classification ..................................... 35 2.5 Current limitations ................................ ......................36 Chapter 3. Multiple SRDB-related Snoring Sound Classification .................39 3.1 Introduction ................................ .............................39 3.2 System architecture ................................ ......................41 3.3 Evaluation ................................ ...............................52 3.4 Results ................................ ..................................55 3.5 Discussion ................................ ...............................59 3.6 Summary ................................ ..................................63 Chapter 4. Patients OSA Severity Classification .............................65 4.1 Introduction ................................ .............................65 4.2 Existing Approaches ................................ ......................69 4.3 System Architecture ................................ ......................70 4.4 Evaluation ................................ ...............................85 4.5 Results ................................ ..................................87 4.6 Discussion ................................ ...............................94 4.7 Summary ................................ ..................................97 Chapter 5. Patient OSA Severity Prediction using Deep Learning Techniques .....99 5.1 Introduction ................................ .............................99 5.2 Methods ................................ ..................................101 5.3 Results ................................ ..................................109 5.4 Discussion ................................ ...............................115 5.5 Summary ................................ ..................................118 Chapter 6. Conclusions and Future Work ........................................120 6.1 Conclusions ................................ ..............................120 6.2 Future work ................................ ..............................127Docto

    Enabling human physiological sensing by leveraging intelligent head-worn wearable systems

    Get PDF
    This thesis explores the challenges of enabling human physiological sensing by leveraging head-worn wearable computer systems. In particular, we want to answer a fundamental question, i.e., could we leverage head-worn wearables to enable accurate and socially-acceptable solutions to improve human healthcare and prevent life-threatening conditions in our daily lives? To that end, we will study the techniques that utilise the unique advantages of wearable computers to (1) facilitate new sensing capabilities to capture various biosignals from the brain, the eyes, facial muscles, sweat glands, and blood vessels, (2) address motion artefacts and environmental noise in real-time with signal processing algorithms and hardware design techniques, and (3) enable long-term, high-fidelity biosignal monitoring with efficient on-chip intelligence and pattern-driven compressive sensing algorithms. We first demonstrate the ability to capture the activities of the user's brain, eyes, facial muscles, and sweat glands by proposing WAKE, a novel behind-the-ear biosignal sensing wearable. By studying the human anatomy in the ear area, we propose a wearable design to capture brain waves (EEG), eye movements (EOG), facial muscle contractions (EMG), and sweat gland activities (EDA) with a minimal number of sensors. Furthermore, we introduce a Three-fold Cascaded Amplifying (3CA) technique and signal processing algorithms to tame the motion artefacts and environmental noises for capturing high-fidelity signals in real time. We devise a machine-learning model based on the captured signals to detect microsleep with a high temporal resolution. Second, we will discuss our work on developing an efficient Pattern-dRiven Compressive Sensing framework (PROS) to enable long-term biosignal monitoring on low-power wearables. The system introduces tiny on-chip pattern recognition primitives (TinyPR) and a novel pattern-driven compressive sensing technique (PDCS) that exploits the sparsity of biosignals. They provide the ability to capture high-fidelity biosignals with an ultra-low power footprint. This development will unlock long-term healthcare applications on wearable computers, such as epileptic seizure monitoring, microsleep detection, etc. These applications were previously impractical on energy and resource-constrained wearable computers due to the limited battery lifetime, slow response rate, and inadequate biosignal quality. Finally, we will further explore the possibility of capturing the activities of a blood vessel (i.e., superficial temporal artery) lying deep inside the user's ear using an ear-worn wearable computer. The captured optical pulse signals (PPG) are used to develop a frequent and comfortable blood pressure monitoring system called eBP. In contrast to existing devices, eBP introduces a novel in-ear wearable system design and algorithms to eliminate the need to block the blood flow inside the ear, alleviating the user's discomfort

    Bag-of-words representations for computer audition

    Get PDF
    Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.Maschinelles Hören ist im täglichen Leben allgegenwärtig, mit Anwendungen, die von personalisierten virtuellen Agenten bis hin zum Gesundheitswesen reichen. Aus technischer Sicht besteht das Ziel darin, den Inhalt eines Audiosignals hinsichtlich einer Auswahl definierter Labels robust zu klassifizieren. Die Labels beschreiben bspw. die akustische Umgebung der Aufnahme, eine medizinische Diagnose oder - im Falle von Sprache - was gesagt wird oder wie es gesagt wird. Übliche Ansätze hierzu verwenden maschinelles Lernen, d.h., es werden anwendungsspezifische Modelle anhand von Beispieldaten trainiert. Trotz jüngster Erfolge beim Ende-zu-Ende-Lernen mittels neuronaler Netze, in welchen das unverarbeitete Audiosignal als Eingabe benutzt wird, sind Modelle, die auf definierten akustischen Merkmalen basieren, in manchen Bereichen weiterhin überlegen. Dies gilt im Besonderen für Einsatzzwecke, für die nur wenige Daten vorhanden sind. Allerdings besteht dabei das Problem, dass Zeitfolgen von akustischen Deskriptoren in viele Algorithmen des maschinellen Lernens nicht direkt eingespeist werden können, da diese eine statische Eingabe fester Länge benötigen. Außerdem kann es auch für dynamische (zeitabhängige) Klassifikatoren vorteilhaft sein, die Deskriptoren über ein gewisses Zeitintervall zusammenzufassen. Jedoch hat die Art der Merkmalsdarstellung einen grundlegenden Einfluss auf die Leistungsfähigkeit des Modells. In der vorliegenden Dissertation wird der sogenannte Bag-of-Audio-Words-Ansatz (BoAW) als Alternative zum Standardansatz der statistischen Funktionale untersucht. BoAW ist eine Methode des unüberwachten Lernens von Merkmalsdarstellungen, die von der Bag-of-Words-Methode in der Computerlinguistik inspiriert wurde, bei der ein Textdokument als Histogramm der vorkommenden Wörter beschrieben wird. Das Toolkit openXBOW wird vorgestellt, welches systematisches Training und Optimierung dieser Merkmalsdarstellungen - vereinheitlicht für beliebige Modalitäten mit numerischen oder symbolischen Deskriptoren - erlaubt. Es werden einige Experimente zum BoAW-Ansatz durchgeführt und diskutiert, die sich auf eine große Zahl möglicher Anwendungen und entsprechende Datensätze beziehen, von der Emotionserkennung in gesprochener Sprache bis zur medizinischen Diagnostik. Die Auswertungen beinhalten einen Vergleich verschiedener akustischer Deskriptoren und Konfigurationen der BoAW-Methode. Die wichtigsten Erkenntnisse sind, dass BoAW-Merkmalsvektoren eine geeignete Alternative zu statistischen Funktionalen darstellen, gewisse Vorzüge bieten und gleichzeitig wichtige Eigenschaften der Funktionale, wie bspw. die Datenunabhängigkeit, erhalten können. Zudem wird gezeigt, dass beide Darstellungen komplementär sind und eine Fusionierung die Leistungsfähigkeit eines Systems des maschinellen Hörens verbessert

    A Physiological Signal Processing System for Optimal Engagement and Attention Detection.

    Get PDF
    In today’s high paced, hi-tech and high stress environment, with extended work hours, long to-do lists and neglected personal health, sleep deprivation has become common in modern culture. Coupled with these factors is the inherent repetitious and tedious nature of certain occupations and daily routines, which all add up to an undesirable fluctuation in individuals’ cognitive attention and capacity. Given certain critical professions, a momentary or prolonged lapse in attention level can be catastrophic and sometimes deadly. This research proposes to develop a real-time monitoring system which uses fundamental physiological signals such as the Electrocardiograph (ECG), to analyze and predict the presence or lack of cognitive attention in individuals during task execution. The primary focus of this study is to identify the correlation between fluctuating level of attention and its implications on the physiological parameters of the body. The system is designed using only those physiological signals that can be collected easily with small, wearable, portable and non-invasive monitors and thereby being able to predict well in advance, an individual’s potential loss of attention and ingression of sleepiness. Several advanced signal processing techniques have been implemented and investigated to derive multiple clandestine and informative features. These features are then applied to machine learning algorithms to produce classification models that are capable of differentiating between the cases of a person being attentive and the person not being attentive. Furthermore, Electroencephalograph (EEG) signals are also analyzed and classified for use as a benchmark for comparison with ECG analysis. For the study, ECG signals and EEG signals of volunteer subjects are acquired in a controlled experiment. The experiment is designed to inculcate and sustain cognitive attention for a period of time following which an attempt is made to reduce cognitive attention of volunteer subjects. The data acquired during the experiment is decomposed and analyzed for feature extraction and classification. The presented results show that to a fairly reasonable accuracy it is possible to detect the presence or lack of attention in individuals with just their ECG signal, especially in comparison with analysis done on EEG signals. The continual work of this research includes other physiological signals such as Galvanic Skin Response, Heat Flux, Skin Temperature and video based facial feature analysis

    Adaptation of Speaker and Speech Recognition Methods for the Automatic Screening of Speech Disorders using Machine Learning

    Get PDF
    This PhD thesis presented methods for exploiting the non-verbal communication of individuals suffering from specific diseases or health conditions aiming to reach an automatic screening of them. More specifically, we employed one of the pillars of non-verbal communication, paralanguage, to explore techniques that could be utilized to model the speech of subjects. Paralanguage is a non-lexical component of communication that relies on intonation, pitch, speed of talking, and others, which can be processed and analyzed in an automatic manner. This is called Computational Paralinguistics, which can be defined as the study of modeling non-verbal latent patterns within the speech of a speaker by means of computational algorithms; these patterns go beyond the linguistic} approach. By means of machine learning, we present models from distinct scenarios of both paralinguistics and pathological speech which are capable of estimating the health status of a given disease such as Alzheimer's, Parkinson's, and clinical depression, among others, in an automatic manner

    A survey on perceived speaker traits: personality, likability, pathology, and the first challenge

    Get PDF
    The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks
    corecore