4,379 research outputs found

    Automatic age detection in normal and pathological voice

    Full text link
    Systems that automatically detect voice pathologies are usually trained with recordings belonging to population of all ages. However such an approach might be inadequate because of the acoustic variations in the voice caused by the natural aging process. In top of that, elder voices present some perturbations in quality similar to those related to voice disorders, which make the detection of pathologies more troublesome. With this in mind, the study of methodologies which automatically incorporate information about speakers’ age, aiming at a simplification in the detection of voice disorders is of interest. In this respect, the present paper introduces an age detector trained with normal and pathological voice, constituting a first step towards the study of age-dependent pathology detectors. The proposed system employs sustained vowels of the Saarbrucken database from which two age groups are examinated: adults and elders. Mel frequency cepstral coefficients for characterization, and Gaussian mixture models for classification are utilized. In addition, fusion of vowels at score level is considered to improve detection performance. Results suggest that age might be effectively recognized using normal and pathological voices when using sustained vowels as acoustical material, opening up possibilities for the design of automatic age-dependent voice pathology detection systems

    Audio-Visual Speaker Verification via Joint Cross-Attention

    Full text link
    Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audio-visual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification

    Automated voice pathology discrimination from audio recordings benefits from phonetic analysis of continuous speech

    Get PDF
    In this paper we evaluate the hypothesis that automated methods for diagnosis of voice disorders from speech recordings would benefit from contextual information found in continuous speech. Rather than basing a diagnosis on how disorders affect the average acoustic properties of the speech signal, the idea is to exploit the possibility that different disorders will cause different acoustic changes within different phonetic contexts. Any differences in the pattern of effects across contexts would then provide additional information for discrimination of pathologies. We evaluate this approach using two complementary studies: the first uses a short phrase which is automatically annotated using a phonetic transcription, the second uses a long reading passage which is automatically annotated from text. The first study uses a single sentence recorded from 597 speakers in the Saarbrucken Voice Database to discriminate structural from neurogenic disorders. The results show that discrimination performance for these broad pathology classes improves from 59% to 67% unweighted average recall when classifiers are trained for each phone-label and the results fused. Although the phonetic contexts improved discrimination, the overall sensitivity and specificity of the method seems insufficient for clinical application. We hypothesise that this is because of the limited contexts in the speech audio and the heterogeneous nature of the disorders. In the second study we address these issues by processing recordings of a long reading passage obtained from clinical recordings of 60 speakers with either Spasmodic Dysphonia or Vocal fold Paralysis. We show that discrimination performance increases from 80% to 87% unweighted average recall if classifiers are trained for each phone-labelled region and predictions fused. We also show that the sensitivity and specificity of a diagnostic test with this performance is similar to other diagnostic procedures in clinical use. In conclusion, the studies confirm that the exploitation of contextual differences in the way disorders affect speech improves automated diagnostic performance, and that automated methods for phonetic annotation of reading passages are robust enough to extract useful diagnostic information

    Analyzing training dependencies and posterior fusion in discriminant classification of apnoea patients based on sustained and connected speech

    Get PDF
    We present a novel approach using both sustained vowels and connected speech, to detect obstructive sleep apnea (OSA) cases within a homogeneous group of speakers. The proposed scheme is based on state-of-the-art GMM-based classifiers, and acknowledges specifically the way in which acoustic models are trained on standard databases, as well as the complexity of the resulting models and their adaptation to specific data. Our experimental database contains a suitable number of utterances and sustained speech from healthy (i.e control) and OSA Spanish speakers. Finally, a 25.1% relative reduction in classification error is achieved when fusing continuous and sustained speech classifiers. Index Terms: obstructive sleep apnea (OSA), gaussian mixture models (GMMs), background model (BM), classifier fusion

    Análisis cepstral y la transformada de Hilbert-Huang para la detección automática de la enfermedad de Parkinson

    Get PDF
    Most patients with Parkinson’s Disease (PD) develop speech deficits, including reduced sonority, altered articulation, and abnormal prosody. This article presents a methodology to automatically classify patients with PD and Healthy Control (HC) subjects. In this study, the Hilbert-Huang Transform (HHT) and Mel-Frequency Cepstral Coefficients (MFCCs) were considered to model modulated phonations (changing the tone from low to high and vice versa) of the vowels /a/, /i/, and /u/. The HHT was used to extract the first two formants from audio signals with the aim of modeling the stability of the tongue while the speakers were producing modulated vowels. Kruskal-Wallis statistical tests were used to eliminate redundant and non-relevant features in order to improve classification accuracy. PD patients and HC subjects were automatically classified using a Radial Basis Support Vector Machine (RBF-SVM). The results show that the proposed approach allows an automatic discrimination between PD and HC subjects with accuracies of up to 75 % for women and 73 % for men.La mayoría de las personas con la enfermedad de Parkinson (EP) desarrollan varios déficits del habla, incluyendo sonoridad reducida, alteración de la articulación y prosodia anormal. Este artículo presenta una metodología que permite la clasificación automática de pacientes con EP y sujetos de control sanos (CS). Se considera que la transformada de Hilbert-Huang (THH) y los Coeficientes Cepstrales en las frecuencias de Mel modelan las fonaciones moduladas (cambiando el tono de bajo a alto y de alto a bajo) de las vocales /a/, /i/, y /u/. La THH se utiliza para extraer los dos primeros formantes de las señales de audio, con el objetivo de modelar la estabilidad de la lengua mientras los hablantes producen vocales moduladas. Pruebas estadísticas de Kruskal-Wallis se utilizan para eliminar características redundantes y no relevantes, con el fin de mejorar la precisión de la clasificación. La clasificación automática de sujetos con EP vs. CS se realiza mediante una máquina de soporte vectorial de base radial. De acuerdo con los resultados, el enfoque propuesto permite la discriminación automática de sujetos con EP vs. CS con precisiones de hasta el 75 % para los hombres y 73 % para las mujeres

    Intelligibility Evaluation of Pathological Speech through Multigranularity Feature Extraction and Optimization

    Get PDF
    Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S-transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S-transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F-score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation

    The Oneiric Reality of Electronic Scents

    Full text link
    This paper investigates the ‘oneiric’ dimension of scent, by suggesting a new design process that can be worn as a fashion accessory or integrated in textile technologies, to subtly alter reality and go beyond our senses. It fuses wearable ‘electronic scent’ delivery systems with pioneering biotechnologies as a ground-breaking ‘science fashion’ enabler. The purpose is to enhance wellbeing by reaching a day‐dream state of being through the sense of smell. The sense of smell (or olfaction) is a chemical sense and part of the limbic system which regulates emotion and memory within the brain. The power of scent makes content extremely compelling by offering a heightened sense of reality which is intensified by emotions such as joy, anger and fear. Scent helps us appreciate all the senses as we embark on a sensory journey unlike any other; it enhances mood, keeps us in the moment, diverts us from distractions, reduces boredom and encourages creativity. This paper highlights the importance of smell, the forgotten sense, and also identifies how we as humans have grown to underuse our senses. It endeavours to show how the reinvention of our sensory faculties is possible through advances in biotechnology. It introduces the new ‘data senses’ as a wearable sensory platform that triggers and fine tunes the senses with fragrances. It puts forward a new design process that is currently being developed in clothing elements, jewellery and textile technologies, offering a new method to deliver scent electronically and intelligently in fashion and everyday consumer products. It creates a personal ‘scent wave’, around the wearer, to allow the mind to wander, to give a deeper sense of life or ‘lived reality’ (verses fantasy), a new found satisfaction and confidence, and to reach new heights of creativity. By combining biology with wearable technologies, we propose a biotechnological solution that can be translated into sensory fashion elements. This is a new trend in 21st century ‘data sensing’, based on holographic biosensors that sense the human condition, aromachology (the science of the effect of fragrance and behaviour), colour-therapy, and smart polymer science. The use of biosensors in the world of fashion and textiles, enables us to act on visual cues or detect scent signals and rising stress levels, allowing immediate information to hand. An ‘oneiric’ mood is triggered by a spectrum of scents which is encased in a micro-computerised ‘scent‐cell’ and integrated into clothing elements or jewellery. When we inhale an unexpected scent, it takes us by surprise; the power of fragrance fills us with pleasurable ripples of multi‐sensations and dream‐like qualities. The aromas create a near trance‐like experience that induces a daydream state of (immediate) satisfaction, or a ‘revived reality’ in our personal scent bubble of reality. The products and jewellery items were copyrighted and designed by Slim Barrett and the technology input was from EG Technology and Epigem

    The Neurocognition of Prosody

    Get PDF
    Prosody is one of the most undervalued components of language, despite its fulfillment of manifold purposes. It can, for instance, help assign the correct meaning to compounds such as “white house” (linguistic function), or help a listener understand how a speaker feels (emotional function). However, brain-based models that take into account the role prosody plays in dynamic speech comprehension are still rare. This is probably due to the fact that it has proven difficult to fully denote the neurocognitive architecture underlying prosody. This review discusses clinical and neuroscientific evidence regarding both linguistic and emotional prosody. It will become obvious that prosody processing is a multistage operation and that its temporally and functionally distinct processing steps are anchored in a functionally differentiated brain network

    Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model

    Full text link
    The objective of this work is to extract target speaker's voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains a challenge. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples. For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism. This mechanism is specifically tailored for the speech domain to integrate the phonetic information from audio-visual correspondence in speech generation. In this way, the fusion process maintains the high temporal resolution of the features, without excessive computational requirements. We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.Comment: Project page with demo: https://mm.kaist.ac.kr/projects/avdiffuss
    corecore