102 research outputs found

    Final Report to NSF of the Standards for Facial Animation Workshop

    Get PDF
    The human face is an important and complex communication channel. It is a very familiar and sensitive object of human perception. The facial animation field has increased greatly in the past few years as fast computer graphics workstations have made the modeling and real-time animation of hundreds of thousands of polygons affordable and almost commonplace. Many applications have been developed such as teleconferencing, surgery, information assistance systems, games, and entertainment. To solve these different problems, different approaches for both animation control and modeling have been developed

    Efficient speaker recognition for mobile devices

    Get PDF

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Subsidia: Tools and Resources for Speech Sciences

    Get PDF
    Este libro, resultado de la colaboraciĂłn de investigadores expertos en sus respectivas ĂĄreas, pretende ser una ayuda a la comunidad cientĂ­fica en tanto en cuanto recopila y describe una serie de materiales de gran utilidad para seguir avanzando en la investigaciĂł

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the newborn to the adult and elderly. Over the years the initial issues have grown and spread also in other fields of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years in Firenze, Italy. This edition celebrates twenty-two years of uninterrupted and successful research in the field of voice analysis

    Multi-parametric source-filter separation of speech and prosodic voice restoration

    Get PDF
    In this thesis, methods and models are developed and presented aiming at the estimation, restoration and transformation of the characteristics of human speech. During a first period of the thesis, a concept was developed that allows restoring prosodic voice features and reconstruct more natural sounding speech from pathological voices using a multi-resolution approach. Inspired from observations with respect to this approach, the necessity of a novel method for the separation of speech into voice source and articulation components emerged in order to improve the perceptive quality of the restored speech signal. This work subsequently represents the main part of this work and therefore is presented first in this thesis. The proposed method is evaluated on synthetic, physically modelled, healthy and pathological speech. A robust, separate representation of source and filter characteristics has applications in areas that go far beyond the reconstruction of alaryngeal speech. It is potentially useful for efficient speech coding, voice biometrics, emotional speech synthesis, remote and/or non-invasive voice disorder diagnosis, etc. A key aspect of the voice restoration method is the reliable separation of the speech signal into voice source and articulation for it is mostly the voice source that requires replacement or enhancement in alaryngeal speech. Observations during the evaluation of above method highlighted that this separation is insufficient with currently known methods. Therefore, the main part of this thesis is concerned with the modelling of voice and vocal tract and the estimation of the respective model parameters. Most methods for joint source filter estimation known today represent a compromise between model complexity, estimation feasibility and estimation efficiency. Typically, single-parametric models are used to represent the source for the sake of tractable optimization or multi-parametric models are estimated using inefficient grid searches over the entire parameter space. The novel method presented in this work proposes advances in the direction of efficiently estimating and fitting multi-parametric source and filter models to healthy and pathological speech signals, resulting in a more reliable estimation of voice source and especially vocal tract coefficients. In particular, the proposed method is exhibits a largely reduced bias in the estimated formant frequencies and bandwidths over a large variety of experimental conditions such as environmental noise, glottal jitter, fundamental frequency, voice types and glottal noise. The methods appears to be especially robust to environmental noise and improves the separation of deterministic voice source components from the articulation. Alaryngeal speakers often have great difficulty at producing intelligible, not to mention prosodic, speech. Despite great efforts and advances in surgical and rehabilitative techniques, currently known methods, devices and modes of speech rehabilitation leave pathological speakers with a lack in the ability to control key aspects of their voice. The proposed multiresolution approach presented at the end of this thesis provides alaryngeal speakers an intuitive manner to increase prosodic features in their speech by reconstructing a more intelligible, more natural and more prosodic voice. The proposed method is entirely non-invasive. Key prosodic cues are reconstructed and enhanced at different temporal scales by inducing additional volatility estimated from other, still intact, speech features. The restored voice source is thus controllable in an intuitive way by the alaryngeal speaker. Despite the above mentioned advantages there is also a weak point of the proposed joint source-filter estimation method to be mentioned. The proposed method exhibits a susceptibility to modelling errors of the glottal source. On the other hand, the proposed estimation framework appears to be well suited for future research on exactly this topic. A logical continuation of this work is the leverage the efficiency and reliability of the proposed method for the development of new, more accurate glottal source models

    Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis

    Get PDF
    WydziaƂ NeofilologiiBieĆŒÄ…ca praca prezentuje prĂłbę budowy neurobiologicznie umotywowanego modelu mapowaƄ pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągƂym sygnaƂem częstotliwoƛci podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krĂłtkim wprowadzeniu w problem badawczy w kontekƛcie intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalnoƛci konturu F0 generowanego przez wyuczony model za pomocą eksperymentĂłw percepcyjnych typu ABX oraz MOS przy uĆŒyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowaƄ wyuczonych przez model. W tym celu wykorzystana zostaƂa jedna z tzw. metod wyjaƛniających dla sztucznej inteligencji – Layer-wise Relevance Propagation. W pracy przedstawione zostaƂy wyniki powstaƂej na tej podstawie obszernej analizy iloƛciowej istotnoƛci dla konturu częstotliwoƛci podstawowej kaĆŒdej z 1297 specjalnie wygenerowanych lingwistycznych kategorii wejƛciowych modelu oraz ich wielorakich grupowaƄ na rĂłĆŒnorodnych poziomach abstrakcji. Pracę koƄczy dogƂębna analiza oraz interpretacja uzyskanych wynikĂłw oraz rozwaĆŒania na temat mocnych oraz sƂabych stron zastosowanych metod, a takĆŒe lista proponowanych usprawnieƄ.This work presents an attempt to build a neurobiologically inspired Convolutional Neural Network-based model of the mappings between discrete high-level linguistic categories into a continuous signal of fundamental frequency in Polish neutral read speech. After a brief introduction of the current research problem in the context of intonation, speech synthesis and the phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and their groupings at various levels of abstraction for the specific contours of the fundamental frequency. The work ends with an in-depth interpretation of these results and a discussion of the advantages and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostaƂy cz˛eÂŽsciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki
    • 

    corecore