15 research outputs found

    Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech

    Get PDF
    International audienceSpeaker variability is a well-known problem of state-of-the art Automatic Speech Recognition (ASR) systems. In particular, handling children speech is challenging because of substantial differences in pronunciation of the speech units between adult and child speakers. To build accurate ASR systems for all types of speakers Hidden Markov Models with Gaussian Mixture Densities were intensively used in combinationwith model adaptation techniques.This paper compares different ways to improve the recognition of children speech and describes a novel approach relying on Class-StructuredGaussian Mixture Model (GMM). A common solution for reducing the speaker variability relies on gender and age adaptation. First, it is proposed to replace gender and age byunsupervised clustering. Speaker classes are first used for adaptation of the conventional HMM. Second, speaker classes are used for initializing structured GMM, where the components of Gaussian densities are structured with respect to the speaker classes. In a first approach mixture weights of the structured GMM are set dependent on the speaker class. In a second approach the mixture weights are replaced by explicit dependencies between Gaussian components of mixture densities (as in stranded GMMs, but here the GMMs are class-structured).The different approaches are evaluated and compared on the TIDIGITS task. The best improvement is achieved when structured GMM is combined with feature adaptation

    A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition

    Get PDF
    The adoption of high-accuracy speech recognition algorithms without an effective evaluation of their impact on the target computational resource is impractical for mobile and embedded systems. In this paper, techniques are adopted to minimise the required computational resource for an effective mobile-based speech recognition system. A Dynamic Multi-Layer Perceptron speech recognition technique, capable of running in real time on a state-of-the-art mobile device, has been introduced. Even though a conventional hidden Markov model when applied to the same dataset slightly outperformed our approach, its processing time is much higher. The Dynamic Multi-layer Perceptron presented here has an accuracy level of 96.94% and runs significantly faster than similar techniques

    Speech Processing and Prosody

    Get PDF
    International audienceThe prosody of the speech signal conveys information over the linguistic content of the message: prosody structures the utterance, and also brings information on speaker's attitude and speaker's emotion. Duration of sounds, energy and fundamental frequency are the prosodic features. However their automatic computation and usage are not obvious. Sound duration features are usually extracted from speech recognition results or from a force speech-text alignment. Although the resulting segmentation is usually acceptable on clean native speech data, performance degrades on noisy or not non-native speech. Many algorithms have been developed for computing the fundamental frequency, they lead to rather good performance on clean speech, but again, performance degrades in noisy conditions. However, in some applications, as for example in computer assisted language learning, the relevance of the prosodic features is critical; indeed, the quality of the diagnostic on the learner's pronunciation will heavily depend on the precision and reliability of the estimated prosodic parameters. The paper considers the computation of prosodic features, shows the limitations of automatic approaches, and discusses the problem of computing confidence measures on such features. Then the paper discusses the role of prosodic features and how they can be handled for automatic processing in some tasks such as the detection of discourse particles, the characterization of emotions, the classification of sentence modalities, as well as in computer assisted language learning and in expressive speech synthesis

    Vocal Forgery in Forensic Sciences

    No full text

    Měřitelné změny hlasu při léčbě poruch hlasu

    No full text
    The main purpose of this paper is to show identification possibilities of voice differences for people whose voice has been influenced by any kind of voice disorder. Introduction of common diseases of vocal cords or larynx is followed by a chapter including ordinary treatment techniques. Even if the surgery ends up well the voice production can be affected in some way. Doctors are mostly able to measure only limited amount of voice characterizing parameters. More precise analysis of subjects within predefined time intervals should lead to more specific results and may prove more efficient. This article presents a different scientific approach which is based on the voice parameterization and analysis. The only thing needed for this kind of research is obtaining of recordings of analyzed subjects (before surgery, soon after that and then for example 2 months later). These recordings can be processed using common audio processing methods and required variables are extracted and saved in form of so-called feature vectors. Some features are expected to change as the result of treatment. Some of used methods are similar to ordinary techniques or they have something in common, but it allows to measure and identify even more variables describing the voice. Diagnostic experience can be supplemented by our software, where many parameters are visualized. But the final decision is still up to the doctor.Hlavním cílem tohoto článku je ukázat možnosti detekce změn hlasu u lidí, jejichž hlas byl ovlivněn poruchou hlasu. Po úvodní části věnované představení běžných onemocnění hlasivek či hrtanu je zařazena kapitola popisující běžné metody a techniky léčby. Avšak i po úspěšné operaci může být proces tvorby hlasu ovlivněn. Lékaři mají typicky k dispozici pouze omezený počet měřených parametrů charakterizujících hlas pacienta. Podrobnější pozorování pacientů v předem definovaných časových intervalech může vést ke zvýšení efektivity a zpřesnění výsledků nejen pro diagnostiku. Článek prezentuje odlišný přístup založený na parametrizaci hlasu a jeho analýze. Při tomto postupu postačí pouze pořídit nahrávky hlasu pacientů (před operací, brzy po operaci a zhruba po 2 měsících od operace). Tyto nahrávky jsou zpracovány s využitím běžných metod pro zpracování řeči a požadované parametry jsou z řeči extrahovány a posléze ukládány ve formě tzv. příznakových vektorů. V průběhu léčby jsou očekávány změny hodnot příznaků. Závěry diagnostiky mohou být podpořeny tímto diagnostickým softwarem, který výsledky analýzy přehledně vizualizuje. Nicméně konečné rozhodnutí je stále v rukou lékaře

    Automatic speech recognition and speech variability: A review

    No full text
    Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics
    corecore