14 research outputs found

    Study of Jacobian Normalization for VTLN

    Get PDF
    The divergence of the theory and practice of vocal tract length normalization (VTLN) is addressed, with particular emphasis on the role of the Jacobian determinant. VTLN is placed in a Bayesian setting, which brings in the concept of a prior on the warping factor. The form of the prior, together with acoustic scaling and numerical conditioning are then discussed and evaluated. It is concluded that the Jacobian determinant is important in VTLN, especially for the high dimensional features used in HMM based speech synthesis, and difficulties normally associated with the Jacobian determinant can be attributed to prior and scaling

    VTLN Adaptation for Statistical Speech Synthesis

    Get PDF
    The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features

    CHARACTER-LEVEL INTERACTIONS IN MULTIMODAL COMPUTER-ASSISTED TRANSCRIPTION OF TEXT IMAGES

    Full text link
    HTR systems don't achieve acceptable results in unconstrained applications. Therefore, it is convenient to use a system that allows the user to cooperate in the most confortable way with the system to generate a correct transcription. In this paper, multimodal interaction at character-level is studied.Martín-Albo Simón, D. (2011). CHARACTER-LEVEL INTERACTIONS IN MULTIMODAL COMPUTER-ASSISTED TRANSCRIPTION OF TEXT IMAGES. http://hdl.handle.net/10251/11313Archivo delegad

    Adaptation of children’s speech with limited data based on formant-like peak alignment,”

    Get PDF
    Abstract Automatic recognition of children's speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children's speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorith

    Handwritten Text Line Detection and Classification based on HMMs

    Full text link
    [ES] En este trabajo presentamos una forma para realizar el análisis y la detección de líneas de texto en documentos manuscritos basada en los Modelos Ocultos de Markov, una técnica ampliamente utilizada en otras tareas del reconocimiento del texto manuscrito y del habla. Mostamos que el análisis y la detección de líneas de texto puede realizarse utilizando metodologías más formales en contraposición a los métodos heurístics que se pueden encontrar en la literatura. Nuestro método no solo proporciona las mejores coordenas de posición para cada una de las regiones verticales de la página sino que también las etiqueta, de esta manera superando los métodos heurísticos tradicionales. En nuestros experimentos demonstramos el rendimiento de nuestro método ( tanto en detección como en classificación de líneas) y estudiamos el impacto de incrementalmente restringidos "lenguajes de estructuración vertical de páginas" y modelos morfológicos sobre la precisión de detección y clasificación. Mediante esta experimentación también demostramos la mejora en calidad de las líneas base generadas por nuestro método en comparación con un método heurístico estado del arte basado en perfiles de proyección vertical.[EN] In this paper we present an approach for text line analysis and detection in handwritten documents based on Hidden Markov Models, a technique widely used in other handwritten and speech recognition tasks. It is shown that text line analysis and detection can be solved using a more formal methodology in contraposition to most of the proposed heuristic approaches found in the literature. Our approach not only provides the best position coordinates for each of the vertical page regions but also labels them, in this manner surpassing the traditional heuristic methods. In our experiments we demonstrate the performance of the approach (both in line analysis and detection) and study the impact of increasingly constrained ¿vertical layout language models¿ and morphologic models on text line detection and classification accuracy. Through this experimentation we also show the improvement in quality of the baselines yielded by our approach in comparisonwith a state-of-the-art heuristic method based on vertical projection profiles.Bosch Campos, V. (2012). Handwritten Text Line Detection and Classification based on HMMs. http://hdl.handle.net/10251/17964Archivo delegad

    Vocal tract normalization equals linear transformation in cepstral space

    No full text

    Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis

    Get PDF
    Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors

    Modeling DNN as human learner

    Get PDF
    In previous experiments, human listeners demonstrated that they had the ability to adapt to unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time, previous work in the speech community has shown that pre-trained deep neural network-based (DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course of phoneme category adaptation in a DNN is investigated in more detail. By retuning the DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost all cases, showing very little adaptation after seeing only one (out of ten) training bins. However, unlike our experimental setup mentioned above, in a typical lexically guided perceptual learning experiment, listeners are trained with individual words instead of individual phones, and thus to truly model such a scenario, we would require a model that could take the context of a whole utterance into account. Traditional speech recognition systems accomplish this through the use of hidden Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted much attention. In the second part of this thesis, previous experiments on ambiguous phoneme recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words ending with ambiguous phonemes were used as training targets, instead of individual sounds that consisted of a single phoneme. We found out that despite the vastly different architecture, the new model showed highly similar behavior in terms of classification rate over the time course of incremental retuning. This indicated that ambiguous phonemes in a continuous context could also be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions it generated, although ground truth transcriptions were used to choose a subset of all self-labeled transcriptions. Self-adaptation is of interest as a model of human second language learning, but also has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource language. We investigated two ways to improve the adaptation scheme, with the first being multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled adaptation, and the second being first letting the model adapt to isolated short words before feeding it with longer utterances.Ope
    corecore