14 research outputs found
Study of Jacobian Normalization for VTLN
The divergence of the theory and practice of vocal tract length normalization (VTLN) is addressed, with particular emphasis on the role of the Jacobian determinant. VTLN is placed in a Bayesian setting, which brings in the concept of a prior on the warping factor. The form of the prior, together with acoustic scaling and numerical conditioning are then discussed and evaluated. It is concluded that the Jacobian determinant is important in VTLN, especially for the high dimensional features used in HMM based speech synthesis, and difficulties normally associated with the Jacobian determinant can be attributed to prior and scaling
VTLN Adaptation for Statistical Speech Synthesis
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features
CHARACTER-LEVEL INTERACTIONS IN MULTIMODAL COMPUTER-ASSISTED TRANSCRIPTION OF TEXT IMAGES
HTR systems don't achieve acceptable results in unconstrained applications. Therefore, it is convenient
to use a system that allows the user to cooperate in the most confortable way with the system to
generate a correct transcription. In this paper, multimodal interaction at character-level is studied.Martín-Albo Simón, D. (2011). CHARACTER-LEVEL INTERACTIONS IN MULTIMODAL COMPUTER-ASSISTED TRANSCRIPTION OF TEXT IMAGES. http://hdl.handle.net/10251/11313Archivo delegad
Adaptation of children’s speech with limited data based on formant-like peak alignment,”
Abstract Automatic recognition of children's speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children's speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorith
Handwritten Text Line Detection and Classification based on HMMs
[ES] En este trabajo presentamos una forma para realizar el análisis y la detección de líneas de
texto en documentos manuscritos basada en los Modelos Ocultos de Markov, una técnica
ampliamente utilizada en otras tareas del reconocimiento del texto manuscrito y del
habla. Mostamos que el análisis y la detección de líneas de texto puede realizarse
utilizando metodologías más formales en contraposición a los métodos heurístics que se
pueden encontrar en la literatura. Nuestro método no solo proporciona las mejores
coordenas de posición para cada una de las regiones verticales de la página sino que
también las etiqueta, de esta manera superando los métodos heurísticos tradicionales. En
nuestros experimentos demonstramos el rendimiento de nuestro método ( tanto en
detección como en classificación de líneas) y estudiamos el impacto de incrementalmente
restringidos "lenguajes de estructuración vertical de páginas" y modelos morfológicos
sobre la precisión de detección y clasificación. Mediante esta experimentación también
demostramos la mejora en calidad de las líneas base generadas por nuestro método en
comparación con un método heurístico estado del arte basado en perfiles de proyección
vertical.[EN] In this paper we present an approach for text line analysis and detection in handwritten
documents based on Hidden Markov Models, a technique widely used in other handwritten
and speech recognition tasks. It is shown that text line analysis and detection can be
solved using a more formal methodology in contraposition to most of the proposed
heuristic approaches found in the literature. Our approach not only provides the best
position coordinates for each of the vertical page regions but also labels them, in this
manner surpassing the traditional heuristic methods. In our experiments we demonstrate
the performance of the approach (both in line analysis and detection) and study the
impact of increasingly constrained ¿vertical layout language models¿ and morphologic
models on text line detection and classification accuracy. Through this experimentation
we also show the improvement in quality of the baselines yielded by our approach in
comparisonwith a state-of-the-art heuristic method based on vertical projection profiles.Bosch Campos, V. (2012). Handwritten Text Line Detection and Classification based on HMMs. http://hdl.handle.net/10251/17964Archivo delegad
Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis
Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors
Modeling DNN as human learner
In previous experiments, human listeners demonstrated that they had the ability to adapt to
unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time,
previous work in the speech community has shown that pre-trained deep neural network-based
(DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes
after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course
of phoneme category adaptation in a DNN is investigated in more detail. By retuning the
DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy
of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like
human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost
all cases, showing very little adaptation after seeing only one (out of ten) training bins. However,
unlike our experimental setup mentioned above, in a typical
lexically guided perceptual learning
experiment, listeners are trained with individual words instead of individual phones, and thus to truly
model such a scenario, we would require a model that could take the context of a whole utterance
into account. Traditional speech recognition systems accomplish this through the use of hidden
Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted
much attention. In the second part of this thesis, previous experiments on ambiguous phoneme
recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words
ending with ambiguous phonemes were used as training targets, instead of individual sounds that
consisted of a single phoneme. We found out that despite the vastly different architecture, the
new model showed highly similar behavior in terms of classification rate over the time course of
incremental retuning. This indicated that ambiguous phonemes in a continuous context could also
be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained
Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was
asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the
Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions
it generated, although ground truth transcriptions were used to choose a subset of all self-labeled
transcriptions. Self-adaptation is of interest as a model of human second language learning, but also
has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource
language. We investigated two ways to improve the adaptation scheme, with the first being
multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled
adaptation, and the second being first letting the model adapt to isolated short words before
feeding it with longer utterances.Ope