201 research outputs found
Sequential decision modeling in uncertain conditions
Cette thèse consiste en une série d’approches pour la modélisation de décision structurée - c’est-à-dire qu’elle propose des solutions utilisant des modèles génératifs pour des tâches intégrant plusieurs entrées et sorties, ces entrées et sorties étant dictées par des interactions complexes entre leurs éléments. Un aspect crucial de ces problèmes est la présence en plus d’un résultat correct, des résultats structurellement différents mais considérés tout aussi corrects, résultant d’une grande mais nécessaire incertitude sur les sorties du système. Cette thèse présente quatre articles sur ce sujet, se concentrent en particulier sur le domaine de la synthèse vocale à partir de texte, génération symbolique de musique, traitement de texte, reconnaissance automatique de la parole, et apprentissage de représentations pour la parole et le texte. Chaque article présente une approche particulière à un problème dans ces domaines respectifs, en proposant et étudiant des architectures profondes pour ces domaines. Bien que ces techniques d’apprentissage profond utilisées dans ces articles sont suffisamment versatiles et expressives pour être utilisées dans d’autres domaines, nous resterons concentrés sur les applications décrites dans chaque article.
Le premier article présente une approche permettant le contrôle détaillé, au niveau phonétique et symbolique, d’un système de synthèse vocale, en utilisant une méthode d’échange efficace permettant de combiner des représentations à un niveau lexical. Puisque cette combinaison permet un contrôle proportionné sur les conditions d’entrée, et améliore les prononciations faisant uniquement usage de caractères, ce système de combinaison pour la synthèse vocale a été préféré durant des tests A/B par rapport à des modèles de référence équivalents utilisant les mêmes modalités. Le deuxième article se concentre sur un autre système de synthèse vocale, cette fois-ci centré sur la construction d’une représentation multi-échelle de la parole à travers une décomposition structurée des descripteurs audio. En particulier, l’intérêt de ce travail est dans sa méthodologie économe en calcul malgré avoir été bâti à partir de travaux antérieurs beaucoup plus demandant en ressources de calcul. Afin de bien pouvoir faire de la synthèse vocale sous ces contraintes computationelles, plusieurs nouvelles composantes ont été conçues et intégrées à ce qui devient un modèle efficace de synthèse vocale. Le troisième article un nouveau modèle auto-régressif pour modéliser des chaînes de symboles. Ce modèle fait usage de prédictions et d’estimations itérative et répétées afin de construire une sortie structurée respectant plusieurs contraintes correspondant au domaine sous-jacent. Ce modèle est testé dans le cadre de la génération symbolique de musique et la modélisation de texte, faisant preuve d’excellentes performances en particulier quand la quantité de données s’avère limitée. Le dernier article de la thèse se concentre sur l’étude des représentations pour la parole et le texte apprise à partir d’un système de reconnaissance vocale d’un travail antérieur. À travers une série d’études systématiques utilisant des modèles pré-entraînés de texte et de durée, relations qualitatives entre les données de texte et de parole, et études de performance sur la récupération transmodal “few shot”, nous exposons plusieurs propriétés essentielles sous-jacent à la performance du système, ouvrant la voie pour des développements algorithmiques futurs. De plus, les différents modèles résultants de cette étude obtiennent des résultats impressionnants sur un nombre de tâches de référence utilisant des modèles pré-entraîné transféré sans modification.This thesis presents a sequence of approaches to structured decision modeling - that is, proposing generative solutions to tasks with multiple inputs and outputs, featuring complicated interactions between input elements and output elements. Crucially, these problems also include a high amount of uncertainty about the correct outcome and many largely equivalent but structurally different outcomes can be considered equally correct. This thesis presents four articles about these topics, particularly focusing on the domains of text-to-speech synthesis, symbolic music generation, text processing, automatic speech recognition, and speech-text representation learning. Each article presents a particular approach to solving problems in these respective domains, focused on proposing and understanding deep learning architectures for these domains. The deep learning techniques used in these articles are broadly applicable, flexible, and powerful enough that these general approaches may find application to other areas however we remain focused on the domains discussed in each respective article.
The first article presents an approach allowing for flexible phonetic and character control of a text-to-speech system, utilizing an efficient "swap-out" method for blending representations at the word level. This blending allows for smooth control over input conditions, and also strengthens character only pronunciations, resulting in a preference for a blended text-to-speech system in A/B testing, compared to an equivalent baselines even when using the same input information modalities. The second article focuses on another text-to-speech system, this time centered on building multi-scale representations of speech audio using a structured decomposition of audio features. Particularly this work focuses on a compute efficient methodology, while building on prior work which requires a much greater computational budget than the proposed system. In order to effectively perform text-to-speech synthesis under these computational constraints, a number of new components are constructed and integrated, resulting in an efficient model for text-to-speech synthesis. The third article presents a new non-autoregressive model for modeling symbolic sequences. This model uses iterative prediction and re-estimation in order to build structured outputs, which respect numerous constraints in the underlying sequence domain. This model is applied to symbolic music modeling and text modeling, showing excellent performance particularly in limited data generative settings. The final article in this thesis focuses on understanding the speech-text representations learned by a text-injected speech recognition system from prior literature. Through a systematic series of studies utilizing pre-trained text and duration models, qualitative relations between text and speech sequences, and performance studies in few-shot cross-modal retrieval, we reveal a number of crucial properties underlying the performance of this system, paving the way for future algorithmic development. In addition, model variants built during this study achieve impressive performance results on a number of benchmark tasks using partially frozen and transferred parameters
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot
Recommended from our members
Structured learning with latent variables : theory and algorithms
Most tasks in natural language processing (NLP) try to map structured input (e.g., sentence or word sequence) to some form of structured output (tag sequence, parse tree, semantic graph, translated/paraphrased/compressed sentence), a problem known as “structured prediction”. While various learning algorithms such as the perceptron, maximum entropy, and expectation-maximization have been extended to the structured setting (and thus applicable to NLP problems), directly applying them as is to NLP tasks remains challenging for the following reasons. First, the prohibitively large search space in NLP makes exact search intractable, and in practice inexact search methods like beam search are routinely used instead. Second, the output structures are usually partially, rather than completely, annotated, which requires structured latent variables. However, the introduction of inexact search and latent components violates some key theoretical properties (such as convergence) of conventional structured learning algorithms, and requires us to develop new algorithms suitable for scalable structured learning with latent variables. In this thesis, we first investigate new theoretical properties for these structured learning algorithms with inexact search and latent variables, and then also demonstrate that structured learning with latent variables is a powerful modeling tool for many NLP tasks with less strict annotation requirements, and can be generalized to neural models.Keywords: textual entailment, structured learning, semantic parsing, perceptron, conditional random field, latent variable, machine translatio
The optimality of word lengths
In this thesis, an analysis is done to test new optimality scores to estimate if they explain better text optimality than mean word length. To do so, some large parallel corpora will be analyzed with 460 languages, 34 unique writing systems, and 54 unique language families to test if these scores explain better text optimality. Also, a study to determine whether language family or writing system best explains the optimality of a language has been done. It concludes with a classification of the writing systems and language families with the scores proposed
Rapid Generation of Pronunciation Dictionaries for new Domains and Languages
This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists
Mining question-answer pairs from web forum: a survey of challenges and resolutions
Internet forums, which are also known as discussion boards, are popular web applications. Members of the board discuss issues and share ideas to form a community within the board, and as a result generate huge amount of content on different topics on daily basis. Interest in information extraction and knowledge discovery from such sources has been on the increase in the research community. A number of factors are limiting the potentiality of mining knowledge from forums. Lexical chasm or lexical gap that renders some Natural Language Processing techniques (NLP) less effective, Informal tone that creates noisy data, drifting of discussion topic that prevents focused mining and asynchronous issue that makes it difficult to establish post-reply relationship are some of the problems that need to be addressed. This survey introduces these challenges within the framework of question answering. The survey provides description of the problems; cites and explores useful publications to the reader for further examination; provides an overview of resolution strategies and findings relevant to the challenges
Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval
This dissertation considers the problem of information retrieval in speech. Today's speech retrieval
systems generally use a large vocabulary continuous speech
recognition system to first hypothesize the words which were spoken.
Because these systems have a predefined lexicon, words which
fall outside of the lexicon can significantly reduce search quality---as measured
by Mean Average Precision (MAP). This is particularly important because these Out-Of-Vocabulary (OOV)
words are often rare and therefore good discriminators for topically relevant speech segments.
The focus of this dissertation is on handling these out-of-vocabulary query words. The approach
is to combine results from a word-based speech retrieval system with those from vocabulary-independent
ranked utterance retrieval. The goal of ranked utterance retrieval is to rank speech utterances
by the system's confidence that they contain a particular spoken word, which is accomplished by ranking
the utterances by the estimated frequency of the word in the utterance. Several
new approaches for estimating this frequency are considered, which are motivated by the disparity between
reference and errorfully hypothesized phoneme sequences. The first method learns alternate pronunciations or
degradations from actual recognition hypotheses and incorporates these variants into a new generative estimator for
term frequency. A second method learns transformations of several easily computed features in a discriminative
model for the same task. Both methods significantly improved ranked utterance retrieval in an experimental
validation on new speech.
The best of these ranked utterance retrieval methods is then combined with a word-based speech retrieval system. The combination
approach uses a normalization learned in an additive model, which maps the retrieval status values from each system into estimated probabilities
of relevance that are easily combined. Using this combination, much of the MAP lost because of OOV words is recovered. Evaluated on a
collection of spontaneous, conversational speech, the system recovers 57.5\% of the MAP lost on short (title-only) queries and
41.3\% on longer (title plus description) queries
- …