178 research outputs found

    Multilingual unsupervised word alignment models and their application

    Get PDF
    Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines

    Multilingual representations and models for improved low-resource language processing

    Get PDF
    Word representations are the cornerstone of modern NLP. Representing words or characters using real-valued vectors as static representations that can capture the Semantics and encode the meaning has been popular among researchers. In more recent years, Pretrained Language Models using large amounts of data and creating contextualized representations achieved great performance in various tasks such as Semantic Role Labeling. These large pretrained language models are capable of storing and generalizing information and can be used as knowledge bases. Language models can produce multilingual representations while only using monolingual data during training. These multilingual representations can be beneficial in many tasks such as Machine Translation. Further, knowledge extraction models that only relied on information extracted from English resources, can now benefit from extra resources in other languages. Although these results were achieved for high-resource languages, there are thousands of languages that do not have large corpora. Moreover, for other tasks such as machine translation, if large monolingual data is not available, the models need parallel data, which is scarce for most languages. Further, many languages lack tokenization models, and splitting the text into meaningful segments such as words is not trivial. Although using subwords helps the models to have better coverage over unseen data and new words in the vocabulary, generalizing over low-resource languages with different alphabets and grammars is still a challenge. This thesis investigates methods to overcome these issues for low-resource languages. In the first publication, we explore the degree of multilinguality in multilingual pretrained language models. We demonstrate that these language models can produce high-quality word alignments without using parallel training data, which is not available for many languages. In the second paper, we extract word alignments for all available language pairs in the public bible corpus (PBC). Further, we created a tool for exploring these alignments which are especially helpful in studying low-resource languages. The third paper investigates word alignment in multiparallel corpora and exploits graph algorithms for extracting new alignment edges. In the fourth publication, we propose a new model to iteratively generate cross-lingual word embeddings and extract word alignments when only small parallel corpora are available. Lastly, the fifth paper finds that aggregation of different granularities of text can improve word alignment quality. We propose using subword sampling to produce such granularities

    A matter of timing : A modelling-based investigation of the dynamic behaviour of reproductive hormones in girls and women

    Get PDF
    Hypothalamus-hypofyse-gonade aksen er en del av det kvinnelige endokrine systemet, og regulerer evnen til reproduksjon. Hormoner produsert og utskilt fra tre kjertler (hypotalamus, hypofysen, eggstokkene) påvirker hverandre via tilbakemeldingsinteraksjoner, som er nødvendige for å etablere en regelmessig menstruasjonssyklus hos kvinner. Matematiske modeller som forutsier utviklingen av slike hormonkonsentrasjoner og modning av eggstokkfollikler er nyttige verktøy for å forstå menstruasjonssyklusens dynamiske oppførsel. Slike modeller kan for eksempel hjelpe oss med å undersøke patologiske tilstander som endometriose og polycystisk ovariesyndrom. Videre kan de brukes til systematiske undersøkelser av effekten av medikamenter på det kvinnelige endokrine systemet. Derfor kan vi potensielt bruke slike menstruasjonsyklusmodeller som kliniske beslutningsstøttessystemer. Vi trenger modeller som forutsier hormonkonsentrasjoner sammen med modningen av eggstokkfollikler hos enkeltindivider gjennom påfølgende sykluser. Dette for å kunne simulere hormonelle behandlinger som stimulerer vekst av eggstokkfolliklene (eggstokkstimuleringsprotokoller). Her legger jeg fram et forslag til en matematisk menstruasjonsyklusmodell og viser modellens evne til å forutsi resultatet av eggstokkstimuleringsprotokoller. For å kalibrere denne typen modell trenges individuelle tidsseriedata. Innsamling av slike data er tidskrevende, og forutsetter høy grad av engasjement fra deltakerne i studien. Det er derfor viktig å finne brukbare datatyper som er mindre tid- og ressurskrevende å samle inn, og som likevel kan brukes til modellkalibrering. En type data som er enklere å samle inn er tversnittdata. I denne avhandlingen har jeg utviklet en prosedyre for å bruke tversnittpopulasjonsdata i modellens kalibreringsprosess, og viser hvordan en modell kalibrert med tversnittdata kan brukes til å forutsi individuelle resultater ved oppdatering av en del av modellens parametere. I tillegg til det vitenskapelige bidraget, håper jeg at avhandlingen min skaper oppmerksomhet rundt viktigheten av forskning på kvinners reproduktive helse, og at avhandlingen underbygger verdien av matematiske modeller i forskning på kvinnehelse.The hypothalamic-pituitary-gonadal axis (HPG axis), a part of the human endocrine system, regulates the female reproductive function. Feedback interactions between hormones secreted from the glands forming the HPG axis are essential for establishing a regular menstrual cycle. Mathematical models predicting the time evolution of hormone concentrations and the maturation of ovarian follicles are useful tools for understanding the dynamic behaviour of the menstrual cycle. Such models can, for example, help us to investigate pathological conditions, such as endometriosis or Polycystic Ovary Syndrome. Furthermore, they can be used to systematically study the effects of drugs on the endocrine system. In doing so, menstrual cycle models could potentially be integrated into clinical routines as clinical decision support systems. For the simulation-based investigation of hormonal treatments aiming to stimulate the growth of ovarian follicles (Controlled Ovarian Stimulation (COS)), we need models that predict hormone concentrations and the maturation of ovarian follicles in biological units throughout consecutive cycles. Here, I propose such a mechanistic menstrual cycle model. I also demonstrate its capability to predict the outcome of COS. Individual time series data is usually used to calibrate mechanistic models having clinical implications. Collecting these data, however, is time-consuming and requires a high commitment from study participants. Therefore, integrating different data sets into the model calibration process is of interest. One type of data that is often more feasible to collect than individual time series is cross-sectional data. As part of my thesis, I developed a workflow based on Bayesian updating to integrate cross-sectional data into the model calibration process. I demonstrate the workflow using a mechanistic model describing the time evolution of reproductive hormones during puberty in girls. Exemplary, I show that a model calibrated with cross-sectional data can be used to predict individual dynamics after updating a subset of model parameters. In addition to the scientific contributions of this thesis, I hope that it creates attention for the importance of research in the area of women's reproductive health and underpins the value of mathematical modelling for this field.Doktorgradsavhandlin

    Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach

    Get PDF
    Recent advances in statistical machine translation (SMT) have used dynamic programming (DP) based beam search methods for approximate inference within probabilistic translation models. Despite their success, these methods compromise the probabilistic interpretation of the underlying model thus limiting the application of probabilistically defined decision rules during training and decoding. As an alternative, in this thesis, we propose a novel Monte Carlo sampling approach for theoretically sound approximate probabilistic inference within these models. The distribution we are interested in is the conditional distribution of a log-linear translation model; however, often, there is no tractable way of computing the normalisation term of the model. Instead, a Gibbs sampling approach for phrase-based machine translation models is developed which obviates the need of computing this term yet produces samples from the required distribution. We establish that the sampler effectively explores the distribution defined by a phrase-based models by showing that it converges in a reasonable amount of time to the desired distribution, irrespective of initialisation. Empirical evidence is provided to confirm that the sampler can provide accurate estimates of expectations of functions of interest. The mix of high probability and low probability derivations obtained through sampling is shown to provide a more accurate estimate of expectations than merely using the n-most highly probable derivations. Subsequently, we show that the sampler provides a tractable solution for finding the maximum probability translation in the model. We also present a unified approach to approximating two additional intractable problems: minimum risk training and minimum Bayes risk decoding. Key to our approach is the use of the sampler which allows us to explore the entire probability distribution and maintain a strict probabilistic formulation through the translation pipeline. For these tasks, sampling allies the simplicity of n-best list approaches with the extended view of the distribution that lattice-based approaches benefit from, while avoiding the biases associated with beam search. Our approach is theoretically well-motivated and can give better and more stable results than current state of the art methods

    A Statistical Approach to the Alignment of fMRI Data

    Get PDF
    Multi-subject functional Magnetic Resonance Image studies are critical. The anatomical and functional structure varies across subjects, so the image alignment is necessary. We define a probabilistic model to describe functional alignment. Imposing a prior distribution, as the matrix Fisher Von Mises distribution, of the orthogonal transformation parameter, the anatomical information is embedded in the estimation of the parameters, i.e., penalizing the combination of spatially distant voxels. Real applications show an improvement in the classification and interpretability of the results compared to various functional alignment methods

    A comparison of the CAR and DAGAR spatial random effects models with an application to diabetics rate estimation in Belgium

    Get PDF
    When hierarchically modelling an epidemiological phenomenon on a finite collection of sites in space, one must always take a latent spatial effect into account in order to capture the correlation structure that links the phenomenon to the territory. In this work, we compare two autoregressive spatial models that can be used for this purpose: the classical CAR model and the more recent DAGAR model. Differently from the former, the latter has a desirable property: its ρ parameter can be naturally interpreted as the average neighbor pair correlation and, in addition, this parameter can be directly estimated when the effect is modelled using a DAGAR rather than a CAR structure. As an application, we model the diabetics rate in Belgium in 2014 and show the adequacy of these models in predicting the response variable when no covariates are available

    Building task-oriented machine translation systems

    Full text link
    La principal meta de esta tesis es desarrollar sistemas de traduccion interactiva que presenten mayor sinergia con sus usuarios potenciales. Por ello, el objetivo es hacer los sistemas estado del arte mas ergonomicos, intuitivos y eficientes, con el fin de que el experto humano se sienta mas comodo al utilizarlos. Con este fin se presentan diferentes t�ecnicas enfocadas a mejorar la adaptabilidad y el tiempo de respuesta de los sistemas de traduccion automatica subyacentes, as�ÿ como tambien se presenta una estrategia cuya finalidad es mejorar la interaccion hombre-m�aquina. Todo ello con el proposito ultimo de rellenar el hueco existente entre el estado del arte en traduccion automatica y las herramientas que los traductores humanos tienen a su disposici�on. En lo que respecta al tiempo de respuesta de los sistemas de traducci�on autom�atica, en esta tesis se presenta una t�ecnica de poda de los par�ametros de los modelos de traducci�on actuales, cuya intuici�on est�a basada en el concepto de segmentaci�on biling¤ue, pero que termina por evolucionar hacia una estrategia de re-estimaci�on de dichos par�ametros. Utilizando esta estrategia se obtienen resultados experimentales que demuestran que es posible podar la tabla de segmentos hasta en un 97%, sin mermar por ello la calidad de las traducciones obtenidas. Adem�as, estos resultados son coherentes en diferentes pares de lenguas, lo cual evidencia que la t�ecnica que se presenta aqu�ÿ es efectiva en un entorno de traducci�on autom�atica tradicional, y por lo tanto podr�ÿa ser utilizada directamente en un escenario de post-edici�on. Sin embargo, los experimentos llevados a cabo en traducci�on interactiva son ligeramente menos convincentes, pues implican la necesidad de llegar a un compromiso entre el tiempo de respuesta y la calidad de los sufijos producidos. Por otra parte, se presentan dos t�ecnicas de adaptaci�on, con el prop�osito de mejorar la adaptabilidad de los sistemas de traducci�on autom�atica. La primeraSanchis Trilles, G. (2012). Building task-oriented machine translation systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/17174Palanci
    corecore