4,236 research outputs found

    Probabilistic Modelling of Morphologically Rich Languages

    Full text link
    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    River-mediated dynamic environmental factors and perinatal data analysis

    Get PDF
    Perfluorooctanoic acid (PFOA) and related per- and polyfluoroalkyl substances, a group of man-made persistent organic chemicals employed for many products, are widely distributed in the environment. Adverse health effects may occur even at low exposure levels. A large-scale PFOA contamination of drinking water resources, especially of the river Ruhr, was detected in North Rhine-Westphalia, Germany, in summer 2006. Subsequent measurements are available from the water supply stations along the river and elsewhere. The first state-wide environmental-epidemiological study on the general population analyses these secondary data together with routinely collected perinatal registry data, to estimate possible developmental-toxic effects of PFOA exposure, especially regarding birth weight (BW). Drinking water data are temporally and spatially modelled to assign estimated exposure values to the residents. A generalised linear model with an inverse link deals with the steeply decreasing temporal data pattern at mainly affected stations. Confirmed by a river-wide joint model, the river's segments between the main junctions are the most important factor to explain the spatial structure, besides local effects. Deductions from stations to areal units are made possible via estimated supply proportions. Regression of perinatal data with BW as response usually includes the gestational age (GA) as an important covariate in polynomial form. However, bivariate modelling of BW and GA is recommended to distinguish effects on each, on both, and between them. Bayesian distributional copula regression is applied, where the marginals for BW and GA as well as the copula representing their dependence structure are fitted independently and all parameters are estimated conditional on covariates. While a Gaussian is suitable for BW, the skewed GA data are better modelled by the three-parametric Dagum distribution. The Clayton copula performs better than the Gumbel and the symmetric Gaussian copula, although the lower tail dependence is weak. A non-linear trend of BW on GA is detected by the standard polynomial model. Linear effects of biometric and obstetric covariates and also of maternal smoking on BW mean are similar in both models, while the distributional copula regression also reveals effects on all other parameters. The local PFOA exposure is spatio-temporally assigned to the perinatal data of the most affected town of Arns\-berg and so included in the regression models. No significant effect results and a relatively high amount of noise remains. Perspectively and for larger regions, this can be dealt with by exposure modelling on area level using dependence information, by allowing further asymmetry in the bivariate distribution of BW and GA, and by respecting geographical structures in birth data

    Making "fetch" happen: The influence of social and linguistic context on nonstandard word growth and decline

    Full text link
    In an online community, new words come and go: today's "haha" may be replaced by tomorrow's "lol." Changes in online writing are usually studied as a social process, with innovations diffusing through a network of individuals in a speech community. But unlike other types of innovation, language change is shaped and constrained by the system in which it takes part. To investigate the links between social and structural factors in language change, we undertake a large-scale analysis of nonstandard word growth in the online community Reddit. We find that dissemination across many linguistic contexts is a sign of growth: words that appear in more linguistic contexts grow faster and survive longer. We also find that social dissemination likely plays a less important role in explaining word growth and decline than previously hypothesized

    An Intelligent Text Extraction and Navigation System

    Get PDF
    We present sppc, a high-performance system for intelligent text extraction and navigation from German free text documents. The main purpose of sppc is to extract as much linguistic structure as possible for performing domain-specific processing. sppc consists of a set of domain-independent shallow core components which are realized by means of cascaded weighted finite state machines and generic dynamic tries. All extracted information is represented uniformly in one data structure (called the text chart) in a highly compact and linked form in order to support indexing and navigation through the set of solutions. Germa

    Climate warming can reduce biocontrol efficacy and promote plant invasion due to both genetic and transient metabolomic changes.

    Get PDF
    Climate change may affect plant-herbivore interactions and their associated ecosystem functions. In an experimental evolution approach, we subjected replicated populations of the invasive Ambrosia artemisiifolia to a combination of simulated warming and herbivory by a potential biocontrol beetle. We tracked genomic and metabolomic changes across generations in field populations and assessed plant offspring phenotypes in a common environment. Using an integrated Bayesian model, we show that increased offspring biomass in response to warming arose through changes in the genetic composition of populations. In contrast, increased resistance to herbivory arose through a shift in plant metabolomic profiles without genetic changes, most likely by transgenerational induction of defences. Importantly, while increased resistance was costly at ambient temperatures, warming removed this constraint and favoured both vigorous and better defended plants under biocontrol. Climate warming may thus decrease biocontrol efficiency and promote Ambrosia invasion, with potentially serious economic and health consequences

    Some Salient Issues in the Unsupervised Learning of Igbo Morphology

    Get PDF
    The issue of automatic learning of the morphology of natural language is an important topic in computational linguistics. This owes to the fact that morphology is foundational to the study of linguistics. In addition, the emerging information society demands the application of Information and Communication Technologies (ICT) to languages in ways that demand human-like analysis of language and this depends to a large extent on the ability to undertake computational analysis of morphology. Even though rule-based and supervised learning approaches to the modeling of morphology have been found to be productive, they have also been discovered to be costly, cumbersome and sucseptible to human errors. Contrarily, unsupervised learning methods do not require the expensive human intervention but as in everything statistical, they demand large volumes of linguistic data. This poses a challenge to resource scarce languages such as Igbo. Furthermore, being a highly agglutinative language, Igbo features certain morphological processes that may not be easily accommodated by most of the frequency-driven unsupervised learning models available. this paper takes a critical look at some of the identified challenges of inducing Igbo morphology as a first step in devising methods by which they can be addressed
    • …
    corecore