375 research outputs found

    Exploring Linguistic Constraints in Nlp Applications

    Get PDF
    The key argument of this dissertation is that the success of an Natural Language Processing (NLP) application depends on a proper representation of the corresponding linguistic problem. This theme is raised in the context that the recent progress made in our field is widely credited to the effective use of strong engineering techniques. However, the intriguing power of highly lexicalized models shown in many NLP applications is not only an achievement by the development in machine learning, but also impossible without the extensive hand-annotated data resources made available, which are originally built with very deep linguistic considerations. More specifically, we explore three linguistic aspects in this dissertation: the distinction between closed-class vs. open-class words, long-tail distributions in vocabulary study and determinism in language models. The first two aspects are studied in unsupervised tasks, unsupervised part-of-speech (POS) tagging and morphology learning, and the last one is studied in supervised tasks, English POS tagging and Chinese word segmentation. Each linguistic aspect under study manifests itself in a (different) way to help improve performance or efficiency in some NLP application

    Methods and algorithms for unsupervised learning of morphology

    Get PDF
    This is an accepted manuscript of a chapter published by Springer in Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403 in 2014 available online: https://doi.org/10.1007/978-3-642-54906-9_15 The accepted version of the publication may differ from the final published version.This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.Published versio

    Producing power-law distributions and damping word frequencies with two-stage language models

    Get PDF
    Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s

    Bayesian models of syntactic category acquisition

    Get PDF
    Discovering a word’s part of speech is an essential step in acquiring the grammar of a language. In this thesis we examine a variety of computational Bayesian models that use linguistic input available to children, in the form of transcribed child directed speech, to learn part of speech categories. Part of speech categories are characterised by contextual (distributional/syntactic) and word-internal (morphological) similarity. In this thesis, we assume language learners will be aware of these types of cues, and investigate exactly how they can make use of them. Firstly, we enrich the context of a standard model (the Bayesian Hidden Markov Model) by adding sentence type to the wider distributional context.We show that children are exposed to a much more diverse set of sentence types than evident in standard corpora used for NLP tasks, and previous work suggests that they are aware of the differences between sentence type as signalled by prosody and pragmatics. Sentence type affects local context distributions, and as such can be informative when relying on local context for categorisation. Adding sentence types to the model improves performance, depending on how it is integrated into our models. We discuss how to incorporate novel features into the model structure we use in a flexible manner, and present a second model type that learns to use sentence type as a distinguishing cue only when it is informative. Secondly, we add a model of morphological segmentation to the part of speech categorisation model, in order to model joint learning of syntactic categories and morphology. These two tasks are closely linked: categorising words into syntactic categories is aided by morphological information, and finding morphological patterns in words is aided by knowing the syntactic categories of those words. In our joint model, we find improved performance vis-a-vis single-task baselines, but the nature of the improvement depends on the morphological typology of the language being modelled. This is the first token-based joint model of unsupervised morphology and part of speech category learning of which we are aware

    Tree Structured Dirichlet Processes for Hierarchical Morphological Segmentation

    Get PDF
    This article presents a probabilistic hierarchical clustering model for morphological segmentation In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data

    Unsupervised joint PoS tagging and stemming for agglutinative languages

    Get PDF
    This is an accepted manuscript of an article published by Association for Computing Machinery (ACM) in ACM Transactions on Asian and Low-Resource Language Information Processing on 25/01/2019, available online: https://doi.org/10.1145/3292398 The accepted version of the publication may differ from the final published version.The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) with the project number EEEAG-115E464.Published versio

    The Summarization of Arabic News Texts Using Probabilistic Topic Modeling for L2 Micro Learning Tasks

    Get PDF
    Report submitted as a result, in part, of participation in the Language Flagship Technology Innovation Center's Summer internship program in Summer 2019.The field of Natural Language Processing (NLP) combines computer science, linguistic theory, and mathematics. Natural Language Processing applications aim at equipping computers with human linguistic knowledge. Applications such as Information Retrieval, Machine Translation, spelling checkers, as well as text sum- marization, are intriguing fields that exploit the techniques of NLP. Text summariza- tion represents an important NLP task that simplifies various reading tasks. These NLP-based text summarization tasks can be utilized for the benefits of language acquisition.Language Flagship Technology Innovation Cente

    #Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds

    Full text link
    Compounding of natural language units is a very common phenomena. In this paper, we show, for the first time, that Twitter hashtags which, could be considered as correlates of such linguistic units, undergo compounding. We identify reasons for this compounding and propose a prediction model that can identify with 77.07% accuracy if a pair of hashtags compounding in the near future (i.e., 2 months after compounding) shall become popular. At longer times T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This technique has strong implications to trending hashtag recommendation since newly formed hashtag compounds can be recommended early, even before the compounding has taken place. Further, humans can predict compounds with an overall accuracy of only 48.7% (treated as baseline). Notably, while humans can discriminate the relatively easier cases, the automatic framework is successful in classifying the relatively harder cases.Comment: 14 pages, 4 figures, 9 tables, published in CSCW (Computer-Supported Cooperative Work and Social Computing) 2016. in Proceedings of 19th ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2016
    corecore