178 research outputs found

    Modeling the Relationship among Linguistic Typological Features with Hierarchical Dirichlet Process

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Reconstructing the evolution of Indo-European grammar

    Full text link
    This study uses phylogenetic methods adopted from computational biology in order to reconstruct features of Proto-Indo-European morphosyntax. We estimate the probability of the presence of typological features in Proto-Indo-European on the assumption that these features change according to a stochastic process governed by evolutionary transition rates between them. We compare these probabilities to previous reconstructions of Proto-Indo-European morphosyntax, which use either the comparative-historical method or implicational typology. We find that our reconstruction yields strong support for a canonical model (synthetic, nominative-accusative, headfinal) of the protolanguage and low support for any alternative model. Observing the evolutionary dynamics of features in our data set, we conclude that morphological features have slower rates of change, whereas syntactic traits change faster. Additionally, more frequent, unmarked traits in grammatical hierarchies have slower change rates when compared to less frequent, marked ones, which indicates that universal patterns of economy and frequency impact language change within the family. Keywords - Indo-European linguistics, historical linguistics, phylogenetic linguistics, typology, syntactic reconstructio

    Reconstructing the evolution of Indo-European grammar

    Full text link
    This study uses phylogenetic methods adopted from computational biology in order to reconstruct features of Proto-Indo-European morphosyntax. We estimate the probability of the presence of typological features in Proto-Indo-European on the assumption that these features change according to a stochastic process governed by evolutionary transition rates between them. We compare these probabilities to previous reconstructions of Proto-Indo-European morphosyntax, which use either the comparative-historical method or implicational typology. We find that our reconstruction yields strong support for a canonical model (synthetic, nominative-accusative, headfinal) of the protolanguage and low support for any alternative model. Observing the evolutionary dynamics of features in our data set, we conclude that morphological features have slower rates of change, whereas syntactic traits change faster. Additionally, more frequent, unmarked traits in grammatical hierarchies have slower change rates when compared to less frequent, marked ones, which indicates that universal patterns of economy and frequency impact language change within the family. Keywords - Indo-European linguistics, historical linguistics, phylogenetic linguistics, typology, syntactic reconstructio

    Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches

    Full text link
    We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases

    Practical Natural Language Processing for Low-Resource Languages.

    Full text link
    As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web. Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pd

    Crime Topic Modeling

    Full text link
    The classification of crime into discrete categories entails a massive loss of information. Crimes emerge out of a complex mix of behaviors and situations, yet most of these details cannot be captured by singular crime type labels. This information loss impacts our ability to not only understand the causes of crime, but also how to develop optimal crime prevention strategies. We apply machine learning methods to short narrative text descriptions accompanying crime records with the goal of discovering ecologically more meaningful latent crime classes. We term these latent classes "crime topics" in reference to text-based topic modeling methods that produce them. We use topic distributions to measure clustering among formally recognized crime types. Crime topics replicate broad distinctions between violent and property crime, but also reveal nuances linked to target characteristics, situational conditions and the tools and methods of attack. Formal crime types are not discrete in topic space. Rather, crime types are distributed across a range of crime topics. Similarly, individual crime topics are distributed across a range of formal crime types. Key ecological groups include identity theft, shoplifting, burglary and theft, car crimes and vandalism, criminal threats and confidence crimes, and violent crimes. Though not a replacement for formal legal crime classifications, crime topics provide a unique window into the heterogeneous causal processes underlying crime.Comment: 47 pages, 4 tables, 7 figure

    Dependencies in language: On the causal ontology of linguistic systems

    Get PDF
    Dependency is a fundamental concept in the analysis of linguistic systems. The many if-then statements offered in typology and grammar-writing imply a causally real notion of dependency that is central to the claim being made—usually with reference to widely varying timescales and types of processes. But despite the importance of the concept of dependency in our work, its nature is seldom defined or made explicit. This book brings together experts on language, representing descriptive linguistics, language typology, functional/cognitive linguistics, cognitive science, research on gesture and other semiotic systems, developmental psychology, psycholinguistics, and linguistic anthropology to address the following question: What kinds of dependencies exist among language-related systems, and how do we define and explain them in natural, causal terms
    • …
    corecore