Search CORE

92,483 research outputs found

A general method for the statistical evaluation of typological distributions

Author: Bickel Balthasar
Publication venue
Publication date: 10/08/2010
Field of study

The distribution of linguistic structures in the world is the joint product of universal principles, inheritance from ancestor languages, language contact, social structures, and random fluctuation. This paper proposes a method for evaluating the relative significance of each factor — and in particular, of universal principles — via regression modeling: statistical evidence for universal principles is found if the odds for families to have skewed responses (e.g. all or most members have postnominal relative clauses) as opposed to having an opposite response skewing or no skewing at all, is significantly higher for some condition (e.g. VO order) than for another condition, independently of other factors

Topological Analysis of Syntactic Structures

Author: Karidi Taelin
Marcolli Matilde
Port Alexander
Publication venue
Publication date: 12/03/2019
Field of study

We use the persistent homology method of topological data analysis and dimensional analysis techniques to study data of syntactic structures of world languages. We analyze relations between syntactic parameters in terms of dimensionality, of hierarchical clustering structures, and of non-trivial loops. We show there are relations that hold across language families and additional relations that are family-specific. We then analyze the trees describing the merging structure of persistent connected components for languages in different language families and we show that they partly correlate to historical phylogenetic trees but with significant differences. We also show the existence of interesting non-trivial persistent first homology groups in various language families. We give examples where explicit generators for the persistent first homology can be identified, some of which appear to correspond to homoplasy phenomena, while others may have an explanation in terms of historical linguistics, corresponding to known cases of syntactic borrowing across different language subfamilies.Comment: 83 pages, LaTeX, 44 figure

arXiv.org e-Print Archive

Caltech Authors

An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models

Author: Hao Shudong
Paul Michael J.
Publication venue
Publication date: 10/06/2019
Field of study

Probabilistic topic modeling is a popular choice as the first step of crosslingual tasks to enable knowledge transfer and extract multilingual features. While many multilingual topic models have been developed, their assumptions on the training corpus are quite varied, and it is not clear how well the models can be applied under various training conditions. In this paper, we systematically study the knowledge transfer mechanisms behind different multilingual topic models, and through a broad set of experiments with four models on ten languages, we provide empirical insights that can inform the selection and future development of multilingual topic models

arXiv.org e-Print Archive

Native Language Identification using Stacked Generalization

Author: Dras Mark
Malmasi Shervin
Publication venue
Publication date: 19/03/2017
Field of study

Ensemble methods using multiple classifiers have proven to be the most successful approach for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on three datasets from different languages. We also present the first use of statistical significance testing for comparing NLI systems, showing that our results are significantly better than the previous state of the art. We make available a collection of test set predictions to facilitate future statistical tests

arXiv.org e-Print Archive

The Best of Both Worlds: Lexical Resources To Improve Low-Resource Part-of-Speech Tagging

Author: Agic Zeljko
Klerke Sigrid
Plank Barbara
Publication venue
Publication date: 21/11/2018
Field of study

In natural language processing, the deep learning revolution has shifted the focus from conventional hand-crafted symbolic representations to dense inputs, which are adequate representations learned automatically from corpora. However, particularly when working with low-resource languages, small amounts of symbolic lexical resources such as user-generated lexicons are often available even when gold-standard corpora are not. Such additional linguistic information is though often neglected, and recent neural approaches to cross-lingual tagging typically rely only on word and subword embeddings. While these representations are effective, our recent work has shown clear benefits of combining the best of both worlds: integrating conventional lexical information improves neural cross-lingual part-of-speech (PoS) tagging. However, little is known on how complementary such additional information is, and to what extent improvements depend on the coverage and quality of these external resources. This paper seeks to fill this gap by providing the first thorough analysis on the contributions of lexical resources for cross-lingual PoS tagging in neural times.Comment: Under review for Natural Language Engineerin

arXiv.org e-Print Archive

Varieties

Author: Straubing Howard
Weil Pascal
Publication venue
Publication date: 14/05/2018
Field of study

This text is devoted to the theory of varieties, which provides an important tool, based in universal algebra, for the classification of regular languages. In the introductory section, we present a number of examples that illustrate and motivate the fundamental concepts. We do this for the most part without proofs, and often without precise definitions, leaving these to the formal development of the theory that begins in Section 2. Our presentation of the theory draws heavily on the work of Gehrke, Grigorieff and Pin (2008) on the equational theory of lattices of regular languages. In the subsequent sections we consider in more detail aspects of varieties that were only briefly evoked in the introduction: Decidability, operations on languages, and characterizations in formal logic.Comment: This is a chapter in an upcoming Handbook of Automata Theor

arXiv.org e-Print Archive

On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing

Author: Ahmad Wasi Uddin
Chang Kai-Wei
Hovy Eduard
Ma Xuezhe
Peng Nanyun
Zhang Zhisong
Publication venue
Publication date: 16/04/2019
Field of study

Different languages might have different word orders. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when transferring to distant foreign languages. To test our hypothesis, we train dependency parsers on an English corpus and evaluate their transfer performance on 30 other languages. Specifically, we compare encoders and decoders based on Recurrent Neural Networks (RNNs) and modified self-attentive architectures. The former relies on sequential information while the latter is more flexible at modeling word order. Rigorous experiments and detailed analysis shows that RNN-based architectures transfer well to languages that are close to English, while self-attentive models have better overall cross-lingual transferability and perform especially well on distant languages.Comment: Accepted by NAACL-201

arXiv.org e-Print Archive

Learning Graph Weighted Models on Pictures

Author: Amortila Philip
Rabusseau Guillaume
Publication venue
Publication date: 02/12/2018
Field of study

Graph Weighted Models (GWMs) have recently been proposed as a natural generalization of weighted automata over strings and trees to arbitrary families of labeled graphs (and hypergraphs). A GWM generically associates a labeled graph with a tensor network and computes a value by successive contractions directed by its edges. In this paper, we consider the problem of learning GWMs defined over the graph family of pictures (or 2-dimensional words). As a proof of concept, we consider regression and classification tasks over the simple Bars & Stripes and Shifting Bits picture languages and provide an experimental study investigating whether these languages can be learned in the form of a GWM from positive and negative examples using gradient-based methods. Our results suggest that this is indeed possible and that investigating the use of gradient-based methods to learn picture series and functions computed by GWMs over other families of graphs could be a fruitful direction.Comment: International Conference on Grammatical Inference 2018 (v2: camera-ready

arXiv.org e-Print Archive

One-Shot Neural Cross-Lingual Transfer for Paradigm Completion

Author: Cotterell Ryan
Kann Katharina
Schütze Hinrich
Publication venue
Publication date: 01/01/2017
Field of study

We present a novel cross-lingual transfer method for paradigm completion, the task of mapping a lemma to its inflected forms, using a neural encoder-decoder model, the state of the art for the monolingual task. We use labeled data from a high-resource language to increase performance on a low-resource language. In experiments on 21 language pairs from four different language families, we obtain up to 58% higher accuracy than without transfer and show that even zero-shot and one-shot learning are possible. We further find that the degree of language relatedness strongly influences the ability to transfer morphological knowledge.Comment: Accepted at ACL 201

arXiv.org e-Print Archive

At the boundary between biological and cultural evolution: The origin of surname distributions

Author: Manrubia Susanna C.
Zanette Damian H.
Publication venue
Publication date: 01/01/2002
Field of study

Surnames and nonrecombining alleles are inherited from a single parent in a highly similar way. A simple birth-death model with mutations can accurately describe this process. Exponentially growing and constant populations are investigated, and we study how different compositions of the founder populations can be observed in present-day diversity distributions. We analyse different quantities in the statistically stationary state, both through analytic and numerical methods. Our results compare favourably to field data for family sizes in several countries. We discuss the relationship between the distribution of surnames and the genetic diversity of a population.Comment: 29 pages, 7 figures, submitted to J. Theor. Bio

arXiv.org e-Print Archive

CiteSeerX