92,483 research outputs found
A general method for the statistical evaluation of typological distributions
The distribution of linguistic structures in the world is the joint product of universal principles, inheritance from ancestor languages, language contact, social structures, and random fluctuation. This paper proposes a method for evaluating the relative significance of each factor — and in particular, of universal principles — via regression modeling: statistical evidence for universal principles is found if the odds for families to have skewed responses (e.g. all or most members have postnominal relative clauses) as opposed to having an opposite response skewing or no skewing at all, is significantly higher for some condition (e.g. VO order) than for another condition, independently of other factors
Topological Analysis of Syntactic Structures
We use the persistent homology method of topological data analysis and
dimensional analysis techniques to study data of syntactic structures of world
languages. We analyze relations between syntactic parameters in terms of
dimensionality, of hierarchical clustering structures, and of non-trivial
loops. We show there are relations that hold across language families and
additional relations that are family-specific. We then analyze the trees
describing the merging structure of persistent connected components for
languages in different language families and we show that they partly correlate
to historical phylogenetic trees but with significant differences. We also show
the existence of interesting non-trivial persistent first homology groups in
various language families. We give examples where explicit generators for the
persistent first homology can be identified, some of which appear to correspond
to homoplasy phenomena, while others may have an explanation in terms of
historical linguistics, corresponding to known cases of syntactic borrowing
across different language subfamilies.Comment: 83 pages, LaTeX, 44 figure
An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models
Probabilistic topic modeling is a popular choice as the first step of
crosslingual tasks to enable knowledge transfer and extract multilingual
features. While many multilingual topic models have been developed, their
assumptions on the training corpus are quite varied, and it is not clear how
well the models can be applied under various training conditions. In this
paper, we systematically study the knowledge transfer mechanisms behind
different multilingual topic models, and through a broad set of experiments
with four models on ten languages, we provide empirical insights that can
inform the selection and future development of multilingual topic models
Native Language Identification using Stacked Generalization
Ensemble methods using multiple classifiers have proven to be the most
successful approach for the task of Native Language Identification (NLI),
achieving the current state of the art. However, a systematic examination of
ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble
architectures such as classifier stacking have not been closely evaluated. We
present a set of experiments using three ensemble-based models, testing each
with multiple configurations and algorithms. This includes a rigorous
application of meta-classification models for NLI, achieving state-of-the-art
results on three datasets from different languages. We also present the first
use of statistical significance testing for comparing NLI systems, showing that
our results are significantly better than the previous state of the art. We
make available a collection of test set predictions to facilitate future
statistical tests
The Best of Both Worlds: Lexical Resources To Improve Low-Resource Part-of-Speech Tagging
In natural language processing, the deep learning revolution has shifted the
focus from conventional hand-crafted symbolic representations to dense inputs,
which are adequate representations learned automatically from corpora. However,
particularly when working with low-resource languages, small amounts of
symbolic lexical resources such as user-generated lexicons are often available
even when gold-standard corpora are not. Such additional linguistic information
is though often neglected, and recent neural approaches to cross-lingual
tagging typically rely only on word and subword embeddings. While these
representations are effective, our recent work has shown clear benefits of
combining the best of both worlds: integrating conventional lexical information
improves neural cross-lingual part-of-speech (PoS) tagging. However, little is
known on how complementary such additional information is, and to what extent
improvements depend on the coverage and quality of these external resources.
This paper seeks to fill this gap by providing the first thorough analysis on
the contributions of lexical resources for cross-lingual PoS tagging in neural
times.Comment: Under review for Natural Language Engineerin
Varieties
This text is devoted to the theory of varieties, which provides an important
tool, based in universal algebra, for the classification of regular languages.
In the introductory section, we present a number of examples that illustrate
and motivate the fundamental concepts. We do this for the most part without
proofs, and often without precise definitions, leaving these to the formal
development of the theory that begins in Section 2. Our presentation of the
theory draws heavily on the work of Gehrke, Grigorieff and Pin (2008) on the
equational theory of lattices of regular languages. In the subsequent sections
we consider in more detail aspects of varieties that were only briefly evoked
in the introduction: Decidability, operations on languages, and
characterizations in formal logic.Comment: This is a chapter in an upcoming Handbook of Automata Theor
On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing
Different languages might have different word orders. In this paper, we
investigate cross-lingual transfer and posit that an order-agnostic model will
perform better when transferring to distant foreign languages. To test our
hypothesis, we train dependency parsers on an English corpus and evaluate their
transfer performance on 30 other languages. Specifically, we compare encoders
and decoders based on Recurrent Neural Networks (RNNs) and modified
self-attentive architectures. The former relies on sequential information while
the latter is more flexible at modeling word order. Rigorous experiments and
detailed analysis shows that RNN-based architectures transfer well to languages
that are close to English, while self-attentive models have better overall
cross-lingual transferability and perform especially well on distant languages.Comment: Accepted by NAACL-201
Learning Graph Weighted Models on Pictures
Graph Weighted Models (GWMs) have recently been proposed as a natural
generalization of weighted automata over strings and trees to arbitrary
families of labeled graphs (and hypergraphs). A GWM generically associates a
labeled graph with a tensor network and computes a value by successive
contractions directed by its edges. In this paper, we consider the problem of
learning GWMs defined over the graph family of pictures (or 2-dimensional
words). As a proof of concept, we consider regression and classification tasks
over the simple Bars & Stripes and Shifting Bits picture languages and provide
an experimental study investigating whether these languages can be learned in
the form of a GWM from positive and negative examples using gradient-based
methods. Our results suggest that this is indeed possible and that
investigating the use of gradient-based methods to learn picture series and
functions computed by GWMs over other families of graphs could be a fruitful
direction.Comment: International Conference on Grammatical Inference 2018 (v2:
camera-ready
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
We present a novel cross-lingual transfer method for paradigm completion, the
task of mapping a lemma to its inflected forms, using a neural encoder-decoder
model, the state of the art for the monolingual task. We use labeled data from
a high-resource language to increase performance on a low-resource language. In
experiments on 21 language pairs from four different language families, we
obtain up to 58% higher accuracy than without transfer and show that even
zero-shot and one-shot learning are possible. We further find that the degree
of language relatedness strongly influences the ability to transfer
morphological knowledge.Comment: Accepted at ACL 201
At the boundary between biological and cultural evolution: The origin of surname distributions
Surnames and nonrecombining alleles are inherited from a single parent in a
highly similar way. A simple birth-death model with mutations can accurately
describe this process. Exponentially growing and constant populations are
investigated, and we study how different compositions of the founder
populations can be observed in present-day diversity distributions. We analyse
different quantities in the statistically stationary state, both through
analytic and numerical methods. Our results compare favourably to field data
for family sizes in several countries. We discuss the relationship between the
distribution of surnames and the genetic diversity of a population.Comment: 29 pages, 7 figures, submitted to J. Theor. Bio
- …