10,318 research outputs found
Tracking Typological Traits of Uralic Languages in Distributed Language Representations
Although linguistic typology has a long history, computational approaches
have only recently gained popularity. The use of distributed representations in
computational linguistics has also become increasingly popular. A recent
development is to learn distributed representations of language, such that
typologically similar languages are spatially close to one another. Although
empirical successes have been shown for such language representations, they
have not been subjected to much typological probing. In this paper, we first
look at whether this type of language representations are empirically useful
for model transfer between Uralic languages in deep neural networks. We then
investigate which typological features are encoded in these representations by
attempting to predict features in the World Atlas of Language Structures, at
various stages of fine-tuning of the representations. We focus on Uralic
languages, and find that some typological traits can be automatically inferred
with accuracies well above a strong baseline.Comment: Finnish abstract included in the pape
Problems of methodology and explanation in word order universals research
Ever since the publication of Greenberg 1963, word order typologists have attempted to formulate and refine implicational universals of word order so as to characterize the restricted distribution of certain word order patterns, and in some cases have also attempted to develop general principles to explain the existence of those universals
A general method for the statistical evaluation of typological distributions
The distribution of linguistic structures in the world is the joint product of universal principles, inheritance from ancestor languages, language contact, social structures, and random fluctuation. This paper proposes a method for evaluating the relative significance of each factor — and in particular, of universal principles — via regression modeling: statistical evidence for universal principles is found if the odds for families to have skewed responses (e.g. all or most members have postnominal relative clauses) as opposed to having an opposite response skewing or no skewing at all, is significantly higher for some condition (e.g. VO order) than for another condition, independently of other factors
Learning Language Representations for Typology Prediction
One central mystery of neural NLP is what neural models "know" about their
subject matter. When a neural machine translation system learns to translate
from one language to another, does it learn the syntax or semantics of the
languages? Can this knowledge be extracted from the system to fill holes in
human scientific knowledge? Existing typological databases contain relatively
full feature specifications for only a few hundred languages. Exploiting the
existence of parallel texts in more than a thousand languages, we build a
massive many-to-one neural machine translation (NMT) system from 1017 languages
into English, and use this to predict information missing from typological
databases. Experiments show that the proposed method is able to infer not only
syntactic, but also phonological and phonetic inventory features, and improves
over a baseline that has access to information about the languages' geographic
and phylogenetic neighbors.Comment: EMNLP 201
Thoughts about disordered thinking: measuring and quantifying the laws of order and disorder
Peer ReviewedPostprint (author's final draft
The placement of the head that maximizes predictability. An information theoretic approach
The minimization of the length of syntactic dependencies is a
well-established principle of word order and the basis of a mathematical theory
of word order. Here we complete that theory from the perspective of information
theory, adding a competing word order principle: the maximization of
predictability of a target element. These two principles are in conflict: to
maximize the predictability of the head, the head should appear last, which
maximizes the costs with respect to dependency length minimization. The
implications of such a broad theoretical framework to understand the
optimality, diversity and evolution of the six possible orderings of subject,
object and verb are reviewed.Comment: in press in Glottometric
Token-based typology and word order entropy: A study based on universal dependencies
The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme
- …