2,262 research outputs found
Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families
We present a simple computational approach to assigning a measure of complexity and information/entropy to families of natural languages, based on syntactic parameters and the theory of error correcting codes. We associate to each language a binary string of syntactic parameters and to a language family a binary code, with code words the binary string associated to each language. We then evaluate the code parameters (rate and relative minimum distance) and the position of the parameters with respect to the asymptotic bound of error correcting codes and the Gilbert–Varshamov bound. These bounds are, respectively, related to the Kolmogorov complexity and the Shannon entropy of the code and this gives us a computationally simple way to obtain estimates on the complexity and information, not of individual languages but of language families. This notion of complexity is related, from the linguistic point of view to the degree of variability of syntactic parameter across languages belonging to the same (historical) family
Principles and Parameters: a coding theory perspective
We propose an approach to Longobardi's parametric comparison method (PCM) via
the theory of error-correcting codes. One associates to a collection of
languages to be analyzed with the PCM a binary (or ternary) code with one code
words for each language in the family and each word consisting of the binary
values of the syntactic parameters of the language, with the ternary case
allowing for an additional parameter state that takes into account phenomena of
entailment of parameters. The code parameters of the resulting code can be
compared with some classical bounds in coding theory: the asymptotic bound, the
Gilbert-Varshamov bound, etc. The position of the code parameters with respect
to some of these bounds provides quantitative information on the variability of
syntactic parameters within and across historical-linguistic families. While
computations carried out for languages belonging to the same family yield codes
below the GV curve, comparisons across different historical families can give
examples of isolated codes lying above the asymptotic bound.Comment: 11 pages, LaTe
Syntactic Structures and Code Parameters
We assign binary and ternary error-correcting codes to the data of syntactic
structures of world languages and we study the distribution of code points in
the space of code parameters. We show that, while most codes populate the lower
region approximating a superposition of Thomae functions, there is a
substantial presence of codes above the Gilbert-Varshamov bound and even above
the asymptotic bound and the Plotkin bound. We investigate the dynamics induced
on the space of code parameters by spin glass models of language change, and
show that, in the presence of entailment relations between syntactic parameters
the dynamics can sometimes improve the code. For large sets of languages and
syntactic data, one can gain information on the spin glass dynamics from the
induced dynamics in the space of code parameters.Comment: 14 pages, LaTeX, 12 png figure
Token-based typology and word order entropy: A study based on universal dependencies
The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme
Computational and Robotic Models of Early Language Development: A Review
We review computational and robotics models of early language learning and
development. We first explain why and how these models are used to understand
better how children learn language. We argue that they provide concrete
theories of language learning as a complex dynamic system, complementing
traditional methods in psychology and linguistics. We review different modeling
formalisms, grounded in techniques from machine learning and artificial
intelligence such as Bayesian and neural network approaches. We then discuss
their role in understanding several key mechanisms of language development:
cross-situational statistical learning, embodiment, situated social
interaction, intrinsically motivated learning, and cultural evolution. We
conclude by discussing future challenges for research, including modeling of
large-scale empirical data about language acquisition in real-world
environments.
Keywords: Early language learning, Computational and robotic models, machine
learning, development, embodiment, social interaction, intrinsic motivation,
self-organization, dynamical systems, complexity.Comment: to appear in International Handbook on Language Development, ed. J.
Horst and J. von Koss Torkildsen, Routledg
The forms and meanings of grammatical markers support efficient communication
Functionalist accounts of language suggest that forms are paired with meanings in ways that support efficient communication. Previous work on grammatical marking suggests that word forms have lengths that enable efficient production, and work on the semantic typology of the lexicon suggests that word meanings represent efficient partitions of semantic space. Here we establish a theoretical link between these two lines of work and present an information-theoretic analysis that captures how communicative pressures influence both form and meaning. We apply our approach to the grammatical features of number, tense, and evidentiality and show that the approach explains both which systems of feature values are attested across languages and the relative lengths of the forms for those feature values. Our approach shows that general information-theoretic principles can capture variation in both form and meaning across languages
- …