Search CORE

134 research outputs found

The entropy of words-learnability and expressivity across more than 1000 languages

Author: Alikaniotis Dimitrios
Bentz Chris
Cysouw Michael
Ferrer Cancho Ramon
Publication venue: 'MDPI AG'
Publication date: 01/01/2017
Field of study

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.Peer ReviewedPostprint (published version

Directory of Open Access Journals

Empirical approaches for investigating the origins of structure in speech

Author: Caselli
de Boer
Feldman
François
Goldin-Meadow
Hannah Little
Heikki Rasilo
James Winters
Jordan Zlatev
Kerem Eryılmaz
Keshet
Kirby
Lee
Little
Little
Little
Masur
Matthews
Michael Pleyer
Miura
Pawlby
Sabine van der Ham
Silvey
Stefan Hartmann
ten Bosch
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2017
Field of study

© John Benjamins Publishing Company. In language evolution research, the use of computational and experimental methods to investigate the emergence of structure in language is exploding. In this review, we look exclusively at work exploring the emergence of structure in speech, on both a categorical level (what drives the emergence of an inventory of individual speech sounds), and a combinatorial level (how these individual speech sounds emerge and are reused as part of larger structures). We show that computational and experimental methods for investigating population-level processes can be effectively used to explore and measure the effects of learning, communication and transmission on the emergence of structure in speech. We also look at work on child language acquisition as a tool for generating and validating hypotheses for the emergence of speech categories. Further, we review the effects of noise, iconicity and production effects

Conceptual similarity and communicative need shape colexification:An experimental study

Author: Blythe Richard A.
Karjus Andres
Kirby Simon
Smith Kenny
Wang Tianyu
Publication venue: 'Wiley'
Publication date: 19/03/2021
Field of study

Colexification refers to the phenomenon of multiple meanings sharing one word in a language. Cross-linguistic lexification patterns have been shown to be largely predictable, as similar concepts are often colexified. We test a recent claim that, beyond this general tendency, communicative needs play an important role in shaping colexification patterns. We approach this question by means of a series of human experiments, using an artificial language communication game paradigm. Our results across four experiments match the previous cross-linguistic findings: all other things being equal, speakers do prefer to colexify similar concepts. However, we also find evidence supporting the communicative need hypothesis: when faced with a frequent need to distinguish similar pairs of meanings, speakers adjust their colexification preferences to maintain communicative efficiency, and avoid colexifying those similar meanings which need to be distinguished in communication. This research provides further evidence to support the argument that languages are shaped by the needs and preferences of their speakers

arXiv.org e-Print Archive

Language and society: How social pressures shape grammatical structure

Author: Raviv L.
Publication venue: Radboud University Nijmegen
Publication date: 01/01/2020
Field of study

Optimal coding and the origins of Zipfian laws

Author: Bentz Christian
Ferrer-i-Cancho Ramon
Seguin Caio
Publication venue: 'Informa UK Limited'
Publication date: 29/05/2020
Field of study

The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter. We apply this result to investigate optimal coding also under so-called non-singular coding, a scheme where unique segmentation is not warranted but codes stand for a distinct number. Optimal non-singular coding predicts that the length of a word should grow approximately as the logarithm of its frequency rank, which is again consistent with Zipf's law of abbreviation. Optimal non-singular coding in combination with the maximum entropy principle also predicts Zipf's rank-frequency distribution. Furthermore, our findings on optimal non-singular coding challenge common beliefs about random typing. It turns out that random typing is in fact an optimal coding process, in stark contrast with the common assumption that it is detached from cost cutting considerations. Finally, we discuss the implications of optimal coding for the construction of a compact theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of concordant pair corrected, proofs polished, references update

arXiv.org e-Print Archive

The cross-linguistic performance of word segmentation models over time.

Author: Andrew CAINES
Basbøll
Basbøll
Bernard
Bird
Braginsky
Emma ALTMANN-RICHER
Grønnum
Krogh
Ladefoged
MacWhinney
MacWhinney
Mampe
McCauley
Nespor
Paula BUTTERY
Zipf
Publication venue: J Child Lang
Publication date: 01/11/2019
Field of study

We select three word segmentation models with psycholinguistic foundations - transitional probabilities, the diphone-based segmenter, and PUDDLE - which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult

Simplicity as a driving force in linguistic evolution

Author: Brighton Henry
Publication venue: The University of Edinburgh
Publication date: 01/01/2003
Field of study

Optimization models of natural communication

Author: Ferrer Cancho Ramon
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2018
Field of study

A family of information theoretic models of communication was introduced more than a decade ago to explain the origins of Zipf’s law for word frequencies. The family is a based on a combination of two information theoretic principles: maximization of mutual information between forms and meanings and minimization of form entropy. The family also sheds light on the origins of three other patterns: the principle of contrast; a related vocabulary learning bias; and the meaning-frequency law. Here two important components of the family, namely the information theoretic principles and the energy function that combines them linearly, are reviewed from the perspective of psycholinguistics, language learning, information theory and synergetic linguistics. The minimization of this linear function is linked to the problem of compression of standard information theory and might be tuned by self-organization.Peer ReviewedPostprint (author's final draft

Does greater use of language promote greater conceptual alignment?

Author: Lupyan G
Roebuck H.
Publication venue
Publication date: 01/01/2020
Field of study