8 research outputs found
Word forms are structured for efficient use
Zipf famously stated that, if natural language lexicons are structured for efficient communication, the words that are used the most frequently should require the least effort. This observation explains the famous finding that the most frequent words in a language tend to be short. A related prediction is that, even within words of the same length, the most frequent word forms should be the ones that are easiest to produce and understand. Using orthographics as a proxy for phonetics, we test this hypothesis using corpora of 96 languages from Wikipedia. We find that, across a variety of languages and language families and controlling for length, the most frequent forms in a language tend to be more orthographically wellâformed and have more orthographic neighbors than less frequent forms. We interpret this result as evidence that lexicons are structured by language usage pressures to facilitate efficient communication. Keywords: Lexicon; Word frequency; Phonology; Communication; EfficiencyNational Science Foundation (Grant ES/N0174041/1
Disambiguatory Signals are Stronger in Word-initial Positions
Psycholinguistic studies of human word processing and lexical access provide
ample evidence of the preferred nature of word-initial versus word-final
segments, e.g., in terms of attention paid by listeners (greater) or the
likelihood of reduction by speakers (lower). This has led to the conjecture --
as in Wedel et al. (2019b), but common elsewhere -- that languages have evolved
to provide more information earlier in words than later. Information-theoretic
methods to establish such tendencies in lexicons have suffered from several
methodological shortcomings that leave open the question of whether this high
word-initial informativeness is actually a property of the lexicon or simply an
artefact of the incremental nature of recognition. In this paper, we point out
the confounds in existing methods for comparing the informativeness of segments
early in the word versus later in the word, and present several new measures
that avoid these confounds. When controlling for these confounds, we still find
evidence across hundreds of languages that indeed there is a cross-linguistic
tendency to front-load information in words.Comment: Accepted at EACL 2021. Code is available in
https://github.com/tpimentelms/frontload-disambiguatio
The emergence of word-internal repetition through iterated learning:Explaining the mismatch between learning biases and language design
The idea that natural language is shaped by biases in learning plays a key role in our understanding of how human language is structured, but its corollary that there should be a correspondence between typological generalisations and ease of acquisition is not always supported. For example, natural languages tend to avoid close repetitions of consonants within a word, but developmental evidence suggests that, if anything, words containing sound repetitions are more, not less, likely to be acquired than those without. In this study, we use word-internal repetition as a test case to provide a cultural evolutionary explanation of when and how learning biases impact on language design. Two artificial language experiments showed that adult speakers possess a bias for both consonant and vowel repetitions when learning novel words, but the effects of this bias were observable in language transmission only when there was a relatively high learning pressure on the lexicon. Based on these results, we argue that whether the design of a language reflects biases in learning depends on the relative strength of pressures from learnability and communication efficiency exerted on the linguistic system during cultural transmission
Linguistic Laws and Compression in a Comparative Perspective: A Conceptual Review and Phylogenetic Test in Mammals
Over the last several decades, the application of âLinguistic Lawsâ - statistical
regularities underlying the structure of language- to studying human languages has exploded. These ideas, adopted from Information Theory, and quantitative linguistics, have been useful in helping to understand the evolution of the underlying structures of communicative systems. Moreover, since the publication of a seminal article in 2010, the field has taken a comparative approach to assess the degree of similarities and differences underlying the organisation of communication systems across the natural world. In this thesis, I begin by surveying the state of the field as it pertains to the study of linguistic laws and compression in nonhuman animal communication systems. I subsequently identify a number of theoretical and methodological gaps in the current literature and suggest ways in which these might be rectified to strengthen conclusions in future and enable the pursuit of novel theoretical questions. In the second chapter, I undertake a phylogenetically controlled analysis, which aims to demonstrate the extent of conformity to Zipfâs Law of Abbreviation in mammalian vocal repertoires. I test each individual repertoire, and then examine the entire collection of repertoires together. I find mixed evidence of conformity to the Law of Abbreviation, and conclude with some implications of this work, and future directions in which it might be extended
Learning homophones in context:Easy cases are favored in the lexicon of natural languages
International audienc