132 research outputs found
Testing the robustness of laws of polysemy and brevity versus frequency
The pioneering research of G.K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. Here we focus on a couple of them: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. Here we evaluate the robustness of these laws in contexts where they have not been explored yet to our knowledge. The recovery of the laws again in new conditions provides support for the hypothesis that they originate from abstract mechanisms.Peer ReviewedPostprint (author's final draft
The meaning-frequency law in Zipfian optimization models of communication
According to Zipf's meaning-frequency law, words that are more frequent tend
to have more meanings. Here it is shown that a linear dependency between the
frequency of a form and its number of meanings is found in a family of models
of Zipf's law for word frequencies. This is evidence for a weak version of the
meaning-frequency law. Interestingly, that weak law (a) is not an inevitable of
property of the assumptions of the family and (b) is found at least in the
narrow regime where those models exhibit Zipf's law for word frequencies
Zipf's laws of meaning in Catalan
In his pioneering research, G. K. Zipf formulated a couple of statistical
laws on the relationship between the frequency of a word with its number of
meanings: the law of meaning distribution, relating the frequency of a word and
its frequency rank, and the meaning-frequency law, relating the frequency of a
word with its number of meanings. Although these laws were formulated more than
half a century ago, they have been only investigated in a few languages. Here
we present the first study of these laws in Catalan.
We verify these laws in Catalan via the relationship among their exponents
and that of the rank-frequency law. We present a new protocol for the analysis
of these Zipfian laws that can be extended to other languages. We report the
first evidence of two marked regimes for these laws in written language and
speech, paralleling the two regimes in Zipf's rank-frequency law in large
multi-author corpora discovered in early 2000s. Finally, the implications of
these two regimes will be discussed.Comment: 21 pages, 11 figure
The polysemy of the words that children learn over time
Here we study polysemy as a potential learning bias in vocabulary learning in children. We employ a massive set of transcriptions of conversations between children and adults in English, to analyze the evolution of mean polysemy in the words produced by children whose ages range between 10 and 60 months.
Our results show that mean polysemy in children increases over time in two phases, i.e. a fast growth till the 31st month followed by a slower tendency towards adult speech. In contrast, no dependency with time is found in adults. This may suggest that children have a preference for non-polysemous words in their early stages of vocabulary acquisition. Our hypothesis is twofold: (a) polysemy is a standalone bias or (b) polysemy is a side-effect of other biases. Interestingly, the bias for low polysemy above weakens when controlling by syntactic category (noun, verb, adjective or adverb). The pattern of the evolution of polysemy suggests that both hypotheses may apply to some extent, and that (b) would originate from a combination of the well-known preference for nouns and the lower polysemy of nouns with respect to other syntactic categories.Peer ReviewedPostprint (author's final draft
General patterns and language variation: word frequencies across English, German, and Chinese
Cross-linguistic studies of concepts provide valuable insights for the investigation of the mental lexicon. Recent developments of cross-linguistic databases facilitate an exploration of a diverse set of languages on the basis of comparative concepts. These databases make use of a well-established reference catalog, the Concepticon, which is built from concept lists published in linguistics. A recently released feature of the Concepticon includes data on norms, ratings, and relations for words and concepts. The present study used data on word frequencies to test two hypotheses. First, I examined the assumption that related languages (i.e., English and German) share concepts with more similar frequencies than non-related languages (i.e., English and Chinese). Second, the variation of frequencies across both language pairs was explored to answer the question of whether the related languages share fewer concepts with a large difference between the frequency than the non-related languages. The findings indicate that related languages experience less variation in their frequencies. If there is variation, it seems to be due to cultural and structural differences. The implications of this study are far-reaching in that it exemplifies the use of cross-linguistic data for the study of the mental lexicon
Linguistic Laws and Compression in a Comparative Perspective: A Conceptual Review and Phylogenetic Test in Mammals
Over the last several decades, the application of “Linguistic Laws” - statistical
regularities underlying the structure of language- to studying human languages has exploded. These ideas, adopted from Information Theory, and quantitative linguistics, have been useful in helping to understand the evolution of the underlying structures of communicative systems. Moreover, since the publication of a seminal article in 2010, the field has taken a comparative approach to assess the degree of similarities and differences underlying the organisation of communication systems across the natural world. In this thesis, I begin by surveying the state of the field as it pertains to the study of linguistic laws and compression in nonhuman animal communication systems. I subsequently identify a number of theoretical and methodological gaps in the current literature and suggest ways in which these might be rectified to strengthen conclusions in future and enable the pursuit of novel theoretical questions. In the second chapter, I undertake a phylogenetically controlled analysis, which aims to demonstrate the extent of conformity to Zipf’s Law of Abbreviation in mammalian vocal repertoires. I test each individual repertoire, and then examine the entire collection of repertoires together. I find mixed evidence of conformity to the Law of Abbreviation, and conclude with some implications of this work, and future directions in which it might be extended
Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings
Stretched words like \u27heellllp\u27 or \u27heyyyyy\u27 are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of \u27stretchable words\u27 found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, \u27balance\u27 and \u27stretch\u27, that capture their main characteristics, and explore their dynamics by creating visual tools we call \u27balance plots\u27 and \u27spelling trees\u27. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings and be used as a basis for other linguistic research involving stretchable words, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics
- …