Search CORE

7 research outputs found

A Type-Theoretical Approach to Register Classification

Author: Hou Renkui
Huang Chu-Ren
Publication venue: Waseda Institute for the Study of Language and Information
Publication date: 01/01/2019
Field of study

The challenges of statistical patterns of language: the case of Menzerath's law in genomes

Author: Altmann
Baixeries
Baixeries
Bel-Enguix
Biber
Bloom
Boroda
Carninci
Chung
Cramer
Ferrer-i-Cancho
Ferrer-i-Cancho
Ferrer-i-Cancho
Ferrer-i-Cancho
Hernández-Fernández
Häsler
Ke
Li
Li
Li
Lyons
Maclay
Makalowski
Menzerath
Miller
Miller
Miller
Pennisi
Searls
Siegel
Solé
Suzuki
Taft
Teupenhayn
Wilde
Yazgan
Ye
Zipf
Publication venue: 'Wiley'
Publication date: 29/09/2012
Field of study

The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) non-coding DNA dominates genomes. Here mathematical, statistical and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of non-coding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between non-coding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law.Comment: Title changed, abstract and introduction improved and little corrections on the statistical argument

arXiv.org e-Print Archive

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Diposit Digital de la Universitat de Barcelona

When is Menzerath-Altmann law mathematically trivial? A new approach

Author: Baixeries i Juvillà Jaume
Debowski Lukasz
Ferrer Cancho Ramon
Hernández Fernández Antonio
Macutek Jan
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2014
Field of study

Menzerath’s law, the tendency of Z (the mean size of the parts) to decrease as X (the number of parts) increases, is found in language, music and genomes. Recently, it has been argued that the presence of the law in genomes is an inevitable consequence of the fact that Z = Y/X, which would imply that Z scales with X as Z~1/X. That scaling is a very particular case of Menzerath-Altmann law that has been rejected by means of a correlation test between X and Y in genomes, being X the number of chromosomes of a species, Y its genome size in bases and Z the mean chromosome size. Here we review the statistical foundations of that test and consider three non-parametric tests based upon different correlation metrics and one parametric test to evaluate if Z~1/X in genomes. The most powerful test is a new non-parametric one based upon the correlation ratio, which is able to reject Z~1/X in nine out of 11 taxonomic groups and detect a borderline group. Rather than a fact, Z~1/X is a baseline that real genomes do not meet. The view of Menzerath-Altmann law as inevitable is seriously flawed.Peer ReviewedPostprint (author’s final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Units and constituency in prosodic analysis:a quantitative assessment

Author: Wilson Andrew
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2017
Field of study

Drawing on methods from quantitative linguistics, this paper tests the hypothesis that the intonation unit is a valid language construct whose immediate constituent is the foot (and whose own immediate constituent is the syllable). If the hypothesis is true, then the lengths of intonation units, measured in feet, should abide by a regular and parsimonious discrete probability distribution, and the immediate constituency relationship between feet and intonation units should be further demonstrable by successfully fitting the Menzerath-Altmann equation with a negative exponent. However, out of sixteen texts from the Aix-MARSEC database, only six share a common probability distribution and only eight exhibit a tolerable fit of the Menzerath-Altmann equation. A failure rate of ≥ 50% in both cases casts doubt on the validity of the hypothesis

Lancaster E-Prints

The Phylogeny and Function of Vocal Complexity in Geladas

Author: Gustison Morgan
Publication venue
Publication date: 01/01/2017
Field of study

The complexity of vocal communication varies widely across taxa – from humans who can create an infinite repertoire of sound combinations to some non-human species that produce only a few discrete sounds. A growing body of research is aimed at understanding the origins of ‘vocal complexity’. And yet, we still understand little about the evolutionary processes that led to, and the selective advantages of engaging in, complex vocal behaviors. I contribute to this body of research by examining the phylogeny and function of vocal complexity in wild geladas (Theropithecus gelada), a primate known for its capacity to combine a suite of discrete sound types into varied sequences. First, I investigate the phylogeny of vocal complexity by comparing gelada vocal communication with that of their close baboon relatives and with humans. Comparisons of vocal repertoires reveal that geladas – specifically the males – produce a suite of unique or ‘derived’ call types that results in a more diversified vocal repertoire than baboons. Also, comparisons of acoustic properties reveal that geladas produce vocalizations with greater spectro-temporal modulation, a feature shared with human speech, than baboons. Additionally, I show that the same organizational principle – Menzerath’s law – underpins the structure of gelada vocal sequences (i.e., combinations of derived and homologous call types) and human sentences. Second, I investigate the function of vocal complexity by examining the perception of male complex vocal sequences (i.e., those with more derived call types), the contexts in which they are produced, and how their production differs across individuals. A playback experiment shows that female geladas perceive ‘complex’ and ‘simple’ vocal sequences as being different. Then, two observational studies show that male production of complex vocal sequences mediates their affiliative interactions with females, both during neutral periods and periods of uncertainty (e.g., following conflicts). Finally, I find evidence that vocal complexity can act as a signal of male ‘quality’, in that more dominant males exhibit higher levels of vocal complexity than their subordinate counterparts. Collectively, the work presented in this dissertation presents an integrative investigation of the ultimate origins of complex communication systems, and in the process, it highlights the critical importance of approaching the study of complexity from several scientific perspectives.PHDPsychologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138479/1/gustison_1.pd

Deep Blue Documents at the University of Michigan

Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Author: Gerlach Martin
Publication venue
Publication date: 01/03/2016
Field of study

Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers)

Technische Universität Dresden: Qucosa