Search CORE

24 research outputs found

The size of patent categories: USPTO 1976-2006

Author: Lafond F.D.
Publication venue: United Nations University
Publication date: 01/01/2014
Field of study

Categorization is an important phenomenon in science and society, and classification systems reflect the mesoscale organization of knowledge. The Yule-Simon-Naranan model, which assumes exponential growth of the number of categories and exponential growth of individual categories predicts a power law (Pareto) size distribution, and a power law size-rank relation (Zipf’s law). However, the size distribution of patent subclasses departs from a pure power law, and is shown to be closer to a shifted power law. At a higher aggregation level (patent classes), the rank-size relation deviates even more from a pure power law, and is shown to be closer to a generalized beta curve. These patterns can be explained by assuming a shifted exponential growth of individual categories to obtain a shifted power law size distribution (for subclasses), and by assuming an asymmetric logistic growth of the number of categories to obtain a generalized beta size-rank relationship (for classes). This may suggest a shift towards incremental more than radical innovation

Maastricht University Research Portal

Re-evaluating phoneme frequencies

Author: Macklin-Cordes Jayden L.
Round Erich R.
Publication venue: 'Frontiers Media SA'
Publication date: 26/10/2020
Field of study

Causal processes can give rise to distinctive distributions in the linguistic variables that they affect. Consequently, a secure understanding of a variable's distribution can hold a key to understanding the forces that have causally shaped it. A storied distribution in linguistics has been Zipf's law, a kind of power law. In the wake of a major debate in the sciences around power-law hypotheses and the unreliability of earlier methods of evaluating them, here we re-evaluate the distributions claimed to characterize phoneme frequencies. We infer the fit of power laws and three alternative distributions to 166 Australian languages, using a maximum likelihood framework. We find evidence supporting earlier results, but also nuancing them and increasing our understanding of them. Most notably, phonemic inventories appear to have a Zipfian-like frequency structure among their most-frequent members (though perhaps also a lognormal structure) but a geometric (or exponential) structure among the least-frequent. We compare these new insights the kinds of causal processes that affect the evolution of phonemic inventories over time, and identify a potential account for why, despite there being an important role for phonetic substance in phonemic change, we could still expect inventories with highly diverse phonetic content to share similar distributions of phoneme frequencies. We conclude with priorities for future work in this promising program of research.Comment: 29pp (3 figures, 3 tables). This article has been provisionally accepted for publication (Frontiers in Psychology, Language Sciences). Supplementary information, data and code available at http://doi.org/10.5281/zenodo.388621

arXiv.org e-Print Archive

University of Surrey

MPG.PuRe

University of Queensland eSpace

The power law model applied to the marathon world record

Author: Fernández Revelles Andrés Bernardo
García Mármol Eduardo
Publication venue: 'Universidad de Alicante Servicio de Publicaciones'
Publication date: 01/01/2019
Field of study

In September 2013 the world record in the marathon men's race was broken. The aim of this study is to apply to the 2013 Berlin Marathon a mathematical model based on the power law that analyses the marks distribution and checks its connection. The results show that the correlations obtained in all the different categories have been very significant, with a result of (r ≥ 0.978; p < 0.000) and a linear determination coefficient of (R2 ≥ 0.969). As a conclusion it could be said that the power law application to the 2013 Berlin Marathon Men's race has been an useful and feasible study, and the connection between the data and the mathematical model has been so accurat

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

Repositorio Institucional Universidad de Granada

The evolution of knowledge systems

Author: Lafond F.D.
Publication venue: 'University of Maastricht'
Publication date: 01/01/2014
Field of study

Maastricht University Research Portal

Words by the tail : assessing lexical diversity in scholarly titles using frequency-rank distribution tail fits

Author: Bérubé Nicolas
Larivière Vincent
Mongeon Philippe
Sainte-Marie Maxime
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

This research assesses the evolution of lexical diversity in scholarly titles using a new indicator based on zipfian frequency-rank distribution tail fits. At the operational level, while both head and tail fits of zipfian word distributions are more independent of corpus size than other lexical diversity indicators, the latter however neatly outperforms the former in that regard. This benchmark-setting performance of zipfian distribution tails proves extremely handy in distinguishing actual patterns in lexical diversity from the statistical noise generated by other indicators due to corpus size fluctuations. From an empirical perspective, analysis of Web of Science (WoS) article titles from 1975 to 2014 shows that the lexical concentration of scholarly titles in Natural Sciences & Engineering (NSE) and Social Sciences & Humanities (SSH) articles increases by a little less than 8% over the whole period. With the exception of the lexically concentrated Mathematics, Earth & Space, and Physics, NSE article titles all increased in lexical concentration, suggesting a probable convergence of concentration levels in the near future. As regards to SSH disciplines, aggregation effects observed at the disciplinary group level suggests that, behind the stable concentration levels of SSH disciplines, a cross-disciplinary homogenization of the highest word frequency ranks may be at work. Overall, these trends suggest a progressive standardization of title wording in scientific article titles, as article titles get written using an increasingly restricted and crossdisciplinary set of words

Directory of Open Access Journals

Dépôt Institutionnel Numérique

FigShare

Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Author: Gerlach Martin
Publication venue
Publication date: 01/03/2016
Field of study

Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers)

Technische Universität Dresden: Qucosa

Study of rank- and size-frequency functions and their relations in a generalized Naranan framework

Author: Egghe Leo
Publication venue
Publication date: 01/01/2012
Field of study

Institutional Repository Universiteit Antwerpen

Study of rank- and size-frequency functions and their relations in a generalized Naranan framework

Author: Campanario
Egghe
Egghe
Egghe
Egghe
Egghe
Egghe
L. Egghe
Mansilla
Martinez-Mekler
Naranan
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref