24 research outputs found
The size of patent categories: USPTO 1976-2006
Categorization is an important phenomenon in science and society, and classification systems reflect the mesoscale organization of knowledge. The Yule-Simon-Naranan model, which assumes exponential growth of the number of categories and exponential growth of individual categories predicts a power law (Pareto) size distribution, and a power law size-rank relation (Zipf’s law). However, the size distribution of patent subclasses departs from a pure power law, and is shown to be closer to a shifted power law. At a higher aggregation level (patent classes), the rank-size relation deviates even more from a pure power law, and is shown to be closer to a generalized beta curve. These patterns can be explained by assuming a shifted exponential growth of individual categories to obtain a shifted power law size distribution (for subclasses), and by assuming an asymmetric logistic growth of the number of categories to obtain a generalized beta size-rank relationship (for classes). This may suggest a shift towards incremental more than radical innovation
Re-evaluating phoneme frequencies
Causal processes can give rise to distinctive distributions in the linguistic
variables that they affect. Consequently, a secure understanding of a
variable's distribution can hold a key to understanding the forces that have
causally shaped it. A storied distribution in linguistics has been Zipf's law,
a kind of power law. In the wake of a major debate in the sciences around
power-law hypotheses and the unreliability of earlier methods of evaluating
them, here we re-evaluate the distributions claimed to characterize phoneme
frequencies. We infer the fit of power laws and three alternative distributions
to 166 Australian languages, using a maximum likelihood framework. We find
evidence supporting earlier results, but also nuancing them and increasing our
understanding of them. Most notably, phonemic inventories appear to have a
Zipfian-like frequency structure among their most-frequent members (though
perhaps also a lognormal structure) but a geometric (or exponential) structure
among the least-frequent. We compare these new insights the kinds of causal
processes that affect the evolution of phonemic inventories over time, and
identify a potential account for why, despite there being an important role for
phonetic substance in phonemic change, we could still expect inventories with
highly diverse phonetic content to share similar distributions of phoneme
frequencies. We conclude with priorities for future work in this promising
program of research.Comment: 29pp (3 figures, 3 tables). This article has been provisionally
accepted for publication (Frontiers in Psychology, Language Sciences).
Supplementary information, data and code available at
http://doi.org/10.5281/zenodo.388621
The power law model applied to the marathon world record
In September 2013 the world record in the marathon men's race was broken. The aim of this study is to apply to the 2013 Berlin Marathon a mathematical model based on the power law that analyses the marks distribution and checks its connection. The results show that the correlations obtained in all the different categories have been very significant, with a result of (r ≥ 0.978; p < 0.000) and a linear determination coefficient of (R2 ≥ 0.969). As a conclusion it could be said that the power law application to the 2013 Berlin Marathon Men's race has been an useful and feasible study, and the connection between the data and the mathematical model has been so accurat
Words by the tail : assessing lexical diversity in scholarly titles using frequency-rank distribution tail fits
This research assesses the evolution of lexical diversity in scholarly titles using a new indicator based on zipfian frequency-rank distribution tail fits. At the operational level, while both
head and tail fits of zipfian word distributions are more independent of corpus size than
other lexical diversity indicators, the latter however neatly outperforms the former in that
regard. This benchmark-setting performance of zipfian distribution tails proves extremely
handy in distinguishing actual patterns in lexical diversity from the statistical noise generated
by other indicators due to corpus size fluctuations. From an empirical perspective, analysis
of Web of Science (WoS) article titles from 1975 to 2014 shows that the lexical concentration
of scholarly titles in Natural Sciences & Engineering (NSE) and Social Sciences & Humanities (SSH) articles increases by a little less than 8% over the whole period. With the exception of the lexically concentrated Mathematics, Earth & Space, and Physics, NSE article
titles all increased in lexical concentration, suggesting a probable convergence of concentration levels in the near future. As regards to SSH disciplines, aggregation effects observed
at the disciplinary group level suggests that, behind the stable concentration levels of SSH
disciplines, a cross-disciplinary homogenization of the highest word frequency ranks may
be at work. Overall, these trends suggest a progressive standardization of title wording in
scientific article titles, as article titles get written using an increasingly restricted and crossdisciplinary set of words
Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary.
Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution.
First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers)