111 research outputs found
Zipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems
Background: Zipf's law and Heaps' law are observed in disparate complex
systems. Of particular interests, these two laws often appear together. Many
theoretical models and analyses are performed to understand their co-occurrence
in real systems, but it still lacks a clear picture about their relation.
Methodology/Principal Findings: We show that the Heaps' law can be considered
as a derivative phenomenon if the system obeys the Zipf's law. Furthermore, we
refine the known approximate solution of the Heaps' exponent provided the
Zipf's exponent. We show that the approximate solution is indeed an asymptotic
solution for infinite systems, while in the finite-size system the Heaps'
exponent is sensitive to the system size. Extensive empirical analysis on tens
of disparate systems demonstrates that our refined results can better capture
the relation between the Zipf's and Heaps' exponents. Conclusions/Significance:
The present analysis provides a clear picture about the relation between the
Zipf's law and Heaps' law without the help of any specific stochastic model,
namely the Heaps' law is indeed a derivative phenomenon from Zipf's law. The
presented numerical method gives considerably better estimation of the Heaps'
exponent given the Zipf's exponent and the system size. Our analysis provides
some insights and implications of real complex systems, for example, one can
naturally obtained a better explanation of the accelerated growth of scale-free
networks.Comment: 15 pages, 6 figures, 1 Tabl
Punctuation effects in English and Esperanto texts
A statistical physics study of punctuation effects on sentence lengths is
presented for written texts: {\it Alice in wonderland} and {\it Through a
looking glass}. The translation of the first text into esperanto is also
considered as a test for the role of punctuation in defining a style, and for
contrasting natural and artificial, but written, languages. Several log-log
plots of the sentence length-rank relationship are presented for the major
punctuation marks. Different power laws are observed with characteristic
exponents. The exponent can take a value much less than unity ( 0.50 or
0.30) depending on how a sentence is defined. The texts are also mapped into
time series based on the word frequencies. The quantitative differences between
the original and translated texts are very minutes, at the exponent level. It
is argued that sentences seem to be more reliable than word distributions in
discussing an author style.Comment: 13 pages, 7 figures (3x2+1), 60 reference
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; ArgentinaFil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; Argentin
Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models
The article introduces corrections to Zipf's and Heaps' laws based on
systematic models of the hapax rate. The derivation rests on two assumptions:
The first one is the standard urn model which predicts that marginal frequency
distributions for shorter texts look as if word tokens were sampled blindly
from a given longer text. The second assumption posits that the rate of hapaxes
is a simple function of the text size. Four such functions are discussed: the
constant model, the Davis model, the linear model, and the logistic model. It
is shown that the logistic model yields the best fit.Comment: 42 pages, 7 figures, 3 table
Measuring and modelling Internet diffusion using second level domains: the case of Italy
The last 10 years witnessed an exponential growth of the Internet. According to Hobbes' Internet Timeline, the Internet hosts are about 93 million, while in 1989 they were 100,000. The same happens for second level domain names. In July 1989 the registered domains were about 3,900 while they were over 2 million in July 2000. This paper reports about the construction of a database containing daily observations on registrations of second level domain names underneath the it ccTLD in order to analyse the diffusion of Internet among families and businesses. The section of the database referring to domains registered by individuals is analysed. The penetration rate over the relevant population of potential adopters is computed at highly disaggregated geographical level (province). A concentration analysis is carried out to investigate whether the geographical distribution of Internet is less concentrated than population and income suggesting a diffusive effect. Regression analysis is carried out using demographic, social, economic and infrastructure indicators. Finally we briefly describe the further developments of our research. At the present we are constructing a database containing domains registered by firms together with data about the registrants; the idea is to use this new database and the previous one in order to check for the existence of power laws both in the number of domains registered in each province and in the number of domains registered by each firm.Domain names, Internet metrics, Diffusion, Power laws, Zipf s law
- …