10 research outputs found

    Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts - Fig 3

    No full text
    <p>(a) Probability mass functions <i>f</i>(<i>n</i>) of the absolute frequencies <i>n</i> of words and lemmas in <i>La Regenta</i>, together with their fits, under rescaling of both axis. The collapse of the tails indicates the compatibility of both power-law exponents. (b) The same for, from top to bottom, <i>Artamène, Bragelonne</i> (both in French), <i>Seitsemän v., Kevät ja t</i>., and <i>Vanhempieni r</i>. (all three in Finnish). The rescaled distributions are multiplied in addition by factors 1, 10<sup>−2</sup>, etc., for a clearer visualization.</p

    Analysis of the association between random variables using Pearson and Spearman correlations as statistics.

    No full text
    <p><i>ρ</i> is the value of the correlation statistic and <i>p</i> is the <i>p</i>-value of a two-sided test with null hypothesis <i>ρ</i> = 0, calculated through permutations of one of the variables (the results can be different if <i>p</i> is calculated from a <i>t</i>–test). The sample size is</p

    Coverage of the vocabulary by the dictionary in each language, both at the word-type and at the token level.

    No full text
    <p>The average for all texts is also included. Remember that we distinguish between a word <i>type</i> (corresponding to its orthographic form) and its <i>tokens</i> (actual occurrences in text).</p

    <i>γ</i><sub><i>l</i></sub> (the exponent of the frequency distribution of lemmas) versus <i>γ</i><sub><i>w</i></sub> (the exponent of the frequency distribution of word forms).

    No full text
    <p>As a guide to the eye, the line <i>γ</i><sub><i>l</i></sub> = <i>γ</i><sub><i>w</i></sub> is also shown (solid line). Error bars indicate one standard deviation.</p

    The fit of a linear model for the relationship between exponents (<i>γ</i><sub><i>w</i></sub> and <i>γ</i><sub><i>l</i></sub>) and the relationship between cut-offs (<i>a</i><sub><i>w</i></sub> and <i>a</i><sub><i>l</i></sub>).

    No full text
    <p><i>c</i><sub>1</sub> and <i>c</i><sub>3</sub> stand for slopes and <i>c</i><sub>2</sub> and <i>c</i><sub>4</sub> stand for intercepts. The error bars correspond to one standard deviation. A Student’s <i>t</i>-test is applied to investigate if the slopes are significantly different from one and if the intercepts are significantly different from zero. The resulting <i>p</i>-values indicate that in all cases the slopes are compatible with being equal to one. The intercepts are compatible with zero for the exponents, but seem to be incompatible for the cut-offs.</p

    Characteristics of the books analyzed.

    No full text
    <p><sup>1</sup>Clarissa: Or the History of a Young Lady.</p><p><sup>2</sup>Moby-Dick; or, The Whale.</p><p><sup>3</sup>El ingenioso hidalgo don Quijote de la Mancha (1605)—The Ingenious Gentleman Don Quixote of La Mancha (title in English); including second part: El ingenioso caballero don Quijote de la Mancha (1615).</p><p><sup>4</sup>Artamène ou le Grand Cyrus—Artamène, or Cyrus the Great.</p><p><sup>5</sup>Le Vicomte de Bragelonne ou Dix ans plus tard—The Vicomte of Bragelonne: Ten Years Later.</p><p><sup>6</sup>Seven Brothers.</p><p><sup>7</sup>Spring and the Untimely Return of Winter.</p><p><sup>8</sup>The Story of my Parents.</p><p><sup>9</sup>Madeleine and Georges de Scudéry.</p><p>The length of each book <i>L</i> is measured in millions of tokens.</p

    Power-law fitting results for words and lemmas, denoted respectively by subindices <i>w</i> and <i>l</i>.

    No full text
    <p><i>V</i> is the number of types (vocabulary size), <i>n</i><sub><i>m</i></sub> is the maximum frequency of the distribution, <i>N</i><sub><i>a</i></sub> is the number of types in the power-law tail, i.e., with <i>n</i> ≥ <i>a</i>, <i>a</i> is the minimum value for which the power-law fit holds, and <i>γ</i> and <i>σ</i> are the power-law exponent and its standard deviation, respectively. 2<i>σ</i><sub><i>d</i></sub>, the double of the standard deviation <i>σ</i><sub><i>d</i></sub> is also given. <i>σ</i><sub><i>d</i></sub> is the standard deviation of <i>γ</i><sub><i>l</i></sub>−<i>γ</i><sub><i>w</i></sub> assuming independence, which is </p><p></p><p></p><p></p><p><mi>σ</mi><mi>d</mi></p><mo>=</mo><p></p><p></p><p><mi>σ</mi><mi>w</mi><mn>2</mn></p><mo>+</mo><p><mi>σ</mi><mi>l</mi><mn>2</mn></p><p></p><p></p><p></p><p></p><p></p>. The last column provides ℓ<sub>1</sub>, the number of lemmas associated to only one word form. Notice that the lemma exponent is very close to the one found in Ref. [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0129031#pone.0129031.ref029" target="_blank">29</a>] for the tail of a double power-law fitting, except for <i>Moby-Dick</i> and <i>Ulysses</i>.<p></p

    Probability density <i>D</i>(<i>n</i><sub><i>l</i></sub>/<i>n</i><sub><i>w</i></sub>) of the frequency ratio for lemmas and words, <i>n</i><sub><i>l</i></sub>/<i>n</i><sub><i>w</i></sub>, in <i>La Regenta</i>.

    No full text
    <p>Values of <i>n</i><sub><i>l</i></sub> smaller than <i>n</i><sub><i>w</i></sub> are disregarded, as they arise from words associated to more than one lemma. Bending for the largest <i>n</i><sub><i>l</i></sub>/<i>n</i><sub><i>w</i></sub> is expected as the maximum of the ratio is given by <i>n</i><sub><i>l</i></sub>, which is not constant for each distribution but has a variation of half an order of magnitude (see plot legend).</p
    corecore