11 research outputs found

    Scaling laws and model of words organization in spoken and written language

    No full text
    A broad range of complex physical and biological systems exhibits scaling laws. The human language is a complex system of words organization. Studies of written texts have revealed intriguing scaling laws that characterize the frequency of words occurrence, rank of words, and growth in the number of distinct words with text length. While studies have predominantly focused on the language system in its written form, such as books, little attention is given to the structure of spoken language. Here we investigate a database of spoken language transcripts and written texts, and we uncover that words organization in both spoken language and written texts exhibits scaling laws, although with different crossover regimes and scaling exponents. We propose a model that provides insight into words organization in spoken language and written texts, and successfully accounts for all scaling laws empirically observed in both language forms

    Top 20 most frequently used English words and Chinese characters and their frequencies.

    No full text
    <p>A Chinese character can have different functions in the structure of a sentence and carry different meanings depending on the context, as shown in brackets following each Chinese character in the table. The frequencies are calculated using pooled data of all books in our database.</p

    Detailed information on the database of Chinese and English books and corresponding words statistics for each book.

    No full text
    <p><i>T</i> is the total number of words, and <i>N</i><sub><i>T</i></sub> is the vocabulary size of each book.</p

    Empirical analyses of the word growth mechanism.

    No full text
    <p>Data show results for the first book of (a) English and (b) Chinese language listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>, revealing a scaling relation of the average number of occurrence <i>ϕ</i>(<i>k</i>) of a given word in the second half of a text, provided the frequency of occurrence of this word in the first half of the text is <i>k</i>. Both languages are characterized by a scaling exponent <i>γ</i> ≈ 1, indicating that words which appear with high frequency <i>k</i> in the first part of the text have also high-average occurrence in the rest of the text.</p

    Scaling analyses of words organization in Chinese and English books.

    No full text
    <p>Words organization in Chinese and English language exhibits scaling laws with different characteristics. Log-log plots of (a) probability distribution <i>P</i>(<i>k</i>) of the word frequency <i>k</i>, (b) Zipf’s law <i>Z</i>(<i>r</i>) of the word frequency rank <i>r</i>, and (c) Heaps’ scaling law of the number of distinct words <i>N</i>(<i>t</i>) vs. the number of words <i>t</i> in the text, obtained for the first book in Chinese and in English language listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>. Straight lines indicate the fitting range where the scaling exponents <i>β</i>, <i>α</i> and λ are obtained. Our analyses show that (a) Chinese language books exhibit lower exponent <i>β</i> compared to English books; (b) while English books exhibit a single scaling regime over the entire range of frequency ranks <i>r</i>, Chinese texts are characterized by a clear crossover in the Zipf’s scaling of the normalized word frequency <i>Z</i>(<i>r</i>) vs. word frequency rank <i>r</i>; (c) the number of distinct words <i>N</i>(<i>t</i>) vs. text length <i>t</i> exhibits a crossover with two different scaling exponents at small and intermediate scales for both Chinese and English books. However, Chinese texts are characterized by a third saturation regime for large scales <i>t</i> that is not observed for English books.</p

    Model parameters of Chinese and English books.

    No full text
    <p>The statistics show significant differences in the model parameters <i>k</i><sub>0</sub>, <i>k</i><sub><i>t</i></sub> and <i>k</i><sub><i>p</i></sub> between Chinese and English texts, indicating differences in the dynamic process underlying the language structure, words organization and the occurrence of new words with text growth.</p

    Functional relations between the empirically observed scaling exponents and model parameters.

    No full text
    <p>(a) Exponent <i>β</i> of the empirical probability distribution <i>P</i>(<i>k</i>) vs. model parameter <i>k</i><sub><i>p</i></sub> indicating linear functional dependence. (b) Heaps’ law scaling exponents λ<sub>2</sub> for English books and spoken transcriptions, and λ<sub>3</sub> for Chinese books vs. model parameter <i>k</i><sub><i>t</i></sub> indicating exponential functional dependence. Data points are obtained from the scaling analyses and simulation of all ten Chinese and English language books listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>, and English spoken language from Ref. [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.ref030" target="_blank">30</a>]. The dotted lines indicate 95% confidence intervals of the data points obtained from empirical and model parameters for each separate book.</p

    Empirical results and modeling simulation for Chinese and English language books.

    No full text
    <p>Scaling laws and model simulations for English book No. 1 (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>) are shown in panels (a), (b) and (c), and for Chinese book No. 1 (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>) are shown in panels (d), (e) and (f). Modeling parameters for all Chinese and English language books are given in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t003" target="_blank">Table 3</a>.</p
    corecore