Skip to main content
Article thumbnail
Location of Repository

Modeling Statistical Properties of Written Text

By M. Ángeles Serrano, Alessandro Flammini and Filippo Menczer

Abstract

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics

Topics: Research Article
Publisher: Public Library of Science
OAI identifier: oai:pubmedcentral.nih.gov:2670513
Provided by: PubMed Central
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://www.pubmedcentral.nih.g... (external link)
  • Suggested articles

    Citations

    1. (1976). A general theory of bibliometric and other cumulative advantage processes.
    2. (1998). A study of retrospective and on-line event detection. In:
    3. (1980). An algorithm for suffix stripping. Program 14: 130–137. Modeling Written Text PLoS
    4. (1983). An Introduction to Modern Information Retrieval.
    5. (2004). Analysing the scientific literature in its online context. Nature Web Focus on Access to the Literature.
    6. (2006). Analyzing entities and topics in news articles using statistical topic models.
    7. (2002). Bursty and hierarchical structure in streams. In:
    8. (2006). Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In:
    9. (2002). Computational and evolutionary aspects of Language.
    10. (2008). Computerized text analysis of al-qaeda transcripts. In: Krippendor
    11. (2006). Contextual diversity, not word frequency, determines word naming and lexical decision times.
    12. (2005). De Roeck A
    13. (1996). Distribution of content words and phrases in text and language modelling.
    14. (2004). Finding scientific topics.
    15. (1999). Foundations of statistical natural language processing.
    16. (2002). Growing and navigating the small world Web by local content .
    17. (2006). Hierarchical structures induce long-range dynamical correlations in written texts.
    18. (1949). Human Behaviour and the Principle of Least Effort.
    19. (1978). Information Retrieval: Computational and Theoretical Aspects.
    20. (2005). Integrating topics and Syntax. In:
    21. (2006). Intelligence and Security Informatics for International Security Information Sharing and Data Mining.
    22. (2006). Language and Mind. Cambridge:
    23. (2008). Languages evolve in punctuational bursts.
    24. (2003). Latent dirichlet allocation.
    25. (2006). McCallum A
    26. (2003). Mining the Web: Discovering knowledge from hypertext data.
    27. (2005). Modeling word burstiness using the dirichlet distribution.
    28. (1999). Modern Information Retrieval.
    29. (1991). Natural language processing.
    30. (1955). On a class of skew distribution functions.
    31. (2006). On zipf’s law and rank distributions in linguistics and semiotics.
    32. (1998). On-line new event detection and tracking. In:
    33. (2003). Parametric models of linguistic count data. In:
    34. (1995). Poisson mixtures. Natural Language Engineering.
    35. (2009). Power-law distributions in empirical data.
    36. (1999). Probabilistic latent semantic indexing. In:
    37. (1971). Programming languages in mechanized documentation.
    38. (2006). Scale-free network growth by ranking.
    39. (2007). Semiotic dynamics and collaborative tagging.
    40. (2004). Seriel mechanisms in lexical access: The Rank Hypothesis Psychological Review
    41. (2002). Statistical mechanics of complex networks.
    42. (2005). Text Mining for Biology And Biomedicine.
    43. (2002). The faculty of language: What is it, who has it, and how did it evolve?
    44. (2005). The origin of bursts and heavy tails in human dynamics.
    45. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge:
    46. (2008). Theory of Zipf’s Law and of
    47. (2001). Universal behavior of load distribution in scale-free networks.
    48. (2007). Web Data Mining: Exploring Hyperlinks, Contents and Usage Data.
    49. (2001). Word Frequency Distributions.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.