2,286 research outputs found

    Entropic analysis of the role of words in literary texts

    Full text link
    Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic statistical analysis of the use of words in literary English corpora. We show that there is a quantitative relation between the role of content words in literary English and the Shannon information entropy defined over an appropriate probability distribution. Without assuming any previous knowledge about the syntactic structure of language, we are able to cluster certain groups of words according to their specific role in the text.Comment: 9 pages, 5 figure

    Statistical keyword detection in literary corpora

    Full text link
    Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

    Thesaurus as a complex network

    Full text link
    A thesaurus is one, out of many, possible representations of term (or word) connectivity. The terms of a thesaurus are seen as the nodes and their relationship as the links of a directed graph. The directionality of the links retains all the thesaurus information and allows the measurement of several quantities. This has lead to a new term classification according to the characteristics of the nodes, for example, nodes with no links in, no links out, etc. Using an electronic available thesaurus we have obtained the incoming and outgoing link distributions. While the incoming link distribution follows a stretched exponential function, the lower bound for the outgoing link distribution has the same envelope of the scientific paper citation distribution proposed by Albuquerque and Tsallis. However, a better fit is obtained by simpler function which is the solution of Ricatti's differential equation. We conjecture that this differential equation is the continuous limit of a stochastic growth model of the thesaurus network. We also propose a new manner to arrange a thesaurus using the ``inversion method''.Comment: Contribution to the Proceedings of `Trends and Perspectives in Extensive and Nonextensive Statistical Mechanics', in honour of Constantino Tsallis' 60th birthday (submitted Physica A

    Towards the quantification of the semantic information encoded in written language

    Get PDF
    Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Comment: 19 pages, 4 figure

    The Science of Art “Faithfully Presented”: Entropy in British Victorian Literature

    Get PDF
    In the chemical world, entropy, or the randomness and chaos of a system, must continually increase; it is much more favorable for things to fall apart than to be put together. This scientific concept can also be rightly applied to the study of literature. While it is true books contain information put together into some sense of order from chaos, making them counterintuitive to entropy, I am convinced these works must still obey the laws of thermodynamics. There must be an increase in chaos somewhere, and if it is not within the words themselves, it must lie within the ideas they represent, their interpretation by readers, and the deconstruction of the text through literary analysis. In this study, the works of Victorian authors including Charles Dickens and Thomas Hardy are deconstructed into entropic elements including the lack of a reliable center and the struggle between the compulsion to repeat and the desire for revolution. Examination of entropy in these texts validates their claims to be realistic novels portraying the nuance of authentic life. Entropy applied to literature calls readers to continually deconstruct, wait expectantly for the inevitable eb and flow of light and darkness, and accept unanswerable questions and incomplete endings. This is real life; this is entropy

    Literary Acoustics

    Get PDF
    Bringing together sound studies and intermediality theory, this essay revisits the notion of ‘literary acoustics’ to inquire into the usefulness of intermediality studies for analyzing the relations between literature and sound. The second part of the essay is dedicated to an illustrative analysis of Ben Marcus’s highly experimental, noisy book The Age of Wire and String

    Translation and the Arrow of Time

    Get PDF

    Mathematical Philology: Entropy Information in Refining Classical Texts' Reconstruction, and Early Philologists' Anticipation of Information Theory

    Get PDF
    Philologists reconstructing ancient texts from variously miscopied manuscripts anticipated information theorists by centuries in conceptualizing information in terms of probability. An example is the editorial principle difficilior lectio potior (DLP): in choosing between otherwise acceptable alternative wordings in different manuscripts, “the more difficult reading [is] preferable.” As philologists at least as early as Erasmus observed (and as information theory's version of the second law of thermodynamics would predict), scribal errors tend to replace less frequent and hence entropically more information-rich wordings with more frequent ones. Without measurements, it has been unclear how effectively DLP has been used in the reconstruction of texts, and how effectively it could be used. We analyze a case history of acknowledged editorial excellence that mimics an experiment: the reconstruction of Lucretius's De Rerum Natura, beginning with Lachmann's landmark 1850 edition based on the two oldest manuscripts then known. Treating words as characters in a code, and taking the occurrence frequencies of words from a current, more broadly based edition, we calculate the difference in entropy information between Lachmann's 756 pairs of grammatically acceptable alternatives. His choices average 0.26±0.20 bits higher in entropy information (95% confidence interval, P = 0.005), as against the single bit that determines the outcome of a coin toss, and the average 2.16±0.10 bits (95%) of (predominantly meaningless) entropy information if the rarer word had always been chosen. As a channel width, 0.26±0.20 bits/word corresponds to a 0.790.79+0.09−0.15 likelihood of the rarer word being the one accepted in the reference edition, which is consistent with the observed 547/756 = 0.72±0.03 (95%). Statistically informed application of DLP can recover substantial amounts of semantically meaningful entropy information from noise; hence the extension copiosior informatione lectio potior, “the reading richer in information [is] preferable.” New applications of information theory promise continued refinement in the reconstruction of culturally fundamental texts
    • 

    corecore