Search CORE

2,286 research outputs found

Entropic analysis of the role of words in literary texts

Author: Montemurro Marcelo A.
Zanette Damian H.
Publication venue
Publication date: 01/01/2001
Field of study

Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic statistical analysis of the use of words in literary English corpora. We show that there is a quantitative relation between the role of content words in literary English and the Shannon information entropy defined over an appropriate probability distribution. Without assuming any previous knowledge about the syntactic structure of language, we are able to cluster certain groups of words according to their specific role in the text.Comment: 9 pages, 5 figure

arXiv.org e-Print Archive

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The University of Manchester - Institutional Repository

Statistical keyword detection in literary corpora

Author: Cancho
Cassandro
Cohen
Ebeling
Ebeling
Grosse
J. P. Herrera
Luhn
Mantegna
Montemurro
Ortuño
P. A. Pury
Stanley
Yang
Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/05/2008
Field of study

Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Thesaurus as a complex network

Author: Adriano de Jesus Holanda
Alexandre Souto Martinez
Boyce
Evandro Eduardo Seron Ruiz
Ivan Torres Pisa
Kinouchi
Kintsch
Laherrère
Landauer
Lima
Motter
Newman
Osame Kinouchi
Stanley
Tsallis
Watts
Zipf
Publication venue: 'Elsevier BV'
Publication date: 22/12/2003
Field of study

A thesaurus is one, out of many, possible representations of term (or word) connectivity. The terms of a thesaurus are seen as the nodes and their relationship as the links of a directed graph. The directionality of the links retains all the thesaurus information and allows the measurement of several quantities. This has lead to a new term classification according to the characteristics of the nodes, for example, nodes with no links in, no links out, etc. Using an electronic available thesaurus we have obtained the incoming and outgoing link distributions. While the incoming link distribution follows a stretched exponential function, the lower bound for the outgoing link distribution has the same envelope of the scientific paper citation distribution proposed by Albuquerque and Tsallis. However, a better fit is obtained by simpler function which is the solution of Ricatti's differential equation. We conjecture that this differential equation is the continuous limit of a stochastic growth model of the thesaurus network. We also propose a new manner to arrange a thesaurus using the ``inversion method''.Comment: Contribution to the Proceedings of `Trends and Perspectives in Extensive and Nonextensive Statistical Mechanics', in honour of Constantino Tsallis' 60th birthday (submitted Physica A

arXiv.org e-Print Archive

Crossref

Towards the quantification of the semantic information encoded in written language

Author: Montemurro Marcelo A.
Zanette Damian
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 27/07/2009
Field of study

Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Comment: 19 pages, 4 figure

arXiv.org e-Print Archive

CONICET Digital

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The University of Manchester - Institutional Repository

The Science of Art “Faithfully Presented”: Entropy in British Victorian Literature

Author: Harris Hannah
Publication venue: Eagle Scholar
Publication date: 24/04/2022
Field of study

In the chemical world, entropy, or the randomness and chaos of a system, must continually increase; it is much more favorable for things to fall apart than to be put together. This scientific concept can also be rightly applied to the study of literature. While it is true books contain information put together into some sense of order from chaos, making them counterintuitive to entropy, I am convinced these works must still obey the laws of thermodynamics. There must be an increase in chaos somewhere, and if it is not within the words themselves, it must lie within the ideas they represent, their interpretation by readers, and the deconstruction of the text through literary analysis. In this study, the works of Victorian authors including Charles Dickens and Thomas Hardy are deconstructed into entropic elements including the lack of a reliable center and the struggle between the compulsion to repeat and the desire for revolution. Examination of entropy in these texts validates their claims to be realistic novels portraying the nuance of authentic life. Entropy applied to literature calls readers to continually deconstruct, wait expectantly for the inevitable eb and flow of light and darkness, and accept unanswerable questions and incomplete endings. This is real life; this is entropy

Eagle Scholar University of Mary Washington

Literary Acoustics

Author: Schweighauser Philipp
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2015
Field of study

Bringing together sound studies and intermediality theory, this essay revisits the notion of ‘literary acoustics’ to inquire into the usefulness of intermediality studies for analyzing the relations between literature and sound. The second part of the essay is dedicated to an illustrative analysis of Ben Marcus’s highly experimental, noisy book The Age of Wire and String

edoc

Translation and the Arrow of Time

Author: Folkart Barbara
Publication venue: 'Consortium Erudit'
Publication date: 01/01/1989
Field of study

Érudit

Mathematical Philology: Entropy Information in Refining Classical Texts' Reconstruction, and Early Philologists' Anticipation of Information Theory

Author: Cisne John L.
Schwager Steven J.
Ziomkowski Robert M.
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Philologists reconstructing ancient texts from variously miscopied manuscripts anticipated information theorists by centuries in conceptualizing information in terms of probability. An example is the editorial principle difficilior lectio potior (DLP): in choosing between otherwise acceptable alternative wordings in different manuscripts, “the more difficult reading [is] preferable.” As philologists at least as early as Erasmus observed (and as information theory's version of the second law of thermodynamics would predict), scribal errors tend to replace less frequent and hence entropically more information-rich wordings with more frequent ones. Without measurements, it has been unclear how effectively DLP has been used in the reconstruction of texts, and how effectively it could be used. We analyze a case history of acknowledged editorial excellence that mimics an experiment: the reconstruction of Lucretius's De Rerum Natura, beginning with Lachmann's landmark 1850 edition based on the two oldest manuscripts then known. Treating words as characters in a code, and taking the occurrence frequencies of words from a current, more broadly based edition, we calculate the difference in entropy information between Lachmann's 756 pairs of grammatically acceptable alternatives. His choices average 0.26±0.20 bits higher in entropy information (95% confidence interval, P = 0.005), as against the single bit that determines the outcome of a coin toss, and the average 2.16±0.10 bits (95%) of (predominantly meaningless) entropy information if the rarer word had always been chosen. As a channel width, 0.26±0.20 bits/word corresponds to a 0.790.79+0.09−0.15 likelihood of the rarer word being the one accepted in the reference edition, which is consistent with the observed 547/756 = 0.72±0.03 (95%). Statistically informed application of DLP can recover substantial amounts of semantically meaningful entropy information from noise; hence the extension copiosior informatione lectio potior, “the reading richer in information [is] preferable.” New applications of information theory promise continued refinement in the reconstruction of culturally fundamental texts

Public Library of Science (PLOS)

Ithaca College

Directory of Open Access Journals

PubMed Central