2,286 research outputs found
Entropic analysis of the role of words in literary texts
Beyond the local constraints imposed by grammar, words concatenated in long
sequences carrying a complex message show statistical regularities that may
reflect their linguistic role in the message. In this paper, we perform a
systematic statistical analysis of the use of words in literary English
corpora. We show that there is a quantitative relation between the role of
content words in literary English and the Shannon information entropy defined
over an appropriate probability distribution. Without assuming any previous
knowledge about the syntactic structure of language, we are able to cluster
certain groups of words according to their specific role in the text.Comment: 9 pages, 5 figure
Statistical keyword detection in literary corpora
Understanding the complexity of human language requires an appropriate
analysis of the statistical distribution of words in texts. We consider the
information retrieval problem of detecting and ranking the relevant words of a
text by means of statistical information referring to the "spatial" use of the
words. Shannon's entropy of information is used as a tool for automatic keyword
extraction. By using The Origin of Species by Charles Darwin as a
representative text sample, we show the performance of our detector and compare
it with another proposals in the literature. The random shuffled text receives
special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2
Thesaurus as a complex network
A thesaurus is one, out of many, possible representations of term (or word)
connectivity. The terms of a thesaurus are seen as the nodes and their
relationship as the links of a directed graph. The directionality of the links
retains all the thesaurus information and allows the measurement of several
quantities. This has lead to a new term classification according to the
characteristics of the nodes, for example, nodes with no links in, no links
out, etc. Using an electronic available thesaurus we have obtained the incoming
and outgoing link distributions. While the incoming link distribution follows a
stretched exponential function, the lower bound for the outgoing link
distribution has the same envelope of the scientific paper citation
distribution proposed by Albuquerque and Tsallis. However, a better fit is
obtained by simpler function which is the solution of Ricatti's differential
equation. We conjecture that this differential equation is the continuous limit
of a stochastic growth model of the thesaurus network. We also propose a new
manner to arrange a thesaurus using the ``inversion method''.Comment: Contribution to the Proceedings of `Trends and Perspectives in
Extensive and Nonextensive Statistical Mechanics', in honour of Constantino
Tsallis' 60th birthday (submitted Physica A
Towards the quantification of the semantic information encoded in written language
Written language is a complex communication signal capable of conveying
information encoded in the form of ordered sequences of words. Beyond the local
order ruled by grammar, semantic and thematic structures affect long-range
patterns in word usage. Here, we show that a direct application of information
theory quantifies the relationship between the statistical distribution of
words and the semantic content of the text. We show that there is a
characteristic scale, roughly around a few thousand words, which establishes
the typical size of the most informative segments in written language.
Moreover, we find that the words whose contributions to the overall information
is larger, are the ones more closely associated with the main subjects and
topics of the text. This scenario can be explained by a model of word usage
that assumes that words are distributed along the text in domains of a
characteristic size where their frequency is higher than elsewhere. Our
conclusions are based on the analysis of a large database of written language,
diverse in subjects and styles, and thus are likely to be applicable to general
language sequences encoding complex information.Comment: 19 pages, 4 figure
The Science of Art âFaithfully Presentedâ: Entropy in British Victorian Literature
In the chemical world, entropy, or the randomness and chaos of a system, must continually increase; it is much more favorable for things to fall apart than to be put together. This scientific concept can also be rightly applied to the study of literature. While it is true books contain information put together into some sense of order from chaos, making them counterintuitive to entropy, I am convinced these works must still obey the laws of thermodynamics. There must be an increase in chaos somewhere, and if it is not within the words themselves, it must lie within the ideas they represent, their interpretation by readers, and the deconstruction of the text through literary analysis. In this study, the works of Victorian authors including Charles Dickens and Thomas Hardy are deconstructed into entropic elements including the lack of a reliable center and the struggle between the compulsion to repeat and the desire for revolution. Examination of entropy in these texts validates their claims to be realistic novels portraying the nuance of authentic life. Entropy applied to literature calls readers to continually deconstruct, wait expectantly for the inevitable eb and flow of light and darkness, and accept unanswerable questions and incomplete endings. This is real life; this is entropy
Literary Acoustics
Bringing together sound studies and intermediality theory, this essay revisits the notion of âliterary acousticsâ to inquire into the usefulness of intermediality studies for analyzing the relations between literature and sound. The second part of the essay is dedicated to an illustrative analysis of Ben Marcusâs highly experimental, noisy book The Age of Wire and String
Mathematical Philology: Entropy Information in Refining Classical Texts' Reconstruction, and Early Philologists' Anticipation of Information Theory
Philologists reconstructing ancient texts from variously miscopied manuscripts anticipated information theorists by centuries in conceptualizing information in terms of probability. An example is the editorial principle difficilior lectio potior (DLP): in choosing between otherwise acceptable alternative wordings in different manuscripts, âthe more difficult reading [is] preferable.â As philologists at least as early as Erasmus observed (and as information theory's version of the second law of thermodynamics would predict), scribal errors tend to replace less frequent and hence entropically more information-rich wordings with more frequent ones. Without measurements, it has been unclear how effectively DLP has been used in the reconstruction of texts, and how effectively it could be used. We analyze a case history of acknowledged editorial excellence that mimics an experiment: the reconstruction of Lucretius's De Rerum Natura, beginning with Lachmann's landmark 1850 edition based on the two oldest manuscripts then known. Treating words as characters in a code, and taking the occurrence frequencies of words from a current, more broadly based edition, we calculate the difference in entropy information between Lachmann's 756 pairs of grammatically acceptable alternatives. His choices average 0.26±0.20 bits higher in entropy information (95% confidence interval, Pâ=â0.005), as against the single bit that determines the outcome of a coin toss, and the average 2.16±0.10 bits (95%) of (predominantly meaningless) entropy information if the rarer word had always been chosen. As a channel width, 0.26±0.20 bits/word corresponds to a 0.790.79+0.09â0.15 likelihood of the rarer word being the one accepted in the reference edition, which is consistent with the observed 547/756â=â0.72±0.03 (95%). Statistically informed application of DLP can recover substantial amounts of semantically meaningful entropy information from noise; hence the extension copiosior informatione lectio potior, âthe reading richer in information [is] preferable.â New applications of information theory promise continued refinement in the reconstruction of culturally fundamental texts
- âŠ