8,659 research outputs found
Complex network analysis of literary and scientific texts
We present results from our quantitative study of statistical and network
properties of literary and scientific texts written in two languages: English
and Polish. We show that Polish texts are described by the Zipf law with the
scaling exponent smaller than the one for the English language. We also show
that the scientific texts are typically characterized by the rank-frequency
plots with relatively short range of power-law behavior as compared to the
literary texts. We then transform the texts into their word-adjacency network
representations and find another difference between the languages. For the
majority of the literary texts in both languages, the corresponding networks
revealed the scale-free structure, while this was not always the case for the
scientific texts. However, all the network representations of texts were
hierarchical. We do not observe any qualitative and quantitative difference
between the languages. However, if we look at other network statistics like the
clustering coefficient and the average shortest path length, the English texts
occur to possess more clustered structure than do the Polish ones. This result
was attributed to differences in grammar of both languages, which was also
indicated in the Zipf plots. All the texts, however, show network structure
that differs from any of the Watts-Strogatz, the Barabasi-Albert, and the
Erdos-Renyi architectures
Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko
In the paper, a complex statistical characteristics of a Ukrainian novel is
given for the first time. The distribution of word-forms with respect to their
size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath
are analyzed.Comment: 11 page
Challenges and solutions for Latin named entity recognition
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity
Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track
the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree
of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality
Inference of the Russian drug community from one of the largest social networks in the Russian Federation
The criminal nature of narcotics complicates the direct assessment of a drug
community, while having a good understanding of the type of people drawn or
currently using drugs is vital for finding effective intervening strategies.
Especially for the Russian Federation this is of immediate concern given the
dramatic increase it has seen in drug abuse since the fall of the Soviet Union
in the early nineties. Using unique data from the Russian social network
'LiveJournal' with over 39 million registered users worldwide, we were able for
the first time to identify the on-line drug community by context sensitive text
mining of the users' blogs using a dictionary of known drug-related official
and 'slang' terminology. By comparing the interests of the users that most
actively spread information on narcotics over the network with the interests of
the individuals outside the on-line drug community, we found that the 'average'
drug user in the Russian Federation is generally mostly interested in topics
such as Russian rock, non-traditional medicine, UFOs, Buddhism, yoga and the
occult. We identify three distinct scale-free sub-networks of users which can
be uniquely classified as being either 'infectious', 'susceptible' or 'immune'.Comment: 12 pages, 11 figure
Psychopathology and Creativity Among Creative and Non-Creative Professions
The mad genius debate has been a topic that has been discussed in both popular culture and academic discourse. The current study sought to replicate previous findings that linked psychopathology to creativity. A total of 165 biographies of eminent professionals (artists, scientists, athletes) were rated on 19 mental disorders using a three point scale of not present (0), probable (1), and present (2) for potential symptoms. Athletes served as an eminent but not creative comparison group in order to discern whether fame, independent of creativity, was associated with psychopathology. Comparison of proportion analyses were conducted to identify differences of proportion between these three groups for each psychopathology. Tests for one proportion were calculated to compare each groupâs rates of psychopathology to the rates found in the U.S. population. These analyses were run twice, where subjects were dichotomized into present and not present categories; first, âpresentâ included âprobableâ (inclusive) and second where it included only âpresentâ (exclusive). Artists showed greater frequency rates of psychopathology than scientists and athletes in the more inclusive criteria for inclusion, whereas both artists and athletes showed greater frequency rates than scientists in the stricter criteria. Apart from anxiety disorder, athletes did not differ from the U.S. population in rates of psychopathology whereas artists differed from the population in terms of alcoholism, anxiety disorder, drug abuse, and depression. These data generally corroborate previous research on the link between creativity and psychopathology
Letter counting: a stem cell for Cryptology, Quantitative Linguistics, and Statistics
Counting letters in written texts is a very ancient practice. It has
accompanied the development of Cryptology, Quantitative Linguistics, and
Statistics. In Cryptology, counting frequencies of the different characters in
an encrypted message is the basis of the so called frequency analysis method.
In Quantitative Linguistics, the proportion of vowels to consonants in
different languages was studied long before authorship attribution. In
Statistics, the alternation vowel-consonants was the only example that Markov
ever gave of his theory of chained events. A short history of letter counting
is presented. The three domains, Cryptology, Quantitative Linguistics, and
Statistics, are then examined, focusing on the interactions with the other two
fields through letter counting. As a conclusion, the eclectism of past
centuries scholars, their background in humanities, and their familiarity with
cryptograms, are identified as contributing factors to the mutual enrichment
process which is described here
- âŠ