8,659 research outputs found

    Complex network analysis of literary and scientific texts

    Full text link
    We present results from our quantitative study of statistical and network properties of literary and scientific texts written in two languages: English and Polish. We show that Polish texts are described by the Zipf law with the scaling exponent smaller than the one for the English language. We also show that the scientific texts are typically characterized by the rank-frequency plots with relatively short range of power-law behavior as compared to the literary texts. We then transform the texts into their word-adjacency network representations and find another difference between the languages. For the majority of the literary texts in both languages, the corresponding networks revealed the scale-free structure, while this was not always the case for the scientific texts. However, all the network representations of texts were hierarchical. We do not observe any qualitative and quantitative difference between the languages. However, if we look at other network statistics like the clustering coefficient and the average shortest path length, the English texts occur to possess more clustered structure than do the Polish ones. This result was attributed to differences in grammar of both languages, which was also indicated in the Zipf plots. All the texts, however, show network structure that differs from any of the Watts-Strogatz, the Barabasi-Albert, and the Erdos-Renyi architectures

    Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

    Full text link
    In the paper, a complex statistical characteristics of a Ukrainian novel is given for the first time. The distribution of word-forms with respect to their size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath are analyzed.Comment: 11 page

    Challenges and solutions for Latin named entity recognition

    Get PDF
    Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality

    Inference of the Russian drug community from one of the largest social networks in the Russian Federation

    Full text link
    The criminal nature of narcotics complicates the direct assessment of a drug community, while having a good understanding of the type of people drawn or currently using drugs is vital for finding effective intervening strategies. Especially for the Russian Federation this is of immediate concern given the dramatic increase it has seen in drug abuse since the fall of the Soviet Union in the early nineties. Using unique data from the Russian social network 'LiveJournal' with over 39 million registered users worldwide, we were able for the first time to identify the on-line drug community by context sensitive text mining of the users' blogs using a dictionary of known drug-related official and 'slang' terminology. By comparing the interests of the users that most actively spread information on narcotics over the network with the interests of the individuals outside the on-line drug community, we found that the 'average' drug user in the Russian Federation is generally mostly interested in topics such as Russian rock, non-traditional medicine, UFOs, Buddhism, yoga and the occult. We identify three distinct scale-free sub-networks of users which can be uniquely classified as being either 'infectious', 'susceptible' or 'immune'.Comment: 12 pages, 11 figure

    Psychopathology and Creativity Among Creative and Non-Creative Professions

    Get PDF
    The mad genius debate has been a topic that has been discussed in both popular culture and academic discourse. The current study sought to replicate previous findings that linked psychopathology to creativity. A total of 165 biographies of eminent professionals (artists, scientists, athletes) were rated on 19 mental disorders using a three point scale of not present (0), probable (1), and present (2) for potential symptoms. Athletes served as an eminent but not creative comparison group in order to discern whether fame, independent of creativity, was associated with psychopathology. Comparison of proportion analyses were conducted to identify differences of proportion between these three groups for each psychopathology. Tests for one proportion were calculated to compare each group’s rates of psychopathology to the rates found in the U.S. population. These analyses were run twice, where subjects were dichotomized into present and not present categories; first, “present” included “probable” (inclusive) and second where it included only “present” (exclusive). Artists showed greater frequency rates of psychopathology than scientists and athletes in the more inclusive criteria for inclusion, whereas both artists and athletes showed greater frequency rates than scientists in the stricter criteria. Apart from anxiety disorder, athletes did not differ from the U.S. population in rates of psychopathology whereas artists differed from the population in terms of alcoholism, anxiety disorder, drug abuse, and depression. These data generally corroborate previous research on the link between creativity and psychopathology

    Letter counting: a stem cell for Cryptology, Quantitative Linguistics, and Statistics

    Full text link
    Counting letters in written texts is a very ancient practice. It has accompanied the development of Cryptology, Quantitative Linguistics, and Statistics. In Cryptology, counting frequencies of the different characters in an encrypted message is the basis of the so called frequency analysis method. In Quantitative Linguistics, the proportion of vowels to consonants in different languages was studied long before authorship attribution. In Statistics, the alternation vowel-consonants was the only example that Markov ever gave of his theory of chained events. A short history of letter counting is presented. The three domains, Cryptology, Quantitative Linguistics, and Statistics, are then examined, focusing on the interactions with the other two fields through letter counting. As a conclusion, the eclectism of past centuries scholars, their background in humanities, and their familiarity with cryptograms, are identified as contributing factors to the mutual enrichment process which is described here
    • 

    corecore