Search CORE

119 research outputs found

Probing the topological properties of complex networks modeling short written texts

Author: Amancio Diego R.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 29/12/2014
Field of study

In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well -- many informative discoveries have been made this way -- but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyzes performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks

arXiv.org e-Print Archive

Public Library of Science (PLOS)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

PubMed Central

Universidade de São Paulo

FigShare

A complex network approach to stylometry

Author: Amancio Diego R.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.Comment: PLoS ONE, 2015 (to appear

arXiv.org e-Print Archive

Public Library of Science (PLOS)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

Universidade de São Paulo

FigShare

Text authorship identified using the dynamics of word co-occurrence networks

Author: Akimushkin Camilo
Amancio Diego R.
Oliveira Jr Osvaldo N.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 29/07/2016
Field of study

The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare

Comparing the writing style of real and artificial papers

Author: Amancio Diego R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89\% of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.Comment: To appear in Scientometrics (2015

arXiv.org e-Print Archive

Universidade de São Paulo