127 research outputs found
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
Three-feature model to reproduce the topology of citation networks and the effects from authors' visibility on their h-index
Various factors are believed to govern the selection of references in
citation networks, but a precise, quantitative determination of their
importance has remained elusive. In this paper, we show that three factors can
account for the referencing pattern of citation networks for two topics, namely
"graphenes" and "complex networks", thus allowing one to reproduce the
topological features of the networks built with papers being the nodes and the
edges established by citations. The most relevant factor was content
similarity, while the other two - in-degree (i.e. citation counts) and {age of
publication} had varying importance depending on the topic studied. This
dependence indicates that additional factors could play a role. Indeed, by
intuition one should expect the reputation (or visibility) of authors and/or
institutions to affect the referencing pattern, and this is only indirectly
considered via the in-degree that should correlate with such reputation.
Because information on reputation is not readily available, we simulated its
effect on artificial citation networks considering two communities with
distinct fitness (visibility) parameters. One community was assumed to have
twice the fitness value of the other, which amounts to a double probability for
a paper being cited. While the h-index for authors in the community with larger
fitness evolved with time with slightly higher values than for the control
network (no fitness considered), a drastic effect was noted for the community
with smaller fitness
Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts
There are different ways to define similarity for grouping similar texts into
clusters, as the concept of similarity may depend on the purpose of the task.
For instance, in topic extraction similar texts mean those within the same
semantic field, whereas in author recognition stylistic features should be
considered. In this study, we introduce ways to classify texts employing
concepts of complex networks, which may be able to capture syntactic, semantic
and even pragmatic features. The interplay between the various metrics of the
complex networks is analyzed with three applications, namely identification of
machine translation (MT) systems, evaluation of quality of machine translated
texts and authorship recognition. We shall show that topological features of
the networks representing texts can enhance the ability to identify MT systems
in particular cases. For evaluating the quality of MT texts, on the other hand,
high correlation was obtained with methods capable of capturing the semantics.
This was expected because the golden standards used are themselves based on
word co-occurrence. Notwithstanding, the Katz similarity, which involves
semantic and structure in the comparison of texts, achieved the highest
correlation with the NIST measurement, indicating that in some cases the
combination of both approaches can improve the ability to quantify quality in
MT. In authorship recognition, again the topological features were relevant in
some contexts, though for the books and authors analyzed good results were
obtained with semantic features as well. Because hybrid approaches encompassing
semantic and topological features have not been extensively used, we believe
that the methodology proposed here may be useful to enhance text classification
considerably, as it combines well-established strategies
- …