28,264 research outputs found
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
Comparing the writing style of real and artificial papers
Recent years have witnessed the increase of competition in science. While
promoting the quality of research in many cases, an intense competition among
scientists can also trigger unethical scientific behaviors. To increase the
total number of published papers, some authors even resort to software tools
that are able to produce grammatical, but meaningless scientific manuscripts.
Because automatically generated papers can be misunderstood as real papers, it
becomes of paramount importance to develop means to identify these scientific
frauds. In this paper, I devise a methodology to distinguish real manuscripts
from those generated with SCIGen, an automatic paper generator. Upon modeling
texts as complex networks (CN), it was possible to discriminate real from fake
papers with at least 89\% of accuracy. A systematic analysis of features
relevance revealed that the accessibility and betweenness were useful in
particular cases, even though the relevance depended upon the dataset. The
successful application of the methods described here show, as a proof of
principle, that network features can be used to identify scientific gibberish
papers. In addition, the CN-based approach can be combined in a straightforward
fashion with traditional statistical language processing methods to improve the
performance in identifying artificially generated papers.Comment: To appear in Scientometrics (2015
Extractive Multi-document Summarization Using Multilayer Networks
Huge volumes of textual information has been produced every single day. In
order to organize and understand such large datasets, in recent years,
summarization techniques have become popular. These techniques aims at finding
relevant, concise and non-redundant content from such a big data. While network
methods have been adopted to model texts in some scenarios, a systematic
evaluation of multilayer network models in the multi-document summarization
task has been limited to a few studies. Here, we evaluate the performance of a
multilayer-based method to select the most relevant sentences in the context of
an extractive multi document summarization (MDS) task. In the adopted model,
nodes represent sentences and edges are created based on the number of shared
words between sentences. Differently from previous studies in multi-document
summarization, we make a distinction between edges linking sentences from
different documents (inter-layer) and those connecting sentences from the same
document (intra-layer). As a proof of principle, our results reveal that such a
discrimination between intra- and inter-layer in a multilayered representation
is able to improve the quality of the generated summaries. This piece of
information could be used to improve current statistical methods and related
textual models
On the use of the proximity force approximation for deriving limits to short-range gravitational-like interactions from sphere-plane Casimir force experiments
We discuss the role of the proximity force approximation in deriving limits
to the existence of Yukawian forces - predicted in the submillimeter range by
many unification models - from Casimir force experiments using the sphere-plane
geometry. Two forms of this approximation are discussed, the first used in most
analyses of the residuals from the Casimir force experiments performed so far,
and the second recently discussed in this context in R. Decca et al. [Phys.
Rev. D 79, 124021 (2009)]. We show that the former form of the proximity force
approximation overestimates the expected Yukawa force and that the relative
deviation from the exact Yukawa force is of the same order of magnitude, in the
realistic experimental settings, as the relative deviation expected between the
exact Casimir force and the Casimir force evaluated in the proximity force
approximation. This implies both a systematic shift making the actual limits to
the Yukawa force weaker than claimed so far, and a degree of uncertainty in the
alpha-lambda plane related to the handling of the various approximations used
in the theory for both the Casimir and the Yukawa forces. We further argue that
the recently discussed form for the proximity force approximation is
equivalent, for a geometry made of a generic object interacting with an
infinite planar slab, to the usual exact integration of any additive two-body
interaction, without any need to invoke approximation schemes. If the planar
slab is of finite size, an additional source of systematic error arises due to
the breaking of the planar translational invariance of the system, and we
finally discuss to what extent this may affect limits obtained on power-law and
Yukawa forces.Comment: 11 page, 5 figure
Extractive Multi Document Summarization using Dynamical Measurements of Complex Networks
Due to the large amount of textual information available on Internet, it is
of paramount relevance to use techniques that find relevant and concise
content. A typical task devoted to the identification of informative sentences
in documents is the so called extractive document summarization task. In this
paper, we use complex network concepts to devise an extractive Multi Document
Summarization (MDS) method, which extracts the most central sentences from
several textual sources. In the proposed model, texts are represented as
networks, where nodes represent sentences and the edges are established based
on the number of shared words. Differently from previous works, the
identification of relevant terms is guided by the characterization of nodes via
dynamical measurements of complex networks, including symmetry, accessibility
and absorption time. The evaluation of the proposed system revealed that
excellent results were obtained with particular dynamical measurements,
including those based on the exploration of networks via random walks.Comment: Accepted for publication in BRACIS 2017 (Brazilian Conference on
Intelligent Systems
A complex network approach to stylometry
Statistical methods have been widely employed to study the fundamental
properties of language. In recent years, methods from complex and dynamical
systems proved useful to create several language models. Despite the large
amount of studies devoted to represent texts with physical models, only a
limited number of studies have shown how the properties of the underlying
physical systems can be employed to improve the performance of natural language
processing tasks. In this paper, I address this problem by devising complex
networks methods that are able to improve the performance of current
statistical methods. Using a fuzzy classification strategy, I show that the
topological properties extracted from texts complement the traditional textual
description. In several cases, the performance obtained with hybrid approaches
outperformed the results obtained when only traditional or networked methods
were used. Because the proposed model is generic, the framework devised here
could be straightforwardly used to study similar textual applications where the
topology plays a pivotal role in the description of the interacting agents.Comment: PLoS ONE, 2015 (to appear
- …
