13 research outputs found
Comparison of the language networks from literature and blogs
In this paper we present the comparison of the linguistic networks from
literature and blog texts. The linguistic networks are constructed from texts
as directed and weighted co-occurrence networks of words. Words are nodes and
links are established between two nodes if they are directly co-occurring
within the sentence. The comparison of the networks structure is performed at
global level (network) in terms of: average node degree, average shortest path
length, diameter, clustering coefficient, density and number of components.
Furthermore, we perform analysis on the local level (node) by comparing the
rank plots of in and out degree, strength and selectivity. The
selectivity-based results point out that there are differences between the
structure of the networks constructed from literature and blogs
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
Comparing the writing style of real and artificial papers
Recent years have witnessed the increase of competition in science. While
promoting the quality of research in many cases, an intense competition among
scientists can also trigger unethical scientific behaviors. To increase the
total number of published papers, some authors even resort to software tools
that are able to produce grammatical, but meaningless scientific manuscripts.
Because automatically generated papers can be misunderstood as real papers, it
becomes of paramount importance to develop means to identify these scientific
frauds. In this paper, I devise a methodology to distinguish real manuscripts
from those generated with SCIGen, an automatic paper generator. Upon modeling
texts as complex networks (CN), it was possible to discriminate real from fake
papers with at least 89\% of accuracy. A systematic analysis of features
relevance revealed that the accessibility and betweenness were useful in
particular cases, even though the relevance depended upon the dataset. The
successful application of the methods described here show, as a proof of
principle, that network features can be used to identify scientific gibberish
papers. In addition, the CN-based approach can be combined in a straightforward
fashion with traditional statistical language processing methods to improve the
performance in identifying artificially generated papers.Comment: To appear in Scientometrics (2015
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
A complex network approach to stylometry
Statistical methods have been widely employed to study the fundamental
properties of language. In recent years, methods from complex and dynamical
systems proved useful to create several language models. Despite the large
amount of studies devoted to represent texts with physical models, only a
limited number of studies have shown how the properties of the underlying
physical systems can be employed to improve the performance of natural language
processing tasks. In this paper, I address this problem by devising complex
networks methods that are able to improve the performance of current
statistical methods. Using a fuzzy classification strategy, I show that the
topological properties extracted from texts complement the traditional textual
description. In several cases, the performance obtained with hybrid approaches
outperformed the results obtained when only traditional or networked methods
were used. Because the proposed model is generic, the framework devised here
could be straightforwardly used to study similar textual applications where the
topology plays a pivotal role in the description of the interacting agents.Comment: PLoS ONE, 2015 (to appear
Is Judicial Expertise Dynamic? Judicial Expertise, Complex Networks, and Legal Policy
Article published in the Michigan State Law Review
INDICATORI DI COMPLESSITÀ NEL PARLATO DEGLI INSEGNANTI DI ITALIANO L2: UN’ANALISI QUANTITATIVA
Le analisi quantitative sul teacher talk degli insegnanti di inglese L2 hanno permesso di investigare come essi compiono – in maniera non sempre consapevole e pianificata – degli adeguamenti nel loro modo di parlare di fronte ad una classe di apprendenti. Tali adeguamenti riguardano più livelli linguistici e variano di intensità a seconda del livello complessivo di competenza degli apprendenti. Nel presente lavoro ci proponiamo di analizzare quantitativamente la complessità del parlato di insegnanti madrelingua di italiano L2 raccolto e trascritto durante lezioni appartenenti a due livelli del Common European Framework of Reference for Languages (CEFR, Council of Europe, 2001), A1 e B1. Una parte delle trascrizioni riguarda lezioni svolte in classe (corpus ParInIt, Parlato di Insegnanti di Italiano), nella quale vi è compresenza fisica tra insegnante e apprendenti, un secondo corpus è invece composto di lezioni somministrate online in maniera asincrona, tramite un canale YouTube (corpus Oneworlditaliano). Proponiamo una classificazione degli adeguamenti rispetto alla quale l’analisi quantitativa della complessità degli indicatori linguistici verificherà se è possibile distinguere sia tra livello A1 e livello B1 sia tra il corpus raccolto in presenza e il corpus relativo alle lezioni on line. L’obiettivo finale è comprendere se un’analisi quantitativa dei dati possa aiutare ad individuare gli adeguamenti e le modifiche linguistiche attuate dai docenti per favorire una maggiore comprensibilità dell’input da parte degli apprendenti.
Indicators of complexity in the speech of Italian L2 teachers: a quantitative analysis
Quantitative analyses of L2 English teacher talk have allowed us to investigate how they make – not always conscious and planned way – adjustments in their way of speaking in front of a class of learners. These adjustments concern several linguistic levels and vary in intensity according to the learners' overall level of competence. In this paper we propose to quantitatively analyze the complexity of the teacher talk for Italian as L2 teachers, collected and transcribed during lessons from A1 and B1 levels of the Common European Framework of Reference for Languages (CEFR, Council of Europe, 2001). Part of the transcripts concern lessons carried out in class during face-to-face lessons (corpus ParInIt, Parlato di Insegnanti di Italiano), while a second corpus is composed of lessons administered online asynchronously, through a YouTube channel (corpus Oneworlditaliano). We propose a classification of the adjustments that enables testing the quantitative difference between lessons belonging to level A1 and B1 and between face-to-face vs. online lessons. Furthermore, this work aims to understand if quantitative analyses can help discover the teachers' adjustments that make the input more understandable