20,508 research outputs found
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
A Recurrent Deep Neural Network Model to measure Sentence Complexity for the Italian Language
Text simplification (TS) is a natural language processing task devoted to the modification of a text in such a way that the grammar and structure of the phrases is greatly simplified, preserving the underlying meaning and information contents. In this paper we give a contribution to the TS field presenting a deep neural network model able to detect the complexity of italian sentences. In particular, the system gives a score to an input text that identifies the confidence level during the decision making process and that could be interpreted as a measure of the sentence complexity. Experiments have been carried out on one public corpus of Italian texts created specifically for the task of TS. We have also provided a comparison of our model with a state of the art method
used for the same purpos
Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts
There are different ways to define similarity for grouping similar texts into
clusters, as the concept of similarity may depend on the purpose of the task.
For instance, in topic extraction similar texts mean those within the same
semantic field, whereas in author recognition stylistic features should be
considered. In this study, we introduce ways to classify texts employing
concepts of complex networks, which may be able to capture syntactic, semantic
and even pragmatic features. The interplay between the various metrics of the
complex networks is analyzed with three applications, namely identification of
machine translation (MT) systems, evaluation of quality of machine translated
texts and authorship recognition. We shall show that topological features of
the networks representing texts can enhance the ability to identify MT systems
in particular cases. For evaluating the quality of MT texts, on the other hand,
high correlation was obtained with methods capable of capturing the semantics.
This was expected because the golden standards used are themselves based on
word co-occurrence. Notwithstanding, the Katz similarity, which involves
semantic and structure in the comparison of texts, achieved the highest
correlation with the NIST measurement, indicating that in some cases the
combination of both approaches can improve the ability to quantify quality in
MT. In authorship recognition, again the topological features were relevant in
some contexts, though for the books and authors analyzed good results were
obtained with semantic features as well. Because hybrid approaches encompassing
semantic and topological features have not been extensively used, we believe
that the methodology proposed here may be useful to enhance text classification
considerably, as it combines well-established strategies
Complex systems and the history of the English language
Complexity theory (Mitchell 2009, Kretzschmar 2009) is something that historical linguists not only can use but should use in order to improve the relationship between the speech we observe in historical settings and the generalizations we make from it. Complex systems, as described in physics, ecology, and many other sciences, are made up of massive numbers of components interacting with one another, and this results in self-organization and emergent order. For speech, the âcomponentsâ of a complex system are all of the possible variant realizations of linguistic features as they are deployed by human agents, speakers and writers. The order that emerges in speech is simply the fact that our use of words and other linguistic features is significantly clustered in the spatial and social and textual groups in which we actually communicate. Order emerges from such systems by means of self-organization, but the order that arises from speech is not the same as what linguists study under the rubric of linguistic structure. In both texts and regional/social groups, the frequency distribution of features occurs as the same pattern: an asymptotic hyperbolic curve (or âA-curveâ). Formal linguistic systems, grammars, are thus not the direct result of the complex system, and historical linguists must use complexity to mediate between the language production observed in the community and the grammars we describe. The history of the English language does not proceed as regularly as like clockwork, and an understanding of complex systems helps us to see why and how, and suggests what we can do about it. First, the scaling property of complex systems tells us that there are no representative speakers, and so our observation of any small group of speakers is unlikely to represent any group at a larger scaleâand limited evidence is the necessary condition of many of our historical studies. The fact that underlying complex distributions follow the 80/20 rule, i.e. 80% of the word tokens in a data set will be instances of only 20% of the word types, while the other 80% of the word types will amount to only 20% of the tokens, gives us an effective tool for estimating the status of historical states of the language. Such a frequency-based technique is opposed to the typological âfitâ technique that relies on a few texts that can be reliably located in space, and which may not account for the crosscutting effects of text type, another dimension in which the 80/20 rule applies. Besides issues of sampling, the frequency-based approach also affects how we can think about change. The A-curve immediately translates to the S-curve now used to describe linguistic change, and explains that âchangeâ cannot reasonably be considered to be a qualitative shift. Instead, we can use to model of âpunctuated equilibriumâ from evolutionary biology (e.g., see Gould and Eldredge 1993), which suggests that multiple changes occur simultaneously and compete rather than the older idea of âphyletic gradualismâ in evolution that corresponds to the traditional method of historical linguistics. The Great Vowel Shift, for example, is a useful overall generalization, but complex systems and punctuated equilibrium explain why we should not expect it ever to be âcompleteâ or to appear in the same form in different places. These applications of complexity can help us to understand and interpret our existing studies better, and suggest how new studies in the history of the English language can be made more valid and reliable
Joint perceptual decision-making: a case study in explanatory pluralism.
Traditionally different approaches to the study of cognition have been viewed as competing explanatory frameworks. An alternative view, explanatory pluralism, regards different approaches to the study of cognition as complementary ways of studying the same phenomenon, at specific temporal and spatial scales, using appropriate methodological tools. Explanatory pluralism has been often described abstractly, but has rarely been applied to concrete cases. We present a case study of explanatory pluralism. We discuss three separate ways of studying the same phenomenon: a perceptual decision-making task (Bahrami et al., 2010), where pairs of subjects share information to jointly individuate an oddball stimulus among a set of distractors. Each approach analyzed the same corpus but targeted different units of analysis at different levels of description: decision-making at the behavioral level, confidence sharing at the linguistic level, and acoustic energy at the physical level. We discuss the utility of explanatory pluralism for describing this complex, multiscale phenomenon, show ways in which this case study sheds new light on the concept of pluralism, and highlight good practices to critically assess and complement approaches
Generalized Hurst exponent and multifractal function of original and translated texts mapped into frequency and length time series
A nonlinear dynamics approach can be used in order to quantify complexity in
written texts. As a first step, a one-dimensional system is examined : two
written texts by one author (Lewis Carroll) are considered, together with one
translation, into an artificial language, i.e. Esperanto are mapped into time
series. Their corresponding shuffled versions are used for obtaining a "base
line". Two different one-dimensional time series are used here: (i) one based
on word lengths (LTS), (ii) the other on word frequencies (FTS). It is shown
that the generalized Hurst exponent and the derived curves
of the original and translated texts show marked differences. The original
"texts" are far from giving a parabolic function, - in contrast to
the shuffled texts. Moreover, the Esperanto text has more extreme values. This
suggests cascade model-like, with multiscale time asymmetric features as
finally written texts. A discussion of the difference and complementarity of
mapping into a LTS or FTS is presented. The FTS curves are more
opened than the LTS onesComment: preprint for PRE; 2 columns; 10 pages; 6 (multifigures); 3 Tables; 70
reference
Examining Scientific Writing Styles from the Perspective of Linguistic Complexity
Publishing articles in high-impact English journals is difficult for scholars
around the world, especially for non-native English-speaking scholars (NNESs),
most of whom struggle with proficiency in English. In order to uncover the
differences in English scientific writing between native English-speaking
scholars (NESs) and NNESs, we collected a large-scale data set containing more
than 150,000 full-text articles published in PLoS between 2006 and 2015. We
divided these articles into three groups according to the ethnic backgrounds of
the first and corresponding authors, obtained by Ethnea, and examined the
scientific writing styles in English from a two-fold perspective of linguistic
complexity: (1) syntactic complexity, including measurements of sentence length
and sentence complexity; and (2) lexical complexity, including measurements of
lexical diversity, lexical density, and lexical sophistication. The
observations suggest marginal differences between groups in syntactical and
lexical complexity.Comment: 6 figure
- âŠ