20,508 research outputs found

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    A Recurrent Deep Neural Network Model to measure Sentence Complexity for the Italian Language

    Get PDF
    Text simplification (TS) is a natural language processing task devoted to the modification of a text in such a way that the grammar and structure of the phrases is greatly simplified, preserving the underlying meaning and information contents. In this paper we give a contribution to the TS field presenting a deep neural network model able to detect the complexity of italian sentences. In particular, the system gives a score to an input text that identifies the confidence level during the decision making process and that could be interpreted as a measure of the sentence complexity. Experiments have been carried out on one public corpus of Italian texts created specifically for the task of TS. We have also provided a comparison of our model with a state of the art method used for the same purpos

    Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts

    Get PDF
    There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between the various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies

    Complex systems and the history of the English language

    Get PDF
    Complexity theory (Mitchell 2009, Kretzschmar 2009) is something that historical linguists not only can use but should use in order to improve the relationship between the speech we observe in historical settings and the generalizations we make from it. Complex systems, as described in physics, ecology, and many other sciences, are made up of massive numbers of components interacting with one another, and this results in self-organization and emergent order. For speech, the “components” of a complex system are all of the possible variant realizations of linguistic features as they are deployed by human agents, speakers and writers. The order that emerges in speech is simply the fact that our use of words and other linguistic features is significantly clustered in the spatial and social and textual groups in which we actually communicate. Order emerges from such systems by means of self-organization, but the order that arises from speech is not the same as what linguists study under the rubric of linguistic structure. In both texts and regional/social groups, the frequency distribution of features occurs as the same pattern: an asymptotic hyperbolic curve (or “A-curve”). Formal linguistic systems, grammars, are thus not the direct result of the complex system, and historical linguists must use complexity to mediate between the language production observed in the community and the grammars we describe. The history of the English language does not proceed as regularly as like clockwork, and an understanding of complex systems helps us to see why and how, and suggests what we can do about it. First, the scaling property of complex systems tells us that there are no representative speakers, and so our observation of any small group of speakers is unlikely to represent any group at a larger scale—and limited evidence is the necessary condition of many of our historical studies. The fact that underlying complex distributions follow the 80/20 rule, i.e. 80% of the word tokens in a data set will be instances of only 20% of the word types, while the other 80% of the word types will amount to only 20% of the tokens, gives us an effective tool for estimating the status of historical states of the language. Such a frequency-based technique is opposed to the typological “fit” technique that relies on a few texts that can be reliably located in space, and which may not account for the crosscutting effects of text type, another dimension in which the 80/20 rule applies. Besides issues of sampling, the frequency-based approach also affects how we can think about change. The A-curve immediately translates to the S-curve now used to describe linguistic change, and explains that “change” cannot reasonably be considered to be a qualitative shift. Instead, we can use to model of “punctuated equilibrium” from evolutionary biology (e.g., see Gould and Eldredge 1993), which suggests that multiple changes occur simultaneously and compete rather than the older idea of “phyletic gradualism” in evolution that corresponds to the traditional method of historical linguistics. The Great Vowel Shift, for example, is a useful overall generalization, but complex systems and punctuated equilibrium explain why we should not expect it ever to be “complete” or to appear in the same form in different places. These applications of complexity can help us to understand and interpret our existing studies better, and suggest how new studies in the history of the English language can be made more valid and reliable

    Joint perceptual decision-making: a case study in explanatory pluralism.

    Get PDF
    Traditionally different approaches to the study of cognition have been viewed as competing explanatory frameworks. An alternative view, explanatory pluralism, regards different approaches to the study of cognition as complementary ways of studying the same phenomenon, at specific temporal and spatial scales, using appropriate methodological tools. Explanatory pluralism has been often described abstractly, but has rarely been applied to concrete cases. We present a case study of explanatory pluralism. We discuss three separate ways of studying the same phenomenon: a perceptual decision-making task (Bahrami et al., 2010), where pairs of subjects share information to jointly individuate an oddball stimulus among a set of distractors. Each approach analyzed the same corpus but targeted different units of analysis at different levels of description: decision-making at the behavioral level, confidence sharing at the linguistic level, and acoustic energy at the physical level. We discuss the utility of explanatory pluralism for describing this complex, multiscale phenomenon, show ways in which this case study sheds new light on the concept of pluralism, and highlight good practices to critically assess and complement approaches

    Generalized Hurst exponent and multifractal function of original and translated texts mapped into frequency and length time series

    Full text link
    A nonlinear dynamics approach can be used in order to quantify complexity in written texts. As a first step, a one-dimensional system is examined : two written texts by one author (Lewis Carroll) are considered, together with one translation, into an artificial language, i.e. Esperanto are mapped into time series. Their corresponding shuffled versions are used for obtaining a "base line". Two different one-dimensional time series are used here: (i) one based on word lengths (LTS), (ii) the other on word frequencies (FTS). It is shown that the generalized Hurst exponent h(q)h(q) and the derived f(α)f(\alpha) curves of the original and translated texts show marked differences. The original "texts" are far from giving a parabolic f(α)f(\alpha) function, - in contrast to the shuffled texts. Moreover, the Esperanto text has more extreme values. This suggests cascade model-like, with multiscale time asymmetric features as finally written texts. A discussion of the difference and complementarity of mapping into a LTS or FTS is presented. The FTS f(α)f(\alpha) curves are more opened than the LTS onesComment: preprint for PRE; 2 columns; 10 pages; 6 (multifigures); 3 Tables; 70 reference

    Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

    Full text link
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (1) syntactic complexity, including measurements of sentence length and sentence complexity; and (2) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.Comment: 6 figure
    • 

    corecore