52,845 research outputs found

    Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus

    Get PDF
    In this paper we describe the CLEC corpus, an ongoing project set up at the University of Cádiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques

    Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

    Full text link
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (1) syntactic complexity, including measurements of sentence length and sentence complexity; and (2) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.Comment: 6 figure

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    A linguistically-driven methodology for detecting impending and unfolding emergencies from social media messages

    Get PDF
    Natural disasters have demonstrated the crucial role of social media before, during and after emergencies (Haddow & Haddow 2013). Within our EU project Sland \ub4 ail, we aim to ethically improve \ub4 the use of social media in enhancing the response of disaster-related agen-cies. To this end, we have collected corpora of social and formal media to study newsroom communication of emergency management organisations in English and Italian. Currently, emergency management agencies in English-speaking countries use social media in different measure and different degrees, whereas Italian National Protezione Civile only uses Twitter at the moment. Our method is developed with a view to identifying communicative strategies and detecting sentiment in order to distinguish warnings from actual disasters and major from minor disasters. Our linguistic analysis uses humans to classify alert/warning messages or emer-gency response and mitigation ones based on the terminology used and the sentiment expressed. Results of linguistic analysis are then used to train an application by tagging messages and detecting disaster- and/or emergency-related terminology and emotive language to simulate human rating and forward information to an emergency management system

    Measuring complexity with zippers

    Get PDF
    Physics concepts have often been borrowed and independently developed by other fields of science. In this perspective a significant example is that of entropy in Information Theory. The aim of this paper is to provide a short and pedagogical introduction to the use of data compression techniques for the estimate of entropy and other relevant quantities in Information Theory and Algorithmic Information Theory. We consider in particular the LZ77 algorithm as case study and discuss how a zipper can be used for information extraction.Comment: 10 pages, 3 figure

    The relation between pitch and gestures in a story-telling task

    Get PDF
    Anecdotal evidence suggests that both pitch range and gestures contribute to the perception of speakers\u2019 liveliness in speech. However, the relation between speakers\u2019 pitch range and gestures has received little attention. It is possible that variations in pitch range might be accompanied by variations in gestures, and vice versa. In second language speech, the relation between pitch range and gestures might also be affected by speakers\u2019 difficulty in speaking the L2. In this pilot study we compare global pitch range and gesture rate in the speech of 3 native Italian speakers, telling the same story once in Italian and twice in English as part of an in-class oral presentation task. The hypothesis tested is that contextual factors, such as speakers\u2019 nervousness with the task, cause speakers to use narrow pitch range and limited gestures; a greater ease with the task, due to its repetition, cause speakers to use a wider pitch range and more gestures. This experimental hypothesis is partially confirmed by the results of this study
    corecore