52,845 research outputs found
Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus
In this paper we describe the CLEC corpus, an ongoing project set up at the University of Cádiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques
Examining Scientific Writing Styles from the Perspective of Linguistic Complexity
Publishing articles in high-impact English journals is difficult for scholars
around the world, especially for non-native English-speaking scholars (NNESs),
most of whom struggle with proficiency in English. In order to uncover the
differences in English scientific writing between native English-speaking
scholars (NESs) and NNESs, we collected a large-scale data set containing more
than 150,000 full-text articles published in PLoS between 2006 and 2015. We
divided these articles into three groups according to the ethnic backgrounds of
the first and corresponding authors, obtained by Ethnea, and examined the
scientific writing styles in English from a two-fold perspective of linguistic
complexity: (1) syntactic complexity, including measurements of sentence length
and sentence complexity; and (2) lexical complexity, including measurements of
lexical diversity, lexical density, and lexical sophistication. The
observations suggest marginal differences between groups in syntactical and
lexical complexity.Comment: 6 figure
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
A linguistically-driven methodology for detecting impending and unfolding emergencies from social media messages
Natural disasters have demonstrated the crucial role of social media before, during and after emergencies
(Haddow & Haddow 2013). Within our EU project Sland \ub4 ail, we aim to ethically improve \ub4
the use of social media in enhancing the response of disaster-related agen-cies. To this end, we
have collected corpora of social and formal media to study newsroom communication of emergency
management organisations in English and Italian. Currently, emergency management agencies
in English-speaking countries use social media in different measure and different degrees,
whereas Italian National Protezione Civile only uses Twitter at the moment. Our method is developed
with a view to identifying communicative strategies and detecting sentiment in order to
distinguish warnings from actual disasters and major from minor disasters. Our linguistic analysis
uses humans to classify alert/warning messages or emer-gency response and mitigation ones based
on the terminology used and the sentiment expressed. Results of linguistic analysis are then used
to train an application by tagging messages and detecting disaster- and/or emergency-related terminology
and emotive language to simulate human rating and forward information to an emergency
management system
Measuring complexity with zippers
Physics concepts have often been borrowed and independently developed by
other fields of science. In this perspective a significant example is that of
entropy in Information Theory. The aim of this paper is to provide a short and
pedagogical introduction to the use of data compression techniques for the
estimate of entropy and other relevant quantities in Information Theory and
Algorithmic Information Theory. We consider in particular the LZ77 algorithm as
case study and discuss how a zipper can be used for information extraction.Comment: 10 pages, 3 figure
The relation between pitch and gestures in a story-telling task
Anecdotal evidence suggests that both pitch range and
gestures contribute to the perception of speakers\u2019 liveliness in
speech. However, the relation between speakers\u2019 pitch range
and gestures has received little attention. It is possible that
variations in pitch range might be accompanied by variations
in gestures, and vice versa. In second language speech, the
relation between pitch range and gestures might also be
affected by speakers\u2019 difficulty in speaking the L2. In this
pilot study we compare global pitch range and gesture rate in
the speech of 3 native Italian speakers, telling the same story
once in Italian and twice in English as part of an in-class oral
presentation task. The hypothesis tested is that contextual
factors, such as speakers\u2019 nervousness with the task, cause
speakers to use narrow pitch range and limited gestures; a
greater ease with the task, due to its repetition, cause speakers
to use a wider pitch range and more gestures. This
experimental hypothesis is partially confirmed by the results
of this study
- …