84,463 research outputs found
Examining Scientific Writing Styles from the Perspective of Linguistic Complexity
Publishing articles in high-impact English journals is difficult for scholars
around the world, especially for non-native English-speaking scholars (NNESs),
most of whom struggle with proficiency in English. In order to uncover the
differences in English scientific writing between native English-speaking
scholars (NESs) and NNESs, we collected a large-scale data set containing more
than 150,000 full-text articles published in PLoS between 2006 and 2015. We
divided these articles into three groups according to the ethnic backgrounds of
the first and corresponding authors, obtained by Ethnea, and examined the
scientific writing styles in English from a two-fold perspective of linguistic
complexity: (1) syntactic complexity, including measurements of sentence length
and sentence complexity; and (2) lexical complexity, including measurements of
lexical diversity, lexical density, and lexical sophistication. The
observations suggest marginal differences between groups in syntactical and
lexical complexity.Comment: 6 figure
Detecting Hate Speech in Social Media
In this paper we examine methods to detect hate speech in social media, while
distinguishing this from general profanity. We aim to establish lexical
baselines for this task by applying supervised classification methods using a
recently released dataset annotated for this purpose. As features, our system
uses character n-grams, word n-grams and word skip-grams. We obtain results of
78% accuracy in identifying posts across three classes. Results demonstrate
that the main challenge lies in discriminating profanity and hate speech from
each other. A number of directions for future work are discussed.Comment: Proceedings of Recent Advances in Natural Language Processing
(RANLP). pp. 467-472. Varna, Bulgari
From Statistical to Geolinguistic Data: Mapping and Measuring Linguistic Diversity
The aim of this paper is describing a new methodology for mapping and measuring linguistic diversity in a territory. The three methods that have been created by the Centro di eccellenza della ricerca Osservatorio linguistico permanente dellâitaliano diffuso fra stranieri e delle lingue immigrate in Italia at the UniversitĂ per Stranieri di Siena are the following: - the Toscane favelle model, a procedural application which passes from quantitative statistical data to a demolinguistic paradigm; - the Monterotondo-Mentana model. The surveys of quantitative and qualitative data are carried out using traditional tools (questionnaires, audio and video recordings) as well as advanced technologies; - the Esquilino model. Digital maps are created which present the distribution of the immigrant languages through the presence of signs in linguistic landscape. The final objective is putting together the data surveyed by the three methods in order to have a âspeakingâ territory, in which each point surveyed identifies the languages spoken and the various linguistic manifestations.Language Contact, Linguistic Diversity, Immigrant Languages, Geolinguistic Data, New Methodologies in Sociolinguistic Research
Conservation and use of genetic resources of underutilized crops in the Americas - A continental analysis
Latin America is home to dramatically diverse agroecological regions which harbor a high concentration of underutilized plant species, whose genetic resources hold the potential to address challenges such as sustainable agricultural development, food security and sovereignty, and climate change. This paper examines the status of an expert-informed list of underutilized crops in Latin America and analyses how the most common features of underuse apply to these. The analysis pays special attention to if and how existing international policy and legal frameworks on biodiversity and plant genetic resources effectively support or not the conservation and sustainable use of underutilized crops. Results show that not all minor crops are affected by the same degree of neglect, and that the aspects under which any crop is underutilized vary greatly, calling for specific analyses and interventions. We also show that current international policy and legal instruments have so far provided limited stimulus and funding for the conservation and sustainable use of the genetic resources of these crops. Finally, the paper proposes an analytical framework for identifying and evaluating a cropâs underutilization, in order to define the most appropriate type and levels of intervention (international, national, local) for improving its statu
#Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds
Compounding of natural language units is a very common phenomena. In this
paper, we show, for the first time, that Twitter hashtags which, could be
considered as correlates of such linguistic units, undergo compounding. We
identify reasons for this compounding and propose a prediction model that can
identify with 77.07% accuracy if a pair of hashtags compounding in the near
future (i.e., 2 months after compounding) shall become popular. At longer times
T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This
technique has strong implications to trending hashtag recommendation since
newly formed hashtag compounds can be recommended early, even before the
compounding has taken place. Further, humans can predict compounds with an
overall accuracy of only 48.7% (treated as baseline). Notably, while humans can
discriminate the relatively easier cases, the automatic framework is successful
in classifying the relatively harder cases.Comment: 14 pages, 4 figures, 9 tables, published in CSCW (Computer-Supported
Cooperative Work and Social Computing) 2016. in Proceedings of 19th ACM
conference on Computer-Supported Cooperative Work and Social Computing (CSCW
2016
Language identification with suprasegmental cues: A study based on speech resynthesis
This paper proposes a new experimental paradigm to explore the discriminability of languages, a question which is crucial to the child born in a bilingual environment. This paradigm employs the speech resynthesis technique, enabling the experimenter to preserve or degrade acoustic cues such as phonotactics, syllabic rhythm or intonation from natural utterances. English and Japanese sentences were resynthesized, preserving broad phonotactics, rhythm and intonation (Condition 1), rhythm and intonation (Condition 2), intonation only (Condition 3), or rhythm only (Condition 4). The findings support the notion that syllabic rhythm is a necessary and sufficient cue for French adult subjects to discriminate English from Japanese sentences. The results are consistent with previous research using low-pass filtered speech, as well as with phonological theories predicting rhythmic differences between languages. Thus, the new methodology proposed appears to be well-suited to study language discrimination. Applications for other domains of psycholinguistic research and for automatic language identification are considered
Formulaic Sequences as Fluency Devices in the Oral Production of Native Speakers of Polish
In this paper we attempt to determine the nature and strength of the relationship between the use of formulaic sequences and productive fluency of native speakers of Polish. In particular, we seek to validate the claim that speech characterized by a higher incidence of formulaic sequences is produced more rapidly and with fewer hesitation phenomena. The analysis is based on monologic speeches delivered by 45 speakers of L1 Polish. The data include both the recordings and their transcriptions annotated for a number of objective fluency measures. In the first part of the study the total of formulaic sequences is established for each sample. This is followed by determining a set of temporal measures of the speakersâ output (speech rate, articulation rate, mean length of runs, mean length of pauses, phonation time ratio). The study provides some preliminary evidence of the fluency-enhancing role of formulaic language. Our results show that the use of formulaic sequences is positively and significantly correlated with speech rate, mean length of runs and phonation time ratio. This suggests that a higher concentration of formulaic material in output is associated with faster speed of speech, longer stretches of speech between pauses and an increased amount of time filled with speech
- âŠ