84,463 research outputs found

    Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

    Full text link
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (1) syntactic complexity, including measurements of sentence length and sentence complexity; and (2) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.Comment: 6 figure

    Detecting Hate Speech in Social Media

    Full text link
    In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in identifying posts across three classes. Results demonstrate that the main challenge lies in discriminating profanity and hate speech from each other. A number of directions for future work are discussed.Comment: Proceedings of Recent Advances in Natural Language Processing (RANLP). pp. 467-472. Varna, Bulgari

    From Statistical to Geolinguistic Data: Mapping and Measuring Linguistic Diversity

    Get PDF
    The aim of this paper is describing a new methodology for mapping and measuring linguistic diversity in a territory. The three methods that have been created by the Centro di eccellenza della ricerca Osservatorio linguistico permanente dell’italiano diffuso fra stranieri e delle lingue immigrate in Italia at the Università per Stranieri di Siena are the following: - the Toscane favelle model, a procedural application which passes from quantitative statistical data to a demolinguistic paradigm; - the Monterotondo-Mentana model. The surveys of quantitative and qualitative data are carried out using traditional tools (questionnaires, audio and video recordings) as well as advanced technologies; - the Esquilino model. Digital maps are created which present the distribution of the immigrant languages through the presence of signs in linguistic landscape. The final objective is putting together the data surveyed by the three methods in order to have a “speaking” territory, in which each point surveyed identifies the languages spoken and the various linguistic manifestations.Language Contact, Linguistic Diversity, Immigrant Languages, Geolinguistic Data, New Methodologies in Sociolinguistic Research

    Conservation and use of genetic resources of underutilized crops in the Americas - A continental analysis

    Get PDF
    Latin America is home to dramatically diverse agroecological regions which harbor a high concentration of underutilized plant species, whose genetic resources hold the potential to address challenges such as sustainable agricultural development, food security and sovereignty, and climate change. This paper examines the status of an expert-informed list of underutilized crops in Latin America and analyses how the most common features of underuse apply to these. The analysis pays special attention to if and how existing international policy and legal frameworks on biodiversity and plant genetic resources effectively support or not the conservation and sustainable use of underutilized crops. Results show that not all minor crops are affected by the same degree of neglect, and that the aspects under which any crop is underutilized vary greatly, calling for specific analyses and interventions. We also show that current international policy and legal instruments have so far provided limited stimulus and funding for the conservation and sustainable use of the genetic resources of these crops. Finally, the paper proposes an analytical framework for identifying and evaluating a crop’s underutilization, in order to define the most appropriate type and levels of intervention (international, national, local) for improving its statu

    #Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds

    Full text link
    Compounding of natural language units is a very common phenomena. In this paper, we show, for the first time, that Twitter hashtags which, could be considered as correlates of such linguistic units, undergo compounding. We identify reasons for this compounding and propose a prediction model that can identify with 77.07% accuracy if a pair of hashtags compounding in the near future (i.e., 2 months after compounding) shall become popular. At longer times T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This technique has strong implications to trending hashtag recommendation since newly formed hashtag compounds can be recommended early, even before the compounding has taken place. Further, humans can predict compounds with an overall accuracy of only 48.7% (treated as baseline). Notably, while humans can discriminate the relatively easier cases, the automatic framework is successful in classifying the relatively harder cases.Comment: 14 pages, 4 figures, 9 tables, published in CSCW (Computer-Supported Cooperative Work and Social Computing) 2016. in Proceedings of 19th ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2016

    Language identification with suprasegmental cues: A study based on speech resynthesis

    Get PDF
    This paper proposes a new experimental paradigm to explore the discriminability of languages, a question which is crucial to the child born in a bilingual environment. This paradigm employs the speech resynthesis technique, enabling the experimenter to preserve or degrade acoustic cues such as phonotactics, syllabic rhythm or intonation from natural utterances. English and Japanese sentences were resynthesized, preserving broad phonotactics, rhythm and intonation (Condition 1), rhythm and intonation (Condition 2), intonation only (Condition 3), or rhythm only (Condition 4). The findings support the notion that syllabic rhythm is a necessary and sufficient cue for French adult subjects to discriminate English from Japanese sentences. The results are consistent with previous research using low-pass filtered speech, as well as with phonological theories predicting rhythmic differences between languages. Thus, the new methodology proposed appears to be well-suited to study language discrimination. Applications for other domains of psycholinguistic research and for automatic language identification are considered

    Formulaic Sequences as Fluency Devices in the Oral Production of Native Speakers of Polish

    Get PDF
    In this paper we attempt to determine the nature and strength of the relationship between the use of formulaic sequences and productive fluency of native speakers of Polish. In particular, we seek to validate the claim that speech characterized by a higher incidence of formulaic sequences is produced more rapidly and with fewer hesitation phenomena. The analysis is based on monologic speeches delivered by 45 speakers of L1 Polish. The data include both the recordings and their transcriptions annotated for a number of objective fluency measures. In the first part of the study the total of formulaic sequences is established for each sample. This is followed by determining a set of temporal measures of the speakers’ output (speech rate, articulation rate, mean length of runs, mean length of pauses, phonation time ratio). The study provides some preliminary evidence of the fluency-enhancing role of formulaic language. Our results show that the use of formulaic sequences is positively and significantly correlated with speech rate, mean length of runs and phonation time ratio. This suggests that a higher concentration of formulaic material in output is associated with faster speed of speech, longer stretches of speech between pauses and an increased amount of time filled with speech
    • 

    corecore