10,822 research outputs found

    Predicting Native Language from Gaze

    Get PDF
    A fundamental question in language learning concerns the role of a speaker's first language in second language acquisition. We present a novel methodology for studying this question: analysis of eye-movement patterns in second language reading of free-form text. Using this methodology, we demonstrate for the first time that the native language of English learners can be predicted from their gaze fixations when reading English. We provide analysis of classifier uncertainty and learned features, which indicates that differences in English reading are likely to be rooted in linguistic divergences across native languages. The presented framework complements production studies and offers new ground for advancing research on multilingualism.Comment: ACL 201

    Scalable Privacy-Compliant Virality Prediction on Twitter

    Get PDF
    The digital town hall of Twitter becomes a preferred medium of communication for individuals and organizations across the globe. Some of them reach audiences of millions, while others struggle to get noticed. Given the impact of social media, the question remains more relevant than ever: how to model the dynamics of attention in Twitter. Researchers around the world turn to machine learning to predict the most influential tweets and authors, navigating the volume, velocity, and variety of social big data, with many compromises. In this paper, we revisit content popularity prediction on Twitter. We argue that strict alignment of data acquisition, storage and analysis algorithms is necessary to avoid the common trade-offs between scalability, accuracy and privacy compliance. We propose a new framework for the rapid acquisition of large-scale datasets, high accuracy supervisory signal and multilanguage sentiment prediction while respecting every privacy request applicable. We then apply a novel gradient boosting framework to achieve state-of-the-art results in virality ranking, already before including tweet's visual or propagation features. Our Gradient Boosted Regression Tree is the first to offer explainable, strong ranking performance on benchmark datasets. Since the analysis focused on features available early, the model is immediately applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective Content Analysi

    Multilingual Twitter Sentiment Classification: The Role of Human Annotators

    Get PDF
    What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

    Effects of corrective feedback on EFL speaking task complexity in China’s university classroom

    Get PDF
    Corrective feedback (CF) and task complexity are two important pedagogical topics in second language acquisition research in recent years, but there is few research investigating effects of CF on speaking task complexity in China’s university classroom settings. This research, through conducting different versions of speaking task experiments among 24 university students in China, explores the effect of teachers’ CF on English as a Foreign Language (EFL) speaking task complexity. According to the analysis of first-hand data, this research finds CF has different effects on EFL oral production with different task complexity. In simple speaking task, the effects of five kinds of CF (from largest to smallest) are listed as follows: clarification quest, metalinguistic feedback, recast, repetition and confirmation check. Regarding complex speaking task, the effects of five categorized CF are ranked from largest to smallest as follows: metalinguistic feedback, confirmation check, recast, clarification request and repetition. Improving to provide CF in pedagogical practice is an important contribution to promote EFL speaking task, so, on the basis of above research results, appropriate ways and forms of providing CF are expected to promote efficiency of CF in EFL classroom under the context of Chinese university classroom

    Attitudes toward immigrants in Luxembourg - Do contacts matter?

    Get PDF
    According to the latest official statistics, the number of immigrants in Luxembourg is approaching half the population. This demographic change raises questions concerning social inclusion, social cohesion, and intergroup conflicts. The present paper contributes to this discussion by analyzing attitudes toward immigrants and their determinants. Controlling for key socio-demographic and economic individual characteristics, we focus specifically on examining how the intensity of core contacts between nationals and inhabitants with migratory background affects attitudes toward immigrants among three groups of Luxembourg residents: natives, first-generation immigrants, and second-generation immigrants. The European Values Study data of 2008 was used in the paper. The results indicate that attitudes toward immigrants depend significantly on the origins of the residents of Luxembourg. Nationals adopt the most negative stance toward immigrants; they are followed by second-generation and first-generation immigrants. Attitudes of second-generation immigrants are closer to those of the native population than to those of first-generation immigrants, which confirms the assimilation hypotheses. Core contacts appear to play the most important role in the case of first-generation immigrants. The more connected the first-generation migrant to the native population, the more negative his/her opinion of immigrants.attitudes toward immigrants; contact theory; migratory background; EVS

    Global disease monitoring and forecasting with Wikipedia

    Full text link
    Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data such as social media and search queries are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2r^2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein and adjust novelty claims accordingly; revise title; various revisions for clarit

    Portuguese patent classification: A use case of text classification using machine learning and transfer learning approaches

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsPatent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing through the years worldwide. Patents are more than ever being used as financial protection for companies that also use patent databases to raise researches and leverage product innovations. Instituto Nacional de Propriedade Industrial, INPI, is the government agency responsible for protecting Industrial Property rights in Portugal. INPI has promoted a competition to explore technologies to solve some challenges related to Industrial Properties, including the classification of patents, one of the critical phases of the grant patent process. In this work project, we used the dataset put available by INPI to explore traditional machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results to the task, even though with a performance only 4% superior to a LinearSVC model using TF-IDF feature engineering. In general, the model presents a good performance, despite the low score when classes had few training samples. However, the analysis of misclassified samples showed that the specificity of the context has more influence on the learning than the number of samples itself. Patent classification is a challenging task not just because of 1) the hierarchical structure of the classification but also because of 2) the way a patent is described, 3) the overlap of the contexts, and 4) the underrepresentation of the classes. Nevertheless, it is an area of growing interest, and that can be leveraged by the new researches that are revolutionizing machine learning applications, especially text mining
    • 

    corecore