10,822 research outputs found
Predicting Native Language from Gaze
A fundamental question in language learning concerns the role of a speaker's
first language in second language acquisition. We present a novel methodology
for studying this question: analysis of eye-movement patterns in second
language reading of free-form text. Using this methodology, we demonstrate for
the first time that the native language of English learners can be predicted
from their gaze fixations when reading English. We provide analysis of
classifier uncertainty and learned features, which indicates that differences
in English reading are likely to be rooted in linguistic divergences across
native languages. The presented framework complements production studies and
offers new ground for advancing research on multilingualism.Comment: ACL 201
Scalable Privacy-Compliant Virality Prediction on Twitter
The digital town hall of Twitter becomes a preferred medium of communication
for individuals and organizations across the globe. Some of them reach
audiences of millions, while others struggle to get noticed. Given the impact
of social media, the question remains more relevant than ever: how to model the
dynamics of attention in Twitter. Researchers around the world turn to machine
learning to predict the most influential tweets and authors, navigating the
volume, velocity, and variety of social big data, with many compromises. In
this paper, we revisit content popularity prediction on Twitter. We argue that
strict alignment of data acquisition, storage and analysis algorithms is
necessary to avoid the common trade-offs between scalability, accuracy and
privacy compliance. We propose a new framework for the rapid acquisition of
large-scale datasets, high accuracy supervisory signal and multilanguage
sentiment prediction while respecting every privacy request applicable. We then
apply a novel gradient boosting framework to achieve state-of-the-art results
in virality ranking, already before including tweet's visual or propagation
features. Our Gradient Boosted Regression Tree is the first to offer
explainable, strong ranking performance on benchmark datasets. Since the
analysis focused on features available early, the model is immediately
applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective
Content Analysi
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
Effects of corrective feedback on EFL speaking task complexity in Chinaâs university classroom
Corrective feedback (CF) and task complexity are two important pedagogical topics in second language acquisition research in recent years, but there is few research investigating effects of CF on speaking task complexity in Chinaâs university classroom settings. This research, through conducting different versions of speaking task experiments among 24 university students in China, explores the effect of teachersâ CF on English as a Foreign Language (EFL) speaking task complexity. According to the analysis of first-hand data, this research finds CF has different effects on EFL oral production with different task complexity. In simple speaking task, the effects of five kinds of CF (from largest to smallest) are listed as follows: clarification quest, metalinguistic feedback, recast, repetition and confirmation check. Regarding complex speaking task, the effects of five categorized CF are ranked from largest to smallest as follows: metalinguistic feedback, confirmation check, recast, clarification request and repetition. Improving to provide CF in pedagogical practice is an important contribution to promote EFL speaking task, so, on the basis of above research results, appropriate ways and forms of providing CF are expected to promote efficiency of CF in EFL classroom under the context of Chinese university classroom
Attitudes toward immigrants in Luxembourg - Do contacts matter?
According to the latest official statistics, the number of immigrants in Luxembourg is approaching half the population. This demographic change raises questions concerning social inclusion, social cohesion, and intergroup conflicts. The present paper contributes to this discussion by analyzing attitudes toward immigrants and their determinants. Controlling for key socio-demographic and economic individual characteristics, we focus specifically on examining how the intensity of core contacts between nationals and inhabitants with migratory background affects attitudes toward immigrants among three groups of Luxembourg residents: natives, first-generation immigrants, and second-generation immigrants. The European Values Study data of 2008 was used in the paper. The results indicate that attitudes toward immigrants depend significantly on the origins of the residents of Luxembourg. Nationals adopt the most negative stance toward immigrants; they are followed by second-generation and first-generation immigrants. Attitudes of second-generation immigrants are closer to those of the native population than to those of first-generation immigrants, which confirms the assimilation hypotheses. Core contacts appear to play the most important role in the case of first-generation immigrants. The more connected the first-generation migrant to the native population, the more negative his/her opinion of immigrants.attitudes toward immigrants; contact theory; migratory background; EVS
Global disease monitoring and forecasting with Wikipedia
Infectious disease is a leading threat to public health, economic stability,
and other key social structures. Efforts to mitigate these impacts depend on
accurate and timely monitoring to measure the risk and progress of disease.
Traditional, biologically-focused monitoring techniques are accurate but costly
and slow; in response, new techniques based on social internet data such as
social media and search queries are emerging. These efforts are promising, but
important challenges in the areas of scientific peer review, breadth of
diseases and countries, and forecasting hamper their operational usefulness.
We examine a freely available, open data source for this use: access logs
from the online encyclopedia Wikipedia. Using linear models, language as a
proxy for location, and a systematic yet simple article selection procedure, we
tested 14 location-disease combinations and demonstrate that these data
feasibly support an approach that overcomes these challenges. Specifically, our
proof-of-concept yields models with up to 0.92, forecasting value up to
the 28 days tested, and several pairs of models similar enough to suggest that
transferring models from one location to another without re-training is
feasible.
Based on these preliminary results, we close with a research agenda designed
to overcome these challenges and produce a disease monitoring and forecasting
system that is significantly more effective, robust, and globally comprehensive
than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein
and adjust novelty claims accordingly; revise title; various revisions for
clarit
Portuguese patent classification: A use case of text classification using machine learning and transfer learning approaches
Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsPatent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing through the years worldwide. Patents are more than ever being used as financial protection for companies that also use patent databases to raise researches and leverage product innovations. Instituto Nacional de Propriedade Industrial, INPI, is the government agency responsible for protecting Industrial Property rights in Portugal. INPI has promoted a competition to explore technologies to solve some challenges related to Industrial Properties, including the classification of patents, one of the critical phases of the grant patent process.
In this work project, we used the dataset put available by INPI to explore traditional machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results to the task, even though with a performance only 4% superior to a LinearSVC model using TF-IDF feature engineering. In general, the model presents a good performance, despite the low score when classes had few training samples. However, the analysis of misclassified samples showed that the specificity of the context has more influence on the learning than the number of samples itself.
Patent classification is a challenging task not just because of 1) the hierarchical structure of the classification but also because of 2) the way a patent is described, 3) the overlap of the contexts, and 4) the underrepresentation of the classes. Nevertheless, it is an area of growing interest, and that can be leveraged by the new researches that are revolutionizing machine learning applications, especially text mining
- âŠ