22 research outputs found

    Revisiting reverts: accurate revert detection in Wikipedia

    No full text
    Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration. As part of the development of such approaches, revert detection is often performed as an important pre-processing step in tasks as diverse as the extraction of implicit networks of editors, the analysis of edit or editor features and the removal of noise when analyzing the emergence of the content of an article. The current state of the art in revert detection is based on a rather naive approach, which identifies revision duplicates based on MD5 hash values. This is an efficient, but not very precise technique that forms the basis for the majority of research based on revert relations in Wikipedia. In this paper we prove that this method has a number of important drawbacks - it only detects a limited number of reverts, while simultaneously misclassifying too many edits as reverts, and not distinguishing between complete and partial reverts. This is very likely to hamper the accurate interpretation of the findings of revert-related research. We introduce an improved algorithm for the detection of reverts based on word tokens added or deleted to addresses these drawbacks. We report on the results of a user study and other tests demonstrating the considerable gains in accuracy and coverage by our method, and argue for a positive trade-off, in certain research scenarios, between these improvements and our algorithm's increased runtime

    Demographic inference and representative population estimates from multilingual social media data

    No full text
    Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.</p

    SemEval-2022 Task 8: multilingual news article similarity

    No full text
    Thousands of new news articles appear daily in outlets in different languages. Understanding which articles refer to the same story can not only improve applications like news aggregation but enable cross-linguistic analysis of media consumption and attention. However, assessing the similarity of stories in news articles is challenging due to the different dimensions in which a story might vary, e.g., two articles may have substantial textual overlap but describe similar events that happened years apart. To address this challenge, we introduce a new dataset of nearly 10,000 news article pairs spanning 18 language combinations annotated for seven dimensions of similarity as SemEval 2022 Task 8. Here, we present an overview of the task, the best performing submissions, and the frontiers and challenges for measuring multilingual news article similarity. While the participants of this SemEval task contributed very strong models, achieving up to 0.818 correlation with gold standard labels across languages, human annotators are capable of reaching higher correlations, suggesting space for further progress

    Questions in English and French Research Articles in Linguistics: A Corpus-Based Contrastive Analysis

    No full text
    Although research on evaluation in academic writing has profited from developments in contrastive linguistics since the late 1980s, very little empirical research has been conducted with respect to questions in contrastive studies. The aim of this study is to investigate the functions of questions as a means of reader engagement in academic research articles in English and French in the discipline of linguistics. To do this, a corpus-based contrastive analysis of two subcorpora of KIAP (FlĂžttum et al. in Academic voices across languages and disciplines, John Benjamins, Amsterdam, 2006) is conducted. The English and French subcorpora are assessed using Hyland’s model of stance and reader engagement in terms of questions and their seven functions as evaluative markers of reader engagement (Text 22(4):529–557, 2002; Discourse Stud 7(2):173–192, 2005b), including their form and distribution within the text. This analysis focuses on two particular functions of questions, namely ‘framing the discourse’ and ‘organising the text’. The results suggest that, although there is some degree of homogeneity in the use of questions in terms of function, form and distribution, there is also evidence of important differences between the two languages. These findings illustrate some distinctions in writing in these two discourse communities and their potential for informing language pedagogy in both English for academic purposes and Français langue acadĂ©mique
    corecore