1,280 research outputs found

    Characterising User Content on a Multi-lingual Social Network

    Full text link
    Social media has been on the vanguard of political information diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understanding of political activity on social media. As a result, there has only been a limited number of studies into a large portion of the world, including the largest, multilingual and multi-cultural democracy: India. In this paper we present our characterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images (often referred to as memes), and posts from Hindi have the largest cross-lingual diffusion across ShareChat (as well as images containing text in English). In the case of images containing text that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and therefore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting.Comment: Accepted at ICWSM 2020, please cite the ICWSM versio

    Characterising User Content on a Multi-lingual Social Network

    Get PDF
    Social media has been on the vanguard of political infor- mation diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understand- ing of political activity on social media. As a result, there has only been a limited number of studies into a large portion of the world, including the largest, multilingual and multi- cultural democracy: India. In this paper we present our char- acterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Tel- ugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images (often referred to as memes), and posts from Hindi have the largest cross-lingual diffusion across ShareChat (as well as images containing text in English). In the case of images containing text that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and there- fore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting

    Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model

    Get PDF
    Sport and exercise contribute to health and well-being in cities. While previous research has mainly focused on activities at specific locations such as sport facilities, "informal sport" that occur at arbitrary locations across the city have been largely neglected. Such activities are more challenging to observe, but this challenge may be addressed using data collected from social media platforms, because social media users regularly generate content related to sports and exercise at given locations. This allows studying all sport, including those "informal sport" which are at arbitrary locations, to better understand sports and exercise-related activities in cities. However, user-generated geographical information available on social media platforms is becoming scarcer and coarser. This places increased emphasis on extracting location information from free-form text content on social media, which is complicated by multilingualism and informal language. To support this effort, this article presents an end-to-end deep learning-based bilingual toponym recognition model for extracting location information from social media content related to sports and exercise. We show that our approach outperforms five state-of-the-art deep learning and machine learning models. We further demonstrate how our model can be deployed in a geoparsing framework to support city planners in promoting healthy and active lifestyles.Peer reviewe

    Check Mate: Prioritizing User Generated Multi-Media Content for Fact-Checking

    Full text link
    Volume of content and misinformation on social media is rapidly increasing. There is a need for systems that can support fact checkers by prioritizing content that needs to be fact checked. Prior research on prioritizing content for fact-checking has focused on news media articles, predominantly in English language. Increasingly, misinformation is found in user-generated content. In this paper we present a novel dataset that can be used to prioritize check-worthy posts from multi-media content in Hindi. It is unique in its 1) focus on user generated content, 2) language and 3) accommodation of multi-modality in social media posts. In addition, we also provide metadata for each post such as number of shares and likes of the post on ShareChat, a popular Indian social media platform, that allows for correlative analysis around virality and misinformation. The data is accessible on Zenodo (https://zenodo.org/record/4032629) under Creative Commons Attribution License (CC BY 4.0).Comment: 8 pages, 13 figures, 2 table

    State-of-the-art generalisation research in NLP: a taxonomy and review

    Get PDF
    The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to up-date as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference

    A Taxonomy of Arts Interventions for People With Dementia

    Get PDF
    Background and Objectives The current evidence base for the arts and dementia has several limitations relating to the description, explanation, communication, and simplification of arts interventions. Research addressing these challenges must be multidisciplinary, taking account of humanities and science perspectives. Consequently, this research aimed to produce a taxonomy, or classification, of arts interventions for people with dementia as a contribution to this growing field. Research Design and Methods This research was underpinned by taxonomy and realist methodology. Taxonomy, the science of classification, produces a common language to name, define, and describe the world around us. Realist theory explains how interventions “work” and produce their effects. The main findings in this paper were generated from a case study and a Delphi study. Results An arts and dementia taxonomy of 12 dimensions was developed: Art Form, Artistic elements, Artistic focus, Artistic materials, Arts activity, Arts approaches, Arts facilitators, Arts location, Competencies, Complementary arts, Intervention context, Principles. Discussion and Implications Arts interventions can be classified according to their contexts, mechanisms, and outcomes. A range of stakeholders could benefit from the taxonomy, including people with dementia, artists, practitioners, carers, care staff, funders, commissioners, researchers, and academics. Language relating to the arts and dementia can be adapted depending on the audience. This is a foundational model requiring further development within the arts and dementia community

    Wikipedia and Westminster: Quality and Dynamics of Wikipedia Pages about UK Politicians

    Full text link
    Wikipedia is a major source of information providing a large variety of content online, trusted by readers from around the world. Readers go to Wikipedia to get reliable information about different subjects, one of the most popular being living people, and especially politicians. While a lot is known about the general usage and information consumption on Wikipedia, less is known about the life-cycle and quality of Wikipedia articles in the context of politics. The aim of this study is to quantify and qualify content production and consumption for articles about politicians, with a specific focus on UK Members of Parliament (MPs). First, we analyze spatio-temporal patterns of readers' and editors' engagement with MPs' Wikipedia pages, finding huge peaks of attention during election times, related to signs of engagement on other social media (e.g. Twitter). Second, we quantify editors' polarisation and find that most editors specialize in a specific party and choose specific news outlets as references. Finally we observe that the average citation quality is pretty high, with statements on 'Early life and career' missing citations most often (18%).Comment: A preprint of accepted publication at the 31ST ACM Conference on Hypertext and Social Media (HT'20

    METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

    Get PDF
    International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

    The Computer Science Ontology: A Comprehensive Automatically-Generated Taxonomy of Research Areas

    Get PDF
    Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 14K topics and 162K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO, we have also released the CSO Classifier, a tool for automatically classifying research papers, and the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO. Users can use the portal to navigate and visualise sections of the ontology, rate topics and relationships, and suggest missing ones. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various research communities engaged with scholarly data

    Combating Misinformation on Social Media by Exploiting Post and User-level Information

    Get PDF
    Misinformation on social media has far-reaching negative impact on the public and society. Given the large number of real-time posts on social media, traditional manual-based methods of misinformation detection are not viable. Therefore, computational approaches (i.e., data-driven) have been proposed to combat online misinformation. Previous work on computational misinformation analysis has mainly focused on employing natural language processing (NLP) techniques to develop misinformation detection systems at the post level (e.g., using text and propagation network). However, it is also important to exploit information at the user level in social media, as users play a significant role (e.g., post, diffuse, refute, etc.) in spreading misinformation. The main aim of this thesis is to: (i) develop novel methods for analysing the behaviour of users who are likely to share or refute misinformation in social media; and (ii) predict and characterise unreliable stories with high popularity in social media. To this end, we first highlight the limitations in the evaluation protocol in popular rumour detection benchmarks on the post level and propose to evaluate such systems using chronological splits (i.e., considering temporal concept drift). On the user level, we introduce two novel tasks on (i) early detecting Twitter users that are likely to share misinformation before they actually do it; and (ii) identifying and characterising active citizens who refute misinformation in social media. Finally, we develop a new dataset to enable the study on predicting the future popularity (e.g. number of likes, replies, retweets) of false rumour on Weibo
    corecore