68 research outputs found

    Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

    Get PDF
    Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web Conference (WWW '19

    Scraping social media photos posted in Kenya and elsewhere to detect and analyze food types

    Full text link
    Monitoring population-level changes in diet could be useful for education and for implementing interventions to improve health. Research has shown that data from social media sources can be used for monitoring dietary behavior. We propose a scrape-by-location methodology to create food image datasets from Instagram posts. We used it to collect 3.56 million images over a period of 20 days in March 2019. We also propose a scrape-by-keywords methodology and used it to scrape ∌30,000 images and their captions of 38 Kenyan food types. We publish two datasets of 104,000 and 8,174 image/caption pairs, respectively. With the first dataset, Kenya104K, we train a Kenyan Food Classifier, called KenyanFC, to distinguish Kenyan food from non-food images posted in Kenya. We used the second dataset, KenyanFood13, to train a classifier KenyanFTR, short for Kenyan Food Type Recognizer, to recognize 13 popular food types in Kenya. The KenyanFTR is a multimodal deep neural network that can identify 13 types of Kenyan foods using both images and their corresponding captions. Experiments show that the average top-1 accuracy of KenyanFC is 99% over 10,400 tested Instagram images and of KenyanFTR is 81% over 8,174 tested data points. Ablation studies show that three of the 13 food types are particularly difficult to categorize based on image content only and that adding analysis of captions to the image analysis yields a classifier that is 9 percent points more accurate than a classifier that relies only on images. Our food trend analysis revealed that cakes and roasted meats were the most popular foods in photographs on Instagram in Kenya in March 2019.Accepted manuscrip

    Data science methods for the analysis of controversial social dedia discussions

    Get PDF
    Social media communities like Reddit and Twitter allow users to express their views on topics of their interest, and to engage with other users who may share or oppose these views. This can lead to productive discussions towards a consensus, or to contended debates, where disagreements frequently arise. Prior work on such settings has primarily focused on identifying notable instances of antisocial behavior such as hate-speech and “trolling”, which represent possible threats to the health of a community. These, however, are exceptionally severe phenomena, and do not encompass controversies stemming from user debates, differences of opinions, and off-topic content, all of which can naturally come up in a discussion without going so far as to compromise its development. This dissertation proposes a framework for the systematic analysis of social media discussions that take place in the presence of controversial themes, disagreements, and mixed opinions from participating users. For this, we develop a feature-based model to describe key elements of a discussion, such as its salient topics, the level of activity from users, the sentiments it expresses, and the user feedback it receives. Initially, we build our feature model to characterize adversarial discussions surrounding political campaigns on Twitter, with a focus on the factual and sentimental nature of their topics and the role played by different users involved. We then extend our approach to Reddit discussions, leveraging community feedback signals to define a new notion of controversy and to highlight conversational archetypes that arise from frequent and interesting interaction patterns. We use our feature model to build logistic regression classifiers that can predict future instances of controversy in Reddit communities centered on politics, world news, sports, and personal relationships. Finally, our model also provides the basis for a comparison of different communities in the health domain, where topics and activity vary considerably despite their shared overall focus. In each of these cases, our framework provides insight into how user behavior can shape a community’s individual definition of controversy and its overall identity.Social-Media Communities wie Reddit und Twitter ermöglichen es Nutzern, ihre Ansichten zu eigenen Themen zu Ă€ußern und mit anderen Nutzern in Kontakt zu treten, die diese Ansichten teilen oder ablehnen. Dies kann zu produktiven Diskussionen mit einer Konsensbildung fĂŒhren oder zu strittigen Auseinandersetzungen ĂŒber auftretende Meinungsverschiedenheiten. FrĂŒhere Arbeiten zu diesem Komplex konzentrierten sich in erster Linie darauf, besondere FĂ€lle von asozialem Verhalten wie Hassrede und "Trolling" zu identifizieren, da diese eine Gefahr fĂŒr die GesprĂ€chskultur und den Wert einer Community darstellen. Die sind jedoch außergewöhnlich schwerwiegende PhĂ€nomene, die keinesfalls bei jeder Kontroverse auftreten die sich aus einfachen Diskussionen, Meinungsverschiedenheiten und themenfremden Inhalten ergeben. All diese Reibungspunkte können auch ganz natĂŒrlich in einer Diskussion auftauchen, ohne dass diese gleich den ganzen GesprĂ€chsverlauf gefĂ€hrden. Diese Dissertation stellt ein Framework fĂŒr die systematische Analyse von Social-Media Diskussionen vor, die vornehmlich von kontroversen Themen, strittigen Standpunkten und Meinungsverschiedenheiten der teilnehmenden Nutzer geprĂ€gt sind. Dazu entwickeln wir ein Feature-Modell, um SchlĂŒsselelemente einer Diskussion zu beschreiben. Dazu zĂ€hlen der AktivitĂ€tsgrad der Benutzer, die Wichtigkeit der einzelnen Aspekte, die Stimmung, die sie ausdrĂŒckt, und das Benutzerfeedback. ZunĂ€chst bauen wir unser Feature-Modell so auf, um bei Diskussionen gegensĂ€tzlicher politischer Kampagnen auf Twitter die oben genannten SchlĂŒsselelemente zu bestimmen. Der Schwerpunkt liegt dabei auf den sachlichen und emotionalen Aspekten der Themen im Bezug auf die Rollen verschiedener Nutzer. Anschließend erweitern wir unseren Ansatz auf Reddit-Diskussionen und nutzen das Community-Feedback, um einen neuen Begriff der Kontroverse zu definieren und Konversationsarchetypen hervorzuheben, die sich aus Interaktionsmustern ergeben. Wir nutzen unser Feature-Modell, um ein Logistischer Regression Verfahren zu entwickeln, das zukĂŒnftige Kontroversen in Reddit-Communities in den Themenbereichen Politik, Weltnachrichten, Sport und persönliche Beziehungen vorhersagen kann. Schlussendlich bietet unser Modell auch die Grundlage fĂŒr eine Vergleichbarkeit verschiedener Communities im Gesundheitsbereich, auch wenn dort die Themen und die NutzeraktivitĂ€t, trotz des gemeinsamen Gesamtfokus, erheblich variieren. In jedem der genannten Themenbereiche gibt unser Framework Erkenntnisgewinne, wie das Verhalten der Nutzer die spezifisch Definition von Kontroversen der Community prĂ€gt

    From teaching books to educational videos and vice versa: a cross-media content retrieval experience

    Get PDF
    Due to the rapid growth of multimedia data and the diffusion of remote and mixed learning, teaching sessions are becoming more and more multi-modal. To deepen the knowledge of specific topics, learners can be interested in retrieving educational videos that complement the textual content of teaching books. However, retrieving educational videos can be particularly challenging when there is a lack of metadata information. To tackle the aforesaid issue, this paper explores the joint use of Deep Learning and Natural Language Processing techniques to retrieve cross-media educational resources (i.e., from text snippets to videos and vice versa). It applies NLP techniques to both the audio transcript of the videos and to the text snippets in the books in order to quantify the semantic relationships between pairs of educational resources of different media types. Then, it trains a Deep Learning model on top of the NLP-based features. The probabilities returned by the Deep Learning model are used to rank the candidate resources based on their relevance to a given query. The results achieved on a real collection of educational multimodal data show that the proposed approach performs better than state-of-the-art solutions. Furthermore, a preliminary attempt to apply the same approach to address a similar retrieval task (i.e., from text to image and vice versa) has shown promising results

    Information consumption on social media : efficiency, divisiveness, and trust

    Get PDF
    Over the last decade, the advent of social media has profoundly changed the way people produce and consume information online. On these platforms, users themselves play a role in selecting the sources from which they consume information, overthrowing traditional journalistic gatekeeping. Moreover, advertisers can target users with news stories using users’ personal data. This new model has many advantages: the propagation of news is faster, the number of news sources is large, and the topics covered are diverse. However, in this new model, users are often overloaded with redundant information, and they can get trapped in filter bubbles by consuming divisive and potentially false information. To tackle these concerns, in my thesis, I address the following important questions: (i) How efficient are users at selecting their information sources? We have defined three intuitive notions of users’ efficiency in social media: link, in-flow, and delay efficiency. We use these three measures to assess how good users are at selecting who to follow within the social media system in order to most efficiently acquire information. (ii) How can we break the filter bubbles that users get trapped in? Users on social media sites such as Twitter often get trapped in filter bubbles by being exposed to radical, highly partisan, or divisive information. To prevent users from getting trapped in filter bubbles, we propose an approach to inject diversity in users’ information consumption by identifying non-divisive, yet informative information. (iii) How can we design an efficient framework for fact-checking? Proliferation of false information is a major problem in social media. To counter it, social media platforms typically rely on expert fact-checkers to detect false news. However, human fact-checkers can realistically only cover a tiny fraction of all stories. So, it is important to automatically prioritizing and selecting a small number of stories for human to fact check. However, the goals for prioritizing stories for fact-checking are unclear. We identify three desired objectives to prioritize news for fact-checking. These objectives are based on the users’ perception of truthfulness of stories. Our key finding is that these three objectives are incompatible in practice.In den letzten zehn Jahren haben soziale Medien die Art und Weise, wie Menschen online Informationen generieren und konsumieren, grundlegend verĂ€ndert. Auf Social Media Plattformen wĂ€hlen Nutzer selbst aus, von welchen Quellen sie Informationen beziehen hebeln damit das traditionelle Modell journalistischen Gatekeepings aus. ZusĂ€tzlich können Werbetreibende Nutzerdaten dazu verwenden, um Nachrichtenartikel gezielt an Nutzer zu verbreiten. Dieses neue Modell bietet einige Vorteile: Nachrichten verbreiten sich schneller, die Zahl der Nachrichtenquellen ist grĂ¶ĂŸer, und es steht ein breites Spektrum an Themen zur Verfügung. Das hat allerdings zur Folge, dass Benutzer hĂ€ufig mit überflüssigen Informationen überladen werden und in Filterblasen geraten können, wenn sie zu einseitige oder falsche Informationen konsumieren. Um diesen Problemen Rechnung zu tragen, gehe ich in meiner Dissertation auf die drei folgenden wichtigen Fragestellungen ein: ‱ (i) Wie effizient sind Nutzer bei der Auswahl ihrer Informationsquellen? Dazu definieren wir drei verschiedene, intuitive Arten von Nutzereffizienz in sozialen Medien: Link-, In-Flowund Delay-Effizienz. Mithilfe dieser drei Metriken untersuchen wir, wie gut Nutzer darin sind auszuwĂ€hlen, wem sie auf Social Media Plattformen folgen sollen um effizient an Informationen zu gelangen. ‱ (ii) Wie können wir verhindern, dass Benutzer in Filterblasen geraten? Nutzer von Social Media Webseiten werden hĂ€ufig Teil von Filterblasen, wenn sie radikalen, stark parteiischen oder spalterischen Informationen ausgesetzt sind. Um das zu verhindern, entwerfen wir einen Ansatz mit dem Ziel, den Informationskonsum von Nutzern zu diversifizieren, indem wir Informationen identifizieren, die nicht polarisierend und gleichzeitig informativ sind. ‱ (iii) Wie können wir Nachrichten effizient auf faktische Korrektheit hin überprüfen? Die Verbreitung von Falschinformationen ist eines der großen Probleme sozialer Medien. Um dem entgegenzuwirken, sind Social Media Plattformen in der Regel auf fachkundige Faktenprüfer zur Identifizierung falscher Nachrichten angewiesen. Die manuelle Überprüfung von Fakten kann jedoch realistischerweise nur einen sehr kleinen Teil aller Artikel und Posts abdecken. Daher ist es wichtig, automatisch eine überschaubare Zahl von Artikeln für die manuellen Faktenkontrolle zu priorisieren. Nach welchen Zielen eine solche Priorisierung erfolgen soll, ist jedoch unklar. Aus diesem Grund identifizieren wir drei wünschenswerte Priorisierungskriterien für die Faktenkontrolle. Diese Kriterien beruhen auf der Wahrnehmung des Wahrheitsgehalts von Artikeln durch Nutzer. Unsere Schlüsselbeobachtung ist, dass diese drei Kriterien in der Praxis nicht miteinander vereinbar sind

    Enhancing Data Classification Quality of Volunteered Geographic Information

    Get PDF
    Geographic data is one of the fundamental components of any Geographic Information System (GIS). Nowadays, the utility of GIS becomes part of everyday life activities, such as searching for a destination, planning a trip, looking for weather information, etc. Without a reliable data source, systems will not provide guaranteed services. In the past, geographic data was collected and processed exclusively by experts and professionals. However, the ubiquity of advanced technology results in the evolution of Volunteered Geographic Information (VGI), when the geographic data is collected and produced by the general public. These changes influence the availability of geographic data, when common people can work together to collect geographic data and produce maps. This particular trend is known as collaborative mapping. In collaborative mapping, the general public shares an online platform to collect, manipulate, and update information about geographic features. OpenStreetMap (OSM) is a prominent example of a collaborative mapping project, which aims to produce a free world map editable and accessible by anyone. During the last decade, VGI has expanded based on the power of crowdsourcing. The involvement of the public in data collection raises great concern about the resulting data quality. There exist various perspectives of geographic data quality this dissertation focuses particularly on the quality of data classification (i.e., thematic accuracy). In professional data collection, data is classified based on quantitative and/or qualitative ob- servations. According to a pre-defined classification model, which is usually constructed by experts, data is assigned to appropriate classes. In contrast, in most collaborative mapping projects data classification is mainly based on individualsa cognition. Through online platforms, contributors collect information about geographic features and trans- form their perceptions into classified entities. In VGI projects, the contributors mostly have limited experience in geography and cartography. Therefore, the acquired data may have a questionable classification quality. This dissertation investigates the challenges of data classification in VGI-based mapping projects (i.e., collaborative mapping projects). In particular, it lists the challenges relevant to the evolution of VGI as well as to the characteristics of geographic data. Furthermore, this work proposes a guiding approach to enhance the data classification quality in such projects. The proposed approach is based on the following premises (i) the availability of large amounts of data, which fosters applying machine learning techniques to extract useful knowledge, (ii) utilization of the extracted knowledge to guide contributors to appropriate data classification, (iii) the humanitarian spirit of contributors to provide precise data, when they are supported by a guidance system, and (iv) the power of crowdsourcing in data collection as well as in ensuring the data quality. This cumulative dissertation consists of five peer-reviewed publications in international conference proceedings and international journals. The publications divide the disser- tation into three parts the first part presents a comprehensive literature review about the relevant previous work of VGI quality assurance procedures (Chapter 2), the second part studies the foundations of the approach (Chapters 3-4), and the third part discusses the proposed approach and provides a validation example for implementing the approach (Chapters 5-6). Furthermore, Chapter 1 presents an overview about the research ques- tions and the adapted research methodology, while Chapter 7 concludes the findings and summarizes the contributions. The proposed approach is validated through empirical studies and an implemented web application. The findings reveal the feasibility of the proposed approach. The output shows that applying the proposed approach results in enhanced data classification quality. Furthermore, the research highlights the demands for intuitive data collection and data interpretation approaches adequate to VGI-based mapping projects. An interaction data collection approach is required to guide the contributors toward enhanced data quality, while an intuitive data interpretation approach is needed to derive more precise information from rich VGI resources

    Methods for improving entity linking and exploiting social media messages across crises

    Get PDF
    Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real world applications. In the first part of this thesis I explore an approximation of the difficulty to link entity mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments. An important component in the majority of the entity linking tools is the probability that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic
    • 

    corecore