13,316 research outputs found

    A Graph-structured Dataset for Wikipedia Research

    Get PDF
    Wikipedia is a rich and invaluable source of information. Its central place on the Web makes it a particularly interesting object of study for scientists. Researchers from different domains used various complex datasets related to Wikipedia to study language, social behavior, knowledge organization, and network theory. While being a scientific treasure, the large size of the dataset hinders pre-processing and may be a challenging obstacle for potential new studies. This issue is particularly acute in scientific domains where researchers may not be technically and data processing savvy. On one hand, the size of Wikipedia dumps is large. It makes the parsing and extraction of relevant information cumbersome. On the other hand, the API is straightforward to use but restricted to a relatively small number of requests. The middle ground is at the mesoscopic scale when researchers need a subset of Wikipedia ranging from thousands to hundreds of thousands of pages but there exists no efficient solution at this scale. In this work, we propose an efficient data structure to make requests and access subnetworks of Wikipedia pages and categories. We provide convenient tools for accessing and filtering viewership statistics or "pagecounts" of Wikipedia web pages. The dataset organization leverages principles of graph databases that allows rapid and intuitive access to subgraphs of Wikipedia articles and categories. The dataset and deployment guidelines are available on the LTS2 website \url{https://lts2.epfl.ch/Datasets/Wikipedia/}

    Overrepresentation of the Underrepresented: Gender Bias in Wikipedia

    Get PDF
    The goal of our research is to determine if gender bias exists in Wikipedia. Wikipedia is a very large dataset that has been used to train artificial intelligence models. If a dataset that is being used for this purpose is biased, then the artificial intelligence model that was trained with it will be biased as well, therefore making biased decisions. For this reason, it is important to explore large datasets for any potential biases before they are used in machine learning. Since Wikipedia is ontologically structured, we used graph theory to create a network of all of the website’s categories in order to look at the relationships between men-related categories and women-related categories with measures of shortest paths, successor intersections, and average betweenness centrality. We found there is an overexposure of categories that relate to men as they are far more central in Wikipedia and easier to get to than categories that relate to women. However, although women-related categories are not as central, there are about six times more categories that mention women in the title than men, which we consider to be overrepresentation. This is most likely due to women being considered an exception in many fields while men are considered the norm. Our methods can be used to either periodically study gender bias in Wikipedia as its data changes relatively frequently or our methods can be used to study other biases in either Wikipedia or other network-like datasets

    Constructing Datasets for Multi-hop Reading Comprehension Across Documents

    Get PDF
    Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence - effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% - leaving ample room for improvement.Comment: This paper directly corresponds to the TACL version (https://transacl.org/ojs/index.php/tacl/article/view/1325) apart from minor changes in wording, additional footnotes, and appendice
    • …
    corecore