Article thumbnail

Entity extraction from Wikipedia list pages

By Nicolas Heist and Heiko Paulheim

Abstract

When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia’s policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia’s list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision

Topics: 004 Informatik
Publisher: 'Springer Science and Business Media LLC'
Year: 2020
DOI identifier: 10.1007/978-3-030-49461-2_19
OAI identifier: oai:ub-madoc.bib.uni-mannheim.de:55151
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • https://orcid.org/0000-0003-43... (external link)
  • https://madoc.bib.uni-mannheim... (external link)
  • https://madoc.bib.uni-mannheim... (external link)
  • https://orcid.org/0000-0002-43... (external link)
  • https://primo.bib.uni-mannheim... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.