Search CORE

10 research outputs found

Open Knowledge Enrichment for Long-tail Entities

Author: Bast Hannah
Bhagavatula Chandra Sekhar
Bordes Antoine
Dettmers Tim
Gunaratna Kalpa
Hoffart Johannes
Lajus Jonathan
Li Yuan
Lin Yankai
Manning Christopher
Mintz Mike
Pasternack Jeff
Paulheim Heiko
Rao Delip
Reinanda Ridho
Shi Baoxu
Sun Zhiqing
Surdeanu Mihai
Tonon Alberto
Veličković Petar
Wang Xianzhi
Zangerle Eva
Zhang Ningyu
Publication venue
Publication date: 19/02/2020
Field of study

Knowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations regarding long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the approach.Comment: Accepted by the 29th International World Wide Web Conference (WWW 2020

arXiv.org e-Print Archive

Crossref

Enriching and validating geographic information on the web

Author: Tempelmeier Nicolas
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2022
Field of study

The continuous growth of available data on the World Wide Web has led to an unprecedented amount of available information. However, the enormous variance in data quality and trustworthiness of information sources impairs the great potential of the large amount of vacant information. This observation especially applies to geographic information on the Web, i.e., information describing entities that are located on the Earth’s surface. With the advent of mobile devices, the impact of geographic Web information on our everyday life has substantially grown. The mobile devices have also enabled the creation of novel data sources such as OpenStreetMap (OSM), a collaborative crowd-sourced map providing open cartographic information. Today, we use geographic information in many applications, including routing, location recommendation, or geographic question answering. The processing of geographic Web information yields unique challenges. First, the descriptions of geographic entities on the Web are typically not validated. Since not all Web information sources are trustworthy, the correctness of some geographic Web entities is questionable. Second, geographic information sources on the Web are typically isolated from each other. The missing integration of information sources hinders the efficient use of geographic Web information for many applications. Third, the description of geographic entities is typically incomplete. Depending on the application, missing information is a decisive criterion for (not) using a particular data source. Due to the large scale of the Web, the manual correction of these problems is usually not feasible such that automated approaches are required. In this thesis, we tackle these challenges from three different angles. (i) Validation of geographic Web information: We validate geographic Web information by detecting vandalism in OpenStreetMap, for instance, the replacement of a street name with advertisement. To this end, we present the OVID model for automated vandalism detection in OpenStreetMap. (ii) Enrichment of geographic Web information through integration: We integrate OpenStreetMap with other geographic Web information sources, namely knowledge graphs, by identifying entries corresponding to the same world real-world entities in both data sources. We present the OSM2KG model for automated identity link discovery between OSM and knowledge graphs. (iii) Enrichment of missing information in geographic Web information: We consider semantic annotations of geographic entities on Web pages as an additional data source. We exploit existing annotations of categorical properties of Web entities as training data to enrich missing categorical properties in geographic Web information. For all of the proposed models, we conduct extensive evaluations on real-world datasets. Our experimental results confirm that the proposed solutions reliably outperform existing baselines. Furthermore, we demonstrate the utility of geographic Web Information in two application scenarios. (i) Corpus of geographic entity embeddings: We introduce the GeoVectors corpus, a linked open dataset of ready-to-use embeddings of geographic entities. With GeoVectors, we substantially lower the burden to use geographic data in machine learning applications. (ii) Application to event impact prediction: We employ several geographic Web information sources to predict the impact of public events on road traffic. To this end, we use cartographic, event, and event venue information from the Web.Durch die kontinuierliche Zunahme verfügbarer Daten im World Wide Web, besteht heute eine noch nie da gewesene Menge verfügbarer Informationen. Das große Potential dieser Daten wird jedoch durch hohe Schwankungen in der Datenqualität und in der Vertrauenswürdigkeit der Datenquellen geschmälert. Dies kann vor allem am Beispiel von geografischen Web-Informationen beobachtet werden. Geografische Web-Informationen sind Informationen über Entitäten, die über Koordinaten auf der Erdoberfläche verfügen. Die Relevanz von geografischen Web-Informationen für den Alltag ist durch die Verbreitung von internetfähigen, mobilen Endgeräten, zum Beispiel Smartphones, extrem gestiegen. Weiterhin hat die Verfügbarkeit der mobilen Endgeräte auch zur Erstellung neuartiger Datenquellen wie OpenStreetMap (OSM) geführt. OSM ist eine offene, kollaborative Webkarte, die von Freiwilligen dezentral erstellt wird. Mittlerweile ist die Nutzung geografischer Informationen die Grundlage für eine Vielzahl von Anwendungen, wie zum Beispiel Navigation, Reiseempfehlungen oder geografische Frage-Antwort-Systeme. Bei der Verarbeitung geografischer Web-Informationen müssen einzigartige Herausforderungen berücksichtigt werden. Erstens werden die Beschreibungen geografischer Web-Entitäten typischerweise nicht validiert. Da nicht alle Informationsquellen im Web vertrauenswürdig sind, ist die Korrektheit der Darstellung mancher Web-Entitäten fragwürdig. Zweitens sind Informationsquellen im Web oft voneinander isoliert. Die fehlende Integration von Informationsquellen erschwert die effektive Nutzung von geografischen Web-Information in vielen Anwendungsfällen. Drittens sind die Beschreibungen von geografischen Entitäten typischerweise unvollständig. Je nach Anwendung kann das Fehlen von bestimmten Informationen ein entscheidendes Kriterium für die Nutzung einer Datenquelle sein. Da die Größe des Webs eine manuelle Behebung dieser Probleme nicht zulässt, sind automatisierte Verfahren notwendig. In dieser Arbeit nähern wir uns diesen Herausforderungen von drei verschiedenen Richtungen. (i) Validierung von geografischen Web-Informationen: Wir validieren geografische Web-Informationen, indem wir Vandalismus in OpenStreetMap identifizieren, zum Beispiel das Ersetzen von Straßennamen mit Werbetexten. (ii) Anreicherung von geografischen Web-Information durch Integration: Wir integrieren OpenStreetMap mit anderen Informationsquellen im Web (Wissensgraphen), indem wir Einträge in beiden Informationsquellen identifizieren, die den gleichen Echtwelt-Entitäten entsprechen. (iii) Anreicherung von fehlenden geografischen Informationen: Wir nutzen semantische Annotationen von geografischen Entitäten auf Webseiten als weitere Datenquelle. Wir nutzen existierende Annotationen kategorischer Attribute von Web-Entitäten als Trainingsdaten, um fehlende kategorische Attribute in geografischen Web-Informationen zu ergänzen. Wir führen ausführliche Evaluationen für alle beschriebenen Modelle durch. Die vorgestellten Lösungsansätze erzielen verlässlich bessere Ergebnisse als existierende Ansätze. Weiterhin demonstrieren wir den Nutzen von geografischen Web-Informationen in zwei Anwendungsszenarien. (i) Korpus mit Embeddings von geografischen Entitäten: Wir stellen den GeoVectors-Korpus vor, einen verlinkten, offenen Datensatz mit direkt nutzbaren Embeddings von geografischen Web-Entitäten. Der GeoVectors-Korpus erleichtert die Nutzung von geografischen Daten in Anwendungen von maschinellen Lernen erheblich. (ii) Anwendung zur Prognose von Veranstaltungsauswirkungen: Wir nutzen Karten-, Veranstaltungs- und Veranstaltungsstätten-Daten aus dem Web, um die Auswirkungen von Veranstaltungen auf den Straßenverkehr zu prognostizieren

Institutionelles Repositorium der Leibniz Universität Hannover

Augmenting cross-domain knowledge bases using web tables

Author: Oulabi Yaser
Publication venue
Publication date: 01/01/2020
Field of study

Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web. This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion. When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps. Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema. By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts. Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale. In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables

MAnnheim DOCument Server

Data Science and Knowledge Discovery

Author
Publication venue: 'MDPI AG'
Publication date: 21/06/2022
Field of study

Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining

Directory of Open Access Books (DOAB)

Entwicklung eines Frameworks unter Verwendung von Kontextinformationen und kollektiver Intelligenz

Author: Beul Michael
Publication venue
Publication date: 21/04/2017
Field of study

Die Bedeutung von Daten, Informationen und Wissen als Faktor für wirtschaftliches und gesellschaftliches Handeln ist enorm und wächst noch weiter an. Ihr Austausch kennzeichnet und bestimmt die Globalisierung und den digitalen Wandel weit mehr als der Austausch von Waren. Grundlage dieser Entwicklung sind in erster Linie die enormen Fortschritte der Informations- und Kommunikationstechnik, die inzwischen insbesondere die Verfügbarkeit von Daten und Informationen nahezu zu jedem Zeitpunkt und an jedem Ort ermöglichen. Allerdings führt die riesige, rasant weiterwachsende verfügbare Menge an Daten und Informationen zu einer Überflutung in dem Sinne, dass es immer schwieriger wird, die jeweils relevanten Daten und Informationen zu finden bzw. zu identifizieren. Insbesondere beim Einsatz von Softwaresystemen ergibt sich aber für die Nutzer der Systeme häufig situations‑/kontextabhängig ein drängender Informationsbedarf, u.a. deshalb, weil die Systeme in immer kürzeren Zyklen verändert bzw. weiterentwickelt werden. Die entsprechende Suche nach Informationen zur Deckung des Informationsbedarfs ist jedoch häufig zeitaufwendig und wird vielfach „suboptimal“ durchgeführt. Michael Beul geht in seiner Arbeit der Frage nach, wie die Suche und Bereitstellung von relevanten Informationen erleichtert bzw. automatisiert durchgeführt werden kann, um eine effektivere Nutzung von Anwendungssystemen zu ermöglichen. Er erarbeitet ein Framework, welches insbesondere mit Hilfe von Konzepten der kollektiven Intelligenz eine kontextabhängige Echtzeit-Informationsbeschaffung für Nutzer softwareintensiver Systeme in den verschiedenen Anwendungsdomänen ermöglicht

Duisburg-Essen Publications Online

Participative Urban Health and Healthy Aging in the Age of AI

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/07/2022
Field of study

This open access book constitutes the refereed proceedings of the 18th International Conference on String Processing and Information Retrieval, ICOST 2022, held in Paris, France, in June 2022. The 15 full papers and 10 short papers presented in this volume were carefully reviewed and selected from 33 submissions. They cover topics such as design, development, deployment, and evaluation of AI for health, smart urban environments, assistive technologies, chronic disease management, and coaching and health telematics systems

Directory of Open Access Books (DOAB)

Urban Informatics

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/04/2021
Field of study

This open access book is the first to systematically introduce the principles of urban informatics and its application to every aspect of the city that involves its functioning, control, management, and future planning. It introduces new models and tools being developed to understand and implement these technologies that enable cities to function more efficiently – to become ‘smart’ and ‘sustainable’. The smart city has quickly emerged as computers have become ever smaller to the point where they can be embedded into the very fabric of the city, as well as being central to new ways in which the population can communicate and act. When cities are wired in this way, they have the potential to become sentient and responsive, generating massive streams of ‘big’ data in real time as well as providing immense opportunities for extracting new forms of urban data through crowdsourcing. This book offers a comprehensive review of the methods that form the core of urban informatics from various kinds of urban remote sensing to new approaches to machine learning and statistical modelling. It provides a detailed technical introduction to the wide array of tools information scientists need to develop the key urban analytics that are fundamental to learning about the smart city, and it outlines ways in which these tools can be used to inform design and policy so that cities can become more efficient with a greater concern for environment and equity

Directory of Open Access Books (DOAB)

Urban Informatics

Author
Publication venue: Springer Nature
Publication date: 06/04/2021
Field of study

UCL Discovery

Urban Informatics

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

OAPEN Library

Hacking the web 2.0: user agency and the role of hackers as computational mediators

Author: Rota Andrea
Publication venue
Publication date: 01/01/2016
Field of study

This thesis studies the contested reconfigurations of computational agency within the domain of practices and affordances involved in the use of the Internet in everyday life (here labelled lifeworld Internet), through the transition of the Internet to a much deeper reliance on computation than at any previous stage. Computational agency is here considered not only in terms of capacity to act enabled (or restrained) by the computational layer but also as the recursive capacity to reconfigure the computational layer itself, therefore in turn affecting one’s own and others’ computational agency. My research is based on multisited and diachronic ethnographic fieldwork: an initial (2005–2007) autoethnographic case study focused on the negotiations of computational agency within the development of a Web 2.0 application, later (2010–2011) fieldwork interviews focused on processes through which users make sense of the increasing pervasiveness of the Internet and of computation in everyday life, and a review (2010–2015) of hacker discourses focused on tracing the processes through which hackers constitute themselves as a recursive public able to inscribe counter–narratives in the development of technical form and to reproduce itself as a public of computational mediators with capacity to operate at the intersection of the technical and the social. By grounding my enquiry in the specific context of the lifeworlds of individual end users but by following computational agency through global hacker discourses, my research explores the role of computation, computational capacity and computational mediators in the processes through which users ‘hack’ their everyday Internet environments for practical utility, or develop independent alternatives to centralized Internet services as part of their contestation of values inscribed in the materiality of mainstream Internet

LSE Theses Online