10 research outputs found

    Open Knowledge Enrichment for Long-tail Entities

    Full text link
    Knowledge bases (KBs) have gradually become a valuable asset for many AI applications. While many current KBs are quite large, they are widely acknowledged as incomplete, especially lacking facts of long-tail entities, e.g., less famous persons. Existing approaches enrich KBs mainly on completing missing links or filling missing values. However, they only tackle a part of the enrichment problem and lack specific considerations regarding long-tail entities. In this paper, we propose a full-fledged approach to knowledge enrichment, which predicts missing properties and infers true facts of long-tail entities from the open Web. Prior knowledge from popular entities is leveraged to improve every enrichment step. Our experiments on the synthetic and real-world datasets and comparison with related work demonstrate the feasibility and superiority of the approach.Comment: Accepted by the 29th International World Wide Web Conference (WWW 2020

    Enriching and validating geographic information on the web

    Get PDF
    The continuous growth of available data on the World Wide Web has led to an unprecedented amount of available information. However, the enormous variance in data quality and trustworthiness of information sources impairs the great potential of the large amount of vacant information. This observation especially applies to geographic information on the Web, i.e., information describing entities that are located on the Earth’s surface. With the advent of mobile devices, the impact of geographic Web information on our everyday life has substantially grown. The mobile devices have also enabled the creation of novel data sources such as OpenStreetMap (OSM), a collaborative crowd-sourced map providing open cartographic information. Today, we use geographic information in many applications, including routing, location recommendation, or geographic question answering. The processing of geographic Web information yields unique challenges. First, the descriptions of geographic entities on the Web are typically not validated. Since not all Web information sources are trustworthy, the correctness of some geographic Web entities is questionable. Second, geographic information sources on the Web are typically isolated from each other. The missing integration of information sources hinders the efficient use of geographic Web information for many applications. Third, the description of geographic entities is typically incomplete. Depending on the application, missing information is a decisive criterion for (not) using a particular data source. Due to the large scale of the Web, the manual correction of these problems is usually not feasible such that automated approaches are required. In this thesis, we tackle these challenges from three different angles. (i) Validation of geographic Web information: We validate geographic Web information by detecting vandalism in OpenStreetMap, for instance, the replacement of a street name with advertisement. To this end, we present the OVID model for automated vandalism detection in OpenStreetMap. (ii) Enrichment of geographic Web information through integration: We integrate OpenStreetMap with other geographic Web information sources, namely knowledge graphs, by identifying entries corresponding to the same world real-world entities in both data sources. We present the OSM2KG model for automated identity link discovery between OSM and knowledge graphs. (iii) Enrichment of missing information in geographic Web information: We consider semantic annotations of geographic entities on Web pages as an additional data source. We exploit existing annotations of categorical properties of Web entities as training data to enrich missing categorical properties in geographic Web information. For all of the proposed models, we conduct extensive evaluations on real-world datasets. Our experimental results confirm that the proposed solutions reliably outperform existing baselines. Furthermore, we demonstrate the utility of geographic Web Information in two application scenarios. (i) Corpus of geographic entity embeddings: We introduce the GeoVectors corpus, a linked open dataset of ready-to-use embeddings of geographic entities. With GeoVectors, we substantially lower the burden to use geographic data in machine learning applications. (ii) Application to event impact prediction: We employ several geographic Web information sources to predict the impact of public events on road traffic. To this end, we use cartographic, event, and event venue information from the Web.Durch die kontinuierliche Zunahme verfĂŒgbarer Daten im World Wide Web, besteht heute eine noch nie da gewesene Menge verfĂŒgbarer Informationen. Das große Potential dieser Daten wird jedoch durch hohe Schwankungen in der DatenqualitĂ€t und in der VertrauenswĂŒrdigkeit der Datenquellen geschmĂ€lert. Dies kann vor allem am Beispiel von geografischen Web-Informationen beobachtet werden. Geografische Web-Informationen sind Informationen ĂŒber EntitĂ€ten, die ĂŒber Koordinaten auf der ErdoberflĂ€che verfĂŒgen. Die Relevanz von geografischen Web-Informationen fĂŒr den Alltag ist durch die Verbreitung von internetfĂ€higen, mobilen EndgerĂ€ten, zum Beispiel Smartphones, extrem gestiegen. Weiterhin hat die VerfĂŒgbarkeit der mobilen EndgerĂ€te auch zur Erstellung neuartiger Datenquellen wie OpenStreetMap (OSM) gefĂŒhrt. OSM ist eine offene, kollaborative Webkarte, die von Freiwilligen dezentral erstellt wird. Mittlerweile ist die Nutzung geografischer Informationen die Grundlage fĂŒr eine Vielzahl von Anwendungen, wie zum Beispiel Navigation, Reiseempfehlungen oder geografische Frage-Antwort-Systeme. Bei der Verarbeitung geografischer Web-Informationen mĂŒssen einzigartige Herausforderungen berĂŒcksichtigt werden. Erstens werden die Beschreibungen geografischer Web-EntitĂ€ten typischerweise nicht validiert. Da nicht alle Informationsquellen im Web vertrauenswĂŒrdig sind, ist die Korrektheit der Darstellung mancher Web-EntitĂ€ten fragwĂŒrdig. Zweitens sind Informationsquellen im Web oft voneinander isoliert. Die fehlende Integration von Informationsquellen erschwert die effektive Nutzung von geografischen Web-Information in vielen AnwendungsfĂ€llen. Drittens sind die Beschreibungen von geografischen EntitĂ€ten typischerweise unvollstĂ€ndig. Je nach Anwendung kann das Fehlen von bestimmten Informationen ein entscheidendes Kriterium fĂŒr die Nutzung einer Datenquelle sein. Da die GrĂ¶ĂŸe des Webs eine manuelle Behebung dieser Probleme nicht zulĂ€sst, sind automatisierte Verfahren notwendig. In dieser Arbeit nĂ€hern wir uns diesen Herausforderungen von drei verschiedenen Richtungen. (i) Validierung von geografischen Web-Informationen: Wir validieren geografische Web-Informationen, indem wir Vandalismus in OpenStreetMap identifizieren, zum Beispiel das Ersetzen von Straßennamen mit Werbetexten. (ii) Anreicherung von geografischen Web-Information durch Integration: Wir integrieren OpenStreetMap mit anderen Informationsquellen im Web (Wissensgraphen), indem wir EintrĂ€ge in beiden Informationsquellen identifizieren, die den gleichen Echtwelt-EntitĂ€ten entsprechen. (iii) Anreicherung von fehlenden geografischen Informationen: Wir nutzen semantische Annotationen von geografischen EntitĂ€ten auf Webseiten als weitere Datenquelle. Wir nutzen existierende Annotationen kategorischer Attribute von Web-EntitĂ€ten als Trainingsdaten, um fehlende kategorische Attribute in geografischen Web-Informationen zu ergĂ€nzen. Wir fĂŒhren ausfĂŒhrliche Evaluationen fĂŒr alle beschriebenen Modelle durch. Die vorgestellten LösungsansĂ€tze erzielen verlĂ€sslich bessere Ergebnisse als existierende AnsĂ€tze. Weiterhin demonstrieren wir den Nutzen von geografischen Web-Informationen in zwei Anwendungsszenarien. (i) Korpus mit Embeddings von geografischen EntitĂ€ten: Wir stellen den GeoVectors-Korpus vor, einen verlinkten, offenen Datensatz mit direkt nutzbaren Embeddings von geografischen Web-EntitĂ€ten. Der GeoVectors-Korpus erleichtert die Nutzung von geografischen Daten in Anwendungen von maschinellen Lernen erheblich. (ii) Anwendung zur Prognose von Veranstaltungsauswirkungen: Wir nutzen Karten-, Veranstaltungs- und VeranstaltungsstĂ€tten-Daten aus dem Web, um die Auswirkungen von Veranstaltungen auf den Straßenverkehr zu prognostizieren

    Augmenting cross-domain knowledge bases using web tables

    Get PDF
    Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web. This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion. When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps. Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema. By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts. Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale. In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables

    Data Science and Knowledge Discovery

    Get PDF
    Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining

    Entwicklung eines Frameworks unter Verwendung von Kontextinformationen und kollektiver Intelligenz

    Get PDF
    Die Bedeutung von Daten, Informationen und Wissen als Faktor fĂŒr wirtschaftliches und gesellschaftliches Handeln ist enorm und wĂ€chst noch weiter an. Ihr Austausch kennzeichnet und bestimmt die Globalisierung und den digitalen Wandel weit mehr als der Austausch von Waren. Grundlage dieser Entwicklung sind in erster Linie die enormen Fortschritte der Informations- und Kommunikationstechnik, die inzwischen insbesondere die VerfĂŒgbarkeit von Daten und Informationen nahezu zu jedem Zeitpunkt und an jedem Ort ermöglichen. Allerdings fĂŒhrt die riesige, rasant weiterwachsende verfĂŒgbare Menge an Daten und Informationen zu einer Überflutung in dem Sinne, dass es immer schwieriger wird, die jeweils relevanten Daten und Informationen zu finden bzw. zu identifizieren. Insbesondere beim Einsatz von Softwaresystemen ergibt sich aber fĂŒr die Nutzer der Systeme hĂ€ufig situations‑/kontextabhĂ€ngig ein drĂ€ngender Informationsbedarf, u.a. deshalb, weil die Systeme in immer kĂŒrzeren Zyklen verĂ€ndert bzw. weiterentwickelt werden. Die entsprechende Suche nach Informationen zur Deckung des Informationsbedarfs ist jedoch hĂ€ufig zeitaufwendig und wird vielfach „suboptimal“ durchgefĂŒhrt. Michael Beul geht in seiner Arbeit der Frage nach, wie die Suche und Bereitstellung von relevanten Informationen erleichtert bzw. automatisiert durchgefĂŒhrt werden kann, um eine effektivere Nutzung von Anwendungssystemen zu ermöglichen. Er erarbeitet ein Framework, welches insbesondere mit Hilfe von Konzepten der kollektiven Intelligenz eine kontextabhĂ€ngige Echtzeit-Informationsbeschaffung fĂŒr Nutzer softwareintensiver Systeme in den verschiedenen AnwendungsdomĂ€nen ermöglicht

    Participative Urban Health and Healthy Aging in the Age of AI

    Get PDF
    This open access book constitutes the refereed proceedings of the 18th International Conference on String Processing and Information Retrieval, ICOST 2022, held in Paris, France, in June 2022. The 15 full papers and 10 short papers presented in this volume were carefully reviewed and selected from 33 submissions. They cover topics such as design, development, deployment, and evaluation of AI for health, smart urban environments, assistive technologies, chronic disease management, and coaching and health telematics systems

    Urban Informatics

    Get PDF
    This open access book is the first to systematically introduce the principles of urban informatics and its application to every aspect of the city that involves its functioning, control, management, and future planning. It introduces new models and tools being developed to understand and implement these technologies that enable cities to function more efficiently – to become ‘smart’ and ‘sustainable’. The smart city has quickly emerged as computers have become ever smaller to the point where they can be embedded into the very fabric of the city, as well as being central to new ways in which the population can communicate and act. When cities are wired in this way, they have the potential to become sentient and responsive, generating massive streams of ‘big’ data in real time as well as providing immense opportunities for extracting new forms of urban data through crowdsourcing. This book offers a comprehensive review of the methods that form the core of urban informatics from various kinds of urban remote sensing to new approaches to machine learning and statistical modelling. It provides a detailed technical introduction to the wide array of tools information scientists need to develop the key urban analytics that are fundamental to learning about the smart city, and it outlines ways in which these tools can be used to inform design and policy so that cities can become more efficient with a greater concern for environment and equity

    Urban Informatics

    Get PDF
    This open access book is the first to systematically introduce the principles of urban informatics and its application to every aspect of the city that involves its functioning, control, management, and future planning. It introduces new models and tools being developed to understand and implement these technologies that enable cities to function more efficiently – to become ‘smart’ and ‘sustainable’. The smart city has quickly emerged as computers have become ever smaller to the point where they can be embedded into the very fabric of the city, as well as being central to new ways in which the population can communicate and act. When cities are wired in this way, they have the potential to become sentient and responsive, generating massive streams of ‘big’ data in real time as well as providing immense opportunities for extracting new forms of urban data through crowdsourcing. This book offers a comprehensive review of the methods that form the core of urban informatics from various kinds of urban remote sensing to new approaches to machine learning and statistical modelling. It provides a detailed technical introduction to the wide array of tools information scientists need to develop the key urban analytics that are fundamental to learning about the smart city, and it outlines ways in which these tools can be used to inform design and policy so that cities can become more efficient with a greater concern for environment and equity

    Urban Informatics

    Get PDF
    This open access book is the first to systematically introduce the principles of urban informatics and its application to every aspect of the city that involves its functioning, control, management, and future planning. It introduces new models and tools being developed to understand and implement these technologies that enable cities to function more efficiently – to become ‘smart’ and ‘sustainable’. The smart city has quickly emerged as computers have become ever smaller to the point where they can be embedded into the very fabric of the city, as well as being central to new ways in which the population can communicate and act. When cities are wired in this way, they have the potential to become sentient and responsive, generating massive streams of ‘big’ data in real time as well as providing immense opportunities for extracting new forms of urban data through crowdsourcing. This book offers a comprehensive review of the methods that form the core of urban informatics from various kinds of urban remote sensing to new approaches to machine learning and statistical modelling. It provides a detailed technical introduction to the wide array of tools information scientists need to develop the key urban analytics that are fundamental to learning about the smart city, and it outlines ways in which these tools can be used to inform design and policy so that cities can become more efficient with a greater concern for environment and equity

    Hacking the web 2.0: user agency and the role of hackers as computational mediators

    Get PDF
    This thesis studies the contested reconfigurations of computational agency within the domain of practices and affordances involved in the use of the Internet in everyday life (here labelled lifeworld Internet), through the transition of the Internet to a much deeper reliance on computation than at any previous stage. Computational agency is here considered not only in terms of capacity to act enabled (or restrained) by the computational layer but also as the recursive capacity to reconfigure the computational layer itself, therefore in turn affecting one’s own and others’ computational agency. My research is based on multisited and diachronic ethnographic fieldwork: an initial (2005–2007) autoethnographic case study focused on the negotiations of computational agency within the development of a Web 2.0 application, later (2010–2011) fieldwork interviews focused on processes through which users make sense of the increasing pervasiveness of the Internet and of computation in everyday life, and a review (2010–2015) of hacker discourses focused on tracing the processes through which hackers constitute themselves as a recursive public able to inscribe counter–narratives in the development of technical form and to reproduce itself as a public of computational mediators with capacity to operate at the intersection of the technical and the social. By grounding my enquiry in the specific context of the lifeworlds of individual end users but by following computational agency through global hacker discourses, my research explores the role of computation, computational capacity and computational mediators in the processes through which users ‘hack’ their everyday Internet environments for practical utility, or develop independent alternatives to centralized Internet services as part of their contestation of values inscribed in the materiality of mainstream Internet
    corecore