12 research outputs found

    Web-scale profiling of semantic annotations in HTML pages

    Full text link
    The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-deïŹned meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by proïŹling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the ïŹrst step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates ïŹrst the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-speciïŹc adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-speciïŹc issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated

    Enriching and validating geographic information on the web

    Get PDF
    The continuous growth of available data on the World Wide Web has led to an unprecedented amount of available information. However, the enormous variance in data quality and trustworthiness of information sources impairs the great potential of the large amount of vacant information. This observation especially applies to geographic information on the Web, i.e., information describing entities that are located on the Earth’s surface. With the advent of mobile devices, the impact of geographic Web information on our everyday life has substantially grown. The mobile devices have also enabled the creation of novel data sources such as OpenStreetMap (OSM), a collaborative crowd-sourced map providing open cartographic information. Today, we use geographic information in many applications, including routing, location recommendation, or geographic question answering. The processing of geographic Web information yields unique challenges. First, the descriptions of geographic entities on the Web are typically not validated. Since not all Web information sources are trustworthy, the correctness of some geographic Web entities is questionable. Second, geographic information sources on the Web are typically isolated from each other. The missing integration of information sources hinders the efficient use of geographic Web information for many applications. Third, the description of geographic entities is typically incomplete. Depending on the application, missing information is a decisive criterion for (not) using a particular data source. Due to the large scale of the Web, the manual correction of these problems is usually not feasible such that automated approaches are required. In this thesis, we tackle these challenges from three different angles. (i) Validation of geographic Web information: We validate geographic Web information by detecting vandalism in OpenStreetMap, for instance, the replacement of a street name with advertisement. To this end, we present the OVID model for automated vandalism detection in OpenStreetMap. (ii) Enrichment of geographic Web information through integration: We integrate OpenStreetMap with other geographic Web information sources, namely knowledge graphs, by identifying entries corresponding to the same world real-world entities in both data sources. We present the OSM2KG model for automated identity link discovery between OSM and knowledge graphs. (iii) Enrichment of missing information in geographic Web information: We consider semantic annotations of geographic entities on Web pages as an additional data source. We exploit existing annotations of categorical properties of Web entities as training data to enrich missing categorical properties in geographic Web information. For all of the proposed models, we conduct extensive evaluations on real-world datasets. Our experimental results confirm that the proposed solutions reliably outperform existing baselines. Furthermore, we demonstrate the utility of geographic Web Information in two application scenarios. (i) Corpus of geographic entity embeddings: We introduce the GeoVectors corpus, a linked open dataset of ready-to-use embeddings of geographic entities. With GeoVectors, we substantially lower the burden to use geographic data in machine learning applications. (ii) Application to event impact prediction: We employ several geographic Web information sources to predict the impact of public events on road traffic. To this end, we use cartographic, event, and event venue information from the Web.Durch die kontinuierliche Zunahme verfĂŒgbarer Daten im World Wide Web, besteht heute eine noch nie da gewesene Menge verfĂŒgbarer Informationen. Das große Potential dieser Daten wird jedoch durch hohe Schwankungen in der DatenqualitĂ€t und in der VertrauenswĂŒrdigkeit der Datenquellen geschmĂ€lert. Dies kann vor allem am Beispiel von geografischen Web-Informationen beobachtet werden. Geografische Web-Informationen sind Informationen ĂŒber EntitĂ€ten, die ĂŒber Koordinaten auf der ErdoberflĂ€che verfĂŒgen. Die Relevanz von geografischen Web-Informationen fĂŒr den Alltag ist durch die Verbreitung von internetfĂ€higen, mobilen EndgerĂ€ten, zum Beispiel Smartphones, extrem gestiegen. Weiterhin hat die VerfĂŒgbarkeit der mobilen EndgerĂ€te auch zur Erstellung neuartiger Datenquellen wie OpenStreetMap (OSM) gefĂŒhrt. OSM ist eine offene, kollaborative Webkarte, die von Freiwilligen dezentral erstellt wird. Mittlerweile ist die Nutzung geografischer Informationen die Grundlage fĂŒr eine Vielzahl von Anwendungen, wie zum Beispiel Navigation, Reiseempfehlungen oder geografische Frage-Antwort-Systeme. Bei der Verarbeitung geografischer Web-Informationen mĂŒssen einzigartige Herausforderungen berĂŒcksichtigt werden. Erstens werden die Beschreibungen geografischer Web-EntitĂ€ten typischerweise nicht validiert. Da nicht alle Informationsquellen im Web vertrauenswĂŒrdig sind, ist die Korrektheit der Darstellung mancher Web-EntitĂ€ten fragwĂŒrdig. Zweitens sind Informationsquellen im Web oft voneinander isoliert. Die fehlende Integration von Informationsquellen erschwert die effektive Nutzung von geografischen Web-Information in vielen AnwendungsfĂ€llen. Drittens sind die Beschreibungen von geografischen EntitĂ€ten typischerweise unvollstĂ€ndig. Je nach Anwendung kann das Fehlen von bestimmten Informationen ein entscheidendes Kriterium fĂŒr die Nutzung einer Datenquelle sein. Da die GrĂ¶ĂŸe des Webs eine manuelle Behebung dieser Probleme nicht zulĂ€sst, sind automatisierte Verfahren notwendig. In dieser Arbeit nĂ€hern wir uns diesen Herausforderungen von drei verschiedenen Richtungen. (i) Validierung von geografischen Web-Informationen: Wir validieren geografische Web-Informationen, indem wir Vandalismus in OpenStreetMap identifizieren, zum Beispiel das Ersetzen von Straßennamen mit Werbetexten. (ii) Anreicherung von geografischen Web-Information durch Integration: Wir integrieren OpenStreetMap mit anderen Informationsquellen im Web (Wissensgraphen), indem wir EintrĂ€ge in beiden Informationsquellen identifizieren, die den gleichen Echtwelt-EntitĂ€ten entsprechen. (iii) Anreicherung von fehlenden geografischen Informationen: Wir nutzen semantische Annotationen von geografischen EntitĂ€ten auf Webseiten als weitere Datenquelle. Wir nutzen existierende Annotationen kategorischer Attribute von Web-EntitĂ€ten als Trainingsdaten, um fehlende kategorische Attribute in geografischen Web-Informationen zu ergĂ€nzen. Wir fĂŒhren ausfĂŒhrliche Evaluationen fĂŒr alle beschriebenen Modelle durch. Die vorgestellten LösungsansĂ€tze erzielen verlĂ€sslich bessere Ergebnisse als existierende AnsĂ€tze. Weiterhin demonstrieren wir den Nutzen von geografischen Web-Informationen in zwei Anwendungsszenarien. (i) Korpus mit Embeddings von geografischen EntitĂ€ten: Wir stellen den GeoVectors-Korpus vor, einen verlinkten, offenen Datensatz mit direkt nutzbaren Embeddings von geografischen Web-EntitĂ€ten. Der GeoVectors-Korpus erleichtert die Nutzung von geografischen Daten in Anwendungen von maschinellen Lernen erheblich. (ii) Anwendung zur Prognose von Veranstaltungsauswirkungen: Wir nutzen Karten-, Veranstaltungs- und VeranstaltungsstĂ€tten-Daten aus dem Web, um die Auswirkungen von Veranstaltungen auf den Straßenverkehr zu prognostizieren

    Semantic Systems. The Power of AI and Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

    Linked Open Data - Creating Knowledge Out of Interlinked Data: Results of the LOD2 Project

    Get PDF
    Database Management; Artificial Intelligence (incl. Robotics); Information Systems and Communication Servic

    Heuristics for fixing common errors in deployed schema.org microdata

    Full text link

    Engineering Agile Big-Data Systems

    Get PDF
    To be effective, data-intensive systems require extensive ongoing customisation to reflect changing user requirements, organisational policies, and the structure and interpretation of the data they hold. Manual customisation is expensive, time-consuming, and error-prone. In large complex systems, the value of the data can be such that exhaustive testing is necessary before any new feature can be added to the existing design. In most cases, the precise details of requirements, policies and data will change during the lifetime of the system, forcing a choice between expensive modification and continued operation with an inefficient design.Engineering Agile Big-Data Systems outlines an approach to dealing with these problems in software and data engineering, describing a methodology for aligning these processes throughout product lifecycles. It discusses tools which can be used to achieve these goals, and, in a number of case studies, shows how the tools and methodology have been used to improve a variety of academic and business systems

    Engineering Agile Big-Data Systems

    Get PDF
    To be effective, data-intensive systems require extensive ongoing customisation to reflect changing user requirements, organisational policies, and the structure and interpretation of the data they hold. Manual customisation is expensive, time-consuming, and error-prone. In large complex systems, the value of the data can be such that exhaustive testing is necessary before any new feature can be added to the existing design. In most cases, the precise details of requirements, policies and data will change during the lifetime of the system, forcing a choice between expensive modification and continued operation with an inefficient design.Engineering Agile Big-Data Systems outlines an approach to dealing with these problems in software and data engineering, describing a methodology for aligning these processes throughout product lifecycles. It discusses tools which can be used to achieve these goals, and, in a number of case studies, shows how the tools and methodology have been used to improve a variety of academic and business systems
    corecore