26 research outputs found

    Web-scale profiling of semantic annotations in HTML pages

    Full text link
    The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated

    Enriching and validating geographic information on the web

    Get PDF
    The continuous growth of available data on the World Wide Web has led to an unprecedented amount of available information. However, the enormous variance in data quality and trustworthiness of information sources impairs the great potential of the large amount of vacant information. This observation especially applies to geographic information on the Web, i.e., information describing entities that are located on the Earth’s surface. With the advent of mobile devices, the impact of geographic Web information on our everyday life has substantially grown. The mobile devices have also enabled the creation of novel data sources such as OpenStreetMap (OSM), a collaborative crowd-sourced map providing open cartographic information. Today, we use geographic information in many applications, including routing, location recommendation, or geographic question answering. The processing of geographic Web information yields unique challenges. First, the descriptions of geographic entities on the Web are typically not validated. Since not all Web information sources are trustworthy, the correctness of some geographic Web entities is questionable. Second, geographic information sources on the Web are typically isolated from each other. The missing integration of information sources hinders the efficient use of geographic Web information for many applications. Third, the description of geographic entities is typically incomplete. Depending on the application, missing information is a decisive criterion for (not) using a particular data source. Due to the large scale of the Web, the manual correction of these problems is usually not feasible such that automated approaches are required. In this thesis, we tackle these challenges from three different angles. (i) Validation of geographic Web information: We validate geographic Web information by detecting vandalism in OpenStreetMap, for instance, the replacement of a street name with advertisement. To this end, we present the OVID model for automated vandalism detection in OpenStreetMap. (ii) Enrichment of geographic Web information through integration: We integrate OpenStreetMap with other geographic Web information sources, namely knowledge graphs, by identifying entries corresponding to the same world real-world entities in both data sources. We present the OSM2KG model for automated identity link discovery between OSM and knowledge graphs. (iii) Enrichment of missing information in geographic Web information: We consider semantic annotations of geographic entities on Web pages as an additional data source. We exploit existing annotations of categorical properties of Web entities as training data to enrich missing categorical properties in geographic Web information. For all of the proposed models, we conduct extensive evaluations on real-world datasets. Our experimental results confirm that the proposed solutions reliably outperform existing baselines. Furthermore, we demonstrate the utility of geographic Web Information in two application scenarios. (i) Corpus of geographic entity embeddings: We introduce the GeoVectors corpus, a linked open dataset of ready-to-use embeddings of geographic entities. With GeoVectors, we substantially lower the burden to use geographic data in machine learning applications. (ii) Application to event impact prediction: We employ several geographic Web information sources to predict the impact of public events on road traffic. To this end, we use cartographic, event, and event venue information from the Web.Durch die kontinuierliche Zunahme verfügbarer Daten im World Wide Web, besteht heute eine noch nie da gewesene Menge verfügbarer Informationen. Das große Potential dieser Daten wird jedoch durch hohe Schwankungen in der Datenqualität und in der Vertrauenswürdigkeit der Datenquellen geschmälert. Dies kann vor allem am Beispiel von geografischen Web-Informationen beobachtet werden. Geografische Web-Informationen sind Informationen über Entitäten, die über Koordinaten auf der Erdoberfläche verfügen. Die Relevanz von geografischen Web-Informationen für den Alltag ist durch die Verbreitung von internetfähigen, mobilen Endgeräten, zum Beispiel Smartphones, extrem gestiegen. Weiterhin hat die Verfügbarkeit der mobilen Endgeräte auch zur Erstellung neuartiger Datenquellen wie OpenStreetMap (OSM) geführt. OSM ist eine offene, kollaborative Webkarte, die von Freiwilligen dezentral erstellt wird. Mittlerweile ist die Nutzung geografischer Informationen die Grundlage für eine Vielzahl von Anwendungen, wie zum Beispiel Navigation, Reiseempfehlungen oder geografische Frage-Antwort-Systeme. Bei der Verarbeitung geografischer Web-Informationen müssen einzigartige Herausforderungen berücksichtigt werden. Erstens werden die Beschreibungen geografischer Web-Entitäten typischerweise nicht validiert. Da nicht alle Informationsquellen im Web vertrauenswürdig sind, ist die Korrektheit der Darstellung mancher Web-Entitäten fragwürdig. Zweitens sind Informationsquellen im Web oft voneinander isoliert. Die fehlende Integration von Informationsquellen erschwert die effektive Nutzung von geografischen Web-Information in vielen Anwendungsfällen. Drittens sind die Beschreibungen von geografischen Entitäten typischerweise unvollständig. Je nach Anwendung kann das Fehlen von bestimmten Informationen ein entscheidendes Kriterium für die Nutzung einer Datenquelle sein. Da die Größe des Webs eine manuelle Behebung dieser Probleme nicht zulässt, sind automatisierte Verfahren notwendig. In dieser Arbeit nähern wir uns diesen Herausforderungen von drei verschiedenen Richtungen. (i) Validierung von geografischen Web-Informationen: Wir validieren geografische Web-Informationen, indem wir Vandalismus in OpenStreetMap identifizieren, zum Beispiel das Ersetzen von Straßennamen mit Werbetexten. (ii) Anreicherung von geografischen Web-Information durch Integration: Wir integrieren OpenStreetMap mit anderen Informationsquellen im Web (Wissensgraphen), indem wir Einträge in beiden Informationsquellen identifizieren, die den gleichen Echtwelt-Entitäten entsprechen. (iii) Anreicherung von fehlenden geografischen Informationen: Wir nutzen semantische Annotationen von geografischen Entitäten auf Webseiten als weitere Datenquelle. Wir nutzen existierende Annotationen kategorischer Attribute von Web-Entitäten als Trainingsdaten, um fehlende kategorische Attribute in geografischen Web-Informationen zu ergänzen. Wir führen ausführliche Evaluationen für alle beschriebenen Modelle durch. Die vorgestellten Lösungsansätze erzielen verlässlich bessere Ergebnisse als existierende Ansätze. Weiterhin demonstrieren wir den Nutzen von geografischen Web-Informationen in zwei Anwendungsszenarien. (i) Korpus mit Embeddings von geografischen Entitäten: Wir stellen den GeoVectors-Korpus vor, einen verlinkten, offenen Datensatz mit direkt nutzbaren Embeddings von geografischen Web-Entitäten. Der GeoVectors-Korpus erleichtert die Nutzung von geografischen Daten in Anwendungen von maschinellen Lernen erheblich. (ii) Anwendung zur Prognose von Veranstaltungsauswirkungen: Wir nutzen Karten-, Veranstaltungs- und Veranstaltungsstätten-Daten aus dem Web, um die Auswirkungen von Veranstaltungen auf den Straßenverkehr zu prognostizieren

    Semantic Systems. The Power of AI and Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

    Cross-domain Recommendations based on semantically-enhanced User Web Behavior

    Get PDF
    Information seeking in the Web can be facilitated by recommender systems that guide the users in a personalized manner to relevant resources in the large space of the possible options in the Web. This work investigates how to model people\u27s Web behavior at multiple sites and learn to predict future preferences, in order to generate relevant cross-domain recommendations. This thesis contributes with novel techniques for building cross-domain recommender systems in an open Web setting

    Evaluación de la Accesibilidad y Adaptabilidad de Objetos de Aprendizaje y cursos online a través de estándares y metadatos

    Get PDF
    El emprendimiento ha adquirido relevancia como tema de investigación en los últimos años. El individuo emprendedor tiene un rol influyente en la economía, por lo que entender sus motivaciones y aspiraciones es clave en esta investigación. El emprendimiento puede considerarse desde dos perspectivas, a nivel individual y desde el punto de vista de la organización, tratando así el intraemprendimiento o emprendimiento corporativo. Así pues, este trabajo tiene como objetivo principal analizar los diferentes niveles en los que se desarrolla la actividad emprendedora y entender el vínculo existente con las soft skills, permitiendo así considerar esta actividad como elemento de dinamización. Para abordar el objetivo general planteado se ha estudiado en primera instancia la literatura anterior vinculada con el concepto de emprendimiento, la cual queda vinculada al entorno individual y organizacional. Posteriormente, se analizan las habilidades o soft skills determinantes como elementos influyentes que permiten el desarrollo y crecimiento emprendedor. De esta forma, se consigue abordar la repercusión de la actividad emprendedora e intraemprendedora desde una perspectiva general. En esta línea, elementos como la creatividad y el conocimiento han quedado vinculados a lo largo de toda la investigación, puesto que los emprendedores requieren la actualización constante de conocimientos, y la búsqueda y el aprovechamiento de las oportunidades existentes. En consecuencia, esta investigación contribuye con el gap existente en la literatura y permite poner en valor las capacidades más relevantes en el entramado laboral, permitiendo con esto analizar los nuevos vínculos entre la sociedad y la iniciativa emprendedora. Las primeras aproximaciones de la investigación se han desarrollado a través del análisis de la base de datos European Skills, Competences, Qualifications and Occupations (ESCO), la cual ha permitido destacar las cualidades y competencias clave en el desarrollo emprendedor. A su vez, se ha realizado, como primer artículo de la tesis un análisis bibliométrico del concepto de emprendimiento. Con esto se ha conseguido destacar a los investigadores más representativos en este ámbito, y entender las redes y conexiones existentes entre ellos. De igual forma, se han destacado las palabras innovación y formación vinculadas a este concepto siendo clave para continuar con la investigación en el tema. A nivel del individuo se ha desarrollado un análisis de las motivaciones del emprendedor, quedando reflejado en el segundo artículo de la tesis. En este caso, la investigación ha examinado cómo influyen las variables de creatividad, comunicación y liderazgo en la decisión de convertirse en emprendedor en una situación prepandémica y en la situación actual, considerada como la nueva normalidad. En este sentido, la motivación emprendedora ha destacado por quedar influenciada por factores como la incertidumbre. Además, las variables creatividad, comunicación y liderazgo no son representativas en la presencia de emprendedores potenciales en la situación post-pandémica de nueva normalidad, sin embargo, sí que lo eran antes de la Covid-19. Por consiguiente, se vuelve necesario mencionar que debido a la Covid-19 se desarrolló un análisis comparativo, enriqueciendo en gran medida los resultados obtenidos. Por último, y debido a la dificultad que supone acceder a los datos estratégicos internos de las organizaciones, se han estudiado las variables que impactan en la estrategia de la empresa a través del desarrollo de una encuesta a 241 pequeñas y medianas empresas (PYMES). Esto, ha permitido considerar la influencia que ha tenido cada variable destacada en el análisis anterior, creatividad, comunicación y liderazgo, en la organización. En consecuencia, a través del tercer artículo se consigue un análisis en profundidad de la repercusión de la formación y las skills determinantes en la estrategia empresarial. Se investiga en este artículo si esas variables impactan directamente en el desarrollo de iniciativas intraemprendedoras. La investigación destaca la relevancia de la formación de los empleados en las organizaciones como componente diferencial y generador de valor. Así pues, la formación en habilidades y competencias les permitirá desarrollar actividades emprendedoras, lo que ayudará a la toma de decisiones estratégicas y a la diferenciación en el actual mercado competitivo y cambiante

    Modelling and Recognizing Personal Data

    Get PDF
    To define what a person is represents a hard task, due to the fact that personal data, i.e., data that refer or describe a person, have a very heterogeneous nature. The issue is only worsening with the advent of technologies that, while allowing unprecedented collection and processing capabilities, cannot \textit{understand} the world as humans do. This problem is a well-known long-standing problem in computer science called the Semantic Gap Problem. It was originally defined in the research area of image processing as "... the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation...". In the context of this work, the semantic gap is the lack of coincidence is between sensor data collected by ubiquitous devices and the human knowledge about the world that relies on their intelligence, habits, and routines. This thesis addresses the semantic gap problem from a representational point of view, proposing an interdisciplinary approach able to model and recognize personal data in real life scenarios. In fact, the semantic gap affects many communities, ranging from ubiquitous computing to user modelling, that must face the issue of managing the complexity of personal data in terms of modelling and recognition. The contributions of this Ph. D. Thesis are: 1) The definition of a methodology based on an interdisciplinary approach that can account for how to represent and allow the recognition of personal data. The interdisciplinary approach relies on the entity-centric approach and on an interdisciplinary categorization to define and structure personal data. 2) The definition of an ontology of personal data to represent human in a general way while also accounting their different dimensions of their everyday life; 3) The instantiation of the personal data representation above in a reference architecture that allows implementing the ontology and that can exploit the methodology to account for how to recognize personal data. 4) The adoption of the methodology for defining personal data and its instantiation in three real-life use cases with different goals in mind, proving that our modelling works in different domains and can account for several dimensions of the user

    Search Results: Predicting Ranking Algorithms With User Ratings and User-Driven Data

    Get PDF
    The purpose of this correlational quantitative study was to examine the possible relationship between user-driven parameters, user ratings, and ranking algorithms. The study’s population consisted of students and faculty in the information technology (IT) field at a university in Huntington, WV. Arrow’s impossibility theorem was used as the theoretical framework for this study. Complete survey data were collected from 47 students and faculty members in the IT field, and a multiple regression analysis was used to measure the correlations between the variables. The model was able to explain 85% of the total variability in the ranking algorithm. The overall model was able to significantly predict the algorithm ranking discounted cumulative gain, R2 = .852, F(3,115) = 220.13, p \u3c .01. The Respondent DCG and Search term variables were the most significant predictor with p = .0001. The overall findings can potentially be useful to content providers who focus their content on a specific niche. The content created by these providers would most likely be focused entirely on that subgroup of interested users. While it is necessary to focus content to the interested users, it may be beneficial to expand the content to more generic terms to help reach potential new users outside of the subgroups of interest. User’s searching for more generic terms could potentially be exposed to more content that would generally require more specific search terms. This exposure with more generic terms could help users expand their knowledge of new content more quickly

    Resource discovery in heterogeneous digital content environments

    Get PDF
    The concept of 'resource discovery' is central to our understanding of how users explore, navigate, locate and retrieve information resources. This submission for a PhD by Published Works examines a series of 11 related works which explore topics pertaining to resource discovery, each demonstrating heterogeneity in their digital discovery context. The assembled works are prefaced by nine chapters which seek to review and critically analyse the contribution of each work, as well as provide contextualization within the wider body of research literature. A series of conceptual sub-themes is used to organize and structure the works and the accompanying critical commentary. The thesis first begins by examining issues in distributed discovery contexts by studying collection level metadata (CLM), its application in 'information landscaping' techniques, and its relationship to the efficacy of federated item-level search tools. This research narrative continues but expands in the later works and commentary to consider the application of Knowledge Organization Systems (KOS), particularly within Semantic Web and machine interface contexts, with investigations of semantically aware terminology services in distributed discovery. The necessary modelling of data structures to support resource discovery - and its associated functionalities within digital libraries and repositories - is then considered within the novel context of technology-supported curriculum design repositories, where questions of human-computer interaction (HCI) are also examined. The final works studied as part of the thesis are those which investigate and evaluate the efficacy of open repositories in exposing knowledge commons to resource discovery via web search agents. Through the analysis of the collected works it is possible to identify a unifying theory of resource discovery, with the proposed concept of (meta)data alignment described and presented with a visual model. This analysis assists in the identification of a number of research topics worthy of further research; but it also highlights an incremental transition by the present author, from using research to inform the development of technologies designed to support or facilitate resource discovery, particularly at a 'meta' level, to the application of specific technologies to address resource discovery issues in a local context. Despite this variation the research narrative has remained focussed on topics surrounding resource discovery in heterogeneous digital content environments and is noted as having generated a coherent body of work. Separate chapters are used to consider the methodological approaches adopted in each work and the contribution made to research knowledge and professional practice.The concept of 'resource discovery' is central to our understanding of how users explore, navigate, locate and retrieve information resources. This submission for a PhD by Published Works examines a series of 11 related works which explore topics pertaining to resource discovery, each demonstrating heterogeneity in their digital discovery context. The assembled works are prefaced by nine chapters which seek to review and critically analyse the contribution of each work, as well as provide contextualization within the wider body of research literature. A series of conceptual sub-themes is used to organize and structure the works and the accompanying critical commentary. The thesis first begins by examining issues in distributed discovery contexts by studying collection level metadata (CLM), its application in 'information landscaping' techniques, and its relationship to the efficacy of federated item-level search tools. This research narrative continues but expands in the later works and commentary to consider the application of Knowledge Organization Systems (KOS), particularly within Semantic Web and machine interface contexts, with investigations of semantically aware terminology services in distributed discovery. The necessary modelling of data structures to support resource discovery - and its associated functionalities within digital libraries and repositories - is then considered within the novel context of technology-supported curriculum design repositories, where questions of human-computer interaction (HCI) are also examined. The final works studied as part of the thesis are those which investigate and evaluate the efficacy of open repositories in exposing knowledge commons to resource discovery via web search agents. Through the analysis of the collected works it is possible to identify a unifying theory of resource discovery, with the proposed concept of (meta)data alignment described and presented with a visual model. This analysis assists in the identification of a number of research topics worthy of further research; but it also highlights an incremental transition by the present author, from using research to inform the development of technologies designed to support or facilitate resource discovery, particularly at a 'meta' level, to the application of specific technologies to address resource discovery issues in a local context. Despite this variation the research narrative has remained focussed on topics surrounding resource discovery in heterogeneous digital content environments and is noted as having generated a coherent body of work. Separate chapters are used to consider the methodological approaches adopted in each work and the contribution made to research knowledge and professional practice

    2020 annual report

    Get PDF
    The South Carolina Coordinating Council for Workforce Development annually publishes a report on council members, activities and accomplishments, and survey results
    corecore