132 research outputs found

    A vocabulary-independent generation framework for DBpedia and beyond

    Get PDF
    The dbpedia Extraction Framework, the generation framework behind one of the Linked Open Data cloud’s central hubs, has limitations which lead to quality issues with the dbpedia dataset. Therefore, we provide a new take on its Extraction Framework that allows for a sustainable and general-purpose Linked Data generation framework by adapting a semantic-driven approach. The proposed approach decouples, in a declarative manner, the extraction, transformation, and mapping rules execution. This way, among others, interchanging different schema annotations is supported, instead of being coupled to a certain ontology as it is now, because the dbpedia Extraction Framework allows only generating a certain dataset with a single semantic representation. In this paper, we shed more light to the added value that this aspect brings. We provide an extracted dbpedia dataset using a different vocabulary, and give users the opportunity to generate a new dbpedia dataset using a custom combination of vocabularies

    Assessing and refining mappings to RDF to improve dataset quality

    Get PDF
    RDF dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually -but rarely- applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the RDF dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for RDF datasets stemming originally from (semi-) structured data (e.g., CSV, XML, JSON). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the RDF dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinements based on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as DBpedia, or newly generated, such as iLastic. Our evaluation indicates the efficiency of our workflow, as it significantly improves the overall quality of an RDF dataset in the observed cases

    Word Sense Disambiguation for Ontology Learning

    Get PDF
    Ontology learning aims to automatically extract ontological concepts and relationships from related text repositories and is expected to be more efficient and scalable than manual ontology development. One of the challenging issues associated with ontology learning is word sense disambiguation (WSD). Most WSD research employs resources such as WordNet, text corpora, or a hybrid approach. Motivated by the large volume and richness of user-generated content in social media, this research explores the role of social media in ontology learning. Specifically, our approach exploits social media as a dynamic context rich data source for WSD. This paper presents a method and preliminary evidence for the efficacy of our proposed method for WSD. The research is in progress toward conducting a formal evaluation of the social media based method for WSD, and plans to incorporate the WSD routine into an ontology learning system in the future

    Community-Driven Engineering of the DBpedia Infobox Ontology and DBpedia Live Extraction

    Get PDF
    The DBpedia project aims at extracting information based on semi-structured data present in Wikipedia articles, interlinking it with other knowledge bases, and publishing this information as RDF freely on the Web. So far, the DBpedia project has succeeded in creating one of the largest knowledge bases on the Data Web, which is used in many applications and research prototypes. However, the manual effort required to produce and publish a new version of the dataset – which was already partially outdated the moment it was released – has been a drawback. Additionally, the maintenance of the DBpedia Ontology, an ontology serving as a structural backbone for the extracted data, made the release cycles even more heavyweight. In the course of this thesis, we make two contributions: Firstly, we develop a wiki-based solution for maintaining the DBpedia Ontology. By allowing anyone to edit, we aim to distribute the maintenance work among the DBpedia community. Secondly, we extend DBpedia with a Live Extraction Framework, which is capable of extracting RDF data from articles that have recently been edited on the English Wikipedia. By making this RDF data automatically public in near realtime, namely via SPARQL and Linked Data, we overcome many of the drawbacks of the former release cycles

    Desarrollo de un sistema para la población de bases de conocimiento en la Web de datos

    Get PDF
    Durante las últimas décadas, el uso de la World Wide Web ha estado creciendo de forma exponencial, en gran parte gracias a la capacidad de los usuarios de aportar contenidos. Esta expansión ha convertido a la Web en una gran fuente de datos heterogénea. Sin embargo, la Web estaba orientada a las personas y no al procesado automático de la información por parte de agentes software. Para facilitar esto, han surgido diferentes iniciativas, metodologías y tecnologías agrupadas bajo las denominaciones de Web Semántica (Semantic Web), y Web de datos enlazados (Web of Linked Data). Sus pilares fundamentales son las ontologías, definidas como especificaciones explícitas formales de acuerdo a una conceptualización, y las bases de conocimiento (Knowledge Bases), repositorios con datos modelados según una ontología. Muchas de estas bases de conocimiento son pobladas con datos de forma manual, mientras que otras usan como fuente páginas web de las que se extrae la información mediante técnicas automáticas. Un ejemplo de esto último es DBpedia, cuyos datos son obtenidos de los infoboxes, pequeñas cajas de información estructurada que acompañan a cada artículo de Wikipedia. Actualmente, uno de los grandes problemas de estas bases de conocimiento es la gran cantidad de errores e inconsistencias en los datos, la falta de precisión y la ausencia de enlaces o relaciones entre datos que deberían estar relacionados. Estos problemas son, en parte, debidos al desconocimiento de los usuarios sobre los procesos de inserción de datos. La falta de información sobre la estructura de las bases de conocimiento provoca que no sepan qué pueden o deben introducir, ni en qué forma deben hacerlo. Por otra parte, aunque existen técnicas automáticas de inserción de datos, suelen tener un rendimiento más bajo que usuarios especialistas, sobre todo si las fuentes usadas son de baja calidad. Este proyecto plantea el análisis, diseño y desarrollo de un sistema que ayuda a los usuarios a crear contenido para poblar bases de conocimiento. Dicho sistema proporciona al usuario información sobre qué datos y metadatos pueden introducirse y qué formato deben emplear, sugiriéndoles posibles valores para diferentes campos, y ayudándoles a relacionar los nuevos datos con datos ya existentes cuando sea posible. Para ello, el sistema hace uso tanto de técnicas estadísticas sobre datos ya introducidos, como de técnicas semánticas sobre las posibles relaciones y restricciones definidas en la base de conocimiento con la que se trabaja. Además, el sistema desarrollado está accesible como aplicación web (http://sid.cps.unizar.es/Infoboxer), es adaptable a distintas bases de conocimiento y permite exportar el contenido creado en diferentes formatos, incluyendo RDF e infobox de Wikipedia. Por último señalar que el sistema ha sido probado en tres evaluaciones con usuarios, en las que ha demostrado su efectividad y sencillez para crear contenido de mayor calidad que sin su uso, y que se han escrito dos artículos de investigación sobre este trabajo; uno de ellos aceptado para su exposición y publicación en las XXI Jornadas de Ingeniería del Software y Bases de Datos (JISBD), y el otro en proceso de revisión en la 15th International Semantic Web Conference (ISWC)

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Mining Meaning from Wikipedia

    Get PDF
    Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie
    corecore