27 research outputs found

    Efficient dictionary compression for processing RDF big data using Google BigQuery

    Get PDF
    The Resource Description Framework (RDF) data model, is used on the Web to express billions of structured statements in a wide range of topics, including government, publications, life sciences, etc. Consequently, processing and storing this data requires the provision of high specification systems, both in terms of storage and computational capabilities. On the other hand, cloud-based big data services such as Google BigQuery can be used to store and query this data without any upfront investment. Google BigQuery pricing is based on the size of the data being stored or queried, but given that RDF statements contain long Uniform Resource Identifiers (URIs), the cost of query and storage of RDF big data can increase rapidly. In this paper we present and evaluate a novel and efficient dictionary compression algorithm which is faster, generates small dictionaries that can fit in memory and results in better compression rate when compared with other large scale RDF dictionary compression. Consequently, our algorithm also reduces the BigQuery storage and query cos

    Desarrollo de un sistema para la población de bases de conocimiento en la Web de datos

    Get PDF
    Durante las últimas décadas, el uso de la World Wide Web ha estado creciendo de forma exponencial, en gran parte gracias a la capacidad de los usuarios de aportar contenidos. Esta expansión ha convertido a la Web en una gran fuente de datos heterogénea. Sin embargo, la Web estaba orientada a las personas y no al procesado automático de la información por parte de agentes software. Para facilitar esto, han surgido diferentes iniciativas, metodologías y tecnologías agrupadas bajo las denominaciones de Web Semántica (Semantic Web), y Web de datos enlazados (Web of Linked Data). Sus pilares fundamentales son las ontologías, definidas como especificaciones explícitas formales de acuerdo a una conceptualización, y las bases de conocimiento (Knowledge Bases), repositorios con datos modelados según una ontología. Muchas de estas bases de conocimiento son pobladas con datos de forma manual, mientras que otras usan como fuente páginas web de las que se extrae la información mediante técnicas automáticas. Un ejemplo de esto último es DBpedia, cuyos datos son obtenidos de los infoboxes, pequeñas cajas de información estructurada que acompañan a cada artículo de Wikipedia. Actualmente, uno de los grandes problemas de estas bases de conocimiento es la gran cantidad de errores e inconsistencias en los datos, la falta de precisión y la ausencia de enlaces o relaciones entre datos que deberían estar relacionados. Estos problemas son, en parte, debidos al desconocimiento de los usuarios sobre los procesos de inserción de datos. La falta de información sobre la estructura de las bases de conocimiento provoca que no sepan qué pueden o deben introducir, ni en qué forma deben hacerlo. Por otra parte, aunque existen técnicas automáticas de inserción de datos, suelen tener un rendimiento más bajo que usuarios especialistas, sobre todo si las fuentes usadas son de baja calidad. Este proyecto plantea el análisis, diseño y desarrollo de un sistema que ayuda a los usuarios a crear contenido para poblar bases de conocimiento. Dicho sistema proporciona al usuario información sobre qué datos y metadatos pueden introducirse y qué formato deben emplear, sugiriéndoles posibles valores para diferentes campos, y ayudándoles a relacionar los nuevos datos con datos ya existentes cuando sea posible. Para ello, el sistema hace uso tanto de técnicas estadísticas sobre datos ya introducidos, como de técnicas semánticas sobre las posibles relaciones y restricciones definidas en la base de conocimiento con la que se trabaja. Además, el sistema desarrollado está accesible como aplicación web (http://sid.cps.unizar.es/Infoboxer), es adaptable a distintas bases de conocimiento y permite exportar el contenido creado en diferentes formatos, incluyendo RDF e infobox de Wikipedia. Por último señalar que el sistema ha sido probado en tres evaluaciones con usuarios, en las que ha demostrado su efectividad y sencillez para crear contenido de mayor calidad que sin su uso, y que se han escrito dos artículos de investigación sobre este trabajo; uno de ellos aceptado para su exposición y publicación en las XXI Jornadas de Ingeniería del Software y Bases de Datos (JISBD), y el otro en proceso de revisión en la 15th International Semantic Web Conference (ISWC)

    User-centric Query Refinement and Processing Using Granularity Based Strategies

    Get PDF
    Under the context of large-scale scientific literatures, this paper provides a user-centric approach for refining and processing incomplete or vague query based on cognitive- and granularity-based strategies. From the viewpoints of user interests retention and granular information processing, we examine various strategies for user-centric unification of search and reasoning. Inspired by the basic level for human problem-solving in cognitive science, we refine a query based on retained user interests. We bring the multi-level, multi-perspective strategies from human problem-solving to large-scale search and reasoning. The power/exponential law-based interests retention modeling, network statistics-based data selection, and ontology-supervised hierarchical reasoning are developed to implement these strategies. As an illustration, we investigate some case studies based on a large-scale scientific literature dataset, DBLP. The experimental results show that the proposed strategies are potentially effective. © 2010 Springer-Verlag London Limited

    Facilitating Scientometrics in Learning Analytics and Educational Data Mining – the LAK Dataset

    Get PDF
    The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not yet available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, being a near-complete corpus of research publications of a particular discipline, it facilitates scientometric investigations, for instance, about the evolution of a scientific field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications

    A Reference Architecture for Building Semantic-Web Mediators

    Get PDF
    The Semantic Web comprises a large amount of distributed and heterogeneous ontologies, which have been developed by different communities, and there exists a need to integrate them. Mediators are pieces of software that help to perform this integration, which have been widely studied in the context of nested relational models. Unfortunately, mediators for databases that are modelled using ontologies have not been so widely studied. In this paper, we present a reference architecture for building semantic-web mediators. To the best of our knowledge, this is the first reference architecture in the bibliography that solves the integration problem as a whole, contrarily to existing approaches that focus on specific problems. Furthermore, we describe a case study that is contextualised in the digital libraries domain in which we realise the benefits of our reference architecture. Finally, we identify a number of best practices to build semantic-web mediators.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602,Junta de Andalucía P08-TIC-4100Ministerio de Industria, Turismo y Comercio TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Benchmarking Data Exchange among Semantic-Web Ontologies

    Get PDF
    The increasing popularity of the Web of Data is motivating the need to integrate semantic-web ontologies. Data exchange is one integration approach that aims to populate a target ontology using data that come from one or more source ontologies. Currently, there exist a variety of systems that are suitable to perform data exchange among these ontologies; unfortunately, they have uneven performance, which makes it appealing assessing and ranking them from an empirical point of view. In the bibliography, there exist a number of benchmarks, but they cannot be applied to this context because they are not suitable for testing semantic-web ontologies or they do not focus on data exchange problems. In this paper, we present MostoBM, a benchmark for testing data exchange systems in the context of such ontologies. It provides a catalogue of three real-world and seven synthetic data exchange patterns, which can be instantiated into a variety of scenarios using some parameters. These scenarios help to analyze how the performance of data exchange systems evolves as the exchanging ontologies are scaled in structured and/or data. Finally, we provide an evaluation methodology to compare data exchange systems side by side and to make informed and statistically sound decisions regarding: 1) which data exchange system performs better; and 2) how the performance of a system is influenced by the parameters of our benchmark.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-

    UPI: A Primary Index for Uncertain Databases

    Get PDF
    Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the potential to offer substantial speedups for non-selective analytic queries on large uncertain databases. In this paper, we propose a new index called a UPI (Uncertain Primary Index) that clusters heap files according to uncertain attributes with both discrete and continuous uncertainty distributions. Because uncertain attributes may have several possible values, a UPI on an uncertain attribute duplicates tuple data once for each possible value. To prevent the size of the UPI from becoming unmanageable, its size is kept small by placing low-probability tuples in a special Cutoff Index that is consulted only when queries for low-probability values are run. We also propose several other optimizations, including techniques to improve secondary index performance and techniques to reduce maintenance costs and fragmentation by buffering changes to the table and writing updates in sequential batches. Finally, we develop cost models for UPIs to estimate query performance in various settings to help automatically select tuning parameters of a UPI. We have implemented a prototype UPI and experimented on two real datasets. Our results show that UPIs can significantly (up to two orders of magnitude) improve the performance of uncertain queries both over clustered and unclustered attributes. We also show that our buffering techniques mitigate table fragmentation and keep the maintenance cost as low as or even lower than using an unclustered heap file.National Science Foundation (U.S.) (Grant IIS-0448124)National Science Foundation (U.S.) (Grant IIS-0905553)National Science Foundation (U.S.) (Grant IIS-0916691
    corecore