489 research outputs found

    Dynamic Discovery of Type Classes and Relations in Semantic Web Data

    Full text link
    The continuing development of Semantic Web technologies and the increasing user adoption in the recent years have accelerated the progress incorporating explicit semantics with data on the Web. With the rapidly growing RDF (Resource Description Framework) data on the Semantic Web, processing large semantic graph data have become more challenging. Constructing a summary graph structure from the raw RDF can help obtain semantic type relations and reduce the computational complexity for graph processing purposes. In this paper, we addressed the problem of graph summarization in RDF graphs, and we proposed an approach for building summary graph structures automatically from RDF graph data. Moreover, we introduced a measure to help discover optimum class dissimilarity thresholds and an effective method to discover the type classes automatically. In future work, we plan to investigate further improvement options on the scalability of the proposed method

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Processing Analytical Queries in the AWESOME Polystore [Information Systems Architectures]

    Full text link
    Modern big data applications usually involve heterogeneous data sources and analytical functions, leading to increasing demand for polystore systems, especially analytical polystore systems. This paper presents AWESOME system along with a domain-specific language ADIL. ADIL is a powerful language which supports 1) native heterogeneous data models such as Corpus, Graph, and Relation; 2) a rich set of analytical functions; and 3) clear and rigorous semantics. AWESOME is an efficient tri-store middle-ware which 1) is built on the top of three heterogeneous DBMSs (Postgres, Solr, and Neo4j) and is easy to be extended to incorporate other systems; 2) supports the in-memory query engines and is equipped with analytical capability; 3) applies a cost model to efficiently execute workloads written in ADIL; 4) fully exploits machine resources to improve scalability. A set of experiments on real workloads demonstrate the capability, efficiency, and scalability of AWESOME

    Theory and Practice of Data Citation

    Full text link
    Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming "data-intensive", where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has. The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation. The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association for Information Science and Technology (JASIST), 201

    Desarrollo de API y rediseño de la base de datos asociada para CMS

    Get PDF
    Trabajo de Fin de Grado en Ingeniería Informática y Matemáticas, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2019/2020CMS (Compact Muon Solenoid) is a particle detector part of the LHC (Large Hadron Collider), the particle accelerator of CERN in Switzerland. The metadata gathered during the operation of the detector are written to a relational database. The web system displaying these data is being fully rewritten and redesigned, and the new system uses a RESTful API written in Java, which exposes diverse data coming from different sources. To store these data, the old system used several database tables aggregating the necessary information. However, most of those tables are obsolete, and must be redesigned, along with the whole infrastructure used to update the information and keep it consistent. The objective of this project is, then, to redesign the database and build upon it an API that returns the relevant information.CMS (Compact Muon Solenoid) es un detector de partículas situado en el LHC (Large Hadron Collider), el gran acelerador de partículas del CERN, en Suiza. Los metadatos recogidos durante el funcionamiento del detector se escriben a una base de datos relacional. El sistema web que muestra estos datos está siendo completamente reescrito y rediseñado, y el nuevo sistema utiliza como backend una API, RESTful y escrita en Java, que expone diversos datos de orígenes muy distintos. Para reunir estos datos, el antiguo sistema utilizaba diversas tablas que agregaban la información necesaria. Sin embargo, la mayor parte de estas tablas han quedado obsoletas, y deben ser rediseñadas, junto con toda la infraestructura necesaria para mantener la información actualizada y consistente. Por tanto, el objetivo de este proyecto es rediseñar la parte afectada de la base de datos y construir sobre esta una API que devuelva la información relevante.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    A survey of RDB to RDF translation approaches and tools

    Get PDF
    ISRN I3S/RR 2013-04-FR 24 pagesRelational databases scattered over the web are generally opaque to regular web crawling tools. To address this concern, many RDB-to-RDF approaches have been proposed over the last years. In this paper, we propose a detailed review of seventeen RDB-to-RDF initiatives, considering end-to-end projects that delivered operational tools. The different tools are classified along three major axes: mapping description language, mapping implementation and data retrieval method. We analyse the motivations, commonalities and differences between existing approaches. The expressiveness of existing mapping languages is not always sufficient to produce semantically rich data and make it usable, interoperable and linkable. We therefore briefly present various strategies investigated in the literature to produce additional knowledge. Finally, we show that R2RML, the W3C recommendation for describing RDB to RDF mappings, may not apply to all needs in the wide scope of RDB to RDF translation applications, leaving space for future extensions

    Dimensional enrichment of statistical linked open data

    Get PDF
    On-Line Analytical Processing (OLAP) is a data analysis technique typically used for local and well-prepared data. However, initiatives like Open Data and Open Government bring new and publicly available data on the web that are to be analyzed in the same way. The use of semantic web technologies for this context is especially encouraged by the Linked Data initiative. There is already a considerable amount of statistical linked open data sets published using the RDF Data Cube Vocabulary (QB) which is designed for these purposes. However, QB lacks some essential schema constructs (e.g., dimension levels) to support OLAP. Thus, the QB4OLAP vocabulary has been proposed to extend QB with the necessary constructs and be fully compliant with OLAP. In this paper, we focus on the enrichment of an existing QB data set with QB4OLAP semantics. We first thoroughly compare the two vocabularies and outline the benefits of QB4OLAP. Then, we propose a series of steps to automate the enrichment of QB data sets with specific QB4OLAP semantics; being the most important, the definition of aggregate functions and the detection of new concepts in the dimension hierarchy construction. The proposed steps are defined to form a semi-automatic enrichment method, which is implemented in a tool that enables the enrichment in an interactive and iterative fashion. The user can enrich the QB data set with QB4OLAP concepts (e.g., full-fledged dimension hierarchies) by choosing among the candidate concepts automatically discovered with the steps proposed. Finally, we conduct experiments with 25 users and use three real-world QB data sets to evaluate our approach. The evaluation demonstrates the feasibility of our approach and shows that, in practice, our tool facilitates, speeds up, and guarantees the correct results of the enrichment process.Peer ReviewedPostprint (author's final draft