489 research outputs found
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
The continuing development of Semantic Web technologies and the increasing
user adoption in the recent years have accelerated the progress incorporating
explicit semantics with data on the Web. With the rapidly growing RDF (Resource
Description Framework) data on the Semantic Web, processing large semantic
graph data have become more challenging. Constructing a summary graph structure
from the raw RDF can help obtain semantic type relations and reduce the
computational complexity for graph processing purposes. In this paper, we
addressed the problem of graph summarization in RDF graphs, and we proposed an
approach for building summary graph structures automatically from RDF graph
data. Moreover, we introduced a measure to help discover optimum class
dissimilarity thresholds and an effective method to discover the type classes
automatically. In future work, we plan to investigate further improvement
options on the scalability of the proposed method
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Processing Analytical Queries in the AWESOME Polystore [Information Systems Architectures]
Modern big data applications usually involve heterogeneous data sources and
analytical functions, leading to increasing demand for polystore systems,
especially analytical polystore systems. This paper presents AWESOME system
along with a domain-specific language ADIL. ADIL is a powerful language which
supports 1) native heterogeneous data models such as Corpus, Graph, and
Relation; 2) a rich set of analytical functions; and 3) clear and rigorous
semantics. AWESOME is an efficient tri-store middle-ware which 1) is built on
the top of three heterogeneous DBMSs (Postgres, Solr, and Neo4j) and is easy to
be extended to incorporate other systems; 2) supports the in-memory query
engines and is equipped with analytical capability; 3) applies a cost model to
efficiently execute workloads written in ADIL; 4) fully exploits machine
resources to improve scalability. A set of experiments on real workloads
demonstrate the capability, efficiency, and scalability of AWESOME
Theory and Practice of Data Citation
Citations are the cornerstone of knowledge propagation and the primary means
of assessing the quality of research, as well as directing investments in
science. Science is increasingly becoming "data-intensive", where large volumes
of data are collected and analyzed to discover complex patterns through
simulations and experiments, and most scientific reference works have been
replaced by online curated datasets. Yet, given a dataset, there is no
quantitative, consistent and established way of knowing how it has been used
over time, who contributed to its curation, what results have been yielded or
what value it has.
The development of a theory and practice of data citation is fundamental for
considering data as first-class research objects with the same relevance and
centrality of traditional scientific products. Many works in recent years have
discussed data citation from different viewpoints: illustrating why data
citation is needed, defining the principles and outlining recommendations for
data citation systems, and providing computational methods for addressing
specific issues of data citation.
The current panorama is many-faceted and an overall view that brings together
diverse aspects of this topic is still missing. Therefore, this paper aims to
describe the lay of the land for data citation, both from the theoretical (the
why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association
for Information Science and Technology (JASIST), 201
Desarrollo de API y rediseño de la base de datos asociada para CMS
Trabajo de Fin de Grado en Ingeniería Informática y Matemáticas, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2019/2020CMS (Compact Muon Solenoid) is a particle detector part of the LHC (Large Hadron Collider), the particle accelerator of CERN in Switzerland. The metadata gathered during the operation of the detector are written to a relational database. The web system displaying these data is being fully rewritten and redesigned, and the new system uses a RESTful API written in Java, which exposes diverse data coming from different sources. To store these data, the old system used several database tables aggregating the necessary information. However, most of those tables are obsolete, and must be redesigned, along with the whole infrastructure used to update the information and keep it consistent. The objective of this project is, then, to redesign the database and build upon it an API that returns the relevant information.CMS (Compact Muon Solenoid) es un detector de partículas situado en el LHC (Large Hadron Collider), el gran acelerador de partículas del CERN, en Suiza. Los metadatos recogidos durante el funcionamiento del detector se escriben a una base de datos relacional. El sistema web que muestra estos datos está siendo completamente reescrito y rediseñado, y el nuevo sistema utiliza como backend una API, RESTful y escrita en Java, que expone diversos datos de orígenes muy distintos. Para reunir estos datos, el antiguo sistema utilizaba diversas tablas que agregaban la información necesaria. Sin embargo, la mayor parte de estas tablas han quedado obsoletas, y deben ser rediseñadas, junto con toda la infraestructura necesaria para mantener la información actualizada y consistente. Por tanto, el objetivo de este proyecto es rediseñar la parte afectada de la base de datos y construir sobre esta una API que devuelva la información relevante.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu
Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web
If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches
A survey of RDB to RDF translation approaches and tools
ISRN I3S/RR 2013-04-FR 24 pagesRelational databases scattered over the web are generally opaque to regular web crawling tools. To address this concern, many RDB-to-RDF approaches have been proposed over the last years. In this paper, we propose a detailed review of seventeen RDB-to-RDF initiatives, considering end-to-end projects that delivered operational tools. The different tools are classified along three major axes: mapping description language, mapping implementation and data retrieval method. We analyse the motivations, commonalities and differences between existing approaches. The expressiveness of existing mapping languages is not always sufficient to produce semantically rich data and make it usable, interoperable and linkable. We therefore briefly present various strategies investigated in the literature to produce additional knowledge. Finally, we show that R2RML, the W3C recommendation for describing RDB to RDF mappings, may not apply to all needs in the wide scope of RDB to RDF translation applications, leaving space for future extensions
Dimensional enrichment of statistical linked open data
On-Line Analytical Processing (OLAP) is a data analysis technique typically used for local and well-prepared data. However, initiatives like Open Data and Open Government bring new and publicly available data on the web that are to be analyzed in the same way. The use of semantic web technologies for this context is especially encouraged by the Linked Data initiative. There is already a considerable amount of statistical linked open data sets published using the RDF Data Cube Vocabulary (QB) which is designed for these purposes. However, QB lacks some essential schema constructs (e.g., dimension levels) to support OLAP. Thus, the QB4OLAP vocabulary has been proposed to extend QB with the necessary constructs and be fully compliant with OLAP. In this paper, we focus on the enrichment of an existing QB data set with QB4OLAP semantics. We first thoroughly compare the two vocabularies and outline the benefits of QB4OLAP. Then, we propose a series of steps to automate the enrichment of QB data sets with specific QB4OLAP semantics; being the most important, the definition of aggregate functions and the detection of new concepts in the dimension hierarchy construction. The proposed steps are defined to form a semi-automatic enrichment method, which is implemented in a tool that enables the enrichment in an interactive and iterative fashion. The user can enrich the QB data set with QB4OLAP concepts (e.g., full-fledged dimension hierarchies) by choosing among the candidate concepts automatically discovered with the steps proposed. Finally, we conduct experiments with 25 users and use three real-world QB data sets to evaluate our approach. The evaluation demonstrates the feasibility of our approach and shows that, in practice, our tool facilitates, speeds up, and guarantees the correct results of the enrichment process.Peer ReviewedPostprint (author's final draft
- …