420 research outputs found
NORA: Scalable OWL reasoner based on NoSQL databasesand Apache Spark
Reasoning is the process of inferring new knowledge and identifying inconsis-tencies within ontologies. Traditional techniques often prove inadequate whenreasoning over large Knowledge Bases containing millions or billions of facts.This article introduces NORA, a persistent and scalable OWL reasoner built ontop of Apache Spark, designed to address the challenges of reasoning over exten-sive and complex ontologies. NORA exploits the scalability of NoSQL databasesto effectively apply inference rules to Big Data ontologies with large ABoxes. Tofacilitatescalablereasoning,OWLdata,includingclassandpropertyhierarchiesand instances, are materialized in the Apache Cassandra database. Spark pro-grams are then evaluated iteratively, uncovering new implicit knowledge fromthe dataset and leading to enhanced performance and more efficient reasoningover large-scale ontologies. NORA has undergone a thorough evaluation withdifferent benchmarking ontologies of varying sizes to assess the scalability of thedeveloped solution.Funding for open access charge: Universidad de Málaga / CBUA
This work has been partially funded by grant (funded by MCIN/AEI/10.13039/501100011033/) PID2020-112540RB-C41,AETHER-UMA (A smart data holistic approach for context-aware data analytics: semantics and context exploita-tion). Antonio BenĂtez-Hidalgo is supported by Grant PRE2018-084280 (Spanish Ministry of Science, Innovation andUniversities)
Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a
database of all EC FP7 and H2020 funded research projects, including metadata
of their results (publications and datasets). These data are stored in an HBase
NoSQL database, post-processed, and exposed as HTML for human consumption, and
as XML through a web service interface. As an intermediate format to facilitate
statistical computations, CSV is generated internally. To interlink the
OpenAIRE data with related data on the Web, we aim at exporting them as Linked
Open Data (LOD). The LOD export is required to integrate into the overall data
processing workflow, where derived data are regenerated from the base data
every day. We thus faced the challenge of identifying the best-performing
conversion approach.We evaluated the performances of creating LOD by a
MapReduce job on top of HBase, by mapping the intermediate CSV files, and by
mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc
Semantic Data Management in Data Lakes
In recent years, data lakes emerged as away to manage large amounts of
heterogeneous data for modern data analytics. One way to prevent data lakes
from turning into inoperable data swamps is semantic data management. Some
approaches propose the linkage of metadata to knowledge graphs based on the
Linked Data principles to provide more meaning and semantics to the data in the
lake. Such a semantic layer may be utilized not only for data management but
also to tackle the problem of data integration from heterogeneous sources, in
order to make data access more expressive and interoperable. In this survey, we
review recent approaches with a specific focus on the application within data
lake systems and scalability to Big Data. We classify the approaches into (i)
basic semantic data management, (ii) semantic modeling approaches for enriching
metadata in data lakes, and (iii) methods for ontologybased data access. In
each category, we cover the main techniques and their background, and compare
latest research. Finally, we point out challenges for future work in this
research area, which needs a closer integration of Big Data and Semantic Web
technologies
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
A Survey on Mapping Semi-Structured Data and Graph Data to Relational Data
The data produced by various services should be stored and managed in an appropriate format for gaining valuable knowledge conveniently. This leads to the emergence of various data models, including relational, semi-structured, and graph models, and so on. Considering the fact that the mature relational databases established on relational data models are still predominant in today's market, it has fueled interest in storing and processing semi-structured data and graph data in relational databases so that mature and powerful relational databases' capabilities can all be applied to these various data. In this survey, we review existing methods on mapping semi-structured data and graph data into relational tables, analyze their major features, and give a detailed classification of those methods. We also summarize the merits and demerits of each method, introduce open research challenges, and present future research directions. With this comprehensive investigation of existing methods and open problems, we hope this survey can motivate new mapping approaches through drawing lessons from eachmodel's mapping strategies, aswell as a newresearch topic - mapping multi-model data into relational tables.Peer reviewe
Online Index Extraction from Linked Open Data Sources
The production of machine-readable data in the form of RDF datasets belonging to the Linked Open Data (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extraction of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes collect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An evaluation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented
Cloud-based solutions supporting data and knowledge integration in bioinformatics
In recent years, computer advances have changed the way the science progresses and have
boosted studies in silico; as a result, the concept of “scientific research” in bioinformatics
has quickly changed shifting from the idea of a local laboratory activity towards Web
applications and databases provided over the network as services. Thus, biologists have
become among the largest beneficiaries of the information technologies, reaching and
surpassing the traditional ICT users who operate in the field of so-called "hard science"
(i.e., physics, chemistry, and mathematics). Nevertheless, this evolution has to deal with
several aspects (including data deluge, data integration, and scientific collaboration, just to
cite a few) and presents new challenges related to the proposal of innovative approaches in
the wide scenario of emergent ICT solutions.
This thesis aims at facing these challenges in the context of three case studies, being
each case study devoted to cope with a specific open issue by proposing proper solutions in
line with recent advances in computer science.
The first case study focuses on the task of unearthing and integrating information from
different web resources, each having its own organization, terminology and data formats in
order to provide users with flexible environment for accessing the above resources and
smartly exploring their content. The study explores the potential of cloud paradigm as an
enabling technology to severely curtail issues associated with scalability and performance
of applications devoted to support the above task. Specifically, it presents Biocloud Search
EnGene (BSE), a cloud-based application which allows for searching and integrating
biological information made available by public large-scale genomic repositories. BSE is
publicly available at: http://biocloud-unica.appspot.com/.
The second case study addresses scientific collaboration on the Web with special focus
on building a semantic network, where team members, adequately supported by easy
access to biomedical ontologies, define and enrich network nodes with annotations derived
from available ontologies. The study presents a cloud-based application called
Collaborative Workspaces in Biomedicine (COWB) which deals with supporting users in
the construction of the semantic network by organizing, retrieving and creating
connections between contents of different types. Public and private workspaces provide an
accessible representation of the collective knowledge that is incrementally expanded.
COWB is publicly available at: http://cowb-unica.appspot.com/.
Finally, the third case study concerns the knowledge extraction from very large datasets.
The study investigates the performance of random forests in classifying microarray data. In
particular, the study faces the problem of reducing the contribution of trees whose nodes
are populated by non-informative features. Experiments are presented and results are then
analyzed in order to draw guidelines about how reducing the above contribution.
With respect to the previously mentioned challenges, this thesis sets out to give two
contributions summarized as follows. First, the potential of cloud technologies has been
evaluated for developing applications that support the access to bioinformatics resources
and the collaboration by improving awareness of user's contributions and fostering users
interaction. Second, the positive impact of the decision support offered by random forests
has been demonstrated in order to tackle effectively the curse of dimensionality
- …