Search CORE

420 research outputs found

NORA: Scalable OWL reasoner based on NoSQL databasesand Apache Spark

Author: Benítez-Hidalgo Antonio
Navas-Delgado Ismael
Roldán-García María del Mar
Publication venue: Wiley
Publication date: 04/09/2023
Field of study

Reasoning is the process of inferring new knowledge and identifying inconsis-tencies within ontologies. Traditional techniques often prove inadequate whenreasoning over large Knowledge Bases containing millions or billions of facts.This article introduces NORA, a persistent and scalable OWL reasoner built ontop of Apache Spark, designed to address the challenges of reasoning over exten-sive and complex ontologies. NORA exploits the scalability of NoSQL databasesto effectively apply inference rules to Big Data ontologies with large ABoxes. Tofacilitatescalablereasoning,OWLdata,includingclassandpropertyhierarchiesand instances, are materialized in the Apache Cassandra database. Spark pro-grams are then evaluated iteratively, uncovering new implicit knowledge fromthe dataset and leading to enhanced performance and more efficient reasoningover large-scale ontologies. NORA has undergone a thorough evaluation withdifferent benchmarking ontologies of varying sizes to assess the scalability of thedeveloped solution.Funding for open access charge: Universidad de Málaga / CBUA This work has been partially funded by grant (funded by MCIN/AEI/10.13039/501100011033/) PID2020-112540RB-C41,AETHER-UMA (A smart data holistic approach for context-aware data analytics: semantics and context exploita-tion). Antonio Benítez-Hidalgo is supported by Grant PRE2018-084280 (Spanish Ministry of Science, Innovation andUniversities)

Repositorio Institucional Universidad de Málaga

Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML

Author: Huang Jyun-Yao
Karim Farah
Lange Christoph
Vahdati Sahar
Publication venue
Publication date: 01/01/2015
Field of study

OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the intermediate CSV files, and by mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc

arXiv.org e-Print Archive

Fraunhofer-ePrints

Database Management Systems: A NoSQL Analysis

Author: Kadebu P. (Prudence)
Mapanga I. (Innocent)
Publication venue: 'Engineering Research Publication ERP'
Publication date: 01/09/2013
Field of study

Volume 1 Issue 7 (September 2013

Neliti

Semantic Data Management in Data Lakes

Author: Hoseini Sayed
Quix Christoph
Theissen-Lipp Johannes
Publication venue
Publication date: 23/10/2023
Field of study

In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose the linkage of metadata to knowledge graphs based on the Linked Data principles to provide more meaning and semantics to the data in the lake. Such a semantic layer may be utilized not only for data management but also to tackle the problem of data integration from heterogeneous sources, in order to make data access more expressive and interoperable. In this survey, we review recent approaches with a specific focus on the application within data lake systems and scalability to Big Data. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access. In each category, we cover the main techniques and their background, and compare latest research. Finally, we point out challenges for future work in this research area, which needs a closer integration of Big Data and Semantic Web technologies

arXiv.org e-Print Archive

An introduction to Graph Data Management

Author: A Dries
A Gutiérrez
A Iosup
A Morari
A Poulovassilis
AD Zhu
AO Mendelzon
B Amann
B Elser
C Berge
C Vicknair
C Watters
C Weiss
CS Chang
D Conte
D Dominguez-Sal
D Theodoratos
DC Faye
DW Shipman
EF Codd
FW Tompa
G Malewicz
GM Kuper
H He
HS Kunii
IF Cruz
IF Cruz
J Hidders
J Paredaens
J Peckham
J. Hidders
Jonathan Hayes
K Zeng
L Kowalik
L Zou
M Atre
M Ciglan
M Consens
M Gemis
M Gyssens
M Han
M Levene
M Levene
M Levene
M Mainguenaud
M Schmidt
M Yannakakis
MA Bornea
MA Rodriguez
MA Rodriguez
Marc Andries
MP Consens
MP Consens
N Kiesel
N Roussopoulos
O Erling
P Barceló Baeza
P Buneman
P Yuan
Philippe Cudré-Mauroux
PPS Chen
PT Wood
PT Wood
R Agrawal
R Angles
R Angles
R Brijder
R Ronen
RH Güting
RS Xin
S Abiteboul
S Abiteboul
T Neumann
W Fan
W Kim
Y Guo
Y Low
Y Papakonstantinou
Y Tian
Y Zhao
YA Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/12/2017
Field of study

A graph database is a database where the data structures for the schema and/or instances are modeled as a (labeled)(directed) graph or generalizations of it, and where querying is expressed by graph-oriented operations and type constructors. In this article we present the basic notions of graph databases, give an historical overview of its main development, and study the main current systems that implement them

arXiv.org e-Print Archive

Crossref

A Survey on Mapping Semi-Structured Data and Graph Data to Relational Data

Author: Lu Jiaheng
Yan Zhengtong
Yuan Gongsheng
Publication venue
Publication date: 01/10/2023
Field of study

The data produced by various services should be stored and managed in an appropriate format for gaining valuable knowledge conveniently. This leads to the emergence of various data models, including relational, semi-structured, and graph models, and so on. Considering the fact that the mature relational databases established on relational data models are still predominant in today's market, it has fueled interest in storing and processing semi-structured data and graph data in relational databases so that mature and powerful relational databases' capabilities can all be applied to these various data. In this survey, we review existing methods on mapping semi-structured data and graph data into relational tables, analyze their major features, and give a detailed classification of those methods. We also summarize the merits and demerits of each method, introduce open research challenges, and present future research directions. With this comprehensive investigation of existing methods and open problems, we hope this survey can motivate new mapping approaches through drawing lessons from eachmodel's mapping strategies, aswell as a newresearch topic - mapping multi-model data into relational tables.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Online Index Extraction from Linked Open Data Sources

Author: Benedetti Fabio
Bergamaschi Sonia
Po Laura
Publication venue: country:DEU
Publication date: 01/01/2014
Field of study

The production of machine-readable data in the form of RDF datasets belonging to the Linked Open Data (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extraction of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes collect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An evaluation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented

CiteSeerX

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Cloud-based solutions supporting data and knowledge integration in bioinformatics

Author: Milia Gabriele
Publication venue
Publication date: 25/05/2015
Field of study

In recent years, computer advances have changed the way the science progresses and have boosted studies in silico; as a result, the concept of “scientific research” in bioinformatics has quickly changed shifting from the idea of a local laboratory activity towards Web applications and databases provided over the network as services. Thus, biologists have become among the largest beneficiaries of the information technologies, reaching and surpassing the traditional ICT users who operate in the field of so-called "hard science" (i.e., physics, chemistry, and mathematics). Nevertheless, this evolution has to deal with several aspects (including data deluge, data integration, and scientific collaboration, just to cite a few) and presents new challenges related to the proposal of innovative approaches in the wide scenario of emergent ICT solutions. This thesis aims at facing these challenges in the context of three case studies, being each case study devoted to cope with a specific open issue by proposing proper solutions in line with recent advances in computer science. The first case study focuses on the task of unearthing and integrating information from different web resources, each having its own organization, terminology and data formats in order to provide users with flexible environment for accessing the above resources and smartly exploring their content. The study explores the potential of cloud paradigm as an enabling technology to severely curtail issues associated with scalability and performance of applications devoted to support the above task. Specifically, it presents Biocloud Search EnGene (BSE), a cloud-based application which allows for searching and integrating biological information made available by public large-scale genomic repositories. BSE is publicly available at: http://biocloud-unica.appspot.com/. The second case study addresses scientific collaboration on the Web with special focus on building a semantic network, where team members, adequately supported by easy access to biomedical ontologies, define and enrich network nodes with annotations derived from available ontologies. The study presents a cloud-based application called Collaborative Workspaces in Biomedicine (COWB) which deals with supporting users in the construction of the semantic network by organizing, retrieving and creating connections between contents of different types. Public and private workspaces provide an accessible representation of the collective knowledge that is incrementally expanded. COWB is publicly available at: http://cowb-unica.appspot.com/. Finally, the third case study concerns the knowledge extraction from very large datasets. The study investigates the performance of random forests in classifying microarray data. In particular, the study faces the problem of reducing the contribution of trees whose nodes are populated by non-informative features. Experiments are presented and results are then analyzed in order to draw guidelines about how reducing the above contribution. With respect to the previously mentioned challenges, this thesis sets out to give two contributions summarized as follows. First, the potential of cloud technologies has been evaluated for developing applications that support the access to bioinformatics resources and the collaboration by improving awareness of user's contributions and fostering users interaction. Second, the positive impact of the decision support offered by random forests has been demonstrated in order to tackle effectively the curse of dimensionality

Archivio istituzionale della ricerca - Università di Cagliari

UniCA Eprints