640 research outputs found
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Thesaurus-based search in large heterogeneous collections
In cultural heritage, large virtual collections are coming into
existence. Such collections contain heterogeneous sets of metadata and
vocabulary concepts, originating from multiple sources. In the context
of the E-Culture demonstrator we have shown earlier that such virtual
collections can be effectively explored with keyword search and semantic
clustering. In this paper we describe the design rationale of ClioPatria,
an open-source system which provides APIs for scalable semantic graph
search. The use of ClioPatria’s search strategies is illustrated with a
realistic use case: searching for ”Picasso”. We discuss details of scalable
graph search, the required OWL reasoning functionalities and show why
SPARQL queries are insufficient for solving the search problem
ClioPatria: A SWI-Prolog Infrastructure for the Semantic Web
ClioPatria is a comprehensive semantic web development framework based on SWI-Prolog. SWI-Prolog provides an efficient C-based main-memory RDF store that is designed to cooperate naturally and efficiently with Prolog, realizing a flexible RDF-based environment for rule based programming. ClioPatria extends this core with a SPARQL and LOD server, an extensible web frontend to manage the server, browse the data, query the data using SPARQL and Prolog and a Git-based plugin manager. The ability to query RDF using Prolog provides query composition and smooth integration with application logic. ClioPatria is primarily positioned as a prototyping platform for exploring novel ways of reasoning with RDF data. It has been used in several research projects in order to perform tasks such as data integration and enrichment and semantic search
Strategies for Managing Linked Enterprise Data
Data, information and knowledge become key assets of our 21st century economy. As a result, data and knowledge management become key tasks with regard to sustainable development and business success. Often, knowledge is not explicitly represented residing in the minds of people or scattered among a variety of data sources. Knowledge is inherently associated with semantics that conveys its meaning to a human or machine agent. The Linked Data concept facilitates the semantic integration of heterogeneous data sources. However, we still lack an effective knowledge integration strategy applicable to enterprise scenarios, which balances between large amounts of data stored in legacy information systems and data lakes as well as tailored domain specific ontologies that formally describe real-world concepts. In this thesis we investigate strategies for managing linked enterprise data analyzing how actionable knowledge can be derived from enterprise data leveraging knowledge graphs. Actionable knowledge provides valuable insights, supports decision makers with clear interpretable arguments, and keeps its inference processes explainable. The benefits of employing actionable knowledge and its coherent management strategy span from a holistic semantic representation layer of enterprise data, i.e., representing numerous data sources as one, consistent, and integrated knowledge source, to unified interaction mechanisms with other systems that are able to effectively and efficiently leverage such an actionable knowledge. Several challenges have to be addressed on different conceptual levels pursuing this goal, i.e., means for representing knowledge, semantic data integration of raw data sources and subsequent knowledge extraction, communication interfaces, and implementation. In order to tackle those challenges we present the concept of Enterprise Knowledge Graphs (EKGs), describe their characteristics and advantages compared to existing approaches. We study each challenge with regard to using EKGs and demonstrate their efficiency. In particular, EKGs are able to reduce the semantic data integration effort when processing large-scale heterogeneous datasets. Then, having built a consistent logical integration layer with heterogeneity behind the scenes, EKGs unify query processing and enable effective communication interfaces for other enterprise systems. The achieved results allow us to conclude that strategies for managing linked enterprise data based on EKGs exhibit reasonable performance, comply with enterprise requirements, and ensure integrated data and knowledge management throughout its life cycle
Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Enabling Web-scale data integration in biomedicine through Linked Open Data
The biomedical data landscape is fragmented with several isolated, heterogeneous data and knowledge sources, which use varying formats, syntaxes, schemas, and entity notations, existing on the Web. Biomedical researchers face severe logistical and technical challenges to query, integrate, analyze, and visualize data from multiple diverse sources in the context of available biomedical knowledge. Semantic Web technologies and Linked Data principles may aid toward Web-scale semantic processing and data integration in biomedicine. The biomedical research community has been one of the earliest adopters of these technologies and principles to publish data and knowledge on the Web as linked graphs and ontologies, hence creating the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we provide our perspective on some opportunities proffered by the use of LSLOD to integrate biomedical data and knowledge in three domains: (1) pharmacology, (2) cancer research, and (3) infectious diseases. We will discuss some of the major challenges that hinder the wide-spread use and consumption of LSLOD by the biomedical research community. Finally, we provide a few technical solutions and insights that can address these challenges. Eventually, LSLOD can enable the development of scalable, intelligent infrastructures that support artificial intelligence methods for augmenting human intelligence to achieve better clinical outcomes for patients, to enhance the quality of biomedical research, and to improve our understanding of living systems
Semantic Web Based Relational Database Access With Conflict Resolution
This thesis focuses on (1) accessing relational databases through Semantic Web technologies and (2) resolving conflicts that usually arises when integrating data from heterogeneous source schemas and/or instances.
In the first part of the thesis, we present an approach to access relational databases using Semantic Web technologies. Our approach is built on top of Ontop framework for Ontology Based Data Access. It extracts both Ontop mappings and an equivalent OWL ontology from an existing database schema. The end users can then access the underlying data source through SPARQL queries. The proposed approach takes into consideration the different relationships between the entities of the database schema when it extracts the mapping and the equivalent ontology. Instead of extracting a flat ontology that is an exact copy of the database schema, it extracts a rich ontology. The extracted ontology can also be used as an intermediary between a domain ontology and the underlying database schema. Our approach covers independent or master entities that do not have foreign references, dependent or detailed entities that have some foreign keys that reference other entities, recursive entities that contain some self references, binary join entities that relate two entities together, and n-ary join entities that map two or more entities in an n-ary relation. The implementation results indicate that the extracted Ontop mappings and ontology are accurate. i.e., end users can query all data (using SPARQL) from the underlying database source in the same way as if they have written SQL queries.
In the second part, we present an overview of the conflict resolution approaches in both conventional data integration systems and collaborative data sharing communities. We focus on the latter as it supports the needs of scientific communities for data sharing and collaboration. We first introduce the purpose of the study, and present a brief overview of data integration. Next, we talk about the problem of inconsistent data in conventional integration systems, and we summarize the conflict handling strategies used to handle such inconsistent data. Then we focus on the problem of conflict resolution in collaborative data sharing communities. A collaborative data sharing community is a group of users who agree to share a common database instance, such that all users have access to the shared instance and they can add to, update, and extend this shared instance. We discuss related works that adopt different conflict resolution strategies in the area of collaborative data sharing, and we provide a comparison between them. We find that a Collaborative Data Sharing System (CDSS) can best support the needs of certain communities such as scientific communities. We then discuss some open research opportunities to improve the efficiency and performance of the CDSS. Finally, we summarize our work so far towards achieving these open research directions
Developing a RDF4J frontend
Dissertação de mestrado integrado em Engenharia InformáticaA few years ago, data was not shared and kept isolated, preventing communication
between datasets. Currently, we have more significant data volumes, and in a world where
everything is connected, our data is now also following this trend.
Data model focus changed from a square structure like the relational model to a model
centered on the relations. Knowledge graphs are the new paradigm to represent and manage
this new kind of information structure.
Along with the new paradigm, graph databases emerged to support the new requirements.
Despite the increasing interest in the field, only a few native solutions are available. Most
are under a commercial license, and the open-source options have very basic or outdated
interfaces, and because of that, they are a little distant for most end-users.
In this thesis, we introduce the Open Web Ontobud and discuss its design and develop ment. Ontobud is a Web Application aimed at improving the interface for one of the most
fascinating and influential frameworks in this area: RDF4J. RDF4J is a Java framework to
deal with RDF triple storage, management, and query.
Open Web Ontobud is an open-source RDF4J web frontend created to reduce the gap
between end-users and the RDF4J backend. We created a web interface that enables users
with a basic knowledge of OWL and SPARQL to explore ontologies via resource tables or
graphs and extract information from them with SPARQL queries. The interface aims to
remain intuitive, providing tooltips and help when needed, as well as some statistical data
in a readily available form.
Despite the frontend being the main focus, a backend and two databases are also used for
a total of four components in the framework. For the best deployment experience, Docker
was used for its simplicity, allowing deployment in just a few commands. Each component
has a dedicated image, following a modular design and allowing them to be executed on
separate machines if desired.No passado, dados não era partilhada e permanecia isolada, impedindo comunicação
entre datasets. Atualmente, temos maiores volumes de dados e num mundo onde tudo está
interligado, os nossos dados também seguem essa tendência.
O foco de modelo de dados alterou de uma estrutura quadrada, como o modelo relacional,
para um modelo centrado em relações. Grafos de Conhecimento são o novo paradigma para
a representação e manipulação desta nova estrutura de dados.
Com o novo paradigma, bases de dados de grafos emergiram para suportar as novas
necessidades. Apesar do aumento de interesse neste campo, apenas algumas soluções
nativas estão disponíveis. A maioria requere uma licença comercial, e as opções open-source
são interfaces básicas ou desatualizadas, e por consequência, distantes a muitos utilizadores.
Nesta tese introduzimos o Open Web Ontobud e discutimos o seu design e desenvolvi mento. O Ontobud é uma aplicação Web direcionada ao melhoramento da interface de
uma das mais fascinantes e influentes frameworks nesta área: o RDF4J. O RDF4J é uma
framework em Java para guardar, manipular e inquirir grafos RDF.
Open Web Ontobud é um open-source web frontend para o RDF4J criado para diminuir
a separação entre os utilizadores e o RDF4J backend. Nós criamos uma interface web que
permite utilizadores com conhecimento básico de OWL e SPARQL explorar ontologias
através de tabelas de recursos ou grafos, e inquirir informação com queries SPARQL. O
objetivo da interface é ser intuitiva, com tooltips e ajuda quando necessário, bem como
alguma informação de estatísticas numa forma facilmente acessível.
Apesar do frontend ser o foco principal, o backend e duas bases de dados também são
utilizadas, para um total de quatro componentes nesta framework. Para a melhor experiência
de inicialização utilizamos Docker pela sua simplicidade, permitindo inicialização em poucos
comandos. Cada componente tem uma imagem dedicada, seguindo um design modular e
permitindo cada componente ser executada em máquinas separadas se necessário
- …