33 research outputs found
Semantic Data Management in Data Lakes
In recent years, data lakes emerged as away to manage large amounts of
heterogeneous data for modern data analytics. One way to prevent data lakes
from turning into inoperable data swamps is semantic data management. Some
approaches propose the linkage of metadata to knowledge graphs based on the
Linked Data principles to provide more meaning and semantics to the data in the
lake. Such a semantic layer may be utilized not only for data management but
also to tackle the problem of data integration from heterogeneous sources, in
order to make data access more expressive and interoperable. In this survey, we
review recent approaches with a specific focus on the application within data
lake systems and scalability to Big Data. We classify the approaches into (i)
basic semantic data management, (ii) semantic modeling approaches for enriching
metadata in data lakes, and (iii) methods for ontologybased data access. In
each category, we cover the main techniques and their background, and compare
latest research. Finally, we point out challenges for future work in this
research area, which needs a closer integration of Big Data and Semantic Web
technologies
GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings
A lot of geospatial data has become available at no charge in many countries recently. Geospatial data that is currently made available by government agencies usually do not follow the linked data paradigm. In the few cases where government agencies do follow the linked data paradigm (e.g., Ordnance Survey in the United Kingdom), specialized scripts have been used for transforming geospatial data into RDF. In this paper we present the open source tool GeoTriples which generates and processes extended R2RML and RML mappings that transform geospatial data from many input formats into RDF. GeoTriples allows the transformation of geospatial data stored in raw files (shapefiles, CSV, KML, XML, GML and GeoJSON) and spatially-enabled RDBMS (PostGIS and MonetDB) into RDF graphs using well-known vocabularies like GeoSPARQL and stSPARQL, but without being tightly coupled to a specific vocabulary. GeoTriples has been developed in European projects LEO and Melodies and has been used to transform many geospatial data sources into linked data. We study the performance of GeoTriples experimentally using large publicly available geospatial datasets, and show that GeoTriples is very efficient and scalable especially when its mapping processor is implemented using Apache Hadoop
Streamlining Knowledge Graph Construction with a fa\c{c}ade: The SPARQL Anything project
What should a data integration framework for knowledge engineers look like?
Recent research on Knowledge Graph construction proposes the design of a
fa\c{c}ade, a notion borrowed from object-oriented software engineering. This
idea is applied to SPARQL Anything, a system that allows querying heterogeneous
resources as-if they were in RDF, in plain SPARQL 1.1, by overloading the
SERVICE clause. SPARQL Anything supports a wide variety of file formats, from
popular ones (CSV, JSON, XML, Spreadsheets) to others that are not supported by
alternative solutions (Markdown, YAML, DOCx, Bibtex). Features include querying
Web APIs with high flexibility, parametrised queries, and chaining multiple
transformations into complex pipelines. In this paper, we describe the design
rationale and software architecture of the SPARQL Anything system. We provide
references to an extensive set of reusable, real-world scenarios from various
application domains. We report on the value-to-users of the founding
assumptions of its design, compared to alternative solutions through a
community survey and a field report from the industry.Comment: 15 page
A Mapping-based Method to Query MongoDB Documents with SPARQL
International audienceAccessing legacy data as virtual RDF stores is a key issue in the building of the Web of Data. In recent years, the MongoDB database has become a popular actor in the NoSQL market, making it a significant potential contributor to the Web of Linked Data. Therefore, in this paper we address the question of how to access arbitrary MongoDB documents with SPARQL. We propose a two-step method to (i) translate a SPARQL query into a pivot abstract query under MongoDB-to-RDF mappings represented in the xR2RML language, then (ii) translate the pivot query into a concrete MongoDB query. We elaborate on the discrepancy between the expressiveness of SPARQL and the MongoDB query language, and we show that we can always come up with a rewriting that shall produce all correct answers
Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data
Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today.
In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources.
We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event.
Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering.
--
LâintĂ©gration des donnĂ©es promet dâĂȘtre lâun des principaux catalyseurs permettant dâextraire des nouveaux aperçus de la richesse des donnĂ©es biologiques dĂ©jĂ disponibles publiquement. Cependant, lâhĂ©tĂ©rogĂ©nĂ©itĂ© des sources de donnĂ©es existantes pose encore des dĂ©fis importants pour parvenir Ă lâinteropĂ©rabilitĂ© des bases de donnĂ©es biologiques. De plus, en surmontant seulement les dĂ©fis techniques de lâintĂ©gration des donnĂ©es, par exemple grĂące Ă lâutilisation de formats standard de reprĂ©sentation de donnĂ©es, on laisse ouvert un problĂšme encore plus grand. Ă savoir, la courbe dâapprentissage abrupte nĂ©cessaire pour comprendre la modĂ©li- sation des donnĂ©es choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent ĂȘtre interrogĂ©s et jointes. Par consĂ©quent, la plupart des donnĂ©es biologiques publiquement disponibles restent pratiquement inexplorĂ©s aujourdâhui.
Dans cette thĂšse, nous abordons lâensemble des deux problĂšmes, en introduisant dâabord une solution dâintĂ©gration de donnĂ©es basĂ©e sur ontologies, afin dâattĂ©nuer le problĂšme dâhĂ©tĂ©- rogĂ©nĂ©itĂ© des sources de donnĂ©es. Nous montrons, Ă travers lâexemple de Bgee, une base de donnĂ©es dâexpression de gĂšnes, une approche permettant les bases de donnĂ©es relationnelles dâĂȘtre publiĂ©s sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela prĂ©sente lâimportant avantage que la source de donnĂ©es dâorigine peut rester inchangĂ©, tout en de- venant interopĂ©rable avec les sources RDF externes.
Nous complĂ©tons nos mĂ©thodes avec des Ă©tudes de cas appliquĂ©es, conçues pour guider les experts du domaine dans la formulation de requĂȘtes fĂ©dĂ©rĂ©es expressives, ciblant les don- nĂ©es intĂ©grĂ©es dans les domaines des relations Ă©volutionnaires et de lâexpression des gĂšnes. Plus prĂ©cisĂ©ment, nous introduisons deux analyses comparatives, dâabord dans le mĂȘme do- maine (en utilisant des donnĂ©es dâorthologie provenant de plusieurs sources de donnĂ©es in- teropĂ©rables) et ensuite Ă travers des domaines interconnectĂ©s, afin dâĂ©tudier la relation entre le changement dâexpression et le taux dâĂ©volution suite Ă une duplication de gĂšne.
Enfin, afin de mitiger le dĂ©calage sĂ©mantique entre les utilisateurs et les donnĂ©es, nous concevons et implĂ©mentons Bio-SODA, un systĂšme de rĂ©ponse aux questions sur des graphes de connaissances domaine-spĂ©cifique, qui ne nĂ©cessite pas de donnĂ©es de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similaritĂ© syntactique et sĂ©mantique, tout en incorporant des mĂ©triques de centralitĂ© des nĆuds, pour classer les possibles candidats en rĂ©ponse Ă une question utilisateur donnĂ©e. Nos rĂ©sultats suite aux tests effectuĂ©s en utilisant Bio-SODA sur plusieurs bases de donnĂ©es Ă travers plusieurs domaines (tantĂŽt liĂ©s Ă la bioinformatique quâextĂ©rieurs) montrent que Bio-SODA rĂ©ussit Ă rĂ©pondre Ă des questions complexes, en- gendrant multiples entitĂ©s, au-delĂ de lâĂ©tat actuel de la technique en matiĂšre de systĂšmes de rĂ©ponses aux questions sur les donnĂ©es structures, en particulier graphes de connaissances