Search CORE

2,323 research outputs found

Towards information profiling: data lake content metadata management

Author: Abelló Gamazo Alberto
Al-serafi Ayman Mounir Mohamed
Calders Toon
Romero Moral Óscar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

XSPARQL: Traveling between the XML and RDF worlds - and avoiding the XSLT Pilgrimage

Author: Akhtar Waseem
Kopecky Jacek
Krennwallner Thomas
Polleres Axel
Publication venue
Publication date: 01/01/2008
Field of study

Open Research Online (The Open University)

A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web

Author: A. Ballatore
A. Buccella
A. Burton-Jones
A. Gangemi
A. Gore
A. Gómez-Pérez
A. Polleres
A. Schwering
A. Turner
B. Smith
C. Bizer
C. Jones
C. Keßler
C. Keßler
C. Manning
C.B. Jones
D. Buscaldi
D. Coleman
D. Nadeau
D. Strasunskas
D. Sui
F. Baader
F. Fonseca
F. Giunchiglia
F. Giunchiglia
F. Harvey
F.. Gey
F.J. Lopez-Pellicer
G. Bordogna
G. Fu
G. Tré De
G. Weikum
J. Giles
J. Goodwin
J. Howe
J. Leveling
K. Janowicz
K. Janowicz
L. Vaccari
L.L. Hill
M. Egenhofer
M. Goodchild
M. Goodchild
M. Grassi
M. Haklay
M. Haklay
M. Haklay
M. Kitsuregawa
M. Lutz
N. Choi
N. Guarino
N. Guarino
P. Burrough
P. Magnus
P. Roget
P. Singh
P.D. Smart
R. Fouad
R. Rada
S. Auer
S. Auer
S. Freitas
S. Hahmann
S. Overell
S. Schade
S. Staab
S. Vaid
S. Winter
T. Berners-Lee
T. Mandl
T. Mandl
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Over the past decade, rapid advances in web technologies, coupled with innovative models of spatial data collection and consumption, have generated a robust growth in geo-referenced information, resulting in spatial information overload. Increasing 'geographic intelligence' in traditional text-based information retrieval has become a prominent approach to respond to this issue and to fulfill users' spatial information needs. Numerous efforts in the Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the Linking Open Data initiative have converged in a constellation of open knowledge bases, freely available online. In this article, we survey these open knowledge bases, focusing on their geospatial dimension. Particular attention is devoted to the crucial issue of the quality of geo-knowledge bases, as well as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic Network, is outlined as our contribution to this area. Research directions in information integration and Geographic Information Retrieval (GIR) are then reviewed, with a critical discussion of their current limitations and future prospects

arXiv.org e-Print Archive

Crossref

Advances in Large-Scale RDF Data Management

Author: Boncz P.A.
Erling O.
Pham M.D.
Publication venue: heidelberg
Publication date: 01/01/2014
Field of study

One of the prime goals of the LOD2 project is improving the performance and scalability of RDF storage solutions so that the increasing amount of Linked Open Data (LOD) can be efficiently managed. Virtuoso has been chosen as the basic RDF store for the LOD2 project, and during the project it has been significantly improved by incorporating advanced relational database techniques from MonetDB and Vectorwise, turning it into a compressed column store with vectored execution. This has reduced the performance gap (“RDF tax”) between Virtuoso’s SQL and SPARQL query performance in a way that still respects the “schema-last” nature of RDF. However, by lacking schema information, RDF database systems such as Virtuoso still cannot use advanced relational storage optimizations such as table partitioning or clustered indexes and have to execute SPARQL queries with many self-joins to a triple table, which leads to more join effort than needed in SQL systems. In this chapter, we first discuss the new column store techniques applied to Virtuoso, the enhancements in its cluster parallel version, and show its performance using the popular BSBM benchmark at the unsurpassed scale of 150 billion triples. We finally describe ongoing work in deriving an “emergent” relational schema from RDF data, which can help to close the performance gap between relational-based and RDF-based storage solutions

VU Research Portal

Springer - Publisher Connector

CWI's Institutional Repository