Search CORE

41 research outputs found

Representing Dataset Quality Metadata using Multi-Dimensional Views

Author: Auer Sören
Debattista Jeremy
Lange Christoph
Publication venue
Publication date: 01/01/2014
Field of study

Data quality is commonly defined as fitness for use. The problem of identifying quality of data is faced by many data consumers. Data publishers often do not have the means to identify quality problems in their data. To make the task for both stakeholders easier, we have developed the Dataset Quality Ontology (daQ). daQ is a core vocabulary for representing the results of quality benchmarking of a linked dataset. It represents quality metadata as multi-dimensional and statistical observations using the Data Cube vocabulary. Quality metadata are organised as a self-contained graph, which can, e.g., be embedded into linked open datasets. We discuss the design considerations, give examples for extending daQ by custom quality metrics, and present use cases such as analysing data versions, browsing datasets by quality, and link identification. We finally discuss how data cube visualisation tools enable data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5 September 2014, Leipzig, German

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

Scalable Quality Assessment of Linked Data

Author: Debattista Jeremy
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

In a world where the information economy is booming, poor data quality can lead to adverse consequences, including social and economical problems such as decrease in revenue. Furthermore, data-driven indus- tries are not just relying on their own (proprietary) data silos, but are also continuously aggregating data from different sources. This aggregation could then be re-distributed back to “data lakes”. However, this data (including Linked Data) is not necessarily checked for its quality prior to its use. Large volumes of data are being exchanged in a standard and interoperable format between organisations and published as Linked Data to facilitate their re-use. Some organisations, such as government institutions, take a step further and open their data. The Linked Open Data Cloud is a witness to this. However, similar to data in data lakes, it is challenging to determine the quality of this heterogeneous data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data quality, the current solutions do not aggregate a holistic approach that enables both the assessment of datasets and also provides consumers with quality results that can then be used to find, compare and rank datasets’ fitness for use. In this thesis we investigate methods to assess the quality of (possibly large) linked datasets with the intent that data consumers can then use the assessment results to find datasets that are fit for use, that is; finding the right dataset for the task at hand. Moreover, the benefits of quality assessment are two-fold: (1) data consumers do not need to blindly rely on subjective measures to choose a dataset, but base their choice on multiple factors such as the intrinsic structure of the dataset, therefore fostering trust and reputation between the publishers and consumers on more objective foundations; and (2) data publishers can be encouraged to improve their datasets so that they can be re-used more. Furthermore, our approach scales for large datasets. In this regard, we also look into improving the efficiency of quality metrics using various approximation techniques. However the trade-off is that consumers will not get the exact quality value, but a very close estimate which anyway provides the required guidance towards fitness for use. The central point of this thesis is not on data quality improvement, nonetheless, we still need to understand what data quality means to the consumers who are searching for potential datasets. This thesis looks into the challenges faced to detect quality problems in linked datasets presenting quality results in a standardised machine-readable and interoperable format for which agents can make sense out of to help human consumers identifying the fitness for use dataset. Our proposed approach is more consumer-centric where it looks into (1) making the assessment of quality as easy as possible, that is, allowing stakeholders, possibly non-experts, to identify and easily define quality metrics and to initiate the assessment; and (2) making results (quality metadata and quality reports) easy for stakeholders to understand, or at least interoperable with other systems to facilitate a possible data quality pipeline. Finally, our framework is used to assess the quality of a number of heterogeneous (large) linked datasets, where each assessment returns a quality metadata graph that can be consumed by agents as Linked Data. In turn, these agents can intelligently interpret a dataset’s quality with regard to multiple dimensions and observations, and thus provide further insight to consumers regarding its fitness for use

bonndoc – Der Publikationsserver der Universität Bonn

An intelligent linked data quality dashboard

Author: Brennan Rob
Debattista Jeremy
Srivatsa Neha
Vaidyambath Ramneesh
Publication venue: CEUR-WS
Publication date: 01/09/2019
Field of study

This paper describes a new intelligent, data-driven dashboard for linked data quality assessment. The development goal was to assist data quality engineers to interpret data quality problems found when evaluating a dataset us-ing a metrics-based data quality assessment. This required construction of a graph linking the problematic things identified in the data, the assessment metrics and the source data. This context and supporting user interfaces help the user to un-derstand data quality problems. An analysis widget also helped the user identify the root cause multiple problems. This supported the user in identification and prioritization of the problems that need to be fixed and to improve data quality. The dashboard was shown to be useful for users to clean data. A user evaluation was performed with both expert and novice data quality engineers

DCU Online Research Access Service

Semantic data ingestion for intelligent, value-driven big data analytics

Author: Attard Judie
Brennan Rob
Debattista Jeremy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/10/2018
Field of study

In this position paper we describe a conceptual model for intelligent Big Data analytics based on both semantic and machine learning AI techniques (called AI ensembles). These processes are linked to business outcomes by explicitly modelling data value and using semantic technologies as the underlying mode for communication between the diverse processes and organisations creating AI ensembles. Furthermore, we show how data governance can direct and enhance these ensembles by providing recommendations and insights that to ensure the output generated produces the highest possible value for the organisation

Crossref

DCU Online Research Access Service

Luzzu - A Framework for Linked Data Quality Assessment

Author: Auer Sören
Debattista Jeremy
Lange Christoph
Publication venue
Publication date: 07/01/2016
Field of study

With the increasing adoption and growth of the Linked Open Data cloud [9], with RDFa, Microformats and other ways of embedding data into ordinary Web pages, and with initiatives such as schema.org, the Web is currently being complemented with a Web of Data. Thus, the Web of Data shares many characteristics with the original Web of Documents, which also varies in quality. This heterogeneity makes it challenging to determine the quality of the data published on the Web and to subsequently make this information explicit to data consumers. The main contribution of this article is LUZZU, a quality assessment framework for Linked Open Data. Apart from providing quality metadata and quality problem reports that can be used for data cleaning, LUZZU is extensible: third party metrics can be easily plugged-in the framework. The framework does not rely on SPARQL endpoints, and is thus free of all the problems that come with them, such as query timeouts. Another advantage over SPARQL based qual- ity assessment frameworks is that metrics implemented in LUZZU can have more complex functionality than triple matching. Using the framework, we performed a quality assessment of a number of statistical linked datasets that are available on the LOD cloud. For this evaluation, 25 metrics from ten different dimensions were implemented

arXiv.org e-Print Archive

CiteSeerX

Crossref

Fraunhofer-ePrints

Understanding information professionals: a survey on the quality of Linked Data sources for digital libraries

Author: Brennan Rob
Debattista Jeremy
McKenna Lucy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/10/2018
Field of study

In this paper we provide an in-depth analysis of a survey related to Information Professionals (IPs) experiences with Linked Data quality. We discuss and highlight shortcomings in linked data sources following a survey related to the quality issues IPs find when using such sources for their daily tasks such as metadata creation

DCU Online Research Access Service

Saffron: a data value assessment tool for quantifying the value of data assets

Author: Attard Judie
Brennan Rob
Debattista Jeremy
Publication venue
Publication date: 01/01/2019
Field of study

Data has become an indispensable commodity and it is the basis for many products and services. It has become increasingly important to understand the value of this data in order to be able to exploit it and reap the full benefits. Yet, many businesses and entities are simply hoarding data without understanding its true potential. We here present Saffron; a Data Value Assessment Tool that enables the quantification of the value of data assets based on a number of different data value dimensions. Based on the Data Value Vocabulary (DaVe), Saffron enables the extensible representation of the calculated value of data assets, whilst also catering for the subjective and contextual nature of data value. The tool exploits semantic technologies in order to provide traceable explanations of the calculated data value. Saffron therefore provides the first step towards the efficient and effective exploitation of data assets

Irish Universities

DCU Online Research Access Service

Towards Cleaning-up Open Data Portals: A Metadata Reconciliation Approach

Author: Auer Sören
Campos Maria Luiza Machado
Debattista Jeremy
Orlandi Fabrizio
Tygel Alan
Publication venue
Publication date: 15/10/2015
Field of study

This paper presents an approach for metadata reconciliation, curation and linking for Open Governamental Data Portals (ODPs). ODPs have been lately the standard solution for governments willing to put their public data available for the society. Portal managers use several types of metadata to organize the datasets, one of the most important ones being the tags. However, the tagging process is subject to many problems, such as synonyms, ambiguity or incoherence, among others. As our empiric analysis of ODPs shows, these issues are currently prevalent in most ODPs and effectively hinders the reuse of Open Data. In order to address these problems, we develop and implement an approach for tag reconciliation in Open Data Portals, encompassing local actions related to individual portals, and global actions for adding a semantic metadata layer above individual portals. The local part aims to enhance the quality of tags in a single portal, and the global part is meant to interlink ODPs by establishing relations between tags.Comment: 8 pages,10 Figures - Under Revision for ICSC201

arXiv.org e-Print Archive

Fraunhofer-ePrints

Assessing the quality of geospatial linked data – experiences from Ordnance Survey Ireland (OSi)

Author: Brennan Rob
Clinton Éamonn
Debattista Jeremy
Publication venue
Publication date: 13/09/2018
Field of study

Ordnance Survey Ireland (OSi) is Ireland’s national mapping agency that is responsible for the digitisation of the island’s infrastructure in terms of mapping. Generating data from various sensors (e.g. spatial sensors), OSi build its knowledge in the Prime2 framework, a subset of which is transformed into geo-Linked Data. In this paper we discuss how the quality of the generated sematic data fares against datasets in the LOD cloud. We set up Luzzu, a scalable Linked Data quality assessment framework, in the OSi pipeline to continuously assess produced data in order to tackle any quality problems prior to publishing

Irish Universities

DCU Online Research Access Service