3,555 research outputs found
Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a
database of all EC FP7 and H2020 funded research projects, including metadata
of their results (publications and datasets). These data are stored in an HBase
NoSQL database, post-processed, and exposed as HTML for human consumption, and
as XML through a web service interface. As an intermediate format to facilitate
statistical computations, CSV is generated internally. To interlink the
OpenAIRE data with related data on the Web, we aim at exporting them as Linked
Open Data (LOD). The LOD export is required to integrate into the overall data
processing workflow, where derived data are regenerated from the base data
every day. We thus faced the challenge of identifying the best-performing
conversion approach.We evaluated the performances of creating LOD by a
MapReduce job on top of HBase, by mapping the intermediate CSV files, and by
mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc
Representing Dataset Quality Metadata using Multi-Dimensional Views
Data quality is commonly defined as fitness for use. The problem of
identifying quality of data is faced by many data consumers. Data publishers
often do not have the means to identify quality problems in their data. To make
the task for both stakeholders easier, we have developed the Dataset Quality
Ontology (daQ). daQ is a core vocabulary for representing the results of
quality benchmarking of a linked dataset. It represents quality metadata as
multi-dimensional and statistical observations using the Data Cube vocabulary.
Quality metadata are organised as a self-contained graph, which can, e.g., be
embedded into linked open datasets. We discuss the design considerations, give
examples for extending daQ by custom quality metrics, and present use cases
such as analysing data versions, browsing datasets by quality, and link
identification. We finally discuss how data cube visualisation tools enable
data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5
September 2014, Leipzig, German
Luzzu - A Framework for Linked Data Quality Assessment
With the increasing adoption and growth of the Linked Open Data cloud [9],
with RDFa, Microformats and other ways of embedding data into ordinary Web
pages, and with initiatives such as schema.org, the Web is currently being
complemented with a Web of Data. Thus, the Web of Data shares many
characteristics with the original Web of Documents, which also varies in
quality. This heterogeneity makes it challenging to determine the quality of
the data published on the Web and to subsequently make this information
explicit to data consumers. The main contribution of this article is LUZZU, a
quality assessment framework for Linked Open Data. Apart from providing quality
metadata and quality problem reports that can be used for data cleaning, LUZZU
is extensible: third party metrics can be easily plugged-in the framework. The
framework does not rely on SPARQL endpoints, and is thus free of all the
problems that come with them, such as query timeouts. Another advantage over
SPARQL based qual- ity assessment frameworks is that metrics implemented in
LUZZU can have more complex functionality than triple matching. Using the
framework, we performed a quality assessment of a number of statistical linked
datasets that are available on the LOD cloud. For this evaluation, 25 metrics
from ten different dimensions were implemented
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
Recommended from our members
Models for Learning (Mod4L) Final Report: Representing Learning Designs
The Mod4L Models of Practice project is part of the JISC-funded Design for Learning Programme. It ran from 1 May – 31 December 2006. The philosophy underlying the project was that a general split is evident in the e-learning community between development of e-learning tools, services and standards, and research into how teachers can use these most effectively, and is impeding uptake of new tools and methods by teachers. To help overcome this barrier and bridge the gap, a need is felt for practitioner-focused resources which describe a range of learning designs and offer guidance on how these may be chosen and applied, how they can support effective practice in design for learning, and how they can support the development of effective tools, standards and systems with a learning design capability (see, for example, Griffiths and Blat 2005, JISC 2006). Practice models, it was suggested, were such a resource.
The aim of the project was to: develop a range of practice models that could be used by practitioners in real life contexts and have a high impact on improving teaching and learning practice.
We worked with two definitions of practice models. Practice models are:
1. generic approaches to the structuring and orchestration of learning activities. They express elements of pedagogic principle and allow practitioners to make informed choices (JISC 2006)
However, however effective a learning design may be, it can only be shared with others through a representation. The issue of representation of learning designs is, then, central to the concept of sharing and reuse at the heart of JISC’s Design for Learning programme. Thus practice models should be both representations of effective practice, and effective representations of practice. Hence we arrived at the project working definition of practice models as:
2. Common, but decontextualised, learning designs that are represented in a way that is usable by practitioners (teachers, managers, etc).(Mod4L working definition, Falconer & Littlejohn 2006).
A learning design is defined as the outcome of the process of designing, planning and orchestrating learning activities as part of a learning session or programme (JISC 2006).
Practice models have many potential uses: they describe a range of learning designs that are found to be effective, and offer guidance on their use; they support sharing, reuse and adaptation of learning designs by teachers, and also the development of tools, standards and systems for planning, editing and running the designs.
The project took a practitioner-centred approach, working in close collaboration with a focus group of 12 teachers recruited across a range of disciplines and from both FE and HE. Focus group members are listed in Appendix 1. Information was gathered from the focus group through two face to face workshops, and through their contributions to discussions on the project wiki. This was supplemented by an activity at a JISC pedagogy experts meeting in October 2006, and a part workshop at ALT-C in September 2006. The project interim report of August 2006 contained the outcomes of the first workshop (Falconer and Littlejohn, 2006).
The current report refines the discussion of issues of representing learning designs for sharing and reuse evidenced in the interim report and highlights problems with the concept of practice models (section 2), characterises the requirements teachers have of effective representations (section 3), evaluates a number of types of representation against these requirements (section 4), explores the more technically focused role of sequencing representations and controlled vocabularies (sections 5 & 6), documents some generic learning designs (section 8.2) and suggests ways forward for bridging the gap between teachers and developers (section 2.6).
All quotations are taken from the Mod4L wiki unless otherwise stated
Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR) 2007
This is the proceedings of the Workshop on Semantic Content Acquisition and Representation, held in conjunction with NODALIDA 2007, on May 24 2007 in Tartu, Estonia.</p
Scalable Quality Assessment of Linked Data
In a world where the information economy is booming, poor data quality can lead to adverse consequences, including social and economical problems such as decrease in revenue. Furthermore, data-driven indus- tries are not just relying on their own (proprietary) data silos, but are also continuously aggregating data from different sources. This aggregation could then be re-distributed back to “data lakes”. However, this data (including Linked Data) is not necessarily checked for its quality prior to its use. Large volumes of data are being exchanged in a standard and interoperable format between organisations and published as Linked Data to facilitate their re-use. Some organisations, such as government institutions, take a step further and open their data. The Linked Open Data Cloud is a witness to this. However, similar to data in data lakes, it is challenging to determine the quality of this heterogeneous data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data quality, the current solutions do not aggregate a holistic approach that enables both the assessment of datasets and also provides consumers with quality results that can then be used to find, compare and rank datasets’ fitness for use. In this thesis we investigate methods to assess the quality of (possibly large) linked datasets with the intent that data consumers can then use the assessment results to find datasets that are fit for use, that is; finding the right dataset for the task at hand. Moreover, the benefits of quality assessment are two-fold: (1) data consumers do not need to blindly rely on subjective measures to choose a dataset, but base their choice on multiple factors such as the intrinsic structure of the dataset, therefore fostering trust and reputation between the publishers and consumers on more objective foundations; and (2) data publishers can be encouraged to improve their datasets so that they can be re-used more. Furthermore, our approach scales for large datasets. In this regard, we also look into improving the efficiency of quality metrics using various approximation techniques. However the trade-off is that consumers will not get the exact quality value, but a very close estimate which anyway provides the required guidance towards fitness for use. The central point of this thesis is not on data quality improvement, nonetheless, we still need to understand what data quality means to the consumers who are searching for potential datasets. This thesis looks into the challenges faced to detect quality problems in linked datasets presenting quality results in a standardised machine-readable and interoperable format for which agents can make sense out of to help human consumers identifying the fitness for use dataset. Our proposed approach is more consumer-centric where it looks into (1) making the assessment of quality as easy as possible, that is, allowing stakeholders, possibly non-experts, to identify and easily define quality metrics and to initiate the assessment; and (2) making results (quality metadata and quality reports) easy for stakeholders to understand, or at least interoperable with other systems to facilitate a possible data quality pipeline. Finally, our framework is used to assess the quality of a number of heterogeneous (large) linked datasets, where each assessment returns a quality metadata graph that can be consumed by agents as Linked Data. In turn, these agents can intelligently interpret a dataset’s quality with regard to multiple dimensions and observations, and thus provide further insight to consumers regarding its fitness for use
Evaluating the quality of linked open data in digital libraries
Cultural heritage institutions have recently started to share their metadata as Linked Open Data (LOD) in order to disseminate and enrich them. The publication of large bibliographic data sets as LOD is a challenge that requires the design and implementation of custom methods for the transformation, management, querying and enrichment of the data. In this report, the methodology defined by previous research for the evaluation of the quality of LOD is analysed and adapted to the specific case of Resource Description Framework (RDF) triples containing standard bibliographic information. The specified quality measures are reported in the case of four highly relevant libraries.This work has been partially supported by the ECLIPSE-UA RTI2018-094283-B-C32 (Spanish Ministry of Education and Science)
- …