51 research outputs found
Framework to Automatically Determine the Quality of Open Data Catalogs
Data catalogs play a crucial role in modern data-driven organizations by
facilitating the discovery, understanding, and utilization of diverse data
assets. However, ensuring their quality and reliability is complex, especially
in open and large-scale data environments. This paper proposes a framework to
automatically determine the quality of open data catalogs, addressing the need
for efficient and reliable quality assessment mechanisms. Our framework can
analyze various core quality dimensions, such as accuracy, completeness,
consistency, scalability, and timeliness, offer several alternatives for the
assessment of compatibility and similarity across such catalogs as well as the
implementation of a set of non-core quality dimensions such as provenance,
readability, and licensing. The goal is to empower data-driven organizations to
make informed decisions based on trustworthy and well-curated data assets. The
source code that illustrates our approach can be downloaded from
https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page
Building an Integrated Enhanced Virtual Research Environment Metadata Catalogue
Purpose
The purpose of this paper is to boost multidisciplinary research by the building of an integrated catalogue or research assets metadata. Such an integrated catalogue should enable researchers to solve problems or analyse phenomena that require a view across several scientific domains.
Design/methodology/approach
There are two main approaches for integrating metadata catalogues provided by different e-science research infrastructures (e-RIs): centralised and distributed. The authors decided to implement a central metadata catalogue that describes, provides access to and records actions on the assets of a number of e-RIs participating in the system. The authors chose the CERIF data model for description of assets available via the integrated catalogue. Analysis of popular metadata formats used in e-RIs has been conducted, and mappings between popular formats and the CERIF data model have been defined using an XML-based tool for description and automatic execution of mappings.
Findings
An integrated catalogue of research assets metadata has been created. Metadata from e-RIs supporting Dublin Core, ISO 19139, DCAT-AP, EPOS-DCAT-AP, OIL-E and CKAN formats can be integrated into the catalogue. Metadata are stored in CERIF RDF in the integrated catalogue. A web portal for searching this catalogue has been implemented.
Research limitations/implications
Only five formats are supported at this moment. However, description of mappings between other source formats and the target CERIF format can be defined in the future using the 3M tool, an XML-based tool for describing X3ML mappings that can then be automatically executed on XML metadata records. The approach and best practices described in this paper can thus be applied in future mappings between other metadata formats.
Practical implications
The integrated catalogue is a part of the eVRE prototype, which is a result of the VRE4EIC H2020 project.
Social implications
The integrated catalogue should boost the performance of multi-disciplinary research; thus it has the potential to enhance the practice of data science and so contribute to an increasingly knowledge-based society.
Originality/value
A novel approach for creation of the integrated catalogue has been defined and implemented. The approach includes definition of mappings between various formats. Defined mappings are effective and shareable.Published929-9514IT. Banche datiJCR Journa
DPCat: Specification for an interoperable and machine-readable data processing catalogue based on GDPR
The GDPR requires Data Controllers and Data Protection Officers (DPO) to maintain a
Register of Processing Activities (ROPA) as part of overseeing the organisationâs compliance processes.
The ROPA must include information from heterogeneous sources such as (internal) departments with
varying IT systems and (external) data processors. Current practices use spreadsheets or proprietary
systems that lack machine-readability and interoperability, presenting barriers to automation. We
propose the Data Processing Catalogue (DPCat) for the representation, collection and transfer of
ROPA information, as catalogues in a machine-readable and interoperable manner. DPCat is based
on the Data Catalog Vocabulary (DCAT) and its extension DCAT Application Profile for data portals
in Europe (DCAT-AP), and the Data Privacy Vocabulary (DPV). It represents a comprehensive
semantic model developed from GDPRâs Article and an analysis of the 17 ROPA templates from
EU Data Protection Authorities (DPA). To demonstrate the practicality and feasibility of DPCat,
we present the European Data Protection Supervisorâs (EDPS) ROPA documents using DPCat,
verify them with SHACL to ensure the correctness of information based on legal and contextual
requirements, and produce reports and ROPA documents based on DPA templates using SPARQL.
DPCat supports a data governance process for data processing compliance to harmonise inputs from
heterogeneous sources to produce dynamic documentation that can accommodate differences in
regulatory approaches across DPAs and ease investigative burdens toward efficient enforcement
Danube Region data projects
The Danube Reference Data and Services Infrastructure (DRDSI) project currently provides access to more than 6,500 datasets, relevant for one or more Priority Areas of the EU Strategy for the Danube Region (EUSDR). These datasets can act as a solid foundation for integration of scientific knowledge into the policy making process on different levels (local, regional and international). From the perspective of macro-regional strategies, this would only be possible if data can be used across borders and domains, and put in the right context.
Projects at regional, national, cross-border and macro-regional levels present a useful container to uncover stakeholders, expertise and data creation/sharing capacity for policy-making and research. This JRC technical report investigates the existing project databases and similar resources related to the EUSDR that describe such projects, as well as how this information may be presented in the DRDSI platform.JRC.H.6-Digital Earth and Reference Dat
A Conceptual Model for Participants and Activities in Citizen Science Projects
24 pages, 4 figures, 3 tablesInterest in the formal representation of citizen science comes from portals, platforms, and catalogues of citizen science projects; scientists using citizen science data for their research; and funding agencies and governments interested in the impact of citizen science initiatives. Having a common understanding and representation of citizen science projects, their participants, and their outcomes is key to enabling seamless knowledge and data sharing. In this chapter, we provide a conceptual model comprised of the core citizen science concepts with which projects and data can be described in a standardised manner, focusing on the description of the participants and their activities. The conceptual model is the outcome of a working group from the COST Action CA15212 Citizen Science to Promote Creativity, Scientific Literacy, and Innovation throughout Europe, established to improve data standardisation and interoperability in citizen science activities. It utilises past models and contributes to current standardisation efforts, such as the Public Participation in Scientific Research (PPSR) Common Conceptual Model and the Open Geospatial Consortium (OGC) standards. Its design is intended to fulfil the needs of different stakeholders, as illustrated by several case studies which demonstrate the modelâs applicabilityPeer reviewe
A domain categorisation of vocabularies based on a deep learning classifier.
The publication of large amounts of open data has become a major trend nowadays. This is a consequence of pro-jects like the Linked Open Data (LOD) community, which publishes and integrates datasets using techniques like Linked Data. Linked Data publishers should follow a set of principles for dataset design. This information is described in a 2011 document that describes tasks as the consideration of reusing vocabularies. With regard to the latter, another project called Linked Open Vocabularies (LOV) attempts to compile the vocabularies used in LOD. These vocabularies have been classified by domain following the subjective criteria of LOV members, which has the inherent risk introducing personal biases. In this paper, we present an automatic classifier of vocabularies based on the main categories of the well-known knowledge source Wikipedia. For this purpose, word-embedding models were used, in combination with Deep Learning techniques. Results show that with a hybrid model of regular Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN), vocabularies could be classified with an accuracy of 93.57 per cent. Specifically, 36.25 per cent of the vocabularies belong to the Culture category.pre-print304 K
Building Semantic Knowledge Graphs from (Semi-)Structured Data: A Review
Knowledge graphs have, for the past decade, been a hot topic both in public and private domains, typically used for large-scale integration and analysis of data using graph-based data models. One of the central concepts in this area is the Semantic Web, with the vision of providing a well-defined meaning to information and services on the Web through a set of standards. Particularly, linked data and ontologies have been quite essential for data sharing, discovery, integration, and reuse. In this paper, we provide a systematic literature review on knowledge graph creation from structured and semi-structured data sources using Semantic Web technologies. The review takes into account four prominent publication venues, namely, Extended Semantic Web Conference, International Semantic Web Conference, Journal of Web Semantics, and Semantic Web Journal. The review highlights the tools, methods, types of data sources, ontologies, and publication methods, together with the challenges, limitations, and lessons learned in the knowledge graph creation processes.publishedVersio
European Language Grid
This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 â to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects
- âŠ