51 research outputs found

    Framework to Automatically Determine the Quality of Open Data Catalogs

    Full text link
    Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page

    Building an Integrated Enhanced Virtual Research Environment Metadata Catalogue

    Get PDF
    Purpose The purpose of this paper is to boost multidisciplinary research by the building of an integrated catalogue or research assets metadata. Such an integrated catalogue should enable researchers to solve problems or analyse phenomena that require a view across several scientific domains. Design/methodology/approach There are two main approaches for integrating metadata catalogues provided by different e-science research infrastructures (e-RIs): centralised and distributed. The authors decided to implement a central metadata catalogue that describes, provides access to and records actions on the assets of a number of e-RIs participating in the system. The authors chose the CERIF data model for description of assets available via the integrated catalogue. Analysis of popular metadata formats used in e-RIs has been conducted, and mappings between popular formats and the CERIF data model have been defined using an XML-based tool for description and automatic execution of mappings. Findings An integrated catalogue of research assets metadata has been created. Metadata from e-RIs supporting Dublin Core, ISO 19139, DCAT-AP, EPOS-DCAT-AP, OIL-E and CKAN formats can be integrated into the catalogue. Metadata are stored in CERIF RDF in the integrated catalogue. A web portal for searching this catalogue has been implemented. Research limitations/implications Only five formats are supported at this moment. However, description of mappings between other source formats and the target CERIF format can be defined in the future using the 3M tool, an XML-based tool for describing X3ML mappings that can then be automatically executed on XML metadata records. The approach and best practices described in this paper can thus be applied in future mappings between other metadata formats. Practical implications The integrated catalogue is a part of the eVRE prototype, which is a result of the VRE4EIC H2020 project. Social implications The integrated catalogue should boost the performance of multi-disciplinary research; thus it has the potential to enhance the practice of data science and so contribute to an increasingly knowledge-based society. Originality/value A novel approach for creation of the integrated catalogue has been defined and implemented. The approach includes definition of mappings between various formats. Defined mappings are effective and shareable.Published929-9514IT. Banche datiJCR Journa

    DPCat: Specification for an interoperable and machine-readable data processing catalogue based on GDPR

    Get PDF
    The GDPR requires Data Controllers and Data Protection Officers (DPO) to maintain a Register of Processing Activities (ROPA) as part of overseeing the organisation’s compliance processes. The ROPA must include information from heterogeneous sources such as (internal) departments with varying IT systems and (external) data processors. Current practices use spreadsheets or proprietary systems that lack machine-readability and interoperability, presenting barriers to automation. We propose the Data Processing Catalogue (DPCat) for the representation, collection and transfer of ROPA information, as catalogues in a machine-readable and interoperable manner. DPCat is based on the Data Catalog Vocabulary (DCAT) and its extension DCAT Application Profile for data portals in Europe (DCAT-AP), and the Data Privacy Vocabulary (DPV). It represents a comprehensive semantic model developed from GDPR’s Article and an analysis of the 17 ROPA templates from EU Data Protection Authorities (DPA). To demonstrate the practicality and feasibility of DPCat, we present the European Data Protection Supervisor’s (EDPS) ROPA documents using DPCat, verify them with SHACL to ensure the correctness of information based on legal and contextual requirements, and produce reports and ROPA documents based on DPA templates using SPARQL. DPCat supports a data governance process for data processing compliance to harmonise inputs from heterogeneous sources to produce dynamic documentation that can accommodate differences in regulatory approaches across DPAs and ease investigative burdens toward efficient enforcement

    Danube Region data projects

    Get PDF
    The Danube Reference Data and Services Infrastructure (DRDSI) project currently provides access to more than 6,500 datasets, relevant for one or more Priority Areas of the EU Strategy for the Danube Region (EUSDR). These datasets can act as a solid foundation for integration of scientific knowledge into the policy making process on different levels (local, regional and international). From the perspective of macro-regional strategies, this would only be possible if data can be used across borders and domains, and put in the right context. Projects at regional, national, cross-border and macro-regional levels present a useful container to uncover stakeholders, expertise and data creation/sharing capacity for policy-making and research. This JRC technical report investigates the existing project databases and similar resources related to the EUSDR that describe such projects, as well as how this information may be presented in the DRDSI platform.JRC.H.6-Digital Earth and Reference Dat

    A Conceptual Model for Participants and Activities in Citizen Science Projects

    Get PDF
    24 pages, 4 figures, 3 tablesInterest in the formal representation of citizen science comes from portals, platforms, and catalogues of citizen science projects; scientists using citizen science data for their research; and funding agencies and governments interested in the impact of citizen science initiatives. Having a common understanding and representation of citizen science projects, their participants, and their outcomes is key to enabling seamless knowledge and data sharing. In this chapter, we provide a conceptual model comprised of the core citizen science concepts with which projects and data can be described in a standardised manner, focusing on the description of the participants and their activities. The conceptual model is the outcome of a working group from the COST Action CA15212 Citizen Science to Promote Creativity, Scientific Literacy, and Innovation throughout Europe, established to improve data standardisation and interoperability in citizen science activities. It utilises past models and contributes to current standardisation efforts, such as the Public Participation in Scientific Research (PPSR) Common Conceptual Model and the Open Geospatial Consortium (OGC) standards. Its design is intended to fulfil the needs of different stakeholders, as illustrated by several case studies which demonstrate the model’s applicabilityPeer reviewe

    A domain categorisation of vocabularies based on a deep learning classifier.

    Get PDF
    The publication of large amounts of open data has become a major trend nowadays. This is a consequence of pro-jects like the Linked Open Data (LOD) community, which publishes and integrates datasets using techniques like Linked Data. Linked Data publishers should follow a set of principles for dataset design. This information is described in a 2011 document that describes tasks as the consideration of reusing vocabularies. With regard to the latter, another project called Linked Open Vocabularies (LOV) attempts to compile the vocabularies used in LOD. These vocabularies have been classified by domain following the subjective criteria of LOV members, which has the inherent risk introducing personal biases. In this paper, we present an automatic classifier of vocabularies based on the main categories of the well-known knowledge source Wikipedia. For this purpose, word-embedding models were used, in combination with Deep Learning techniques. Results show that with a hybrid model of regular Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN), vocabularies could be classified with an accuracy of 93.57 per cent. Specifically, 36.25 per cent of the vocabularies belong to the Culture category.pre-print304 K

    Building Semantic Knowledge Graphs from (Semi-)Structured Data: A Review

    Get PDF
    Knowledge graphs have, for the past decade, been a hot topic both in public and private domains, typically used for large-scale integration and analysis of data using graph-based data models. One of the central concepts in this area is the Semantic Web, with the vision of providing a well-defined meaning to information and services on the Web through a set of standards. Particularly, linked data and ontologies have been quite essential for data sharing, discovery, integration, and reuse. In this paper, we provide a systematic literature review on knowledge graph creation from structured and semi-structured data sources using Semantic Web technologies. The review takes into account four prominent publication venues, namely, Extended Semantic Web Conference, International Semantic Web Conference, Journal of Web Semantics, and Semantic Web Journal. The review highlights the tools, methods, types of data sources, ontologies, and publication methods, together with the challenges, limitations, and lessons learned in the knowledge graph creation processes.publishedVersio

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects
    • 

    corecore