23 research outputs found

    Charaterizing RDF graphs through graph-based measures - Framework and assessment

    Get PDF
    The topological structure of RDF graphs inherently differs from other types of graphs, like social graphs, due to the pervasive existence of hierarchical relations (TBox), which complement transversal relations (ABox). Graph measures capture such particularities through descriptive statistics. Besides the classical set of measures established in the field of network analysis, such as size and volume of the graph or the type of degree distribution of its vertices, there has been some effort to define measures that capture some of the aforementioned particularities RDF graphs adhere to. However, some of them are redundant, computationally expensive, and not meaningful enough to describe RDF graphs. In particular, it is not clear which of them are efficient metrics to capture specific distinguishing characteristics of datasets in different knowledge domains (e.g., Cross Domain vs. Linguistics). In this work, we address the problem of identifying a minimal set of measures that is efficient, essential (non-redundant), and meaningful. Based on 54 measures and a sample of 280 graphs of nine knowledge domains from the Linked Open Data Cloud, we identify an essential set of 13 measures, having the capacity to describe graphs concisely. These measures have the capacity to present the topological structures and differences of datasets in established knowledge domains

    Characteristic sets profile features: Estimation and application to SPARQL query planning

    Get PDF
    RDF dataset profiling is the task of extracting a formal representation of a dataset’s features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature. In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation. The results show that even small samples of just 0.5% of the original graph’s entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Linked Data Quality Assessment and its Application to Societal Progress Measurement

    Get PDF
    In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented. With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously. In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself. A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to οΏΌοΏΌmeasure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets. Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology. Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis

    Exploiting Context-Dependent Quality Metadata for Linked Data Source Selection

    Get PDF
    The traditional Web is evolving into the Web of Data which consists of huge collections of structured data over poorly controlled distributed data sources. Live queries are needed to get current information out of this global data space. In live query processing, source selection deserves attention since it allows us to identify the sources which might likely contain the relevant data. The thesis proposes a source selection technique in the context of live query processing on Linked Open Data, which takes into account the context of the request and the quality of data contained in the sources to enhance the relevance (since the context enables a better interpretation of the request) and the quality of the answers (which will be obtained by processing the request on the selected sources). Specifically, the thesis proposes an extension of the QTree indexing structure that had been proposed as a data summary to support source selection based on source content, to take into account quality and contextual information. With reference to a specific case study, the thesis also contributes an approach, relying on the Luzzu framework, to assess the quality of a source with respect to for a given context (according to different quality dimensions). An experimental evaluation of the proposed techniques is also provide

    ΠžΠΊΡ€ΡƒΠΆΠ΅ΡšΠ΅ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρƒ ΠΈ ΠΎΡ†Π΅Π½Ρƒ ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚Π° Π²Π΅Π»ΠΈΠΊΠΈΡ… ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°

    Get PDF
    Linking and publishing data in the Linked Open Data format increases the interoperability and discoverability of resources over the Web. To accomplish this, the process comprises several design decisions, based on the Linked Data principles that, on one hand, recommend to use standards for the representation and the access to data on the Web, and on the other hand to set hyperlinks between data from different sources. Despite the efforts of the World Wide Web Consortium (W3C), being the main international standards organization for the World Wide Web, there is no one tailored formula for publishing data as Linked Data. In addition, the quality of the published Linked Open Data (LOD) is a fundamental issue, and it is yet to be thoroughly managed and considered. In this doctoral thesis, the main objective is to design and implement a novel framework for selecting, analyzing, converting, interlinking, and publishing data from diverse sources, simultaneously paying great attention to quality assessment throughout all steps and modules of the framework. The goal is to examine whether and to what extent are the Semantic Web technologies applicable for merging data from different sources and enabling end-users to obtain additional information that was not available in individual datasets, in addition to the integration into the Semantic Web community space. Additionally, the Ph.D. thesis intends to validate the applicability of the process in the specific and demanding use case, i.e. for creating and publishing an Arabic Linked Drug Dataset, based on open drug datasets from selected Arabic countries and to discuss the quality issues observed in the linked data life-cycle. To that end, in this doctoral thesis, a Semantic Data Lake was established in the pharmaceutical domain that allows further integration and developing different business services on top of the integrated data sources. Through data representation in an open machine-readable format, the approach offers an optimum solution for information and data dissemination for building domain-specific applications, and to enrich and gain value from the original dataset. This thesis showcases how the pharmaceutical domain benefits from the evolving research trends for building competitive advantages. However, as it is elaborated in this thesis, a better understanding of the specifics of the Arabic language is required to extend linked data technologies utilization in targeted Arabic organizations.ПовСзивањС ΠΈ ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° Ρƒ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Ρƒ "ПовСзани ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈ ΠΏΠΎΠ΄Π°Ρ†ΠΈ" (Π΅Π½Π³. Linked Open Data) ΠΏΠΎΠ²Π΅Ρ›Π°Π²Π° интСропСрабилност ΠΈ могућности Π·Π° ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ рСсурса ΠΏΡ€Π΅ΠΊΠΎ Web-Π°. ΠŸΡ€ΠΎΡ†Π΅Ρ јС заснован Π½Π° Linked Data ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΈΠΌΠ° (W3C, 2006) који са јСднС странС Π΅Π»Π°Π±ΠΎΡ€ΠΈΡ€Π° стандардС Π·Π° ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Ρ™Π°ΡšΠ΅ ΠΈ приступ ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° Π½Π° WΠ΅Π±Ρƒ (RDF, OWL, SPARQL), Π° са Π΄Ρ€ΡƒΠ³Π΅ странС, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΈ ΡΡƒΠ³Π΅Ρ€ΠΈΡˆΡƒ ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ Ρ…ΠΈΠΏΠ΅Ρ€Π²Π΅Π·Π° ΠΈΠ·ΠΌΠ΅Ρ’Ρƒ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΈΠ· Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΡ… ΠΈΠ·Π²ΠΎΡ€Π°. Упркос Π½Π°ΠΏΠΎΡ€ΠΈΠΌΠ° W3C ΠΊΠΎΠ½Π·ΠΎΡ€Ρ†ΠΈΡ˜ΡƒΠΌΠ° (W3C јС Π³Π»Π°Π²Π½Π° ΠΌΠ΅Ρ’ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½Π° ΠΎΡ€Π³Π°Π½ΠΈΠ·Π°Ρ†ΠΈΡ˜Π° Π·Π° стандардС Π·Π° Web-Ρƒ), Π½Π΅ ΠΏΠΎΡΡ‚ΠΎΡ˜ΠΈ Ρ˜Π΅Π΄ΠΈΠ½ΡΡ‚Π²Π΅Π½Π° Ρ„ΠΎΡ€ΠΌΡƒΠ»Π° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Ρƒ процСса ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° Ρƒ Linked Data Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Ρƒ. Π£Π·ΠΈΠΌΠ°Ρ˜ΡƒΡ›ΠΈ Ρƒ ΠΎΠ±Π·ΠΈΡ€ Π΄Π° јС ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚ ΠΎΠ±Ρ˜Π°Π²Ρ™Π΅Π½ΠΈΡ… ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΎΠ΄Π»ΡƒΡ‡ΡƒΡ˜ΡƒΡ›ΠΈ Π·Π° Π±ΡƒΠ΄ΡƒΡ›ΠΈ Ρ€Π°Π·Π²ΠΎΡ˜ Web-Π°, Ρƒ овој Π΄ΠΎΠΊΡ‚ΠΎΡ€ΡΠΊΠΎΡ˜ Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜ΠΈ, Π³Π»Π°Π²Π½ΠΈ Ρ†ΠΈΡ™ јС (1) дизајн ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π° ΠΈΠ½ΠΎΠ²Π°Ρ‚ΠΈΠ²Π½ΠΎΠ³ ΠΎΠΊΠ²ΠΈΡ€Π° Π·Π° ΠΈΠ·Π±ΠΎΡ€, Π°Π½Π°Π»ΠΈΠ·Ρƒ, ΠΊΠΎΠ½Π²Π΅Ρ€Π·ΠΈΡ˜Ρƒ, мСђусобно повСзивањС ΠΈ ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΈΠ· Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΡ… ΠΈΠ·Π²ΠΎΡ€Π° ΠΈ (2) Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΡ€ΠΈΠΌΠ΅Π½Π° ΠΎΠ²ΠΎΠ³ приступа Ρƒ Ρ„Π°Ρ€ΠΌΠ°Ρ†eутском Π΄ΠΎΠΌΠ΅Π½Ρƒ. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° докторска Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜Π° Π΄Π΅Ρ‚Π°Ρ™Π½ΠΎ ΠΈΡΡ‚Ρ€Π°ΠΆΡƒΡ˜Π΅ ΠΏΠΈΡ‚Π°ΡšΠ΅ ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚Π° Π²Π΅Π»ΠΈΠΊΠΈΡ… ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… СкосистСма ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° (Π΅Π½Π³. Linked Data Ecosystems), ΡƒΠ·ΠΈΠΌΠ°Ρ˜ΡƒΡ›ΠΈ Ρƒ ΠΎΠ±Π·ΠΈΡ€ могућност ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎΠ³ ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ° ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°. Π Π°Π΄ јС мотивисан ΠΏΠΎΡ‚Ρ€Π΅Π±ΠΎΠΌ Π΄Π° сС ΠΎΠΌΠΎΠ³ΡƒΡ›ΠΈ истраТивачима ΠΈΠ· арапских Π·Π΅ΠΌΠ°Ρ™Π° Π΄Π° ΡƒΠΏΠΎΡ‚Ρ€Π΅Π±ΠΎΠΌ сСмантичких Π²Π΅Π± Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΏΠΎΠ²Π΅ΠΆΡƒ својС ΠΏΠΎΠ΄Π°Ρ‚ΠΊΠ΅ са ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΠΌ ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ°, ΠΊΠ°ΠΎ Π½ΠΏΡ€. DBpedia-јом. Π¦ΠΈΡ™ јС Π΄Π° сС испита Π΄Π° Π»ΠΈ ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈ ΠΏΠΎΠ΄Π°Ρ†ΠΈ ΠΈΠ· Арапских Π·Π΅ΠΌΠ°Ρ™Π° ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π°Ρ˜Ρƒ ΠΊΡ€Π°Ρ˜ΡšΠΈΠΌ корисницима Π΄Π° Π΄ΠΎΠ±ΠΈΡ˜Ρƒ Π΄ΠΎΠ΄Π°Ρ‚Π½Π΅ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ˜Π΅ којС нису доступнС Ρƒ ΠΏΠΎΡ˜Π΅Π΄ΠΈΠ½Π°Ρ‡Π½ΠΈΠΌ скуповима ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°, ΠΏΠΎΡ€Π΅Π΄ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΡ˜Π΅ Ρƒ сСмантички WΠ΅Π± простор. Докторска Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜Π° ΠΏΡ€Π΅Π΄Π»Π°ΠΆΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡ˜Ρƒ Π·Π° Ρ€Π°Π·Π²ΠΎΡ˜ Π°ΠΏΠ»ΠΈΠΊΠ°Ρ†ΠΈΡ˜Π΅ Π·Π° Ρ€Π°Π΄ са ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΠΌ (Linked) ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚ΠΈΡ€Π° софтвСрско Ρ€Π΅ΡˆΠ΅ΡšΠ΅ којС ΠΎΠΌΠΎΠ³ΡƒΡ›ΡƒΡ˜Π΅ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ консолидованог скупа ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΎ Π»Π΅ΠΊΠΎΠ²ΠΈΠΌΠ° ΠΈΠ· ΠΈΠ·Π°Π±Ρ€Π°Π½ΠΈΡ… арапских Π·Π΅ΠΌΠ°Ρ™Π°. Консолидовани скуп ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° јС ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚ΠΈΡ€Π°Π½ Ρƒ ΠΎΠ±Π»ΠΈΠΊΡƒ Π‘Π΅ΠΌΠ°Π½Ρ‚ΠΈΡ‡ΠΊΠΎΠ³ Ρ˜Π΅Π·Π΅Ρ€Π° ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° (Π΅Π½Π³. Semantic Data Lake). Ова Ρ‚Π΅Π·Π° ΠΏΠΎΠΊΠ°Π·ΡƒΡ˜Π΅ ΠΊΠ°ΠΊΠΎ фармацСутска ΠΈΠ½Π΄ΡƒΡΡ‚Ρ€ΠΈΡ˜Π° ΠΈΠΌΠ° користи ΠΎΠ΄ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅ ΠΈΠ½ΠΎΠ²Π°Ρ‚ΠΈΠ²Π½ΠΈΡ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΈ истраТивачких Ρ‚Ρ€Π΅Π½Π΄ΠΎΠ²Π° ΠΈΠ· области сСмантичких Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π°. ΠœΠ΅Ρ’ΡƒΡ‚ΠΈΠΌ, ΠΊΠ°ΠΊΠΎ јС Π΅Π»Π°Π±ΠΎΡ€ΠΈΡ€Π°Π½ΠΎ Ρƒ овој Ρ‚Π΅Π·ΠΈ, ΠΏΠΎΡ‚Ρ€Π΅Π±Π½ΠΎ јС Π±ΠΎΡ™Π΅ Ρ€Π°Π·ΡƒΠΌΠ΅Π²Π°ΡšΠ΅ спСцифичности арапског јСзика Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Ρƒ Linked Data Π°Π»Π°Ρ‚Π° ΠΈ ΡšΡƒΡ…ΠΎΠ²Ρƒ ΠΏΡ€ΠΈΠΌΠ΅Π½Ρƒ са ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° ΠΈΠ· Арапских Π·Π΅ΠΌΠ°Ρ™Π°

    What are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web

    Get PDF
    Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable and reusable datasets. We argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. Firstly, in order to define boundaries of single coherent knowledge graphs within Linked Data, a principled notion of what a dataset is, or, respectively, what links within and between datasets are, has been missing. Secondly, we argue that in order to enable FAIR knowledge graphs, Linked Data misses standardised findability and accessability mechanism, via a single entry link. In order to address the first issue, we (i) propose a rigorous definition of a naming authority for a Linked Data dataset (ii) define different link types for data in Linked datasets, (iii) provide an empirical analysis of linkage among the datasets of the Linked Open Data cloud, and (iv) analyse the dereferenceability of those links. We base our analyses and link computations on a scalable mechanism implemented on top of the HDT format, which allows us to analyse quantity and quality of different link types at scale.Series: Working Papers on Information Systems, Information Business and Operation
    corecore