23 research outputs found
Charaterizing RDF graphs through graph-based measures - Framework and assessment
The topological structure of RDF graphs inherently differs from other types of graphs, like social graphs, due to the pervasive existence of hierarchical relations (TBox), which complement transversal relations (ABox). Graph measures capture such particularities through descriptive statistics. Besides the classical set of measures established in the field of network analysis, such as size and volume of the graph or the type of degree distribution of its vertices, there has been some effort to define measures that capture some of the aforementioned particularities RDF graphs adhere to. However, some of them are redundant, computationally expensive, and not meaningful enough to describe RDF graphs. In particular, it is not clear which of them are efficient metrics to capture specific distinguishing characteristics of datasets in different knowledge domains (e.g., Cross Domain vs. Linguistics). In this work, we address the problem of identifying a minimal set of measures that is efficient, essential (non-redundant), and meaningful. Based on 54 measures and a sample of 280 graphs of nine knowledge domains from the Linked Open Data Cloud, we identify an essential set of 13 measures, having the capacity to describe graphs concisely. These measures have the capacity to present the topological structures and differences of datasets in established knowledge domains
Characteristic sets profile features: Estimation and application to SPARQL query planning
RDF dataset profiling is the task of extracting a formal representation of a datasetβs features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature. In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation. The results show that even small samples of just 0.5% of the original graphβs entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases
Linked Data Quality Assessment and its Application to Societal Progress Measurement
In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented.
With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously.
In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself.
A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to οΏΌοΏΌmeasure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets.
Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology.
Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis
Exploiting Context-Dependent Quality Metadata for Linked Data Source Selection
The traditional Web is evolving into the Web of Data which consists of huge collections
of structured data over poorly controlled distributed data sources. Live
queries are needed to get current information out of this global data space. In live
query processing, source selection deserves attention since it allows us to identify the
sources which might likely contain the relevant data. The thesis proposes a source
selection technique in the context of live query processing on Linked Open Data,
which takes into account the context of the request and the quality of data contained in
the sources to enhance the relevance (since the context enables a better interpretation
of the request) and the quality of the answers (which will be obtained by processing
the request on the selected sources). Specifically, the thesis proposes an extension of
the QTree indexing structure that had been proposed as a data summary to support
source selection based on source content, to take into account quality and contextual
information. With reference to a specific case study, the thesis also contributes an approach,
relying on the Luzzu framework, to assess the quality of a source with respect
to for a given context (according to different quality dimensions). An experimental
evaluation of the proposed techniques is also provide
ΠΠΊΡΡΠΆΠ΅ΡΠ΅ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΈ ΠΎΡΠ΅Π½Ρ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
Linking and publishing data in the Linked Open Data format increases the interoperability
and discoverability of resources over the Web. To accomplish this, the process comprises
several design decisions, based on the Linked Data principles that, on one hand, recommend to
use standards for the representation and the access to data on the Web, and on the other hand
to set hyperlinks between data from different sources.
Despite the efforts of the World Wide Web Consortium (W3C), being the main international
standards organization for the World Wide Web, there is no one tailored formula for publishing
data as Linked Data. In addition, the quality of the published Linked Open Data (LOD) is a
fundamental issue, and it is yet to be thoroughly managed and considered.
In this doctoral thesis, the main objective is to design and implement a novel framework for
selecting, analyzing, converting, interlinking, and publishing data from diverse sources,
simultaneously paying great attention to quality assessment throughout all steps and modules
of the framework. The goal is to examine whether and to what extent are the Semantic Web
technologies applicable for merging data from different sources and enabling end-users to
obtain additional information that was not available in individual datasets, in addition to the
integration into the Semantic Web community space. Additionally, the Ph.D. thesis intends to
validate the applicability of the process in the specific and demanding use case, i.e. for creating
and publishing an Arabic Linked Drug Dataset, based on open drug datasets from selected
Arabic countries and to discuss the quality issues observed in the linked data life-cycle. To that
end, in this doctoral thesis, a Semantic Data Lake was established in the pharmaceutical domain
that allows further integration and developing different business services on top of the
integrated data sources. Through data representation in an open machine-readable format, the
approach offers an optimum solution for information and data dissemination for building
domain-specific applications, and to enrich and gain value from the original dataset. This thesis
showcases how the pharmaceutical domain benefits from the evolving research trends for
building competitive advantages. However, as it is elaborated in this thesis, a better
understanding of the specifics of the Arabic language is required to extend linked data
technologies utilization in targeted Arabic organizations.ΠΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ ΡΠΎΡΠΌΠ°ΡΡ "ΠΠΎΠ²Π΅Π·Π°Π½ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ" (Π΅Π½Π³.
Linked Open Data) ΠΏΠΎΠ²Π΅ΡΠ°Π²Π° ΠΈΠ½ΡΠ΅ΡΠΎΠΏΠ΅ΡΠ°Π±ΠΈΠ»Π½ΠΎΡΡ ΠΈ ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡΠΈ Π·Π° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΠ΅ΡΡΡΡΠ°
ΠΏΡΠ΅ΠΊΠΎ Web-Π°. ΠΡΠΎΡΠ΅Ρ ΡΠ΅ Π·Π°ΡΠ½ΠΎΠ²Π°Π½ Π½Π° Linked Data ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈΠΌΠ° (W3C, 2006) ΠΊΠΎΡΠΈ ΡΠ° ΡΠ΅Π΄Π½Π΅
ΡΡΡΠ°Π½Π΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ° ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΠ΅ ΠΈ ΠΏΡΠΈΡΡΡΠΏ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° Π½Π° WΠ΅Π±Ρ (RDF, OWL,
SPARQL), Π° ΡΠ° Π΄ΡΡΠ³Π΅ ΡΡΡΠ°Π½Π΅, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈ ΡΡΠ³Π΅ΡΠΈΡΡ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ Ρ
ΠΈΠΏΠ΅ΡΠ²Π΅Π·Π° ΠΈΠ·ΠΌΠ΅ΡΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
ΠΈΠ· ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ°.
Π£ΠΏΡΠΊΠΎΡ Π½Π°ΠΏΠΎΡΠΈΠΌΠ° W3C ΠΊΠΎΠ½Π·ΠΎΡΡΠΈΡΡΠΌΠ° (W3C ΡΠ΅ Π³Π»Π°Π²Π½Π° ΠΌΠ΅ΡΡΠ½Π°ΡΠΎΠ΄Π½Π° ΠΎΡΠ³Π°Π½ΠΈΠ·Π°ΡΠΈΡΠ° Π·Π°
ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° Web-Ρ), Π½Π΅ ΠΏΠΎΡΡΠΎΡΠΈ ΡΠ΅Π΄ΠΈΠ½ΡΡΠ²Π΅Π½Π° ΡΠΎΡΠΌΡΠ»Π° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ ΠΏΡΠΎΡΠ΅ΡΠ°
ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ Linked Data ΡΠΎΡΠΌΠ°ΡΡ. Π£Π·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ Π΄Π° ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅Ρ
ΠΎΠ±ΡΠ°Π²ΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎΠ΄Π»ΡΡΡΡΡΡΠΈ Π·Π° Π±ΡΠ΄ΡΡΠΈ ΡΠ°Π·Π²ΠΎΡ Web-Π°, Ρ ΠΎΠ²ΠΎΡ
Π΄ΠΎΠΊΡΠΎΡΡΠΊΠΎΡ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ, Π³Π»Π°Π²Π½ΠΈ ΡΠΈΡ ΡΠ΅ (1) Π΄ΠΈΠ·Π°ΡΠ½ ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΠ° ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΎΠ³ ΠΎΠΊΠ²ΠΈΡΠ°
Π·Π° ΠΈΠ·Π±ΠΎΡ, Π°Π½Π°Π»ΠΈΠ·Ρ, ΠΊΠΎΠ½Π²Π΅ΡΠ·ΠΈΡΡ, ΠΌΠ΅ΡΡΡΠΎΠ±Π½ΠΎ ΠΏΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΈΠ·
ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ° ΠΈ (2) Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΡΠΈΠΌΠ΅Π½Π° ΠΎΠ²ΠΎΠ³ ΠΏΡΠΈΡΡΡΠΏΠ° Ρ ΡΠ°ΡΠΌΠ°ΡeΡΡΡΠΊΠΎΠΌ Π΄ΠΎΠΌΠ΅Π½Ρ.
ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° Π΄ΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° Π΄Π΅ΡΠ°ΡΠ½ΠΎ ΠΈΡΡΡΠ°ΠΆΡΡΠ΅ ΠΏΠΈΡΠ°ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ
ΠΈ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
Π΅ΠΊΠΎΡΠΈΡΡΠ΅ΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Linked Data Ecosystems), ΡΠ·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ
ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡ ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎΠ³ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°. Π Π°Π΄ ΡΠ΅ ΠΌΠΎΡΠΈΠ²ΠΈΡΠ°Π½ ΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ Π΄Π° ΡΠ΅
ΠΎΠΌΠΎΠ³ΡΡΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ° ΠΈΠ· Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° Π΄Π° ΡΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
Π²Π΅Π± ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°
ΠΏΠΎΠ²Π΅ΠΆΡ ΡΠ²ΠΎΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠΊΠ΅ ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΠΌ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ°, ΠΊΠ°ΠΎ Π½ΠΏΡ. DBpedia-ΡΠΎΠΌ. Π¦ΠΈΡ ΡΠ΅ Π΄Π° ΡΠ΅ ΠΈΡΠΏΠΈΡΠ°
Π΄Π° Π»ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π°ΡΡ ΠΊΡΠ°ΡΡΠΈΠΌ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΠΌΠ° Π΄Π° Π΄ΠΎΠ±ΠΈΡΡ
Π΄ΠΎΠ΄Π°ΡΠ½Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ΅ ΠΊΠΎΡΠ΅ Π½ΠΈΡΡ Π΄ΠΎΡΡΡΠΏΠ½Π΅ Ρ ΠΏΠΎΡΠ΅Π΄ΠΈΠ½Π°ΡΠ½ΠΈΠΌ ΡΠΊΡΠΏΠΎΠ²ΠΈΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°, ΠΏΠΎΡΠ΅Π΄
ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΡΠ΅ Ρ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈ WΠ΅Π± ΠΏΡΠΎΡΡΠΎΡ.
ΠΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° ΠΏΡΠ΅Π΄Π»Π°ΠΆΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡΡ Π·Π° ΡΠ°Π·Π²ΠΎΡ Π°ΠΏΠ»ΠΈΠΊΠ°ΡΠΈΡΠ΅ Π·Π° ΡΠ°Π΄ ΡΠ°
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΠΌ (Linked) ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ° ΡΠΎΡΡΠ²Π΅ΡΡΠΊΠΎ ΡΠ΅ΡΠ΅ΡΠ΅ ΠΊΠΎΡΠ΅ ΠΎΠΌΠΎΠ³ΡΡΡΡΠ΅
ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΠΊΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΎΠ³ ΡΠΊΡΠΏΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎ Π»Π΅ΠΊΠΎΠ²ΠΈΠΌΠ° ΠΈΠ· ΠΈΠ·Π°Π±ΡΠ°Π½ΠΈΡ
Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°. ΠΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΈ ΡΠΊΡΠΏ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΡΠ΅ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ°Π½ Ρ ΠΎΠ±Π»ΠΈΠΊΡ Π‘Π΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΎΠ³ ΡΠ΅Π·Π΅ΡΠ°
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Semantic Data Lake).
ΠΠ²Π° ΡΠ΅Π·Π° ΠΏΠΎΠΊΠ°Π·ΡΡΠ΅ ΠΊΠ°ΠΊΠΎ ΡΠ°ΡΠΌΠ°ΡΠ΅ΡΡΡΠΊΠ° ΠΈΠ½Π΄ΡΡΡΡΠΈΡΠ° ΠΈΠΌΠ° ΠΊΠΎΡΠΈΡΡΠΈ ΠΎΠ΄ ΠΏΡΠΈΠΌΠ΅Π½Π΅
ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° ΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΊΠΈΡ
ΡΡΠ΅Π½Π΄ΠΎΠ²Π° ΠΈΠ· ΠΎΠ±Π»Π°ΡΡΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°. ΠΠ΅ΡΡΡΠΈΠΌ, ΠΊΠ°ΠΊΠΎ ΡΠ΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ°Π½ΠΎ Ρ ΠΎΠ²ΠΎΡ ΡΠ΅Π·ΠΈ, ΠΏΠΎΡΡΠ΅Π±Π½ΠΎ ΡΠ΅ Π±ΠΎΡΠ΅ ΡΠ°Π·ΡΠΌΠ΅Π²Π°ΡΠ΅
ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΎΡΡΠΈ Π°ΡΠ°ΠΏΡΠΊΠΎΠ³ ΡΠ΅Π·ΠΈΠΊΠ° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ Linked Data Π°Π»Π°ΡΠ° ΠΈ ΡΡΡ
ΠΎΠ²Ρ ΠΏΡΠΈΠΌΠ΅Π½Ρ
ΡΠ° ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°
What are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable and reusable datasets. We argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. Firstly, in order to define boundaries of single coherent knowledge graphs within Linked Data, a principled notion of what a dataset is, or, respectively, what links within and between datasets are, has been missing. Secondly, we argue that in order to enable FAIR knowledge graphs, Linked Data misses standardised findability and accessability mechanism, via a single entry link. In order to address the first issue, we (i) propose a rigorous definition of a naming authority for a Linked Data dataset (ii) define different link types for data in Linked datasets, (iii) provide an empirical analysis of linkage among the datasets of the Linked Open Data cloud, and (iv) analyse the dereferenceability of those links. We base our analyses and link computations on a scalable mechanism implemented on top of the HDT format, which allows us to analyse quantity and quality of different link types at scale.Series: Working Papers on Information Systems, Information Business and Operation