Search CORE

8 research outputs found

Towards More Usable Dataset Search: From Query Characterization to Snippet Generation

Author: Cheng Gong
Cheng Gong
Ding Bolin
Kacprzak Emilia
Li Rong-Hua
Noy Natasha
Pietriga Emmanuel
Sven
Toupikov Nickolai
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/08/2019
Field of study

Reusing published datasets on the Web is of great interest to researchers and developers. Their data needs may be met by submitting queries to a dataset search engine to retrieve relevant datasets. In this ongoing work towards developing a more usable dataset search engine, we characterize real data needs by annotating the semantics of 1,947 queries using a novel fine-grained scheme, to provide implications for enhancing dataset search. Based on the findings, we present a query-centered framework for dataset search, and explore the implementation of snippet generation and evaluate it with a preliminary user study.Comment: 4 pages, The 28th ACM International Conference on Information and Knowledge Management (CIKM 2019

arXiv.org e-Print Archive

Crossref

Recommending Datasets for Scientific Problem Descriptions

Author: Färber M.
Leisinger A.-K.
Publication venue: Association for Computing Machinery
Publication date: 26/11/2021
Field of study

The steadily rising number of datasets is making it increasingly difficult for researchers and practitioners to be aware of all datasets, particularly of the most relevant datasets for a given research problem. To this end, dataset search engines have been proposed. However, they are based on user\u27s keywords and, thus, have difficulty determining precisely fitting datasets for complex research problems. In this paper, we propose a system that recommends suitable datasets based on a given research problem description. The recommendation task is designed as a domain-specific text classification task. As shown in a comprehensive offline evaluation using various state-of-the-art models, as well as 88,000 paper abstracts and 265,000 citation contexts as research problem descriptions, we obtain an F1-score of 0.75. In an additional user study, we show that users in real-world settings are 88% satisfied in all test cases. We therefore see promising future directions for dataset recommendation

KITopen

Framework to Automatically Determine the Quality of Open Data Catalogs

Author: Martinez-Gil Jorge
Publication venue
Publication date: 28/07/2023
Field of study

Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page

arXiv.org e-Print Archive

Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards

Author: Ben Ellefi Mohamed
Corby Olivier
Lanthaler Markus
Michel Franck
Verborgh Ruben
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/05/2019
Field of study

International audienceTo help in making sense of the ever-increasing number of data sources available on the Web, in this article we tackle the problem of enabling automatic discovery and querying of data sources at Web scale. To pursue this goal, we suggest to (1) provision rich descriptions of data sources and query services thereof, (2) leverage the power of Web search engines to discover data sources, and (3) rely on simple, well-adopted standards that come with extensive tooling. We apply these principles to the concrete case of SPARQL micro-services that aim at querying Web APIs using SPARQL. The proposed solution leverages SPARQL Service Description, SHACL, DCAT, VoID, Schema.org and Hydra to express a rich functional description that allows a software agent to decide whether a micro-service can help in carrying out a certain task. This description can be dynamically transformed into a Web page embedding rich markup data. This Web page is both a human-friendly documentation and a machine-readable description that makes it possible for humans and machines alike to discover and invoke SPARQL micro-services at Web scale, as if they were just another data source. We report on a prototype implementation that is available on-line for test purposes, and that can be effectively discovered using Google's Dataset Search engine

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Towards data-based search engines for RDF graphs: a reproducibility study

Author: BARUSCO MANUEL
Publication venue
Publication date: 24/10/2023
Field of study

openIl framework RDF, grazie alla sua flessibilità e versatilità, è uno dei formati più utilizzati per la condivisione di dati e informazioni sul Web. Al giorno d'oggi sono infatti disponibili molti datasets e knowledge repositories in formato RDF in ambito scientifico e politico, facilmente consultabili e scaricabili da numerosi open data portals. Tuttavia, questi datasets RDF non possono essere sfruttati e consultati appieno, a causa dell'assenza di motori di ricerca avanzati che permettano agli utenti di ottenere i datasets più adatti alle loro esigenze. Questi sistemi rispondono alle esigenze del Ad-Hoc RDF Datasets Retrieval task: lo scopo di questo task è rispondere ad una keyword query dell'utente con un rank di 10 datasets in ordine di rilevanza. I sistemi attuali non sono così avanzati e si basano principalmente sui metadati dei dataset, che potrebbero essere incompleti o non sempre disponibili, invece di basarsi sul loro contenuto. ACORDAR è la prima open test collection creata per testare i sistemi sviluppati per l'Ad-Hoc RDF Datasets Retrieval task. Questa test collection può garantire un impulso nello sviluppo di questi sistemi e un possibile passaggio da sistemi di ricerca basati sui metadati a sistemi basati sul contenuto dei dataset. L'obiettivo principale di questa tesi è uno studio sulla riproducibilità su ACORDAR. Verrà testata la qualità, l'utilità e l'adeguatezza di questa collection per il task, riproducendo i sistemi di base sviluppati dai creatori di ACORDAR e discutendo tutti i problemi di riproducibilità incontrati durante lo sviluppo dei sistemi riprodotti.The RDF framework, thanks to its flexibility and versatility, is one of the most used formats for sharing data and knowledge on the Web. Nowadays a lot of RDF datasets and RDF knowledge repositories are available in the scientific and political fields and can be easily consulted from a lot of open data portals. However, these RDF datasets cannot be fully exploited and accessed due to the absence of advanced search engines that allow users to retrieve the best datasets that suit their needs. These systems solve the Ad-Hoc RDF Datasets retrieval task: answer to a user keyword query with a rank of 10 datasets ordered by relevance. The current systems are not so advanced and are principally based on the datasets metadata, which could be incomplete or not always available, instead of being based on their content. ACORDAR is the first open test collection created to evaluate the systems developed for the Ad-Hoc RDF Datasets retrieval task. This test collection can ensure a boost in the development and improvement of these systems and a possible switch from metadata-based to content-based search systems. The main focus of this thesis is a reproducibility study on the ACORDAR collection. We are going to actually test how this collection is good, useful and suited for the Ad Hoc RDF datasets retrieval task by reproducing the baseline systems developed by the ACORDAR creators and by discussing all the reproducibility problems encountered during the development of the reproduced systems

Padua Thesis and Dissertation Archive

Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and $k$ -Bisimulation for Long $k$ -Chaining

Author: Blume Till
Rau Jannik
Richerby David
Scherp Ansgar
Publication venue
Publication date: 04/11/2022
Field of study

We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal language. An extension of the parallel base algorithm allows incremental graph summarization. In this paper, we prove that the incremental algorithm is correct and show that updates are performed in time

\mathcal{O}(\Delta \cdot d^k)

, where

\Delta

is the number of additions, deletions, and modifications to the input graph,

d

the maximum degree, and

k

is the maximum distance in the subgraphs considered. Although the iterative algorithm supports values of

k>1

, it requires nested data structures for the message passing that are memory-inefficient. Thus, we extended the base summarization algorithm by a hash-based messaging mechanism to support a scalable iterative computation of graph summarizations based on

k

-bisimulation for arbitrary

k

. We empirically evaluate the performance of our algorithms using benchmark and real-world datasets. The incremental algorithm almost always outperforms the batch computation. We observe in our experiments that the incremental algorithm is faster even in cases when

50\%

of the graph database changes from one version to the next. The incremental computation requires a three-layered hash index, which has a low memory overhead of only

8\%

(

\pm 1\%

). Finally, the incremental summarization algorithm outperforms the batch algorithm even with fewer cores. The iterative parallel

k

-bisimulation algorithm computes summaries on graphs with over

10

M edges within seconds. We show that the algorithm processes graphs of

100+\,

M edges within a few minutes while having a moderate memory consumption of

<150

GB. For the largest BSBM1B dataset with 1 billion edges, it computes

k=10

bisimulation in under an hour

arXiv.org e-Print Archive

Browsing Linked Data Catalogs with LODAtlas

Author: A Bernstein
A Hogan
AS Dadzie
AS Dadzie
B Bach
D Holten
E Mäkelä
E Pietriga
G Tummarello
H Glaser
I Ermilov
J Chen
M Bostock
M Dudáš
M d’Aquin
PY Vandenbussche
PY Vandenbussche
S Khatchadourian
T Käfer
W Beek
W Beek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/10/2018
Field of study

International audienceThe Web of Data is growing fast, as exemplified by the evolution of the Linked Open Data (LOD) cloud over the last ten years. One of the consequences of this growth is that it is becoming increasingly difficult for application developers and end-users to find the datasets that would be relevant to them. Semantic Web search engines, open data catalogs , datasets and frameworks such as LODStats and LOD Laundromat, are all useful but only give partial, even if complementary, views on what datasets are available on the Web. We introduce LODAtlas, a portal that enables users to find datasets of interest. Users can make different types of queries about both the datasets' metadata and contents, aggregated from multiple sources. They can then quickly evaluate the matching datasets' relevance, thanks to LODAtlas' summary visualizations of their general metadata, connections and contents

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL-Rennes 1

Σημασιολογικός ιστός & διασυνδεδεμένα δεδομένα σε διεπαφές χρηστών & λογικές υπηρεσίες

Author: Valadakis Ioannis
Βαλαδάκης Ιωάννης
Publication venue
Publication date: 12/05/2020
Field of study

DSpace at NTUA