8 research outputs found
Towards More Usable Dataset Search: From Query Characterization to Snippet Generation
Reusing published datasets on the Web is of great interest to researchers and
developers. Their data needs may be met by submitting queries to a dataset
search engine to retrieve relevant datasets. In this ongoing work towards
developing a more usable dataset search engine, we characterize real data needs
by annotating the semantics of 1,947 queries using a novel fine-grained scheme,
to provide implications for enhancing dataset search. Based on the findings, we
present a query-centered framework for dataset search, and explore the
implementation of snippet generation and evaluate it with a preliminary user
study.Comment: 4 pages, The 28th ACM International Conference on Information and
Knowledge Management (CIKM 2019
Recommending Datasets for Scientific Problem Descriptions
The steadily rising number of datasets is making it increasingly difficult for researchers and practitioners to be aware of all datasets, particularly of the most relevant datasets for a given research problem. To this end, dataset search engines have been proposed. However, they are based on user\u27s keywords and, thus, have difficulty determining precisely fitting datasets for complex research problems. In this paper, we propose a system that recommends suitable datasets based on a given research problem description. The recommendation task is designed as a domain-specific text classification task. As shown in a comprehensive offline evaluation using various state-of-the-art models, as well as 88,000 paper abstracts and 265,000 citation contexts as research problem descriptions, we obtain an F1-score of 0.75. In an additional user study, we show that users in real-world settings are 88% satisfied in all test cases. We therefore see promising future directions for dataset recommendation
Framework to Automatically Determine the Quality of Open Data Catalogs
Data catalogs play a crucial role in modern data-driven organizations by
facilitating the discovery, understanding, and utilization of diverse data
assets. However, ensuring their quality and reliability is complex, especially
in open and large-scale data environments. This paper proposes a framework to
automatically determine the quality of open data catalogs, addressing the need
for efficient and reliable quality assessment mechanisms. Our framework can
analyze various core quality dimensions, such as accuracy, completeness,
consistency, scalability, and timeliness, offer several alternatives for the
assessment of compatibility and similarity across such catalogs as well as the
implementation of a set of non-core quality dimensions such as provenance,
readability, and licensing. The goal is to empower data-driven organizations to
make informed decisions based on trustworthy and well-curated data assets. The
source code that illustrates our approach can be downloaded from
https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page
Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards
International audienceTo help in making sense of the ever-increasing number of data sources available on the Web, in this article we tackle the problem of enabling automatic discovery and querying of data sources at Web scale. To pursue this goal, we suggest to (1) provision rich descriptions of data sources and query services thereof, (2) leverage the power of Web search engines to discover data sources, and (3) rely on simple, well-adopted standards that come with extensive tooling. We apply these principles to the concrete case of SPARQL micro-services that aim at querying Web APIs using SPARQL. The proposed solution leverages SPARQL Service Description, SHACL, DCAT, VoID, Schema.org and Hydra to express a rich functional description that allows a software agent to decide whether a micro-service can help in carrying out a certain task. This description can be dynamically transformed into a Web page embedding rich markup data. This Web page is both a human-friendly documentation and a machine-readable description that makes it possible for humans and machines alike to discover and invoke SPARQL micro-services at Web scale, as if they were just another data source. We report on a prototype implementation that is available on-line for test purposes, and that can be effectively discovered using Google's Dataset Search engine
Towards data-based search engines for RDF graphs: a reproducibility study
openIl framework RDF, grazie alla sua flessibilità e versatilità, è uno dei formati più utilizzati per la condivisione di dati e informazioni sul Web. Al giorno d'oggi sono infatti disponibili molti datasets e knowledge repositories in formato RDF in ambito scientifico e politico, facilmente consultabili e scaricabili da numerosi open data portals. Tuttavia, questi datasets RDF non possono essere sfruttati e consultati appieno, a causa dell'assenza di motori di ricerca avanzati che permettano agli utenti di ottenere i datasets più adatti alle loro esigenze. Questi sistemi rispondono alle esigenze del Ad-Hoc RDF Datasets Retrieval task: lo scopo di questo task è rispondere ad una keyword query dell'utente con un rank di 10 datasets in ordine di rilevanza. I sistemi attuali non sono così avanzati e si basano principalmente sui metadati dei dataset, che potrebbero essere incompleti o non sempre disponibili, invece di basarsi sul loro contenuto.
ACORDAR è la prima open test collection creata per testare i sistemi sviluppati per l'Ad-Hoc RDF Datasets Retrieval task. Questa test collection può garantire un impulso nello sviluppo di questi sistemi e un possibile passaggio da sistemi di ricerca basati sui metadati a sistemi basati sul contenuto dei dataset.
L'obiettivo principale di questa tesi è uno studio sulla riproducibilità su ACORDAR. Verrà testata la qualità, l'utilità e l'adeguatezza di questa collection per il task, riproducendo i sistemi di base sviluppati dai creatori di ACORDAR e discutendo tutti i problemi di riproducibilità incontrati durante lo sviluppo dei sistemi riprodotti.The RDF framework, thanks to its flexibility and versatility, is one of the most used formats for sharing data and knowledge on the Web. Nowadays a lot of RDF datasets and RDF knowledge repositories are available in the scientific and political fields and can be easily consulted from a lot of open data portals. However, these RDF datasets cannot be fully exploited and accessed due to the absence of advanced search engines that allow users to retrieve the best datasets that suit their needs. These systems solve the Ad-Hoc RDF Datasets retrieval task: answer to a user keyword query with a rank of 10 datasets ordered by relevance. The current systems are not so advanced and are principally based on the datasets metadata, which could be incomplete or not always available, instead of being based on their content.
ACORDAR is the first open test collection created to evaluate the systems developed for the Ad-Hoc RDF Datasets retrieval task. This test collection can ensure a boost in the development and improvement of these systems and a possible switch from metadata-based to content-based search systems.
The main focus of this thesis is a reproducibility study on the ACORDAR collection. We are going to actually test how this collection is good, useful and suited for the Ad Hoc RDF datasets retrieval task by reproducing the baseline systems developed by the ACORDAR creators and by discussing all the reproducibility problems encountered during the development of the reproduced systems
Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and -Bisimulation for Long -Chaining
We developed a flexible parallel algorithm for graph summarization based on
vertex-centric programming and parameterized message passing. The base
algorithm supports infinitely many structural graph summary models defined in a
formal language. An extension of the parallel base algorithm allows incremental
graph summarization. In this paper, we prove that the incremental algorithm is
correct and show that updates are performed in time , where is the number of additions, deletions, and modifications
to the input graph, the maximum degree, and is the maximum distance in
the subgraphs considered. Although the iterative algorithm supports values of
, it requires nested data structures for the message passing that are
memory-inefficient. Thus, we extended the base summarization algorithm by a
hash-based messaging mechanism to support a scalable iterative computation of
graph summarizations based on -bisimulation for arbitrary . We
empirically evaluate the performance of our algorithms using benchmark and
real-world datasets. The incremental algorithm almost always outperforms the
batch computation. We observe in our experiments that the incremental algorithm
is faster even in cases when of the graph database changes from one
version to the next. The incremental computation requires a three-layered hash
index, which has a low memory overhead of only (). Finally, the
incremental summarization algorithm outperforms the batch algorithm even with
fewer cores. The iterative parallel -bisimulation algorithm computes
summaries on graphs with over M edges within seconds. We show that the
algorithm processes graphs of M edges within a few minutes while having
a moderate memory consumption of GB. For the largest BSBM1B dataset with
1 billion edges, it computes bisimulation in under an hour
Browsing Linked Data Catalogs with LODAtlas
International audienceThe Web of Data is growing fast, as exemplified by the evolution of the Linked Open Data (LOD) cloud over the last ten years. One of the consequences of this growth is that it is becoming increasingly difficult for application developers and end-users to find the datasets that would be relevant to them. Semantic Web search engines, open data catalogs , datasets and frameworks such as LODStats and LOD Laundromat, are all useful but only give partial, even if complementary, views on what datasets are available on the Web. We introduce LODAtlas, a portal that enables users to find datasets of interest. Users can make different types of queries about both the datasets' metadata and contents, aggregated from multiple sources. They can then quickly evaluate the matching datasets' relevance, thanks to LODAtlas' summary visualizations of their general metadata, connections and contents