8 research outputs found

    Towards More Usable Dataset Search: From Query Characterization to Snippet Generation

    Full text link
    Reusing published datasets on the Web is of great interest to researchers and developers. Their data needs may be met by submitting queries to a dataset search engine to retrieve relevant datasets. In this ongoing work towards developing a more usable dataset search engine, we characterize real data needs by annotating the semantics of 1,947 queries using a novel fine-grained scheme, to provide implications for enhancing dataset search. Based on the findings, we present a query-centered framework for dataset search, and explore the implementation of snippet generation and evaluate it with a preliminary user study.Comment: 4 pages, The 28th ACM International Conference on Information and Knowledge Management (CIKM 2019

    Recommending Datasets for Scientific Problem Descriptions

    Get PDF
    The steadily rising number of datasets is making it increasingly difficult for researchers and practitioners to be aware of all datasets, particularly of the most relevant datasets for a given research problem. To this end, dataset search engines have been proposed. However, they are based on user\u27s keywords and, thus, have difficulty determining precisely fitting datasets for complex research problems. In this paper, we propose a system that recommends suitable datasets based on a given research problem description. The recommendation task is designed as a domain-specific text classification task. As shown in a comprehensive offline evaluation using various state-of-the-art models, as well as 88,000 paper abstracts and 265,000 citation contexts as research problem descriptions, we obtain an F1-score of 0.75. In an additional user study, we show that users in real-world settings are 88% satisfied in all test cases. We therefore see promising future directions for dataset recommendation

    Framework to Automatically Determine the Quality of Open Data Catalogs

    Full text link
    Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page

    Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards

    Get PDF
    International audienceTo help in making sense of the ever-increasing number of data sources available on the Web, in this article we tackle the problem of enabling automatic discovery and querying of data sources at Web scale. To pursue this goal, we suggest to (1) provision rich descriptions of data sources and query services thereof, (2) leverage the power of Web search engines to discover data sources, and (3) rely on simple, well-adopted standards that come with extensive tooling. We apply these principles to the concrete case of SPARQL micro-services that aim at querying Web APIs using SPARQL. The proposed solution leverages SPARQL Service Description, SHACL, DCAT, VoID, Schema.org and Hydra to express a rich functional description that allows a software agent to decide whether a micro-service can help in carrying out a certain task. This description can be dynamically transformed into a Web page embedding rich markup data. This Web page is both a human-friendly documentation and a machine-readable description that makes it possible for humans and machines alike to discover and invoke SPARQL micro-services at Web scale, as if they were just another data source. We report on a prototype implementation that is available on-line for test purposes, and that can be effectively discovered using Google's Dataset Search engine

    Towards data-based search engines for RDF graphs: a reproducibility study

    Get PDF
    openIl framework RDF, grazie alla sua flessibilità e versatilità, è uno dei formati più utilizzati per la condivisione di dati e informazioni sul Web. Al giorno d'oggi sono infatti disponibili molti datasets e knowledge repositories in formato RDF in ambito scientifico e politico, facilmente consultabili e scaricabili da numerosi open data portals. Tuttavia, questi datasets RDF non possono essere sfruttati e consultati appieno, a causa dell'assenza di motori di ricerca avanzati che permettano agli utenti di ottenere i datasets più adatti alle loro esigenze. Questi sistemi rispondono alle esigenze del Ad-Hoc RDF Datasets Retrieval task: lo scopo di questo task è rispondere ad una keyword query dell'utente con un rank di 10 datasets in ordine di rilevanza. I sistemi attuali non sono così avanzati e si basano principalmente sui metadati dei dataset, che potrebbero essere incompleti o non sempre disponibili, invece di basarsi sul loro contenuto. ACORDAR è la prima open test collection creata per testare i sistemi sviluppati per l'Ad-Hoc RDF Datasets Retrieval task. Questa test collection può garantire un impulso nello sviluppo di questi sistemi e un possibile passaggio da sistemi di ricerca basati sui metadati a sistemi basati sul contenuto dei dataset. L'obiettivo principale di questa tesi è uno studio sulla riproducibilità su ACORDAR. Verrà testata la qualità, l'utilità e l'adeguatezza di questa collection per il task, riproducendo i sistemi di base sviluppati dai creatori di ACORDAR e discutendo tutti i problemi di riproducibilità incontrati durante lo sviluppo dei sistemi riprodotti.The RDF framework, thanks to its flexibility and versatility, is one of the most used formats for sharing data and knowledge on the Web. Nowadays a lot of RDF datasets and RDF knowledge repositories are available in the scientific and political fields and can be easily consulted from a lot of open data portals. However, these RDF datasets cannot be fully exploited and accessed due to the absence of advanced search engines that allow users to retrieve the best datasets that suit their needs. These systems solve the Ad-Hoc RDF Datasets retrieval task: answer to a user keyword query with a rank of 10 datasets ordered by relevance. The current systems are not so advanced and are principally based on the datasets metadata, which could be incomplete or not always available, instead of being based on their content. ACORDAR is the first open test collection created to evaluate the systems developed for the Ad-Hoc RDF Datasets retrieval task. This test collection can ensure a boost in the development and improvement of these systems and a possible switch from metadata-based to content-based search systems. The main focus of this thesis is a reproducibility study on the ACORDAR collection. We are going to actually test how this collection is good, useful and suited for the Ad Hoc RDF datasets retrieval task by reproducing the baseline systems developed by the ACORDAR creators and by discussing all the reproducibility problems encountered during the development of the reproduced systems

    Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and kk-Bisimulation for Long kk-Chaining

    Full text link
    We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal language. An extension of the parallel base algorithm allows incremental graph summarization. In this paper, we prove that the incremental algorithm is correct and show that updates are performed in time O(Δdk)\mathcal{O}(\Delta \cdot d^k), where Δ\Delta is the number of additions, deletions, and modifications to the input graph, dd the maximum degree, and kk is the maximum distance in the subgraphs considered. Although the iterative algorithm supports values of k>1k>1, it requires nested data structures for the message passing that are memory-inefficient. Thus, we extended the base summarization algorithm by a hash-based messaging mechanism to support a scalable iterative computation of graph summarizations based on kk-bisimulation for arbitrary kk. We empirically evaluate the performance of our algorithms using benchmark and real-world datasets. The incremental algorithm almost always outperforms the batch computation. We observe in our experiments that the incremental algorithm is faster even in cases when 50%50\% of the graph database changes from one version to the next. The incremental computation requires a three-layered hash index, which has a low memory overhead of only 8%8\% (±1%\pm 1\%). Finally, the incremental summarization algorithm outperforms the batch algorithm even with fewer cores. The iterative parallel kk-bisimulation algorithm computes summaries on graphs with over 1010M edges within seconds. We show that the algorithm processes graphs of 100+100+\,M edges within a few minutes while having a moderate memory consumption of <150<150 GB. For the largest BSBM1B dataset with 1 billion edges, it computes k=10k=10 bisimulation in under an hour

    Browsing Linked Data Catalogs with LODAtlas

    Get PDF
    International audienceThe Web of Data is growing fast, as exemplified by the evolution of the Linked Open Data (LOD) cloud over the last ten years. One of the consequences of this growth is that it is becoming increasingly difficult for application developers and end-users to find the datasets that would be relevant to them. Semantic Web search engines, open data catalogs , datasets and frameworks such as LODStats and LOD Laundromat, are all useful but only give partial, even if complementary, views on what datasets are available on the Web. We introduce LODAtlas, a portal that enables users to find datasets of interest. Users can make different types of queries about both the datasets' metadata and contents, aggregated from multiple sources. They can then quickly evaluate the matching datasets' relevance, thanks to LODAtlas' summary visualizations of their general metadata, connections and contents
    corecore