23 research outputs found

    User-centered semantic dataset retrieval

    Get PDF
    Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process

    Does Term Expansion Matter for the Retrieval of Biodiversity Data?

    Get PDF
    ABSTRACT While term expansion techniques are well investigated for many domains, semantic enrichment of keyword queries for the retrieval of scientific datasets is still paid little attention to. In particular, a systematic analysis of which kind of semantically related concepts lead to the most relevant results is missing. Based on query expansion techniques, we semantically enriched search queries provided by biodiversity researchers to answer specific research questions. We applied them to a system indexing over 92,856 biological metadata files harvested from GFBio -the German Federation for Biological Data. We compared the outcome with the original keyword-based query. The result reveals that enriched keywords deliver a larger number of relevant datasets and that datasets retrieved based on keywords and their synonyms were judged more relevant. Query expansion with other related concepts returned a mixed picture

    How to Search for Biological Data? A Comparison of User Interfaces in a Semantic Search

    Get PDF
    Data discovery is a frequent task in a scholar's daily work. In biodiversity, data search is a particular challenge. Here, scholars have complex information needs such as the rich interplay of organisms and their environments that cannot be unambiguously expressed with a traditional keyword search, e.g., Does tree diversity reduce competition in a subtropical forest? Therefore, data repositories usually offer interfaces that enable users to browse datasets by a pre-determined set of categories or facets. Faceted search is a good compromise between cumbersome user interfaces for structured queries (e.g., using SPARQL) and natural language queries that are hard to interpret for machines. Thus, developers can specify relevant relationships between entities explicitly and users can filter search results by selecting proper categories. For the given query, appropriate categories could be Organism and Habitat. However, there are two crucial design issues that have an impact on the effectiveness of category-based query interfaces: The choice of proper categories and the visual presentation of these categories in the query interface. In our work, we focus on the second aspect. We aim to develop two query interfaces: (a) a common one-box keyword search interface that analyzes the entered terms with respect to their categories automatically (b) a form-based query interface where users can enter their search keywords into a form with a query field per category. In both interfaces, the query keywords are matched against concepts in a knowledge base to make their semantics explicit. In case of a successful match the URI is used to obtain the labels of all sub-concepts to expand the query before sending it to the search engine. Retrieved results are displayed in a list. The aim of our system is not to answer the question completely but to support users in retrieving relevant datasets that give hints to answer a research question. In our talk, we will introduce the two interfaces and invite the conference participants to give feedback. We are particularly interested in a discussion on the appropriateness of the suggested user interfaces. Do scholars prefer a form-based user interface or only a one-field search? What other functions might be helpful, e.g., providing more information about other relations and properties from the concept in the ontology? What kind of explanations might be helpful to understand why a certain result was returned? KEYWORDS: user interfaces, semantic search, biological data, life sciences, biodiversit

    Unfolding existing Data Publication Practice in Research Data Workflows in the Biological and Environmental Sciences – First Results from a Survey

    Get PDF
    In recent years, data publication workflows get more and more attention [1,2]. In order to obtain FAIR data [3], reviewers, data curators and other stakeholders have realized that not only the submitted data matter but also the underlying process to create that data within existing research practice. A better understanding of existing data publication practices in research workflows will help service providers such as data repositories (Pangaea [4], ENA [5], GenBank [6]) to support their users with more appropriate services and tools when submitting data, and otherwise, will sustain the role of data repositories in research practice. Such improved coordination will minimize the workload of researchers and data curators and will facilitate the review process of all stakeholders with respect to reproducibility. Furthermore, well-documented data publication workflows may improve data retrieval and finally data reuse in a long run. One obstacle towards comprehensible and properly described research workflows is the fact that data publication workflows in the life sciences are hard to define. Scholars have their very individual disciplinary background, research skills and experiences. In some domains such as biodiversity, scholars work from several weeks to years to collect and analyze often heterogeneous data from various sources, such as collections, environmental or molecular data repositories. Thus, reconstructing their work process after the project is finalized is very difficult if not impossible. However, our goal is to reveal the state of the art on how scholars manage their data in their research practices. We are in the process of setting up a survey whose general structure is organized according to the GFBio Data Lifecycle [7]. The results will allow us to reveal typical data practices workflows that can be used to evaluate the suitability of existing data repository portals, such as GFBio [8]. In our talk, we present the first insights of the survey. KEYWORDS: data publication workflows, data practices, biological and environmental data, green life sciences, biodiversity REFERENCES: 1. Dallmeier-Tiessen, S., Khodiyar, V., Murphy, F., Nurnberger, A., Raymond, L., Whyte, A., 2017. Connecting Data Publication to the Research Workflow: A Preliminary Analysis, International Journal of Digital Curation, 12, https://doi.org/10.2218/ijdc.v12i1.533. 2. González-Beltrán, A., Li, P., Zhao, J., Avila-Garcia, M. S., Roos, M., Thompson, M., van der Horst, E., Kaliyaperumal, R., Luo, R., Lee, T.-L., Lam, T., Edmunds, S.C., Sansone, S.-A., Rocca-Serra, P, 2015. From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics, PLOS ONE 10, 7, pp. 1–20, https://doi.org/10.1371/journal.pone.0127612. 3. Mark D. Wilkinson et al., 2016. The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3. https://doi.org/10.1038/sdata.2016.18 4. Pangaea, https://www.pangaea.org 5. ENA, https://www.ebi.ac.uk/ena 6. GenBank, https://www.ncbi.nlm.nih.gov/genbank/ 7. GFBio Data Lifecycle, https://www.gfbio.org/training/materials/data-lifecycle 8. GFBio, https://www.gfbio.or

    Dataset Search In Biodiversity Research: Do Metadata In Data Repositories Reflect Scholarly Information Needs?

    Get PDF
    The increasing amount of research data provides the opportunity to link and integrate data to create novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consuming task in daily research practice. In this study, we explore what hampers dataset retrieval in biodiversity research, a field that produces a large amount of heterogeneous data. We analyze the primary source in dataset search - metadata - and determine if they reflect scholarly search interests. We examine if metadata standards provide elements corresponding to search interests, we inspect if selected data repositories use metadata standards representing scholarly interests, and we determine how many fields of the metadata standards used are filled. To determine search interests in biodiversity research, we gathered 169 questions that researchers aimed to answer with the help of retrieved data, identified biological entities and grouped them into 13 categories. Our findings indicate that environments, materials and chemicals, species, biological and chemical processes, locations, data parameters and data types are important search interests in biodiversity research. The comparison with existing metadata standards shows that domain-specific standards cover search interests quite well, whereas general standards do not explicitly contain elements that reflect search interests. We inspect metadata from five large data repositories. Our results confirm that metadata currently poorly reflect search interests in biodiversity research. From these findings, we derive recommendations for researchers and data repositories how to bridge the gap between search interest and metadata provided

    A Test Collection for Dataset Retrieval in Biodiversity Research

    Get PDF
    Searching for scientific datasets is a prominent task in scholars' daily research practice. A variety of data publishers, archives and data portals offer search applications that allow the discovery of datasets. The evaluation of such dataset retrieval systems requires proper test collections, including questions that reflect real world information needs of scholars, a set of datasets and human judgements assessing the relevance of the datasets to the questions in the benchmark corpus. Unfortunately, only very few test collections exist for a dataset search. In this paper, we introduce the BEF-China test collection, the very first test collection for dataset retrieval in biodiversity research, a research field with an increasing demand in data discovery services. The test collection consists of 14 questions, a corpus of 372 datasets from the BEF-China project and binary relevance judgements provided by a biodiversity expert

    Data and its challenges on the path to end-to-end digitization in public administration - Contributions from three projects of the openDVA working group

    Get PDF
    The implementation of the right to digital access (OZG) in Germany stops at the office door focusing only on the needs of citizens. It does not cover any internal administrative processes and leaves out various stakeholders. For true end-to-end digitization, we need detailed, interoperable descriptions that can be exploited by all interested parties, including small to medium enterprises, decision-makers on all levels, individual administrative staff members, and future citizen developers. They all need a big picture and details on legal regulations, existing standards, and specific requirements. We aim to create such a knowledge base and demonstrate this using a first end-to-end digitized public service. Analyzing structured and unstructured data, for example, in the form of the text of a law addressing a public service, we derive a formal definition of the underlying process and necessary decisions. We enhance this with semantic annotation and link it to available standards. This forms the basis for innovative, new services like a platform for citizen developers to easily create and change fully digitized public services or educational modules that are automatically kept in sync with current developments

    BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain

    Get PDF
    Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora.In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora

    Suchmaschinen: Wie funktionieren sie und wie sucht man richtig?

    No full text
    Inhalt: Websiten "einsammeln"; Index bauen; Trefferlisten; Auswahlkriterie
    corecore