9 research outputs found

    Measuring metadata quality

    Get PDF

    A FAIR approach to genomics

    Get PDF
    The aim of this thesis was to increase our understanding on how genome information leads to function and phenotype. To address these questions, I developed a semantic systems biology framework capable of extracting knowledge, biological concepts and emergent system properties, from a vast array of publicly available genome information. In chapter 2, Empusa is described as an infrastructure that bridges the gap between the intended and actual content of a database. This infrastructure was used in chapters 3 and 4 to develop the framework. Chapter 3 describes the development of the Genome Biology Ontology Language and the GBOL stack of supporting tools enforcing consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. A practical implementation of a semantic systems biology framework for FAIR (de novo) genome annotation is provided in chapter 4. The semantic framework and genome annotation tool described in this chapter has been used throughout this thesis to consistently, structurally and functionally annotate and mine microbial genomes used in chapter 5-10. In chapter 5, we introduced how the concept of protein domains and corresponding architectures can be used in comparative functional genomics to provide for a fast, efficient and scalable alternative to sequence-based methods. This allowed us to effectively compare and identify functional variations between hundreds to thousands of genomes. In chapter 6, we used 432 available complete Pseudomonas genomes to study the relationship between domain essentiality and persistence. In this chapter the focus was mainly on domains involved in metabolic functions. The metabolic domain space was explored for domain essentiality and persistence through the integration of heterogeneous data sources including six published metabolic models, a vast gene expression repository and transposon data. In chapter 7, the correlation between the expected and observed genotypes was explored using 16S-rRNA phylogeny and protein domain class content as input. In this chapter it was shown that domain class content yields a higher resolution in comparison to 16S-rRNA when analysing evolutionary distances. Using protein domain classes, we also were able to identify signifying domains, which may have important roles in shaping a species. To demonstrate the use of semantic systems biology workflows in a biotechnological setting we expanded the resource with more than 80.000 bacterial genomes. The genomic information of this resource was mined using a top down approach to identify strains having the trait for 1,3-propanediol production. This resulted in the molecular identification of 49 new species. In addition, we also experimentally verified that 4 species were capable of producing 1,3-propanediol. As discussed in chapter 10, the here developed semantic systems biology workflows were successfully applied in the discovery of key elements in symbiotic relationships, to improve functional genome annotation and in comparative genomics studies. Wet/dry-lab collaboration was often at the basis of the obtained results. The success of the collaboration between the wet and dry field, prompted me to develop an undergraduate course in which the concept of the “Moist” workflow was introduced (Chapter 9).</p

    User-centered semantic dataset retrieval

    Get PDF
    Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability

    Get PDF
    Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge linkages in existing scientific literature to provide impetus to innovation and research productivity. Despite significant advancements in LBD research, previous studies contain several open problems and shortcomings that are hindering its progress. The overarching goal of this thesis is to address these issues, not only to enhance the discovery component of LBD, but also to shed light on new directions that can further strengthen the existing understanding of the LBD work ow. In accordance with this goal, the thesis aims to enhance the LBD work ow with a view to ensuring its widespread applicability. The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of the proposed solutions to a diverse range of problem settings. These problem settings are not necessarily application areas that are closely related to the LBD context, but could include a wide range of problems beyond the typical scope of LBD, which has traditionally been applied to scientific literature. Adapting the LBD work ow to problems outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of LBD research, which is discovering novel linkages in text corpora is valid across a vast range of problem settings. Secondly, the idea of widespread applicability also denotes the capability of the proposed solutions to be executed in new environments. These `new environments' are various academic disciplines (i.e., cross-domain knowledge discovery) and publication languages (i.e., cross-lingual knowledge discovery). The application of LBD models to new environments is timely, since the massive growth of the scientific literature has engendered huge challenges to academics, irrespective of their domain. This thesis is divided into five main research objectives that address the following topics: literature synthesis, the input component, the discovery component, reusability, and portability. The objective of the literature synthesis is to address the gaps in existing LBD reviews by conducting the rst systematic literature review. The input component section aims to provide generalised insights on the suitability of various input types in the LBD work ow, focusing on their role and potential impact on the information retrieval cycle of LBD. The discovery component section aims to intermingle two research directions that have been under-investigated in the LBD literature, `modern word embedding techniques' and `temporal dimension' by proposing diachronic semantic inferences. Their potential positive in uence in knowledge discovery is veri ed through both direct and indirect uses. The reusability section aims to present a new, distinct viewpoint on these LBD models by verifying their reusability in a timely application area using a methodical reuse plan. The last section, portability, proposes an interdisciplinary LBD framework that can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its generalisable capabilities. Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main research objectives, enhancing the existing understanding of the LBD work ow. The thesis offers new insights which future LBD research could further explore and expand to create more eficient, widely applicable LBD models to enable broader community benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Opportunities and challenges presented by Wikidata in the context of biocuration

    No full text
    Abstract-Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome most of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data

    Opportunities and challenges presented by Wikidata in the context of biocuration

    No full text
    Abstract-Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome -many of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data
    corecore