3 research outputs found

    In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability

    Get PDF
    Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge linkages in existing scientific literature to provide impetus to innovation and research productivity. Despite significant advancements in LBD research, previous studies contain several open problems and shortcomings that are hindering its progress. The overarching goal of this thesis is to address these issues, not only to enhance the discovery component of LBD, but also to shed light on new directions that can further strengthen the existing understanding of the LBD work ow. In accordance with this goal, the thesis aims to enhance the LBD work ow with a view to ensuring its widespread applicability. The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of the proposed solutions to a diverse range of problem settings. These problem settings are not necessarily application areas that are closely related to the LBD context, but could include a wide range of problems beyond the typical scope of LBD, which has traditionally been applied to scientific literature. Adapting the LBD work ow to problems outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of LBD research, which is discovering novel linkages in text corpora is valid across a vast range of problem settings. Secondly, the idea of widespread applicability also denotes the capability of the proposed solutions to be executed in new environments. These `new environments' are various academic disciplines (i.e., cross-domain knowledge discovery) and publication languages (i.e., cross-lingual knowledge discovery). The application of LBD models to new environments is timely, since the massive growth of the scientific literature has engendered huge challenges to academics, irrespective of their domain. This thesis is divided into five main research objectives that address the following topics: literature synthesis, the input component, the discovery component, reusability, and portability. The objective of the literature synthesis is to address the gaps in existing LBD reviews by conducting the rst systematic literature review. The input component section aims to provide generalised insights on the suitability of various input types in the LBD work ow, focusing on their role and potential impact on the information retrieval cycle of LBD. The discovery component section aims to intermingle two research directions that have been under-investigated in the LBD literature, `modern word embedding techniques' and `temporal dimension' by proposing diachronic semantic inferences. Their potential positive in uence in knowledge discovery is veri ed through both direct and indirect uses. The reusability section aims to present a new, distinct viewpoint on these LBD models by verifying their reusability in a timely application area using a methodical reuse plan. The last section, portability, proposes an interdisciplinary LBD framework that can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its generalisable capabilities. Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main research objectives, enhancing the existing understanding of the LBD work ow. The thesis offers new insights which future LBD research could further explore and expand to create more eficient, widely applicable LBD models to enable broader community benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Gene-disease association with literature based enrichment

    No full text
    Motivation: Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Methods: Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Results: Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Discussion: Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. Conclusion: Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. Availability: The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.6 page(s
    corecore