79 research outputs found

    Quantifying and filtering knowledge generated by literature based discovery

    Get PDF
    Background Literature based discovery (LBD) automatically infers missed connections between concepts in literature. It is often assumed that LBD generates more information than can be reasonably examined. Methods We present a detailed analysis of the quantity of hidden knowledge produced by an LBD system and the effect of various filtering approaches upon this. The investigation of filtering combined with single or multi-step linking term chains is carried out on all articles in PubMed. Results The evaluation is carried out using both replication of existing discoveries, which provides justification for multi-step linking chain knowledge in specific cases, and using timeslicing, which gives a large scale measure of performance. Conclusions While the quantity of hidden knowledge generated by LBD can be vast, we demonstrate that (a) intelligent filtering can greatly reduce the number of hidden knowledge pairs generated, (b) for a specific term, the number of single step connections can be manageable, and (c) in the absence of single step hidden links, considering multiple steps can provide valid links

    Identifying Patients with Pneumonia from Free-Text Intensive Care Unit Reports

    Get PDF
    Abstract Clinical research studying critical illness phenotypes relies on the identification of clinical syndromes defined by consensus definitions. Pneumonia is a prime example. Historically, identifying pneumonia has required manual chart review, which is a time and resource intensive process. The overall research goal of our work is to develop automated approaches that accurately identify critical illness phenotypes. In this paper, we describe our approach to the identification of pneumonia from electronic medical records, present our preliminary results, and describe future steps

    Exploring relation types for literature-based discovery

    Get PDF
    Objective Literature-based discovery (LBD) aims to identify “hidden knowledge” in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LBD approaches use simple techniques to identify semantically weak relations between concepts, for example, document co-occurrence. These generate huge numbers of hypotheses, difficult for humans to assess. More complex techniques rely on linguistic analysis, for example, shallow parsing, to identify semantically stronger relations. Such approaches generate fewer hypotheses, but may miss hidden knowledge. The authors investigate this trade-off in detail, comparing techniques for identifying related concepts to discover which are most suitable for LBD. Materials and methods A generic LBD system that can utilize a range of relation types was developed. Experiments were carried out comparing a number of techniques for identifying relations. Two approaches were used for evaluation: replication of existing discoveries and the “time slicing” approach.1 Results Previous LBD discoveries could be replicated using relations based either on document co-occurrence or linguistic analysis. Using relations based on linguistic analysis generated many fewer hypotheses, but a significantly greater proportion of them were candidates for hidden knowledge. Discussion and Conclusion The use of linguistic analysis-based relations improves accuracy of LBD without overly damaging coverage. LBD systems often generate huge numbers of hypotheses, which are infeasible to manually review. Improving their accuracy has the potential to make these systems significantly more usabl

    The effect of word sense disambiguation accuracy on literature based discovery

    Get PDF
    Background The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known as “hidden knowledge”). A common approach is to identify hidden knowledge via shared linking terms. However, biomedical documents are highly ambiguous which can lead LBD systems to over generate hidden knowledge by hypothesising connections through different meanings of linking terms. Word Sense Disambiguation (WSD) aims to resolve ambiguities in text by identifying the meaning of ambiguous terms. This study explores the effect of WSD accuracy on LBD performance. Methods An existing LBD system is employed and four approaches to WSD of biomedical documents integrated with it. The accuracy of each WSD approach is determined by comparing its output against a standard benchmark. Evaluation of the LBD output is carried out using timeslicing approach, where hidden knowledge is generated from articles published prior to a certain cutoff date and a gold standard extracted from publications after the cutoff date. Results WSD accuracy varies depending on the approach used. The connection between the performance of the LBD and WSD systems are analysed to reveal a correlation between WSD accuracy and LBD performance. Conclusion This study reveals that LBD performance is sensitive to WSD accuracy. It is therefore concluded that WSD has the potential to improve the output of LBD systems by reducing the amount of spurious hidden knowledge that is generated. It is also suggested that further improvements in WSD accuracy have the potential to improve LBD accuracy

    Improving protein function prediction methods with integrated literature data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p

    Integrated Bio-Entity Network: A System for Biological Knowledge Discovery

    Get PDF
    A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs

    Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases

    Get PDF
    The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs

    Explicitly searching for useful inventions: dynamic relatedness and the costs of connecting versus synthesizing

    Get PDF
    Inventions combine technological features. When features are barely related, burdensomely broad knowledge is required to identify the situations that they share. When features are overly related, burdensomely broad knowledge is required to identify the situations that distinguish them. Thus, according to my first hypothesis, when features are moderately related, the costs of connecting and costs of synthesizing are cumulatively minimized, and the most useful inventions emerge. I also hypothesize that continued experimentation with a specific set of features is likely to lead to the discovery of decreasingly useful inventions; the earlier-identified connections reflect the more common consumer situations. Covering data from all industries, the empirical analysis provides broad support for the first hypothesis. Regressions to test the second hypothesis are inconclusive when examining industry types individually. Yet, this study represents an exploratory investigation, and future research should test refined hypotheses with more sophisticated data, such as that found in literature-based discovery research
    corecore