20 research outputs found

    Use of Text Data in Identifying and Prioritizing Potential Drug Repositioning Candidates

    Get PDF
    New drug development costs between 500 million and 2 billion dollars and takes 10-15 years, with a success rate of less than 10%. Drug repurposing (defined as discovering new indications for existing drugs) could play a significant role in drug development, especially considering the declining success rates of developing novel drugs. In the period 2007-2009, drug repurposing led to the launching of 30-40% of new drugs. Typically, new indications for existing medications are identified by accident. However, new technologies and a large number of available resources enable the development of systematic approaches to identify and validate drug-repurposing candidates with significantly lower cost. A variety of resources have been utilized to identify novel drug repurposing candidates such as biomedical literature, clinical notes, and genetic data. In this dissertation, we focused on using text data in identifying and prioritizing drug repositioning candidates and conducted five studies. In the first study, we aimed to assess the feasibility of using patient reviews from social media to identify potential candidates for drug repurposing. We retrieved patient reviews of 180 medications from an online forum, WebMD. Using dictionary-based and machine learning approaches, we identified disease names in the reviews. Several publicly available resources were used to exclude comments containing known indications and adverse drug effects. After manually reviewing some of the remaining comments, we implemented a rule-based system to identify beneficial effects. The dictionary-based system and machine learning system identified 2178 and 6171 disease names respectively in 64,616 patient comments. We provided a list of 10 common patterns that patients used to report any beneficial effects or uses of medication. After manually reviewing the comments tagged by our rule-based system, we identified five potential drug repurposing candidates. To our knowledge, this was the first study to consider using social media data to identify drug-repurposing candidates. We found that even a rule-based system, with a limited number of rules, could identify beneficial effect mentions in the comments of patients. Our preliminary study shows that social media has the potential to be used in drug repurposing. In the second study, we investigated the significance of extracting information from multiple sentences specifically in the context of drug-disease relation discovery. We used multiple resources such as Semantic Medline, a literature-based resource, and Medline search (for filtering spurious results) and inferred 8,772 potential drug-disease pairs. Our analysis revealed that 6,450 (73.5%) of the 8,772 potential drug-disease relations did not occur in a single sentence. Moreover, only 537 of the drug-disease pairs matched the curated gold standard in the Comparative Toxicogenomics Database (CTD), a trusted resource for drug-disease relations. Among the 537, nearly 75% (407) of the drug-disease pairs occur in multiple sentences. Our analysis revealed that the drug-disease pairs inferred from Semantic Medline or retrieved from CTD could be extracted from multiple sentences in the literature. This highlights the significance of the need for discourse-level analysis in extracting the relations from biomedical literature. In the third and fourth study, we focused on prioritizing drug repositioning candidates extracted from biomedical literature which we refer to as Literature-Based Discovery (LBD). In the third study, we used drug-gene and gene-disease semantic predications extracted from Medline abstracts to generate a list of potential drug-disease pairs. We further ranked the generated pairs, by assigning scores based on the predicates that qualify drug-gene and gene-disease relationships. On comparing the top-ranked drug-disease pairs against the Comparative Toxicogenomics Database, we found that a significant percentage of top-ranked pairs appeared in CTD. Co-occurrence of these high-ranked pairs in Medline abstracts is then used to improve the rankings of the inferred drug-disease relations. Finally, manual evaluation of the top-ten pairs ranked by our approach revealed that nine of them have good potential for biological significance based on expert judgment. In the fourth study, we proposed a method, utilizing information surrounding causal findings, to prioritize discoveries generated by LBD systems. We focused on discovering drug-disease relations, which have the potential to identify drug repositioning candidates or adverse drug reactions. Our LBD system used drug-gene and gene-disease semantic predication in SemMedDB as causal findings and Swanson’s ABC model to generate potential drug-disease relations. Using sentences, as a source of causal findings, our ranking method trained a binary classifier to classify generated drug-disease relations into desired classes. We trained and tested our classifier for three different purposes: a) drug repositioning b) adverse drug-event detection and c) drug-disease relation detection. The classifier obtained 0.78, 0.86, and 0.83 F-measures respectively for these tasks. The number of causal findings of each hypothesis, which were classified as positive by the classifier, is the main metric for ranking hypotheses in the proposed method. To evaluate the ranking method, we counted and compared the number of true relations in the top 100 pairs, ranked by our method and one of the previous methods. Out of 181 true relations in the test dataset, the proposed method ranked 20 of them in the top 100 relations while this number was 13 for the other method. In the last study, we used biomedical literature and clinical trials in ranking potential drug repositioning candidates identified by Phenome-Wide Association Studies (PheWAS). Unlike previous approaches, in this study, we did not limit our method to LBD. First, we generated a list of potential drug repositioning candidates using PheWAS. We retrieved 212,851 gene-disease associations from PheWAS catalog and 14,169 gene-drug relationships from DrugBank. Following Swanson’s model, we generated 52,966 potential drug repositioning candidates. Then, we developed an information retrieval system to retrieve any evidence of those candidates co-occurring in the biomedical literature and clinical trials. We identified nearly 14,800 drug-disease pairs with some evidence of support. In addition, we identified more than 38,000 novel candidates for re-purposing, encompassing hundreds of different disease states and over 1,000 individual medications. We anticipate that these results will be highly useful for hypothesis generation in the field of drug repurposing

    Indirect Relatedness, Evaluation, and Visualization for Literature Based Discovery

    Get PDF
    The exponential growth of scientific literature is creating an increased need for systems to process and assimilate knowledge contained within text. Literature Based Discovery (LBD) is a well established field that seeks to synthesize new knowledge from existing literature, but it has remained primarily in the theoretical realm rather than in real-world application. This lack of real-world adoption is due in part to the difficulty of LBD, but also due to several solvable problems present in LBD today. Of these problems, the ones in most critical need of improvement are: (1) the over-generation of knowledge by LBD systems, (2) a lack of meaningful evaluation standards, and (3) the difficulty interpreting LBD output. We address each of these problems by: (1) developing indirect relatedness measures for ranking and filtering LBD hypotheses; (2) developing a representative evaluation dataset and applying meaningful evaluation methods to individual components of LBD; (3) developing an interactive visualization system that allows a user to explore LBD output in its entirety. In addressing these problems, we make several contributions, most importantly: (1) state of the art results for estimating direct semantic relatedness, (2) development of set association measures, (3) development of indirect association measures, (4) development of a standard LBD evaluation dataset, (5) division of LBD into discrete components with well defined evaluation methods, (6) development of automatic functional group discovery, and (7) integration of indirect relatedness measures and automatic functional group discovery into a comprehensive LBD visualization system. Our results inform future development of LBD systems, and contribute to creating more effective LBD systems

    Computational literature-based discovery for natural products research : current state and future prospects

    Get PDF
    Literature-based discovery (LBD) mines existing literature in order to generate new hypotheses by finding links between previously disconnected pieces of knowledge. Although automated LBD systems are becoming widespread and indispensable in a wide variety of knowledge domains, little has been done to introduce LBD to the field of natural products research. Despite growing knowledge in the natural product domain, most of the accumulated information is found in detached data pools. LBD can facilitate better contextualization and exploitation of this wealth of data, for example by formulating new hypotheses for natural product research, especially in the context of drug discovery and development. Moreover, automated LBD systems promise to accelerate the currently tedious and expensive process of lead identification, optimization, and development. Focusing on natural product research, we briefly reflect the development of automated LBD and summarize its methods and principal data sources. In a thorough review of published use cases of LBD in the biomedical domain, we highlight the immense potential of this data mining approach for natural product research, especially in context with drug discovery or repurposing, mode of action, as well as drug or substance interactions. Most of the 91 natural product-related discoveries in our sample of reported use cases of LBD were addressed at a computer science audience. Therefore, it is the wider goal of this review to introduce automated LBD to researchers who work with natural products and to facilitate the dialogue between this community and the developers of automated LBD systems

    In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability

    Get PDF
    Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge linkages in existing scientific literature to provide impetus to innovation and research productivity. Despite significant advancements in LBD research, previous studies contain several open problems and shortcomings that are hindering its progress. The overarching goal of this thesis is to address these issues, not only to enhance the discovery component of LBD, but also to shed light on new directions that can further strengthen the existing understanding of the LBD work ow. In accordance with this goal, the thesis aims to enhance the LBD work ow with a view to ensuring its widespread applicability. The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of the proposed solutions to a diverse range of problem settings. These problem settings are not necessarily application areas that are closely related to the LBD context, but could include a wide range of problems beyond the typical scope of LBD, which has traditionally been applied to scientific literature. Adapting the LBD work ow to problems outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of LBD research, which is discovering novel linkages in text corpora is valid across a vast range of problem settings. Secondly, the idea of widespread applicability also denotes the capability of the proposed solutions to be executed in new environments. These `new environments' are various academic disciplines (i.e., cross-domain knowledge discovery) and publication languages (i.e., cross-lingual knowledge discovery). The application of LBD models to new environments is timely, since the massive growth of the scientific literature has engendered huge challenges to academics, irrespective of their domain. This thesis is divided into five main research objectives that address the following topics: literature synthesis, the input component, the discovery component, reusability, and portability. The objective of the literature synthesis is to address the gaps in existing LBD reviews by conducting the rst systematic literature review. The input component section aims to provide generalised insights on the suitability of various input types in the LBD work ow, focusing on their role and potential impact on the information retrieval cycle of LBD. The discovery component section aims to intermingle two research directions that have been under-investigated in the LBD literature, `modern word embedding techniques' and `temporal dimension' by proposing diachronic semantic inferences. Their potential positive in uence in knowledge discovery is veri ed through both direct and indirect uses. The reusability section aims to present a new, distinct viewpoint on these LBD models by verifying their reusability in a timely application area using a methodical reuse plan. The last section, portability, proposes an interdisciplinary LBD framework that can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its generalisable capabilities. Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main research objectives, enhancing the existing understanding of the LBD work ow. The thesis offers new insights which future LBD research could further explore and expand to create more eficient, widely applicable LBD models to enable broader community benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Knowledge mining over scientific literature and technical documentation

    Full text link
    Abstract This dissertation focuses on the extraction of information implicitly encoded in domain descriptions (technical terminology and related items) and its usage within a restricted-domain question answering system (QA). Since different variants of the same term can be used to refer to the same domain entity, it is necessary to recognize all possible forms of a given term and structure them, so that they can be used in the question answering process. The knowledge about domain descriptions and their mutual relations is leveraged in an extension to an existing QA system, aimed at the technical maintenance manual of a well-known commercial aircraft. The original version of the QA system did not make use of domain descriptions, which are the novelty introduced by the present work. The explicit treatment of domain descriptions provided considerable gains in terms of efficiency, in particular in the process of analysis of the background document collection. Similar techniques were later applied to another domain (biomedical scientific literature), focusing in particular on protein- protein interactions. This dissertation describes in particular: (1) the extraction of domain specific lexical items which refer to entities of the domain; (2) the detection of relationships (like synonymy and hyponymy) among such items, and their organization into a conceptual structure; (3) their usage within a domain restricted question answering system, in order to facilitate the correct identification of relevant answers to a query; (4) the adaptation of the system to another domain, and extension of the basic hypothesis to tasks other than question answering. Zusammenfassung Das Thema dieser Dissertation ist die Extraktion von Information, welche implizit in technischen Terminologien und ähnlichen Ressourcen enthalten ist, sowie ihre Anwendung in einem Antwortextraktionssystem (AE). Da verschiedene Varianten desselben Terms verwendet werden können, um auf den gleichen Begriff zu verweisen, ist die Erkennung und Strukturierung aller möglichen Formen Voraussetzung für den Einsatz in einem AE-System. Die Kenntnisse über Terme und deren Relationen werden in einem AE System angewandt, welches auf dem Wartungshandbuch eines bekannten Verkehrsflugzeug fokussiert. Die ursprüngliche Version des Systems hatte keine explizite Behandlung von Terminologie. Die explizite Behandlung von Terminologie lieferte eine beachtliche Verbesserung der Effizienz des Systems, insbesondere was die Analyse der zugrundeliegenden Dokumentensammlung betrifft. Ähnliche Methodologien wurden später auf einer anderen Domäne angewandt (biomedizinische Literatur), mit einen besonderen Fokus auf Interaktionen zwischen Proteinen. Diese Dissertation beschreibt insbesondere: (1) die Extraktion der Terminologie (2) die Identifikation der Relationen zwischen Termen (wie z.B. Synonymie und Hyponymie) (3) deren Verwendung in einen AE System (4) die Portierung des Systems auf eine andere Domäne
    corecore