1,010 research outputs found

    Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles.</p> <p>Results</p> <p>To evaluate the performance of our PPI text classifier, we conducted experiments based on the BioCreAtIvE-II IAS dataset. Our results show that adding likely-labeled data generally increases AUC by 3~6%, indicating better ranking ability. Our experiments also show that our newly-proposed term-weighting scheme has the highest AUC among all common weighting schemes. Our final model achieves an F-measure and AUC 2.9% and 5.0% higher than those of the top-ranking system in the IAS challenge.</p> <p>Conclusion</p> <p>Our experiments demonstrate the effectiveness of integrating unlabeled and likely labeled data to augment a PPI text classification system. Our mixed model is suitable for ranking purposes whereas our hierarchical model is better for filtering. In addition, our results indicate that supervised weighting schemes outperform unsupervised ones. Our newly-proposed weighting scheme, TFBRF, which considers documents that do not contain the target word, avoids some of the biases found in traditional weighting schemes. Our experiment results show TFBRF to be the most effective among several other top weighting schemes.</p

    Drug-drug interactions: A machine learning approach

    Get PDF
    Automatic detection of drug-drug interaction (DDI) is a difficult problem in pharmaco-surveillance. Recent practice for in vitro and in vivo pharmacokinetic drug-drug interaction studies have been based on carefully selected drug characteristics such as their pharmacological effects, and on drug-target networks, in order to identify and comprehend anomalies in a drug\u27s biochemical function upon co-administration.;In this work, we present a novel DDI prediction framework that combines several drug-attribute similarity measures to construct a feature space from which we train three machine learning algorithms: Support Vector Machine (SVM), J48 Decision Tree and K-Nearest Neighbor (KNN) using a partially supervised classification algorithm called Positive Unlabeled Learning (PU-Learning) tailored specifically to suit our framework.;In summary, we extracted 1,300 U.S. Food and Drug Administration-approved pharmaceutical drugs and paired them to create 1,688,700 feature vectors. Out of 397 drug-pairs known to interact prior to our experiments, our system was able to correctly identify 80% of them and from the remaining 1,688,303 pairs for which no interaction had been determined, we were able to predict 181 potential DDIs with confidence levels greater than 97%. The latter is a set of DDIs unrecognized by our source of ground truth at the time of study.;Evaluation of the effectiveness of our system involved querying the U.S. Food and Drug Administration\u27s Adverse Effect Reporting System (AERS) database for cases involving drug-pairs used in this study. The results returned from the query listed incidents reported for a number of patients, some of whom had experienced severe adverse reactions leading to outcomes such as prolonged hospitalization, diminished medicinal effect of one or more drugs, and in some cases, death

    Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes

    Get PDF
    BACKGROUND: Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. RESULTS: The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. CONCLUSION: Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development

    Identifying Graphs from Noisy Observational Data

    Get PDF
    There is a growing amount of data describing networks -- examples include social networks, communication networks, and biological networks. As the amount of available data increases, so does our interest in analyzing the properties and characteristics of these networks. However, in most cases the data is noisy, incomplete, and the result of passively acquired observational data; naively analyzing these networks without taking these errors into account can result in inaccurate and misleading conclusions. In my dissertation, I study the tasks of entity resolution, link prediction, and collective classification to address these deficiencies. I describe these tasks in detail and discuss my own work on each of these tasks. For entity resolution, I develop a method for resolving the identities of name mentions in email communications. For link prediction, I develop a method for inferring subordinate-manager relationships between individuals in an email communication network. For collective classification, I propose an adaptive active surveying method to address node labeling in a query-driven setting on network data. In many real-world settings, however, these deficiencies are not found in isolation and all need to be addressed to infer the desired complete and accurate network. Furthermore, because of the dependencies typically found in these tasks, the tasks are inherently inter-related and must be performed jointly. I define the general problem of graph identification which simultaneously performs these tasks; removing the noise and missing values in the observed input network and inferring the complete and accurate output network. I present a novel approach to graph identification using a collection of Coupled Collective Classifiers, C3, which, in addition to capturing the variety of features typically used for each task, can capture the intra- and inter-dependencies required to correctly infer nodes, edges, and labels in the output network. I discuss variants of C3 using different learning and inference paradigms and show the superior performance of C3, in terms of both prediction quality and runtime performance, over various previous approaches. I then conclude by presenting the Graph Alignment, Identification, and Analysis (GAIA) open-source software library which not only provides an implementation of C3 but also algorithms for various tasks in network data such as entity resolution, link prediction, collective classification, clustering, active learning, data generation, and analysis

    Overview of the gene ontology task at BioCreative IV

    Get PDF
    Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation

    Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A review

    Full text link
    Cancer remains one of the most challenging diseases to treat in the medical field. Machine learning has enabled in-depth analysis of rich multi-omics profiles and medical imaging for cancer diagnosis and prognosis. Despite these advancements, machine learning models face challenges stemming from limited labeled sample sizes, the intricate interplay of high-dimensionality data types, the inherent heterogeneity observed among patients and within tumors, and concerns about interpretability and consistency with existing biomedical knowledge. One approach to surmount these challenges is to integrate biomedical knowledge into data-driven models, which has proven potential to improve the accuracy, robustness, and interpretability of model results. Here, we review the state-of-the-art machine learning studies that adopted the fusion of biomedical knowledge and data, termed knowledge-informed machine learning, for cancer diagnosis and prognosis. Emphasizing the properties inherent in four primary data types including clinical, imaging, molecular, and treatment data, we highlight modeling considerations relevant to these contexts. We provide an overview of diverse forms of knowledge representation and current strategies of knowledge integration into machine learning pipelines with concrete examples. We conclude the review article by discussing future directions to advance cancer research through knowledge-informed machine learning.Comment: 41 pages, 4 figures, 2 table

    Biomedical relation extraction:from binary to complex

    Get PDF
    Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions
    • …
    corecore