182 research outputs found

    PPI-IRO: A two-stage method for protein-protein interaction extraction based on interaction relation ontology

    Full text link
    Mining Protein-Protein Interactions (PPIs) from the fast-growing biomedical literature resources has been proven as an effective approach for the identifi cation of biological regulatory networks. This paper presents a novel method based on the idea of Interaction Relation Ontology (IRO), which specifi es and organises words of various proteins interaction relationships. Our method is a two-stage PPI extraction method. At fi rst, IRO is applied in a binary classifi er to determine whether sentences contain a relation or not. Then, IRO is taken to guide PPI extraction by building sentence dependency parse tree. Comprehensive and quantitative evaluations and detailed analyses are used to demonstrate the signifi cant performance of IRO on relation sentences classifi cation and PPI extraction. Our PPI extraction method yielded a recall of around 80% and 90% and an F1 of around 54% and 66% on corpora of AIMed and Bioinfer, respectively, which are superior to most existing extraction methods. Copyright © 2014 Inderscience Enterprises Ltd

    A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature

    Get PDF
    The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Deep learning for extracting protein-protein interactions from biomedical literature

    Full text link
    State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN compares favorably to the state-of-the-art rich-feature and single-kernel based methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on "difficult" instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.Comment: Accepted for publication in Proceedings of the 2017 Workshop on Biomedical Natural Language Processing, 10 pages, 2 figures, 6 table

    Extracting protein-protein interactions from text using rich feature vectors and feature selection

    Get PDF
    Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance. During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation. Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain

    A realistic assessment of methods for extracting gene/protein interactions from free text

    Get PDF
    Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

    Learning Language from a Large (Unannotated) Corpus

    Full text link
    A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

    Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Path

    Full text link
    Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overfitting. We test our model on the SemEval 2010 relation classification task, and achieve an F1F_1-score of 83.7\%, higher than competing methods in the literature.Comment: EMNLP '1
    corecore