5,809 research outputs found
Comparative analysis of five protein-protein interaction corpora
<p>Abstract</p> <p>Background</p> <p>Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate.</p> <p>Results</p> <p>We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty.</p> <p>We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora.</p> <p>Conclusions</p> <p>Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at <url>http://mars.cs.utu.fi/PPICorpora</url>.</p
Comparative analysis of five protein-protein interaction corpora
Conclusions: Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at http://mars.cs.utu.fi/PPICorpora.</p
A realistic assessment of methods for extracting gene/protein interactions from free text
Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community
Extracting protein-protein interactions from text using rich feature vectors and feature selection
Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance.
During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation.
Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain
Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches
We study the adaptation of Link Grammar Parser to the biomedical sublanguage
with a focus on domain terms not found in a general parser lexicon. Using two
biomedical corpora, we implement and evaluate three approaches to addressing
unknown words: automatic lexicon expansion, the use of morphological clues, and
disambiguation using a part-of-speech tagger. We evaluate each approach
separately for its effect on parsing performance and consider combinations of
these approaches. In addition to a 45% increase in parsing efficiency, we find
that the best approach, incorporating information from a domain part-of-speech
tagger, offers a statistically signicant 10% relative decrease in error. The
adapted parser is available under an open-source license at
http://www.it.utu.fi/biolg
Deep learning for extracting protein-protein interactions from biomedical literature
State-of-the-art methods for protein-protein interaction (PPI) extraction are
primarily feature-based or kernel-based by leveraging lexical and syntactic
information. But how to incorporate such knowledge in the recent deep learning
methods remains an open question. In this paper, we propose a multichannel
dependency-based convolutional neural network model (McDepCNN). It applies one
channel to the embedding vector of each word in the sentence, and another
channel to the embedding vector of the head of the corresponding word.
Therefore, the model can use richer information obtained from different
channels. Experiments on two public benchmarking datasets, AIMed and BioInfer,
demonstrate that McDepCNN compares favorably to the state-of-the-art
rich-feature and single-kernel based methods. In addition, McDepCNN achieves
24.4% relative improvement in F1-score over the state-of-the-art methods on
cross-corpus evaluation and 12% improvement in F1-score over kernel-based
methods on "difficult" instances. These results suggest that McDepCNN
generalizes more easily over different corpora, and is capable of capturing
long distance features in the sentences.Comment: Accepted for publication in Proceedings of the 2017 Workshop on
Biomedical Natural Language Processing, 10 pages, 2 figures, 6 table
Cell line name recognition in support of the identification of synthetic lethality in cancer from text
Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers
Event based text mining for integrated network construction
The scientific literature is a rich and challenging data source for research in systems biology, providing numerous interactions between biological entities. Text mining techniques have been increasingly useful to extract such information from the literature in an automatic way, but up to now the main focus of text mining in the systems biology field has been restricted mostly to the discovery of protein-protein interactions. Here, we take this approach one step further, and use machine learning techniques combined with text mining to extract a much wider variety of interactions between biological entities. Each particular interaction type gives rise to a separate network, represented as a graph, all of which can be subsequently combined to yield a so-called integrated network representation. This provides a much broader view on the biological system as a whole, which can then be used in further investigations to analyse specific properties of the networ
The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text
In this paper we discuss five different corpora annotated for protein names. We present several within- and cross-dataset protein tagging experiments showing that different annotation schemes severely affect the portability of statistical protein taggers. By means of a detailed error analysis we identify crucial annotation issues that future annotation projects should take into careful consideration
- …