100 research outputs found
Linguistic feature analysis for protein interaction extraction
<p>Abstract</p> <p>Background</p> <p>The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels.</p> <p>Results</p> <p>Our results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared.</p> <p>Conclusion</p> <p>Our findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches.</p
Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches
Using Unsupervised Patterns to Extract Gene Regulation Relationships for Network Construction
BACKGROUND: The gene expression is usually described in the literature as a transcription factor X that regulates the target gene Y. Previously, some studies discovered gene regulations by using information from the biomedical literature and most of them require effort of human annotators to build the training dataset. Moreover, the large amount of textual knowledge recorded in the biomedical literature grows very rapidly, and the creation of manual patterns from literatures becomes more difficult. There is an increasing need to automate the process of establishing patterns. METHODOLOGY/PRINCIPAL FINDINGS: In this article, we describe an unsupervised pattern generation method called AutoPat. It is a gene expression mining system that can generate unsupervised patterns automatically from a given set of seed patterns. The high scalability and low maintenance cost of the unsupervised patterns could help our system to extract gene expression from PubMed abstracts more precisely and effectively. CONCLUSIONS/SIGNIFICANCE: Experiments on several regulators show reasonable precision and recall rates which validate AutoPat's practical applicability. The conducted regulation networks could also be built precisely and effectively. The system in this study is available at http://ikmbio.csie.ncku.edu.tw/AutoPat/
A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods
Toll-Like Receptor 3 (TLR3) Plays a Major Role in the Formation of Rabies Virus Negri Bodies
Human neurons express the innate immune response receptor, Toll-like receptor 3 (TLR3). TLR3 levels are increased in pathological conditions such as brain virus infection. Here, we further investigated the production, cellular localisation, and function of neuronal TLR3 during neuronotropic rabies virus (RABV) infection in human neuronal cells. Following RABV infection, TLR3 is not only present in endosomes, as observed in the absence of infection, but also in detergent-resistant perinuclear inclusion bodies. As well as TLR3, these inclusion bodies contain the viral genome and viral proteins (N and P, but not G). The size and composition of inclusion bodies and the absence of a surrounding membrane, as shown by electron microscopy, suggest they correspond to the previously described Negri Bodies (NBs). NBs are not formed in the absence of TLR3, and TLR3−/− mice—in which brain tissue was less severely infected—had a better survival rate than WT mice. These observations demonstrate that TLR3 is a major molecule involved in the spatial arrangement of RABV–induced NBs and viral replication. This study shows how viruses can exploit cellular proteins and compartmentalisation for their own benefit
SHIV-162P3 Infection of Rhesus Macaques Given Maraviroc Gel Vaginally Does Not Involve Resistant Viruses
Maraviroc (MVC) gels are effective at protecting rhesus macaques from vaginal SHIV transmission, but breakthrough infections can occur. To determine the effects of a vaginal MVC gel on infecting SHIV populations in a macaque model, we analyzed plasma samples from three rhesus macaques that received a MVC vaginal gel (day 0) but became infected after high-dose SHIV-162P3 vaginal challenge. Two infected macaques that received a placebo gel served as controls. The infecting SHIV-162P3 stock had an overall mean genetic distance of 0.294±0.027%; limited entropy changes were noted across the envelope (gp160). No envelope mutations were observed consistently in viruses isolated from infected macaques at days 14–21, the time of first detectable viremia, nor selected at later time points, days 42–70. No statistically significant differences in MVC susceptibilities were observed between the SHIV inoculum (50% inhibitory concentration [IC50] 1.87 nM) and virus isolated from the three MVC-treated macaques (MVC IC50 1.18 nM, 1.69 nM, and 1.53 nM, respectively). Highlighter plot analyses suggested that infection was established in each MVC-treated animal by one founder virus genotype. The expected Poisson distribution of pairwise Hamming Distance frequency counts was observed and a phylogenetic analysis did not identify infections with distinct lineages from the challenge stock. These data suggest that breakthrough infections most likely result from incomplete viral inhibition and not the selection of MVC-resistant variants
Linking genes to literature: text mining, information extraction, and retrieval applications for biology
Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet
Human Immunodeficiency Virus Type 1 Coreceptor Switching: V1/V2 Gain-of-Fitness Mutations Compensate for V3 Loss-of-Fitness Mutations
Human immunodeficiency virus type 1 (HIV-1) entry into target cells is mediated by the virus envelope binding to CD4 and the conformationally altered envelope subsequently binding to one of two chemokine receptors. HIV-1 envelope glycoprotein (gp120) has five variable loops, of which three (V1/V2 and V3) influence the binding of either CCR5 or CXCR4, the two primary coreceptors for virus entry. Minimal sequence changes in V3 are sufficient for changing coreceptor use from CCR5 to CXCR4 in some HIV-1 isolates, but more commonly additional mutations in V1/V2 are observed during coreceptor switching. We have modeled coreceptor switching by introducing most possible combinations of mutations in the variable loops that distinguish a previously identified group of CCR5- and CXCR4-using viruses. We found that V3 mutations entail high risk, ranging from major loss of entry fitness to lethality. Mutations in or near V1/V2 were able to compensate for the deleterious V3 mutations and may need to precede V3 mutations to permit virus survival. V1/V2 mutations in the absence of V3 mutations often increased the capacity of virus to utilize CCR5 but were unable to confer CXCR4 use. V3 mutations were thus necessary but not sufficient for coreceptor switching, and V1/V2 mutations were necessary for virus survival. HIV-1 envelope sequence evolution from CCR5 to CXCR4 use is constrained by relatively frequent lethal mutations, deep fitness valleys, and requirements to make the right amino acid substitution in the right place at the right time
- …