427 research outputs found

    A scalable machine-learning approach to recognize chemical names within large text databases

    Get PDF
    MOTIVATION: The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases. RESULTS: A first-order Markov Model (MM) was evaluated for its ability to distinguish chemical names from words, yielding ~93% recall in recognizing chemical terms and ~99% precision in rejecting non-chemical terms on smaller test sets. However, because total false-positive events increase with the number of words analyzed, the scalability of name recognition was measured by processing 13.1 million MEDLINE records. The method yielded precision ranges from 54.7% to 100%, depending upon the cutoff score used, averaging 82.7% for approximately 1.05 million putative chemical terms extracted. Extracted chemical terms were analyzed to estimate the number of spelling variants per term, which correlated with the total number of times the chemical name appeared in MEDLINE. This variability in term construction was found to affect both information retrieval and term mapping when using PubMed and Ovid

    BibGlimpse: The case for a light-weight reprint manager in distributed literature research

    Get PDF
    Background While text-mining and distributed annotation systems both aim at capturing knowledge and presenting it in a standardized form, there have been few attempts to investigate potential synergies between these two fields. For instance, distributed annotation would be very well suited for providing topic focussed, expert knowledge enriched text corpora. A key limitation for this approach is the availability of literature annotation systems that can be routinely used by groups of collaborating researchers on a day to day basis, not distracting from the main focus of their work. Results For this purpose, we have designed BibGlimpse. Features like drop-to-file, SVM based automated retrieval of PubMed bibliography for PDF reprints, and annotation support make BibGlimpse an efficient, light-weight reprint manager that facilitates distributed literature research for work groups. Building on an established open search engine, full-text search and structured queries are supported, while at the same time making shared collections of annotated reprints accessible to literature classification and text-mining tools. Conclusion BibGlimpse offers scientists a tool that enhances their own literature management. Moreover, it may be used to create content enriched, annotated text corpora for research in text-mining

    A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

    Get PDF
    Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

    Comparative analysis of five protein-protein interaction corpora

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate.</p> <p>Results</p> <p>We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty.</p> <p>We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora.</p> <p>Conclusions</p> <p>Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at <url>http://mars.cs.utu.fi/PPICorpora</url>.</p

    Automated annotation of chemical names in the literature with tunable accuracy

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A significant portion of the biomedical and chemical literature refers to small molecules. The accurate identification and annotation of compound name that are relevant to the topic of the given literature can establish links between scientific publications and various chemical and life science databases. Manual annotation is the preferred method for these works because well-trained indexers can understand the paper topics as well as recognize key terms. However, considering the hundreds of thousands of new papers published annually, an automatic annotation system with high precision and relevance can be a useful complement to manual annotation.</p> <p>Results</p> <p>An automated chemical name annotation system, MeSH Automated Annotations (MAA), was developed to annotate small molecule names in scientific abstracts with tunable accuracy. This system aims to reproduce the MeSH term annotations on biomedical and chemical literature that would be created by indexers. When comparing automated free text matching to those indexed manually of 26 thousand MEDLINE abstracts, more than 40% of the annotations were false-positive (FP) cases. To reduce the FP rate, MAA incorporated several filters to remove "incorrect" annotations caused by nonspecific, partial, and low relevance chemical names. In part, relevance was measured by the position of the chemical name in the text. Tunable accuracy was obtained by adding or restricting the sections of the text scanned for chemical names. The best precision obtained was 96% with a 28% recall rate. The best performance of MAA, as measured with the F statistic was 66%, which favorably compares to other chemical name annotation systems.</p> <p>Conclusions</p> <p>Accurate chemical name annotation can help researchers not only identify important chemical names in abstracts, but also match unindexed and unstructured abstracts to chemical records. The current work is tested against MEDLINE, but the algorithm is not specific to this corpus and it is possible that the algorithm can be applied to papers from chemical physics, material, polymer and environmental science, as well as patents, biological assay descriptions and other textual data.</p

    Investigating heterogeneous protein annotations toward cross-corpora utilization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.</p> <p>Results</p> <p>We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.</p> <p>Conclusion</p> <p>Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

    Incorporating rich background knowledge for gene named entity classification and recognition

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information.</p> <p>Results</p> <p>We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at <url>http://202.118.75.18:8080/bioner</url>.</p

    Resident physician and hospital pharmacist familiarity with patient discharge medication costs

    Get PDF
    Objective Cost-related medication non-adherence is associated with increased health-care resource utilization and poor patient outcomes. Physicians-in-training generally receive little education regarding costs of prescribed therapy and may rely on hospital pharmacists for this information. However, little is documented regarding either of these health care providers’ familiarity with out-of pocket medication expenses borne by patients in the community. The purpose of this study was to evaluate and compare resident physician and hospital pharmacist familiarity with what patients pay for medications prescribed once discharged. Setting A major tertiary patient care and medical teaching centre in Canada. Method Internal medicine residents and hospital pharmacists within a specific health care organization were invited to participate in an online survey. Eight patient case scenarios and associated discharge therapeutic regimens were outlined and respondents asked to identify the costs patients would incur when having the prescription filled once discharged. Main Outcome Measure Total number and proportion of estimates above and below actual cost were calculated and compared between the groups using χ2 tests. Responses ±10% of the true cost were considered correct. Mean absolute values and standard deviation estimated costs, as well as cost increments above and below 10%, were calculated to assess the magnitude of the discrepancy between the respondent estimates and the actual total cost. Results Forty-four percent of physician residents and 26% of hospital pharmacists accessed the survey. Overall 39% and 47% of medication costs were under-estimated, 32% and 33% were overestimated, and 29% and 21% were correctly estimated by residents and pharmacists, respectively (P = NS). Incorrect estimates were evident across all therapeutic classes and medical indications presented in the survey. The greatest absolute cost discrepancy for both groups was under-estimation of linezolid (800and800 and 400) and over-estimation of clopidogrel (80)andbisoprololtherapy(80) and bisoprolol therapy (22) by residents and pharmacists, respectively. Conclusion Resident physicians and hospital pharmacists are unfamiliar with what patients must pay for drug therapy once discharged

    Rule-based modelling provides an extendable framework for comparing candidate mechanisms underpinning clathrin polymerisation

    Get PDF
    Abstract Polymerisation of clathrin is a key process that underlies clathrin-mediated endocytosis. Clathrin-coated vesicles are responsible for cell internalization of external substances required for normal homeostasis and life –sustaining activity. There are several hypotheses describing formation of closed clathrin structures. According to one of the proposed mechanisms cage formation may start from a flat lattice buildup on the cellular membrane, which is later transformed into a curved structure. Creation of the curved surface requires rearrangement of the lattice, induced by additional molecular mechanisms. Different potential mechanisms require a modeling framework that can be easily modified to compare between them. We created an extendable rule-based model that describes polymerisation of clathrin molecules and various scenarios of cage formation. Using Global Sensitivity Analysis (GSA) we obtained parameter sets describing clathrin pentagon closure and the emergence/production and closure of large-size clathrin cages/vesicles. We were able to demonstrate that the model can reproduce budding of the clathrin cage from an initial flat array

    Studying the Underlying Event in Drell-Yan and High Transverse Momentum Jet Production at the Tevatron

    Get PDF
    We study the underlying event in proton-antiproton collisions by examining the behavior of charged particles (transverse momentum pT > 0.5 GeV/c, pseudorapidity |\eta| < 1) produced in association with large transverse momentum jets (~2.2 fb-1) or with Drell-Yan lepton-pairs (~2.7 fb-1) in the Z-boson mass region (70 < M(pair) < 110 GeV/c2) as measured by CDF at 1.96 TeV center-of-mass energy. We use the direction of the lepton-pair (in Drell-Yan production) or the leading jet (in high-pT jet production) in each event to define three regions of \eta-\phi space; toward, away, and transverse, where \phi is the azimuthal scattering angle. For Drell-Yan production (excluding the leptons) both the toward and transverse regions are very sensitive to the underlying event. In high-pT jet production the transverse region is very sensitive to the underlying event and is separated into a MAX and MIN transverse region, which helps separate the hard component (initial and final-state radiation) from the beam-beam remnant and multiple parton interaction components of the scattering. The data are corrected to the particle level to remove detector effects and are then compared with several QCD Monte-Carlo models. The goal of this analysis is to provide data that can be used to test and improve the QCD Monte-Carlo models of the underlying event that are used to simulate hadron-hadron collisions.Comment: Submitted to Phys.Rev.
    corecore