20 research outputs found

    Investigating heterogeneous protein annotations toward cross-corpora utilization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources.</p> <p>Results</p> <p>We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned.</p> <p>Conclusion</p> <p>Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

    The Prevalence of Limited Health Literacy

    No full text
    OBJECTIVE: To systematically review U.S. studies examining the prevalence of limited health literacy and to synthesize these findings by evaluating demographic associations in pooled analyses. DESIGN: We searched the literature for the period 1963 through January 2004 and identified 2,132 references related to a set of specified search terms. Of the 134 articles and published abstracts retrieved, 85 met inclusion criteria, which were 1) conducted in the United States with ≥25 adults, 2) addressed a hypothesis related to health care, 3) identified a measurement instrument, and 4) presented primary data. The authors extracted data to compare studies by population, methods, and results. MAIN RESULTS: The 85 studies reviewed include data on 31,129 subjects, and report a prevalence of low health literacy between 0% and 68%. Pooled analyses of these data reveal that the weighted prevalence of low health literacy was 26% (95% confidence interval [CI], 22% to 29%) and of marginal health literacy was 20% (95% CI, 16% to 23%). Most studies used either the Rapid Estimate of Adult Literacy in Medicine (REALM) or versions of the Test of Functional Health Literacy in Adults (TOFHLA). The prevalence of low health literacy was not associated with gender (P =.38) or measurement instrument (P =.23) but was associated with level of education (P =.02), ethnicity (P =.0003), and age (P =.004). CONCLUSIONS: A pooled analysis of published reports on health literacy cannot provide a nationally representative prevalence estimate. This systematic review exhibits that limited health literacy, as depicted in the medical literature, is prevalent and is consistently associated with education, ethnicity, and age. It is essential to simplify health services and improve health education. Such changes have the potential to improve the health of Americans and address the health disparities that exist today

    Partial Oxidation of C2 to C4 Paraffins

    No full text
    corecore