10 research outputs found

    Rewriting and suppressing UMLS terms for improved biomedical term identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.</p> <p>Results</p> <p>Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.</p> <p>Conclusions</p> <p>We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at <url>http://biosemantics.org/casper</url>.</p

    The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

    Get PDF
    The production of gold standard corpora is time-consuming and costly. We propose an alternative: the 'silver standard corpus' (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15, 956, 841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus

    Knowledge-based extraction of adverse drug events from biomedical text

    Get PDF
    Background: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledg

    Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts

    Get PDF
    Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks

    The Implicitome: A Resource for Rationalizing Gene-Disease Associations

    Get PDF
    High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.UB – Publicatie

    An informatics approach to prioritizing risk assessment for chemicals and chemical combinations based on near-field exposure from consumer products

    Get PDF
    Over 80,000 chemicals are registered under the U.S. Toxic Substances Control Act of 1976, but only a few hundred have been screened for human toxicity. Not even those used in everyday consumer products, and known to have widespread exposure in the general population, have been screened. Toxicity screening is time-consuming, expensive, and complex because simultaneous or sequential exposure to multiple environmental stressors can affect chemical toxicity. Cumulative risk assessments consider multiple stressors but it is impractical to test every chemical combination and environmental stressor to which people are exposed. The goal of this research is to prioritize the chemical ingredients in consumer products and their most prevalent combinations for risk assessment based on likely exposure and retention. This work is motivated by two concerns. The first, as noted above, is the vast number of environmental chemicals with unknown toxicity. Our body burden (or chemical load) is much greater today than a century ago. The second motivating concern is the mounting evidence that many of these chemicals are potentially harmful. This makes us the unwitting participants in a vast, uncontrolled biochemistry experiment. An informatics approach is developed here that uses publicly available data to estimate chemical exposure from everyday consumer products, which account for a significant proportion of overall chemical load. Several barriers have to be overcome in order for this approach to be effective. First, a structured database of consumer products has to be created. Even though such data is largely public, it is not readily available or easily accessible. The requisite consumer product information is retrieved from online retailers. The resulting database contains brand, name, ingredients, and category for tens of thousands of unique products. Second, chemical nomenclature is often ambiguous. Synonymy (i.e., different names for the same chemical) and homonymy (i.e., the same name for different chemicals) are rampant. The PubChem Compound database, and to a lesser extent the Universal Medical Language System, are used to map chemicals to unique identifiers. Third, lists of toxicologically interesting chemicals have to be compiled. Fortunately, several authoritative bodies (e.g., the U.S. Environmental Protection Agency) publish lists of suspected harmful chemicals to be prioritized for risk assessment. Fourth, tabulating the mere presence of potentially harmful chemicals and their co-occurrence within consumer product formulations is not as interesting as quantifying likely exposure based on consumer usage patterns and product usage modes, so product usage patterns from actual consumers are required. A suitable dataset is obtained from the Kantar Worldpanel, a market analysis firm that tracks consumer behavior. Finally, a computationally feasible probabilistic approach has to be developed to estimate likely exposure and retention for individual chemicals and their combinations. The former is defined here as the presence of a chemical in a product used by a consumer. The latter is exposure combined with the relative likelihood that the chemical will be absorbed by the consumer based on a product’s usage mode (e.g., whether the product is rinsed off or left on after use). The results of four separate analyses are presented here to show the efficacy of the informatics approach. The first is a proof-of-concept demonstrating that the first two barriers, creating the consumer product database and dealing with chemical synonymy and homonymy, can be overcome and that the resulting system can measure the per-product prevalence of a small set of target chemicals (55 asthma-associated and endocrine disrupting compounds) and their combinations. A database of 38,975 distinct consumer products and 32,231 distinct ingredient names was created by scraping Drugstore.com, an online retailer. Nearly one-third of the products (11,688 products, 30%) contained ≥1 target chemical and 5,229 products (13%) contained >1. Of the 55 target chemicals, 31 (56%) appear in ≥1 product and 19 (35%) appear under more than one name. The most frequent 3-way chemical combination (2 phenoxyethanol, methyl paraben, and ethyl paraben) appears in 1,059 products. The second analysis demonstrates that the informatics approach can scale to several thousand target chemicals (11,964 environmental chemicals compiled from five authoritative lists). It repeats the proof-of-concept using a larger product sample (55,209 consumer products). In the third analysis, product usage patterns and usage modes are incorporated. This analysis yields unbiased, rational prioritizations of potentially hazardous chemicals and chemical combinations based on their prevalence within a subset of the product sample (29,814 personal care products), combined exposure from multiple products based on actual consumer behavior, and likely chemical retention based on product usage modes. High-ranking chemicals, and combinations thereof, include glycerol; octamethyltrisiloxane; citric acid; titanium dioxide; 1,2 propanediol; octadecan 1 ol; saccharin; hexitol; limonene; linalool; vitamin e; and 2 phenoxyethanol. The fourth analysis is the same as the third except that each authoritative list is prioritized individually for side-by-side comparison. The informatics approach is a viable and rationale way to prioritize chemicals and chemical combinations for risk assessment based on near-field exposure and retention. Compared to spectrographic approaches to chemical detection, the informatics approach has the advantage of a larger product sample, so it often detects chemicals that are missed during spectrographic analysis. However, the informatics approach is limited to the chemicals that are actually listed on product labels. Manufacturers are not required to specify the chemicals in fragrance or flavor mixtures, so the presence of some chemicals may be underestimated. Likewise, chemicals that are not part of the product formulation (e.g., chemicals leached from packaging, degradation byproducts) cannot be detected. Therefore, spectrographic and informatics approaches are complementary

    Una herramienta basada en terminologías estandarizadas para la anotación semántica de información textual

    Get PDF
    El objetivo de esta tesis es el diseño e implementación de técnicas léxicas, sintácticas y semánticas que permitan aprovechar al máximo los recursos de conocimiento disponibles para mejorar la extracción y el análisis de la información relevante contenida en las publicaciones científicas
    corecore