Search CORE

10 research outputs found

Rewriting and suppressing UMLS terms for improved biomedical term identification

Author: Hettne Kristina M
Kors Jan A
Schijvenaars Bob JA
Schuemie Martijn J
van Mulligen Erik M
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. Results Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. Conclusions We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at <url>http://biosemantics.org/casper</url>.</p

Maastricht University Research Portal

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

EUR Research Repository

Leiden University Scholary Publications

Erasmus University Digital Repository

The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

Author: Beisswanger E. (Elena)
Buyko E. (Ekaterina)
Corbett P. (Peter)
Hahn U. (Udo)
Jimeno-Yepes A.J. (Antonio José)
Kang N. (Ning)
Kors J.A. (Jan)
Milward D. (David)
Mulligen E.M. (Erik) van
Rebholz-Schuhmann D. (Dietrich)
Tomanek K. (Katrin)
Publication venue
Publication date: 01/01/2010
Field of study

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the 'silver standard corpus' (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15, 956, 841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus

Erasmus University Digital Repository

Knowledge-based extraction of adverse drug events from biomedical text

Author: Afzal M.Z. (Zubair)
Bui C. (Chinh)
Kang N. (Ning)
Kors J.A. (Jan)
Mulligen E.M. (Erik) van
Singh B. (Bharat)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledg

EUR Research Repository

Erasmus University Digital Repository

Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts

Author: Andreatta Massimo
Bredkjær Søren
Brunak Søren
Dalgaard Marlene
Hansen Thomas
Jensen Lars J
Jensen Peter Bjødstrup
Jensen Peter Bjødstrup
Juul Anders
Roque Francisco S
Schmock Henriette
Søeby Karen
Werge Thomas
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Copenhagen University Research Information System

University of Southern Denmark Research Output

Online Research Database In Technology

Common disease signatures from gene expression analysis in Huntington’s disease human blood and brain

Author
Publication venue: BioMed Central
Publication date: 01/08/2016
Field of study

Springer - Publisher Connector

Knowledge-based extraction of adverse drug events from biomedical text

Author: A Airola
A Ozg
AM Cohen
AR Aronson
Bharat Singh
C Bizer
CF Thorn
Chinh Bui
D Demner-Fushman
D Ferrucci
D Hanisch
D Revere
DS Wishart
E Buyko
Erik M van Mulligen
F Leitner
F Rinaldi
F Rinaldi
GB Melton
H Gurulingappa
H Gurulingappa
H Gurulingappa
H Jang
HJ Dai
HW Chun
J Saric
J-D Kim
J-H Kim
Jan A Kors
JD Kim
K Fundel
KM Hettne
LJ Jensen
M Bundschus
M Huang
M Krallinger
M Krallinger
M Krallinger
MA Schwartz Hearst
MJ Schuemie
MS Simpson
N Kang
N Kang
N Kang
Ning Kang
O Bodenreider
O Bodenreider
O Uzuner
P Zweigenbaum
PL Elkin
QC Bui
R Islamaj Doğan
S Buchholz
S Kandula
S Katrenko
S Pyysalo
T Rindflesch
TC Rindflesch
TC Rindflesch
U Hahn
Y Huang
Y Kano
Y Tateisi
Zubair Afzal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition

Author: Christopher S. Funk
K. Bretonnel Cohen
Karin M. Verspoor
Lawrence E. Hunter
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

Springer - Publisher Connector

The Implicitome: A Resource for Rationalizing Gene-Disease Associations

Author: Aten E.
Bruskiewich R.
Dunnen J. den
Good B.M.
Haagen H.H.H.B.M. van
Hettne K.M.
Hoen P.A. 't
Horst E. van der
Kaliyaperumal R.
Kors J.A.
Laros J.F.J.
Li T.S.
Mina E.
Mons B.
Mulligen E.M. van
Ommen G.J.B. van
Roos M.
Schuemie M.
Schultes E.A.
Su A.I.
Tatum Z.
Thompson M.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.UB – Publicatie

EUR Research Repository

Leiden University Scholary Publications

An informatics approach to prioritizing risk assessment for chemicals and chemical combinations based on near-field exposure from consumer products

Author: Gabb Henry A.
Publication venue
Publication date: 01/05/2019
Field of study

Over 80,000 chemicals are registered under the U.S. Toxic Substances Control Act of 1976, but only a few hundred have been screened for human toxicity. Not even those used in everyday consumer products, and known to have widespread exposure in the general population, have been screened. Toxicity screening is time-consuming, expensive, and complex because simultaneous or sequential exposure to multiple environmental stressors can affect chemical toxicity. Cumulative risk assessments consider multiple stressors but it is impractical to test every chemical combination and environmental stressor to which people are exposed. The goal of this research is to prioritize the chemical ingredients in consumer products and their most prevalent combinations for risk assessment based on likely exposure and retention. This work is motivated by two concerns. The first, as noted above, is the vast number of environmental chemicals with unknown toxicity. Our body burden (or chemical load) is much greater today than a century ago. The second motivating concern is the mounting evidence that many of these chemicals are potentially harmful. This makes us the unwitting participants in a vast, uncontrolled biochemistry experiment. An informatics approach is developed here that uses publicly available data to estimate chemical exposure from everyday consumer products, which account for a significant proportion of overall chemical load. Several barriers have to be overcome in order for this approach to be effective. First, a structured database of consumer products has to be created. Even though such data is largely public, it is not readily available or easily accessible. The requisite consumer product information is retrieved from online retailers. The resulting database contains brand, name, ingredients, and category for tens of thousands of unique products. Second, chemical nomenclature is often ambiguous. Synonymy (i.e., different names for the same chemical) and homonymy (i.e., the same name for different chemicals) are rampant. The PubChem Compound database, and to a lesser extent the Universal Medical Language System, are used to map chemicals to unique identifiers. Third, lists of toxicologically interesting chemicals have to be compiled. Fortunately, several authoritative bodies (e.g., the U.S. Environmental Protection Agency) publish lists of suspected harmful chemicals to be prioritized for risk assessment. Fourth, tabulating the mere presence of potentially harmful chemicals and their co-occurrence within consumer product formulations is not as interesting as quantifying likely exposure based on consumer usage patterns and product usage modes, so product usage patterns from actual consumers are required. A suitable dataset is obtained from the Kantar Worldpanel, a market analysis firm that tracks consumer behavior. Finally, a computationally feasible probabilistic approach has to be developed to estimate likely exposure and retention for individual chemicals and their combinations. The former is defined here as the presence of a chemical in a product used by a consumer. The latter is exposure combined with the relative likelihood that the chemical will be absorbed by the consumer based on a product’s usage mode (e.g., whether the product is rinsed off or left on after use). The results of four separate analyses are presented here to show the efficacy of the informatics approach. The first is a proof-of-concept demonstrating that the first two barriers, creating the consumer product database and dealing with chemical synonymy and homonymy, can be overcome and that the resulting system can measure the per-product prevalence of a small set of target chemicals (55 asthma-associated and endocrine disrupting compounds) and their combinations. A database of 38,975 distinct consumer products and 32,231 distinct ingredient names was created by scraping Drugstore.com, an online retailer. Nearly one-third of the products (11,688 products, 30%) contained ≥1 target chemical and 5,229 products (13%) contained >1. Of the 55 target chemicals, 31 (56%) appear in ≥1 product and 19 (35%) appear under more than one name. The most frequent 3-way chemical combination (2 phenoxyethanol, methyl paraben, and ethyl paraben) appears in 1,059 products. The second analysis demonstrates that the informatics approach can scale to several thousand target chemicals (11,964 environmental chemicals compiled from five authoritative lists). It repeats the proof-of-concept using a larger product sample (55,209 consumer products). In the third analysis, product usage patterns and usage modes are incorporated. This analysis yields unbiased, rational prioritizations of potentially hazardous chemicals and chemical combinations based on their prevalence within a subset of the product sample (29,814 personal care products), combined exposure from multiple products based on actual consumer behavior, and likely chemical retention based on product usage modes. High-ranking chemicals, and combinations thereof, include glycerol; octamethyltrisiloxane; citric acid; titanium dioxide; 1,2 propanediol; octadecan 1 ol; saccharin; hexitol; limonene; linalool; vitamin e; and 2 phenoxyethanol. The fourth analysis is the same as the third except that each authoritative list is prioritized individually for side-by-side comparison. The informatics approach is a viable and rationale way to prioritize chemicals and chemical combinations for risk assessment based on near-field exposure and retention. Compared to spectrographic approaches to chemical detection, the informatics approach has the advantage of a larger product sample, so it often detects chemicals that are missed during spectrographic analysis. However, the informatics approach is limited to the chemicals that are actually listed on product labels. Manufacturers are not required to specify the chemicals in fragrance or flavor mixtures, so the presence of some chemicals may be underestimated. Likewise, chemicals that are not part of the product formulation (e.g., chemicals leached from packaging, degradation byproducts) cannot be detected. Therefore, spectrographic and informatics approaches are complementary

Illinois Digital Environment for Access to Learning and Scholarship Repository

Una herramienta basada en terminologías estandarizadas para la anotación semántica de información textual

Author: Rodríguez Castiñeira Hadriana
Publication venue
Publication date: 01/01/2021
Field of study

El objetivo de esta tesis es el diseño e implementación de técnicas léxicas, sintácticas y semánticas que permitan aprovechar al máximo los recursos de conocimiento disponibles para mejorar la extracción y el análisis de la información relevante contenida en las publicaciones científicas

Repositorio Institucional da Universidade de Santiago de Compostela