26 research outputs found

    Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection

    Get PDF
    Genes are discovered almost on a daily basis and new names have to be found. Although there are guidelines for gene nomenclature, the naming process is highly creative. Human genes are often named with a gene symbol and a longer, more descriptive term; the short form is very often an abbreviation of the long form. Abbreviations in biomedical language are highly ambiguous, i.e., one gene symbol often refers to more than one gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE for the use of human gene symbols derived from LocusLink. It turns out that just over 40% of these symbols occur in MEDLINE, however, many of these occurrences are not related to genes. Along the process of making an inventory, a disambiguation test collection is constructed automatically

    Using contextual queries

    Get PDF
    Search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. An approach is demonstrated that explores implicit contexts as obtained from a document the user is reading. The approach inserts into an original (web) document functionality to directly activate context driven queries that yield related articles obtained from various information sources

    The strength of co-authorship in gene name disambiguation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.</p> <p>Results</p> <p>Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.</p> <p>Conclusion</p> <p>Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.</p

    Disclosing ambiguous gene aliases by automatic literature profiling

    Get PDF
    Submitted by Nuzia Santos ([email protected]) on 2015-01-14T10:55:18Z No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T10:55:25Z (GMT) No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T11:01:59Z (GMT) No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Made available in DSpace on 2015-01-14T11:01:59Z (GMT). No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5) Previous issue date: 2010Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilGlaxoSmithKline Moore Dr. Molecular Discovery Research. Research Triangle Park, NC, USAFundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilBackground Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples. Results Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved. Conclusions These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gen

    Spina bifida and genetic factors related to myo-inositol, glucose, and zinc.

    No full text
    Contains fulltext : 57961.pdf (publisher's version ) (Closed access)BACKGROUND: Myo-inositol, glucose and zinc and related genetic factors are suggested to be implicated in the etiology of spina bifida. We investigated the biochemical concentrations of these nutrients and polymorphisms in the myo-inositol transporter SLC5A11, myo-inositol synthase ISYNA1, and zinc transporter SLC39A4 in association with spina bifida risk. METHODS: Seventy-six spina bifida triads only were ascertained. In mothers, fathers, and spina bifida children polymorphisms determined were SLC5A11 (544C > T), ISYNA1 (1029A > G), and SLC39A4 (1069C > T). Serum myo-inositol and glucose, and red blood cell zinc concentrations were determined in mothers and spina bifida children. Transmission disequilibrium tests (TDT) were applied to determine associations between the polymorphisms and spina bifida. Associations between biochemical values and genotypes were studied by one-way analysis of variance (ANOVA). Interactions between alleles, biochemical values, and environmental factors were analyzed by conditional logistic regression. RESULTS: No association between SLC5A11, ISYNA1, and SLC39A4 and spina bifida was shown, chi2SLC5A11=0.016, P=0.90; chi2SYNA1=1.52, P=0.22; chi2SLC39A4=0.016, P=0.90; and degrees of freedom (df)=1. Maternal glucose concentrations were comparable for the SLC5A11 genotypes. Significantly lower myo-inositol concentrations were observed in mothers with SLC5A11 CC-genotype, mean (SD) 14.2 (2.6)micromol/L compared to SLC5A11 TT-genotype, 17.0 (3.4)micromol/L, P G polymorphism on spina bifida risk. CONCLUSION: The combination of maternal glucose G polymorphism protects against spina bifida offspring. Moreover, maternal SLC5A11 544C > T polymorphism contributes to the serum myo-inositol concentration. Larger studies should confirm these findings
    corecore