26 research outputs found
Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection
Genes are discovered almost on a daily basis and new names have to be
found. Although there are guidelines for gene nomenclature, the naming
process is highly creative. Human genes are often named with a gene symbol
and a longer, more descriptive term; the short form is very often an
abbreviation of the long form. Abbreviations in biomedical language are
highly ambiguous, i.e., one gene symbol often refers to more than one
gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE
for the use of human gene symbols derived from LocusLink. It turns out
that just over 40% of these symbols occur in MEDLINE, however, many of
these occurrences are not related to genes. Along the process of making an
inventory, a disambiguation test collection is constructed automatically
Using contextual queries
Search engines generally treat search requests in isolation. The results
for a given query are identical, independent of the user, or the context
in which the user made the request. An approach is demonstrated that
explores implicit contexts as obtained from a document the user is
reading. The approach inserts into an original (web) document
functionality to directly activate context driven queries that yield
related articles obtained from various information sources
The strength of co-authorship in gene name disambiguation
<p>Abstract</p> <p>Background</p> <p>A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) â a special case of Word Sense Disambiguation (WSD) â is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact â one of the special features of biological articles â that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.</p> <p>Results</p> <p>Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.</p> <p>Conclusion</p> <p>Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.</p
Disclosing ambiguous gene aliases by automatic literature profiling
Submitted by Nuzia Santos ([email protected]) on 2015-01-14T10:55:18Z
No. of bitstreams: 1
Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T10:55:25Z (GMT) No. of bitstreams: 1
Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T11:01:59Z (GMT) No. of bitstreams: 1
Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Made available in DSpace on 2015-01-14T11:01:59Z (GMT). No. of bitstreams: 1
Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)
Previous issue date: 2010Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilGlaxoSmithKline Moore Dr. Molecular Discovery Research. Research Triangle Park, NC, USAFundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilBackground
Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples.
Results
Aliases were considered âambiguousâ when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of âsynonymsâ. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective âsynonymsâ or âambiguousâ aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and âsynonymâ aliases allowed a 3.6-fold increase in the number of unique documents retrieved.
Conclusions
These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gen
A handbook supported by a web site to facilitate distance learning of medical informatics
A handbook supported by a web site to facilitate distance learning of medical informatics
Spina bifida and genetic factors related to myo-inositol, glucose, and zinc.
Contains fulltext :
57961.pdf (publisher's version ) (Closed access)BACKGROUND: Myo-inositol, glucose and zinc and related genetic factors are suggested to be implicated in the etiology of spina bifida. We investigated the biochemical concentrations of these nutrients and polymorphisms in the myo-inositol transporter SLC5A11, myo-inositol synthase ISYNA1, and zinc transporter SLC39A4 in association with spina bifida risk. METHODS: Seventy-six spina bifida triads only were ascertained. In mothers, fathers, and spina bifida children polymorphisms determined were SLC5A11 (544C > T), ISYNA1 (1029A > G), and SLC39A4 (1069C > T). Serum myo-inositol and glucose, and red blood cell zinc concentrations were determined in mothers and spina bifida children. Transmission disequilibrium tests (TDT) were applied to determine associations between the polymorphisms and spina bifida. Associations between biochemical values and genotypes were studied by one-way analysis of variance (ANOVA). Interactions between alleles, biochemical values, and environmental factors were analyzed by conditional logistic regression. RESULTS: No association between SLC5A11, ISYNA1, and SLC39A4 and spina bifida was shown, chi2SLC5A11=0.016, P=0.90; chi2SYNA1=1.52, P=0.22; chi2SLC39A4=0.016, P=0.90; and degrees of freedom (df)=1. Maternal glucose concentrations were comparable for the SLC5A11 genotypes. Significantly lower myo-inositol concentrations were observed in mothers with SLC5A11 CC-genotype, mean (SD) 14.2 (2.6)micromol/L compared to SLC5A11 TT-genotype, 17.0 (3.4)micromol/L, P G polymorphism on spina bifida risk. CONCLUSION: The combination of maternal glucose G polymorphism protects against spina bifida offspring. Moreover, maternal SLC5A11 544C > T polymorphism contributes to the serum myo-inositol concentration. Larger studies should confirm these findings