5 research outputs found

    Disclosing ambiguous gene aliases by automatic literature profiling

    Get PDF
    Submitted by Nuzia Santos ([email protected]) on 2015-01-14T10:55:18Z No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T10:55:25Z (GMT) No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Approved for entry into archive by Nuzia Santos ([email protected]) on 2015-01-14T11:01:59Z (GMT) No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5)Made available in DSpace on 2015-01-14T11:01:59Z (GMT). No. of bitstreams: 1 Disclosing ambiguous gene aliases by automatic.pdf: 217573 bytes, checksum: ce54aa2c4ea49eb989f9e7308d827ce6 (MD5) Previous issue date: 2010Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilGlaxoSmithKline Moore Dr. Molecular Discovery Research. Research Triangle Park, NC, USAFundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Centro de ExcelĂȘncia em BioinformĂĄtica. Belo Horizonte, MG, Brasil/Fundação Oswaldo Cruz. Centro de Pesquisa RenĂ© Rachou. Grupo de GenĂŽmica e Biologia Computacional. Belo Horizonte, MG, BrasilBackground Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples. Results Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved. Conclusions These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gen
    corecore