Skip to main content
Article thumbnail
Location of Repository

Application of a New Probabilistic Model for Mining Implicit Associated Cancer Genes from OMIM and Medline

By Shanfeng Zhu, Yasushi Okuno, Gozoh Tsujimoto and Hiroshi Mamitsuka


An important issue in current medical science research is to find the genes that are strongly related to an inherited disease. A particular focus is placed on cancer-gene relations, since some types of cancers are inherited. As biomedical databases have grown speedily in recent years, an informatics approach to predict such relations from currently available databases should be developed. Our objective is to find implicit associated cancer-genes from biomedical databases including the literature database. Co-occurrence of biological entities has been shown to be a popular and efficient technique in biomedical text mining. We have applied a new probabilistic model, called mixture aspect model (MAM) [48], to combine different types of co-occurrences of genes and cancer derived from Medline and OMIM (Online Mendelian Inheritance in Man). We trained the probability parameters of MAM using a learning method based on an EM (Expectation and Maximization) algorithm. We examined the performance of MAM by predicting associated cancer gene pairs. Through cross-validation, prediction accuracy was shown to be improved by adding gene-gene co-occurrences from Medline to cancer-gene cooccurrences in OMIM. Further experiments showed that MAM found new cancer-gene relations which are unknown in the literature. Supplementary information can be found at

Topics: Original Research
Publisher: Libertas Academica
OAI identifier:
Provided by: PubMed Central
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://www.pubmedcentral.nih.g... (external link)
  • (external link)
  • Suggested articles


    1. (2001). A literature network of human genes for high-throughput analysis of gene expression.
    2. (2004). A method for fi nding communities of related genes.
    3. (2005). A probabilistic model for mining implicit “Chemical compoundgene” relations from literature.
    4. (2002). A similarity-based method for genome-wide prediction of disease-relevant human genes.
    5. (2001). A simple generalization of the area under the ROC curve for multiple class classifi cation problems.
    6. (2002). Association of genes to genetically inherited diseases using data mining.
    7. (2001). Association study designs for complex diseases.
    8. Basic local alignment search tool,
    9. (2005). CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and Candidate genes,
    10. (1992). Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors.
    11. (2005). Database resources of the National Center for Biotechnology Information.
    12. (1993). dbEST–database for “expressed sequence tags”
    13. (2001). dbSNP: the NCBI database of genetic variation.
    14. (2003). DNA microarray and cancer.
    15. (2004). Extracting and characterizing gene-drug relationships from the literature, Pharmacogenetics,
    16. (1996). Genetic linkage studies for the identification of cancer-related genes.
    17. Genome-wide analysis of DNA copy-number changes using cDNA microarrays.
    18. (2002). Genomics and natural language processing.
    19. (2002). Guidelines for human gene nomenclature.
    20. (1998). High resolution analysis of DNA copy-number variation using comparative genomic hybridization to microarray.
    21. (2005). Identifi cation of gastric cancer-related genes using a cDNA microarray containing novel expressed sequence tags expressed in gastric cancer cells.
    22. (2002). Identifi cation of single nucleotide polymorphisms in the human kallikrein 10 (KLK10) gene and their association with prostate, breast, testicular, and ovarian cancers.
    23. (2003). Immunohistochemical localization of human kallikreins 6, 10 and 13 in benign and malignant prostatic tissues. Prostate Cancer Prostatic Dis.
    24. (1988). Improved tools for biological sequence comparison.
    25. In silico identifi cation of breast cancer genes by combined multiple hight throughput analyses.
    26. (2000). International Classifi cation of Diseases for Oncology Third edition. World Health Organization;
    27. (1998). Isolation and characterization of PAGE-1 and GAGE-7. New genes expressed in the LNCaP prostate cancer progression model that share homology with melanoma-associated antigens.
    28. Kallioniemi A and Kallioniemi OP.(1997) Genome screening by comparative genomic hybridization.
    29. (2004). Latent semantic models for collaborative fi ltering.
    30. (2003). Liver receptor homolog 1 controls the expression of carboxyl ester lipase.
    31. (2003). Mad2 and p53 expression profi les in colorectal cancer and its clinical signifi cance.
    32. (1977). Maximum likelihood from incomplete data via the EM algorithm.
    33. (1998). Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins
    34. (1996). Molecular cloning of a novel membrane-type matrix metalloproteinase from a human breast carcinoma.
    35. (2005). OLIG2 (BHLHB1), a bHLH transcription factor, contributes to leukemogenesis in concert with LMO1.
    36. (2005). Online Meddelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.
    37. (2005). polymorphisms, cigarette use, and risk for colorectal adenoma.
    38. (2001). RefSeq and LocusLink: NCBI genecentered resources.
    39. (2003). Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation.
    40. Roylance R,(2002) Methods of molecular analysis: assessing losses and gains in tumors.
    41. (1995). Serial analysis of gene expression.
    42. Simon JS and Greene JR.(2004) Genome wide in silico SNP-tumor association analysis.
    43. Syk tyrosine kinase expression during multistep mammary carcinogenesis.
    44. (2002). The genetic basis of human cancer edn 2,
    45. (2005). The International HapMap Project web site.
    46. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms,
    47. (2001). Unsupervised learning by probabilistic latent semantic analysis.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.