New Methodology for Measuring Semantic Functional Similarity Based on Bidirectional Integration

Abstract

1.2 billion users in facebook, 17 million articles in Wikipedia, and 190 million tweets per day have demanded significant increase of information processing through Internet in recent years. Similarly life sciences and bioinformatics also have faced issues of processing Big data due to the explosion of publicly available genomic information resulted from the Human Genome Project (HGP) and the increasing usage of high throughput technology. HGP was completed in 2003 and resulted in identifying 20,000-25,000 genes in human DNA and determining the sequences of three billion human base pairs. The information requires huge amount of data storage and becomes difficult to process using on-hand database management tools or traditional data processing applications. This thesis introduces new method, Biological and Statistical Mean (BSM) score to calculate functional similarity between gene products (GPs) that can help to extract biologically relevant and statistically robust information from large-scale biomedical, genomic and proteomic data sources. BSM score is defined by 16 different scoring matrices derived from principles of multi-view learning in machine learning algorithm and five different databases including Gene Ontology, UniProt, SCOP, CATH, and KUPS. The proposed method also shows how diverse databases and principles in machine learning theory can be integrated into a simple scoring function, and how the simple concept can give significant impact on the studies in biomedical and human life sciences. The comprehensive evaluations and performance comparisons with other conventional methods show that BSM score clearly outperforms other methods in terms of sensitivity of clustering similarity functional groups and coverage of identifying related genes. As a part of potential applications handling large amount of diverse data sources in medical domain, this thesis introduces similarity-based drug target identification and disease networks using BSM scores. Application of BSM score is freely available through http://www.ittc.ku.edu/chenlab/goal

    Similar works