ROBUST CROSS-PLATFORM DISEASE PREDICTION USING GENE EXPRESSION MICROARRAYS

Mi, Zhibao

thesis

ROBUST CROSS-PLATFORM DISEASE PREDICTION USING GENE EXPRESSION MICROARRAYS

Authors: Zhibao Mi
Publication date: 29 January 2009
Publisher

Abstract

Microarray technology has been used to predict patient prognosis and response to treatment, which is starting to have an impact on disease intervention and control, and is a significant measure for public health. However, the process has been hindered by a lack of adequate clinical validation. Since both microarray analyses and clinical trials are time and effort intensive, it is crucial to use accumulated inter-study data to validate information from individual studies. For over a decade, microarray data have been accumulated from different technologies. However, using data from one platform to build a model that robustly predicts the clinical characteristics of a new data from another platform remains a challenge. Current cross-platform gene prediction methods use only genes common to both training and test datasets. There are two main drawbacks to that approach: model reconstruction and loss of information. As a result, the prediction accuracy of those methods is unstable. In this dissertation, a module-based prediction strategy was developed to overcome the aforementioned drawbacks. By the current method, groups of genes sharing similar expression patterns rather than individual genes were used as the basic elements of the model predictor. Such an approach borrows information from genes¡¯ similarity when genes are absent in test data. By overcoming the problems of missing genes and noise across platforms, this method yielded robust predictions independent of information from the test data. The performance of this method was evaluated using publicly available microarray data. K-means clustering was used to group genes sharing similar expression profiles into gene modules and small modules were merged into their nearest neighbors. A univariate or multivariate feature selection procedures was applied and a representative gene from each selected module was identified. A prediction model was then constructed by the representative genes from selected gene modules. As a result, the prediction model is portable to any test study as long as partial genes in each module exist in the test study. The newly developed method showed advantages over the traditional methods in terms of prediction robustness to gene noise and gene mismatch issues in inter-study prediction