New statistical tools for microarray data and comparison with existing tools

Abstract

Microarray technologies have gained tremendous interest from researchers in recent years. The problem we are interested in is how to combine two microarray data, which have systematic batch differences. The reason for the combination is that the combined data set contains more samples which will give improved statistical power. This dissertation covers two topics about microarray batch adjustment. The first topic is about the visualization of paired High Dimension Low Sample Size (HDLSS) data. We propose two interesting directions: the Canonical Parallel and the Canonical Orthogonal Directions (CPD & COD). This pair of directions gives an insightful 2-d parallel view for understanding paired HDLSS data sets. The CPD can be used for adjusting the batch differences. An application to the NCI60 cell lines data shows good performance of this method. The second topic is about the comparison between three commonly used batch adjustment methods: the Support Vector Machine (SVM), the Distance Weighted Discrimination (DWD), and the Prediction Analysis of Microarray (PAM). We show that SVM has some serious problems for the HDLSS data. The DWD method is much more robust than PAM under the Unbalanced Subgroup Model. The mathematical studies made in this dissertation are in the area of HDLSS asymptotics, in the sense that the sample sizes are fixed and the dimension (the number of genes) goes to infinity. Hall et. al (2004) have studied the geometric structure of the data when the dimension is high. In this dissertation, we study the geometric structure of the data under more complicated models. In the first topic, we give the conditions for the consistency and the strong inconsistency of the CPD under the Linear Shift Model. This model reflects the effects of systematic biases and the random measurement errors. In the second topic, we compare the PAM and the DWD method using the Unbalanced Subgroup Model. Both methods are biased when the dimension goes to infinity. However, DWD is shown to be consistently more robust than PAM. We give the quantitative bias of them. Keywords: Microarray Batch Adjustment, Principal Component Analysis, Exploratory Data Analysis, High Dimension Low Sample Size Data Analysis, Data Discrimination Meth-ods, Distance Weighted Discrimination, Support Vector Machine, Predication Analysis of Microarray, High Dimension Asymptotics

    Similar works