3 research outputs found
Gene Subset Selection Approaches Based on Linear Separability
We address the concept of linear separability of gene expression data sets with respect to two classes, which has been recently studied in the literature. The problem is to efficiently find all pairs of genes which induce a linear separation of the data. We study the Containment Angle (CA) defined on the unit circle for a linearly separating gene-pair (LS-pair) as an alternative to the paired t-test ranking function for gene selection. Using the CA we also show empirically that a given classifier\u27s error is related to the degree of linear separability of a given data set. Finally we propose gene subset selection methods based on the CA ranking function for LS-pairs and a ranking function for linearly separation genes (LS-genes), and which select only among LS-genes and LS-pairs. Overall, our proposed methods give better results in terms of subset sizes and classification accuracy when compared to well-performing methods, on many gene expression data sets
Machine learning and soft computing approaches to microarray differential expression analysis and feature selection.
Differential expression analysis and feature selection is central to gene expression
microarray data analysis. Standard approaches are flawed with the arbitrary
assignment of cut-off parameters and the inability to adapt to the particular data set
under analysis. Presented in this thesis are three novel approaches to microarray data
feature selection and differential expression analysis based on various machine
learning and soft computing paradigms. The first approach uses a Separability Index
to select ranked genes, making gene selection less arbitrary and more data intrinsic.
The second approach is a novel gene ranking system, the Fuzzy Gene Filter, which
provides a more holistic and adaptive approach to ranking genes. The third approach
is based on a Stochastic Search paradigm and uses the Population Based Incremental
Learning algorithm to identify an optimal gene set with maximum inter-class
distinction.
All three approaches were implemented and tested on a number of data sets and the
results compared to those of standard approaches. The Separability Index approach
attained a K-Nearest Neighbour classification accuracy of 92%, outperforming the
standard approach which attained an accuracy of 89.6%. The gene list identified also
displayed significant functional enrichment. The Fuzzy Gene Filter also outperformed
standard approaches, attaining significantly higher accuracies for all of the classifiers
tested, on both data sets (p < 0.0231 for the prostate data set and p < 0.1888 for the
lymphoma data set). Population Based Incremental Learning outperformed Genetic
Algorithm, identifying a maximum Separability Index of 97.04% (as opposed to
96.39%).
Future developments include incorporating biological knowledge when ranking genes
using the Fuzzy Gene Filter as well as incorporating a functional enrichment
assessment in the fitness function of the Population Based Incremental Learning
algorithm
GEOMETRIC OPTIMIZATION IN SOME PROXIMITY AND BIOINFORMATICS PROBLEMS
The theme of this dissertation is geometric optimization and its applications. We study geometric proximity problems and several bioinformatics problems with a geometric content, requiring the use of geometric optimization tools. We have investigated the following type of proximity problems. Given a point set in a plane with n distinct points, for each point in the set find a pair of points from the remaining points in the set such that the three points either maximize or minimize some geometric measure defined on these. The measures include (a) sum and product; (b) difference; (c) line–distance; (d) triangle area; (e) triangle perimeter; (f) circumcircle–radius; and (g) triangle–distance in three dimensions. We have also studied the application of a linear time incremental geometric algorithm to test the linear separability of a set of blue points from a set of red points, in two and three–dimensional Euclidean spaces. We have used this geometric separability tool on 4 different gene expression data–sets, enumerating gene–pairs and gene–triplets that are linearly separable. Pushing on further, we have exploited this novel tool to identify some bio–marker genes for a classifier. The gene selection method proposed in the dissertation exhibits good classification accuracy as compared to other known feature (or gene) selection methods such as t–values, FCS (Fisher Criterion Score) and SAM (Significance Analysis of Microarrays). Continuing this line of investigation further, we have also designed an efficient algorithm to find the minimum number of outliers when the red and blue point sets are not fully linearly separable. We have also explored the applicability of geometric optimization techniques to the problem of protein structure similarity. We have come up with two new algorithms, EDAlignres and EDAlignsse, for pairwise protein structure alignment. EDAlignres identifies the best structural alignment of two equal length proteins by refining the correspondence obtained from eigendecomposition and to maximize the similarity measure for the refined correspondence. EDAlignsse, on the other hand, does not require the input proteins to be of equal length. These have been fully implemented and tested against well-established protein alignment program