3 research outputs found

    Gene Subset Selection Approaches Based on Linear Separability

    Get PDF
    We address the concept of linear separability of gene expression data sets with respect to two classes, which has been recently studied in the literature. The problem is to efficiently find all pairs of genes which induce a linear separation of the data. We study the Containment Angle (CA) defined on the unit circle for a linearly separating gene-pair (LS-pair) as an alternative to the paired t-test ranking function for gene selection. Using the CA we also show empirically that a given classifier\u27s error is related to the degree of linear separability of a given data set. Finally we propose gene subset selection methods based on the CA ranking function for LS-pairs and a ranking function for linearly separation genes (LS-genes), and which select only among LS-genes and LS-pairs. Overall, our proposed methods give better results in terms of subset sizes and classification accuracy when compared to well-performing methods, on many gene expression data sets

    Machine learning and soft computing approaches to microarray differential expression analysis and feature selection.

    Get PDF
    Differential expression analysis and feature selection is central to gene expression microarray data analysis. Standard approaches are flawed with the arbitrary assignment of cut-off parameters and the inability to adapt to the particular data set under analysis. Presented in this thesis are three novel approaches to microarray data feature selection and differential expression analysis based on various machine learning and soft computing paradigms. The first approach uses a Separability Index to select ranked genes, making gene selection less arbitrary and more data intrinsic. The second approach is a novel gene ranking system, the Fuzzy Gene Filter, which provides a more holistic and adaptive approach to ranking genes. The third approach is based on a Stochastic Search paradigm and uses the Population Based Incremental Learning algorithm to identify an optimal gene set with maximum inter-class distinction. All three approaches were implemented and tested on a number of data sets and the results compared to those of standard approaches. The Separability Index approach attained a K-Nearest Neighbour classification accuracy of 92%, outperforming the standard approach which attained an accuracy of 89.6%. The gene list identified also displayed significant functional enrichment. The Fuzzy Gene Filter also outperformed standard approaches, attaining significantly higher accuracies for all of the classifiers tested, on both data sets (p < 0.0231 for the prostate data set and p < 0.1888 for the lymphoma data set). Population Based Incremental Learning outperformed Genetic Algorithm, identifying a maximum Separability Index of 97.04% (as opposed to 96.39%). Future developments include incorporating biological knowledge when ranking genes using the Fuzzy Gene Filter as well as incorporating a functional enrichment assessment in the fitness function of the Population Based Incremental Learning algorithm

    GEOMETRIC OPTIMIZATION IN SOME PROXIMITY AND BIOINFORMATICS PROBLEMS

    Get PDF
    The theme of this dissertation is geometric optimization and its applications. We study geometric proximity problems and several bioinformatics problems with a geometric content, requiring the use of geometric optimization tools. We have investigated the following type of proximity problems. Given a point set in a plane with n distinct points, for each point in the set find a pair of points from the remaining points in the set such that the three points either maximize or minimize some geometric measure defined on these. The measures include (a) sum and product; (b) difference; (c) line–distance; (d) triangle area; (e) triangle perimeter; (f) circumcircle–radius; and (g) triangle–distance in three dimensions. We have also studied the application of a linear time incremental geometric algorithm to test the linear separability of a set of blue points from a set of red points, in two and three–dimensional Euclidean spaces. We have used this geometric separability tool on 4 different gene expression data–sets, enumerating gene–pairs and gene–triplets that are linearly separable. Pushing on further, we have exploited this novel tool to identify some bio–marker genes for a classifier. The gene selection method proposed in the dissertation exhibits good classification accuracy as compared to other known feature (or gene) selection methods such as t–values, FCS (Fisher Criterion Score) and SAM (Significance Analysis of Microarrays). Continuing this line of investigation further, we have also designed an efficient algorithm to find the minimum number of outliers when the red and blue point sets are not fully linearly separable. We have also explored the applicability of geometric optimization techniques to the problem of protein structure similarity. We have come up with two new algorithms, EDAlignres and EDAlignsse, for pairwise protein structure alignment. EDAlignres identifies the best structural alignment of two equal length proteins by refining the correspondence obtained from eigendecomposition and to maximize the similarity measure for the refined correspondence. EDAlignsse, on the other hand, does not require the input proteins to be of equal length. These have been fully implemented and tested against well-established protein alignment program
    corecore