36 research outputs found

    Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

    Get PDF
    BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. RESULTS: We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. CONCLUSION: For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures

    Characteristics of predictor sets found using differential prioritization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Feature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets.</p> <p>Results</p> <p>A simulation study employing analytical measures such as the distance between classes before and after transformation using principal component analysis is implemented on toy datasets. From these analyses, the necessity of adjusting the differential prioritization based on the dataset of interest is established. This conclusion is supported by comparisons against both simplistic rank-based selection and state-of-the-art equal-priorities scoring methods, which demonstrates the superiority of the DDP-based feature selection technique. Reapplying similar analyses to real-life multiclass microarray datasets provides further confirmation of our findings and of the significance of the DDP for practical applications.</p> <p>Conclusion</p> <p>The findings have been achieved based on analytical evaluations, not empirical evaluation involving classifiers, thus providing further basis for the usefulness of the DDP and validating the need for unequal priorities on relevance and redundancy during feature selection for microarray datasets, especially highly multiclass datasets.</p

    A Densely Interconnected Genome-Wide Network of MicroRNAs and Oncogenic Pathways Revealed Using Gene Expression Signatures

    Get PDF
    MicroRNAs (miRNAs) are important components of cellular signaling pathways, acting either as pathway regulators or pathway targets. Currently, only a limited number of miRNAs have been functionally linked to specific signaling pathways. Here, we explored if gene expression signatures could be used to represent miRNA activities and integrated with genomic signatures of oncogenic pathway activity to identify connections between miRNAs and oncogenic pathways on a high-throughput, genome-wide scale. Mapping >300 gene expression signatures to >700 primary tumor profiles, we constructed a genome-wide miRNA–pathway network predicting the associations of 276 human miRNAs to 26 oncogenic pathways. The miRNA–pathway network confirmed a host of previously reported miRNA/pathway associations and uncovered several novel associations that were subsequently experimentally validated. Globally, the miRNA–pathway network demonstrates a small-world, but not scale-free, organization characterized by multiple distinct, tightly knit modules each exhibiting a high density of connections. However, unlike genetic or metabolic networks typified by only a few highly connected nodes (“hubs”), most nodes in the miRNA–pathway network are highly connected. Sequence-based computational analysis confirmed that highly-interconnected miRNAs are likely to be regulated by common pathways to target similar sets of downstream genes, suggesting a pervasive and high level of functional redundancy among coexpressed miRNAs. We conclude that gene expression signatures can be used as surrogates of miRNA activity. Our strategy facilitates the task of discovering novel miRNA–pathway connections, since gene expression data for multiple normal and disease conditions are abundantly available

    Oncogenic Pathway Combinations Predict Clinical Prognosis in Gastric Cancer

    Get PDF
    Many solid cancers are known to exhibit a high degree of heterogeneity in their deregulation of different oncogenic pathways. We sought to identify major oncogenic pathways in gastric cancer (GC) with significant relationships to patient survival. Using gene expression signatures, we devised an in silico strategy to map patterns of oncogenic pathway activation in 301 primary gastric cancers, the second highest cause of global cancer mortality. We identified three oncogenic pathways (proliferation/stem cell, NF-κB, and Wnt/β-catenin) deregulated in the majority (>70%) of gastric cancers. We functionally validated these pathway predictions in a panel of gastric cancer cell lines. Patient stratification by oncogenic pathway combinations showed reproducible and significant survival differences in multiple cohorts, suggesting that pathway interactions may play an important role in influencing disease behavior. Individual GCs can be successfully taxonomized by oncogenic pathway activity into biologically and clinically relevant subgroups. Predicting pathway activity by expression signatures thus permits the study of multiple cancer-related pathways interacting simultaneously in primary cancers, at a scale not currently achievable by other platforms

    Cancer class prediction using gene expression data

    No full text
    This project aims to devise a method of classification for tumor types by retrieving relevant information from gene expression data.Master of Engineering (MPE

    Differential prioritization in feature selection for multiclass molecular classification

    No full text
    The aim of the thesis is to develop a filter-based feature selection (FS) technique for multi class molecular classification. Molecular classification involves the classification of samples into groups of biological phenotypes based on high-dimensional gene expression data obtained from microarray experiments. The multi class nature of the classification problems demands work on two specific areas: (a) differential prioritization and (b) combinations between different decomposition paradigms of FS and classification. FS aims to form, from the larger set of features in the dataset, a smaller subset of features which are capable of producing the best classification accuracy. This subset is called the predictor set. Relevance and redundancy have always been acknowledged as important criteria in the formation of the predictor set in filter-based FS. These two criteria are included as elements in the predictor set score, which measures the goodness of the predictor set. However, especially in a multiclass problem, we propose that a third criterion is necessary for the formation of the predictor set. This third criterion is the differential prioritization, a novel criterion which dictates the priority of maximizing relevance to the priority of minimizing redundancy. Differential prioritization ensures that the optimal balance between relevance and redundancy is achieved based on the number of classes in the classification problem. This is because as the number of classes increases, the relative importance of minimizing redundancy also increases. For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class problem is more important than minimizing redundancy in a two-class problem. An outcome of the work on differential prioritization is the development of a superior measure for redundancy. Redundancy in the predictor is defined as the amount of similarity or repetition of information among the members of the predictor set. Traditionally, redundancy is measured by directly summing up the pairwise similarity among the members of the predictor set. It is then minimized by defining it as the denominator in a ratio-based predictor set score. This method of measuring and minimizing redundancy faces the problem of singularity at nearminimum redundancy, which results in a skewed representation of the goodness of the predictor set. This motivates us to come up with an alternative measure for redundancy which circumvents the aforementioned problem. In multiclass problems, following the 'divide-and-conquer' philosophy, FS may be decomposed into several two-class sub-problems. The manner of the decomposition determines the decomposition paradigm for the FS problem. This is also true for multiclass classification, which may also be decomposed into several two-class sub-problems. The problem of FS and the problem of classification are inevitably linked to each other, since one of the aims of FS is to aid classification. However, there exists no formal approach for systematically combining the twin problems of FS and classification based on the decomposition paradigm used in each problem. Hence, we propose a system for combining the FS and the classification problems which will enable us to examine the effect of different combinations between decomposition paradigms ofFS and classification on accuracy in multiclass molecular classification

    Differential prioritization in feature selection for multiclass molecular classification

    No full text
    The aim of the thesis is to develop a filter-based feature selection (FS) technique for multi class molecular classification. Molecular classification involves the classification of samples into groups of biological phenotypes based on high-dimensional gene expression data obtained from microarray experiments. The multi class nature of the classification problems demands work on two specific areas: (a) differential prioritization and (b) combinations between different decomposition paradigms of FS and classification. FS aims to form, from the larger set of features in the dataset, a smaller subset of features which are capable of producing the best classification accuracy. This subset is called the predictor set. Relevance and redundancy have always been acknowledged as important criteria in the formation of the predictor set in filter-based FS. These two criteria are included as elements in the predictor set score, which measures the goodness of the predictor set. However, especially in a multiclass problem, we propose that a third criterion is necessary for the formation of the predictor set. This third criterion is the differential prioritization, a novel criterion which dictates the priority of maximizing relevance to the priority of minimizing redundancy. Differential prioritization ensures that the optimal balance between relevance and redundancy is achieved based on the number of classes in the classification problem. This is because as the number of classes increases, the relative importance of minimizing redundancy also increases. For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class problem is more important than minimizing redundancy in a two-class problem. An outcome of the work on differential prioritization is the development of a superior measure for redundancy. Redundancy in the predictor is defined as the amount of similarity or repetition of information among the members of the predictor set. Traditionally, redundancy is measured by directly summing up the pairwise similarity among the members of the predictor set. It is then minimized by defining it as the denominator in a ratio-based predictor set score. This method of measuring and minimizing redundancy faces the problem of singularity at nearminimum redundancy, which results in a skewed representation of the goodness of the predictor set. This motivates us to come up with an alternative measure for redundancy which circumvents the aforementioned problem. In multiclass problems, following the 'divide-and-conquer' philosophy, FS may be decomposed into several two-class sub-problems. The manner of the decomposition determines the decomposition paradigm for the FS problem. This is also true for multiclass classification, which may also be decomposed into several two-class sub-problems. The problem of FS and the problem of classification are inevitably linked to each other, since one of the aims of FS is to aid classification. However, there exists no formal approach for systematically combining the twin problems of FS and classification based on the decomposition paradigm used in each problem. Hence, we propose a system for combining the FS and the classification problems which will enable us to examine the effect of different combinations between decomposition paradigms ofFS and classification on accuracy in multiclass molecular classification

    Increasing Classification Accuracy by Combining Adaptive Sampling and Convex Pseudo-Data ©Springer-Verlag

    No full text
    Abstract. The availability of microarray data has enabled several studies on the application of aggregated classifiers for molecular classification. We present a combination of classifier aggregating and adaptive sampling techniques capable of increasing prediction accuracy of tumor samples for multiclass datasets. Our aggregated classifier method is capable of improving the classification accuracy of predictor sets obtained from our maximal-antiredundancybased feature selection technique. On the Global Cancer Map (GCM) dataset, an improvement over the highest accuracy reported has been achieved by the joint application of our feature selection technique and the modified aggregated classifier method

    A Comparative Study of Two Novel Predictor Set Scoring Methods ©Springer-Verlag

    No full text
    Abstract. Due to the large number of genes measured in a typical microarray dataset, feature selection plays an essential role in tumor classification. In turn, relevance and redundancy are key components in determining the optimal predictor set. However, a third component – the relative weights given to the first two also assumes an equal, if not greater importance in feature selection. Based on this third component, we developed two novel feature selection methods capable of producing high, unbiased classification accuracy in multiclass microarray dataset. In an in-depth analysis comparing the two methods, the optimal values of the relative weights are also estimated
    corecore