14 research outputs found

    A cooperative feature gene extraction algorithm that combines classification and clustering

    Get PDF
    In feature gene selection, filtering model concerns classification accuracy while ignoring gene redundancy problem. On the other hand, gene clustering finds correlated genes without considering their predictive abilities. It is valuable to enhance their performances by the help of each other. We report a new feature gene extraction algorithm, namely Double-thresholding Extraction of Feature Gene (DEFG), that combines gene filtering and gene clustering. It firstly pre-select feature gene set from the original dataset. A modified gene clustering is then applied to refine this set. In the gene clustering, specific designs are employed to balance the predictive abilities and the redundancies of the extracted feature gene. We have tested DEFG on a microarray dataset and compared its performance with that of two benchmark algorithms. The experimental results show that DEFG is superior to them in terms of internal validation accuracy and external validation accuracy. Also, DEFG can generalize the pattern structure by a small number of training samples. ©2009 IEEE.published_or_final_versio

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Bayesian profiling of molecular signatures to predict event times

    Get PDF
    BACKGROUND: It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis. RESULTS: We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis. CONCLUSION: This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis

    A stable iterative method for refining discriminative gene clusters

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray technology is often used to identify the genes that are differentially expressed between two biological conditions. On the other hand, since microarray datasets contain a small number of samples and a large number of genes, it is usually desirable to identify small gene subsets with distinct pattern between sample classes. Such gene subsets are highly discriminative in phenotype classification because of their tightly coupling features. Unfortunately, such identified classifiers usually tend to have poor generalization properties on the test samples due to overfitting problem.</p> <p>Results</p> <p>We propose a novel approach combining both supervised learning with unsupervised learning techniques to generate increasingly discriminative gene clusters in an iterative manner. Our experiments on both simulated and real datasets show that our method can produce a series of robust gene clusters with good classification performance compared with existing approaches.</p> <p>Conclusion</p> <p>This backward approach for refining a series of highly discriminative gene clusters for classification purpose proves to be very consistent and stable when applied to various types of training samples.</p

    Metabolomics-Based Discovery of Diagnostic Biomarkers for Onchocerciasis

    Get PDF
    Onchocerciasis, caused by the filarial parasite Onchocerca volvulus, afflicts millions of people, causing such debilitating symptoms as blindness and acute dermatitis. There are no accurate, sensitive means of diagnosing O. volvulus infection. Clinical diagnostics are desperately needed in order to achieve the goals of controlling and eliminating onchocerciasis and neglected tropical diseases in general. In this study, a metabolomics approach is introduced for the discovery of small molecule biomarkers that can be used to diagnose O. volvulus infection. Blood samples from O. volvulus infected and uninfected individuals from different geographic regions were compared using liquid chromatography separation and mass spectrometry identification. Thousands of chromatographic mass features were statistically compared to discover 14 mass features that were significantly different between infected and uninfected individuals. Multivariate statistical analysis and machine learning algorithms demonstrated how these biomarkers could be used to differentiate between infected and uninfected individuals and indicate that the diagnostic may even be sensitive enough to assess the viability of worms. This study suggests a future potential of these biomarkers for use in a field-based onchocerciasis diagnostic and how such an approach could be expanded for the development of diagnostics for other neglected tropical diseases

    Altered expression of mitochondrial and extracellular matrix genes in the heart of human fetuses with chromosome 21 trisomy

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Down syndrome phenotype has been attributed to overexpression of chromosome 21 (Hsa21) genes. However, the expression profile of Hsa21 genes in trisomic human subjects as well as their effects on genes located on different chromosomes are largely unknown. Using oligonucleotide microarrays we compared the gene expression profiles of hearts of human fetuses with and without Hsa21 trisomy.</p> <p>Results</p> <p>Approximately half of the 15,000 genes examined (87 of the 168 genes on Hsa21) were expressed in the heart at 18–22 weeks of gestation. Hsa21 gene expression was globally upregulated 1.5 fold in trisomic samples. However, not all genes were equally dysregulated and 25 genes were not upregulated at all. Genes located on other chromosomes were also significantly dysregulated. Functional class scoring and gene set enrichment analyses of 473 genes, differentially expressed between trisomic and non-trisomic hearts, revealed downregulation of genes encoding mitochondrial enzymes and upregulation of genes encoding extracellular matrix proteins. There were no significant differences between trisomic fetuses with and without heart defects.</p> <p>Conclusion</p> <p>We conclude that dosage-dependent upregulation of Hsa21 genes causes dysregulation of the genes responsible for mitochondrial function and for the extracellular matrix organization in the fetal heart of trisomic subjects. These alterations might be harbingers of the heart defects associated with Hsa21 trisomy, which could be based on elusive mechanisms involving genetic variability, environmental factors and/or stochastic events.</p

    Unsupervised Discretization by Two-dimensional MDL-based Histogram

    Full text link
    Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalised maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighbouring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to its closest competitor IPD; and 4) is self-adaptive with regard to both sample size and local density structure of the data despite being parameter-free. Finally, we apply our algorithm to two geographic datasets to demonstrate its real-world potential.Comment: 30 pages, 9 figure
    corecore