254 research outputs found

    M3C: Monte Carlo reference-based consensus clustering.

    Get PDF
    Genome-wide data is used to stratify patients into classes for precision medicine using clustering algorithms. A common problem in this area is selection of the number of clusters (K). The Monti consensus clustering algorithm is a widely used method which uses stability selection to estimate K. However, the method has bias towards higher values of K and yields high numbers of false positives. As a solution, we developed Monte Carlo reference-based consensus clustering (M3C), which is based on this algorithm. M3C simulates null distributions of stability scores for a range of K values thus enabling a comparison with real data to remove bias and statistically test for the presence of structure. M3C corrects the inherent bias of consensus clustering as demonstrated on simulated and real expression data from The Cancer Genome Atlas (TCGA). For testing M3C, we developed clusterlab, a new method for simulating multivariate Gaussian clusters

    Nonnegative principal component analysis for mass spectral serum profiles and biomarker discovery

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As a novel cancer diagnostic paradigm, mass spectroscopic serum proteomic pattern diagnostics was reported superior to the conventional serologic cancer biomarkers. However, its clinical use is not fully validated yet. An important factor to prevent this young technology to become a mainstream cancer diagnostic paradigm is that robustly identifying cancer molecular patterns from high-dimensional protein expression data is still a challenge in machine learning and oncology research. As a well-established dimension reduction technique, PCA is widely integrated in pattern recognition analysis to discover cancer molecular patterns. However, its global feature selection mechanism prevents it from capturing local features. This may lead to difficulty in achieving high-performance proteomic pattern discovery, because only features interpreting global data behavior are used to train a learning machine.</p> <p>Methods</p> <p>In this study, we develop a nonnegative principal component analysis algorithm and present a nonnegative principal component analysis based support vector machine algorithm with sparse coding to conduct a high-performance proteomic pattern classification. Moreover, we also propose a nonnegative principal component analysis based filter-wrapper biomarker capturing algorithm for mass spectral serum profiles.</p> <p>Results</p> <p>We demonstrate the superiority of the proposed algorithm by comparison with six peer algorithms on four benchmark datasets. Moreover, we illustrate that nonnegative principal component analysis can be effectively used to capture meaningful biomarkers.</p> <p>Conclusion</p> <p>Our analysis suggests that nonnegative principal component analysis effectively conduct local feature selection for mass spectral profiles and contribute to improving sensitivities and specificities in the following classification, and meaningful biomarker discovery.</p

    M3C: Monte Carlo reference-based consensus clustering

    Get PDF
    Genome-wide data is used to stratify patients into classes for precision medicine using clustering algorithms. A common problem in this area is selection of the number of clusters (K). The Monti consensus clustering algorithm is a widely used method which uses stability selection to estimate K. However, the method has bias towards higher values of K and yields high numbers of false positives. As a solution, we developed Monte Carlo reference-based consensus clustering (M3C), which is based on this algorithm. M3C simulates null distributions of stability scores for a range of K values thus enabling a comparison with real data to remove bias and statistically test for the presence of structure. M3C corrects the inherent bias of consensus clustering as demonstrated on simulated and real expression data from The Cancer Genome Atlas (TCGA). For testing M3C, we developed clusterlab, a new method for simulating multivariate Gaussian clusters

    Clustering in Conjunction with Wrapper Approach to Select Discriminatory Genes for Microarray Dataset Classification

    Get PDF
    With the advent of microarray technology, it is possible to measure gene expression levels of thousands of genes simultaneously. This helps us diagnose and classify some particular cancers directly using DNA microarray. High-dimensionality and small sample size of microarray datasets has made the task of classification difficult. These datasets contain a large number of redundant and irrelevant genes. For efficient classification of samples there is a need of selecting a smaller set of relevant and non-redundant genes. In this paper, we have proposed a two stage algorithm for finding a set of discriminatory genes responsible for classification of high dimensional microarray datasets. In the first stage redundancy is reduced by grouping correlated genes into clusters and selecting a representative gene from each cluster. Maximal information compression index is used to measure similarity between genes. In the second stage a wrapper based forward feature selection method is used to obtain a set of discriminatory genes for a given classifier. We have investigated three different techniques for clustering and four classifiers in our experiments. The proposed algorithm is tested on six well known publicly available datasets. Comparison with the other state-of-the-art methods show that our proposed algorithm is able to achieve better classification accuracy with less number of genes

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    NON-MATRIX FACTORIZATION FOR BLIND IMAGE SEPARATION

    Get PDF
    Hyperspectral unmixing is a process to identify the constituent materials and estimate the corresponding fractions from the mixture, nonnegative matrix factions ( NMF ) is suitable as a candidate for the linear spectral mixture mode, has been applied to the unmixing hyperspectral data. Unfortunately, the local minima is cause by the nonconvexity of the objective function  makes the solution nonunique, thus only the nonnegativity constraint is not sufficient enough to lead to a well define problems. Therefore, two inherent characteristic of hyperspectal data, piecewise smoothness ( both temporal and spatial ) of spectral data and sparseness of abundance fraction of every material, are introduce to the NMF. The adaptive potential function from discontinuity adaptive Markov random field model is used to describe the smoothness constraint while preserving discontinuities is spectral data.  At the same time two NMF algorithms, non smooth NMS and NMF with sparseness constraint, are used to quantify the degree of sparseness of material abundances. Experiment using the synthetic and real data demonstrate the proposed algorithms provides an effective unsupervised technique for hyperspectial unmixing

    A review on initialization methods for nonnegative matrix factorization: Towards omics data experiments

    Get PDF
    Nonnegative Matrix Factorization (NMF) has acquired a relevant role in the panorama of knowledge extraction, thanks to the peculiarity that non-negativity applies to both bases and weights, which allows meaningful interpretations and is consistent with the natural human part-based learning process. Nevertheless, most NMF algorithms are iterative, so initialization methods affect convergence behaviour, the quality of the final solution, and NMF performance in terms of the residual of the cost function. Studies on the impact of NMF initialization techniques have been conducted for text or image datasets, but very few considerations can be found in the literature when biological datasets are studied, even though NMFs have largely demonstrated their usefulness in better understanding biological mechanisms with omic datasets. This paper aims to present the state-of-the-art on NMF initialization schemes along with some initial considerations on the impact of initialization methods when microarrays (a simple instance of omic data) are evaluated with NMF mechanisms. Using a series of measures to qualitatively examine the biological information extracted by a given NMF scheme, it preliminary appears that some information (e.g., represented by genes) can be extracted regardless of the initialization scheme used
    corecore