1,608 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Biclustering of gene expression data by non-smooth non-negative matrix factorization

    Get PDF
    BACKGROUND: The extended use of microarray technologies has enabled the generation and accumulation of gene expression datasets that contain expression levels of thousands of genes across tens or hundreds of different experimental conditions. One of the major challenges in the analysis of such datasets is to discover local structures composed by sets of genes that show coherent expression patterns across subsets of experimental conditions. These patterns may provide clues about the main biological processes associated to different physiological states. RESULTS: In this work we present a methodology able to cluster genes and conditions highly related in sub-portions of the data. Our approach is based on a new data mining technique, Non-smooth Non-Negative Matrix Factorization (nsNMF), able to identify localized patterns in large datasets. We assessed the potential of this methodology analyzing several synthetic datasets as well as two large and heterogeneous sets of gene expression profiles. In all cases the method was able to identify localized features related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The uncovered structures showed a clear biological meaning in terms of relationships among functional annotations of genes and the phenotypes or physiological states of the associated conditions. CONCLUSION: The proposed approach can be a useful tool to analyze large and heterogeneous gene expression datasets. The method is able to identify complex relationships among genes and conditions that are difficult to identify by standard clustering algorithms

    bioNMF: a versatile tool for non-negative matrix factorization in biology

    Get PDF
    BACKGROUND: In the Bioinformatics field, a great deal of interest has been given to Non-negative matrix factorization technique (NMF), due to its capability of providing new insights and relevant information about the complex latent relationships in experimental data sets. This method, and some of its variants, has been successfully applied to gene expression, sequence analysis, functional characterization of genes and text mining. Even if the interest on this technique by the bioinformatics community has been increased during the last few years, there are not many available simple standalone tools to specifically perform these types of data analysis in an integrated environment. RESULTS: In this work we propose a versatile and user-friendly tool that implements the NMF methodology in different analysis contexts to support some of the most important reported applications of this new methodology. This includes clustering and biclustering gene expression data, protein sequence analysis, text mining of biomedical literature and sample classification using gene expression. The tool, which is named bioNMF, also contains a user-friendly graphical interface to explore results in an interactive manner and facilitate in this way the exploratory data analysis process. CONCLUSION: bioNMF is a standalone versatile application which does not require any special installation or libraries. It can be used for most of the multiple applications proposed in the bioinformatics field or to support new research using this method. This tool is publicly available at

    A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels

    Get PDF
    Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy
    corecore