19,151 research outputs found

    VizRank: Data Visualization Guided by Machine Learning

    Get PDF
    Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics

    Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.

    Get PDF
    The wealth of information generated by high-throughput omics technologies in the context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives: i) to critically appraise established and recently developed statistical approaches in their ability to appropriately accommodate the inherently complex nature of real-world omics data and ii) to improve the current understanding of a prevalent condition by identifying biological markers predictive of disease as well as possible biological mechanisms leading to its onset. The specific disease endpoint of interest corresponds to B-cell Lymphoma, a common haematological malignancy for which many challenges related to its aetiology remain unanswered. The seven chapters comprising this thesis are structured in the following manner: the first two correspond to introductory chapters where I describe the main omics technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces

    Expression cartography of human tissues using self organizing maps

    Get PDF
    Background: The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way. Self organizing maps (SOM), a machine learning method, enables the parallel sample- and gene-centered view on the data combined with strong visualization and second-level analysis capabilities. The paper addresses aspects of the method with practical impact in the context of expression analysis of complex data sets.
Results: The method was applied to generate a SOM characterizing the whole genome expression profiles of 67 healthy human tissues selected from ten tissue categories (adipose, endocrine, homeostasis, digestion, exocrine, epithelium, sexual reproduction, muscle, immune system and nervous tissues). SOM mapping reduces the dimension of expression data from ten thousands of genes to a few thousands of metagenes where each metagene acts as representative of a minicluster of co-regulated single genes. Tissue-specific and common properties shared between groups of tissues emerge as a handful of localized spots in the tissue maps collecting groups of co-regulated and co-expressed metagenes. The functional context of the spots was discovered using overrepresentation analysis with respect to pre-defined gene sets of known functional impact. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. Analysis techniques normally used at the gene-level such as two-way hierarchical clustering provide a better signal-to-noise ratio and a better representativeness of the method if applied to the metagenes. Metagene-based clustering analyses aggregate the tissues into essentially three clusters containing nervous, immune system and the remaining tissues. 
Conclusions: The global view on the behavior of a few well-defined modules of correlated and differentially expressed genes is more intuitive and more informative than the separate discovery of the expression levels of hundreds or thousands of individual genes. The metagene approach is less sensitive to a priori selection of genes. It can detect a coordinated expression pattern whose components would not pass single-gene significance thresholds and it is able to extract context-dependent patterns of gene expression in complex data sets.
&#xa

    Expression cartography of human tissues using self organizing maps

    Get PDF
    Background: The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way. Self organizing maps (SOM), a machine learning method, enables the parallel sample- and gene-centered view on the data combined with strong visualization and second-level analysis capabilities. The paper addresses aspects of the method with practical impact in the context of expression analysis of complex data sets.
Results: The method was applied to generate a SOM characterizing the whole genome expression profiles of 67 healthy human tissues selected from ten tissue categories (adipose, endocrine, homeostasis, digestion, exocrine, epithelium, sexual reproduction, muscle, immune system and nervous tissues). SOM mapping reduces the dimension of expression data from ten thousands of genes to a few thousands of metagenes where each metagene acts as representative of a minicluster of co-regulated single genes. Tissue-specific and common properties shared between groups of tissues emerge as a handful of localized spots in the tissue maps collecting groups of co-regulated and co-expressed metagenes. The functional context of the spots was discovered using overrepresentation analysis with respect to pre-defined gene sets of known functional impact. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. Analysis techniques normally used at the gene-level such as two-way hierarchical clustering provide a better signal-to-noise ratio and a better representativeness of the method if applied to the metagenes. Metagene-based clustering analyses aggregate the tissues into essentially three clusters containing nervous, immune system and the remaining tissues. 
Conclusions: The global view on the behavior of a few well-defined modules of correlated and differentially expressed genes is more intuitive and more informative than the separate discovery of the expression levels of hundreds or thousands of individual genes. The metagene approach is less sensitive to a priori selection of genes. It can detect a coordinated expression pattern whose components would not pass single-gene significance thresholds and it is able to extract context-dependent patterns of gene expression in complex data sets.
&#xa

    Features extraction using random matrix theory.

    Get PDF
    Representing the complex data in a concise and accurate way is a special stage in data mining methodology. Redundant and noisy data affects generalization power of any classification algorithm, undermines the results of any clustering algorithm and finally encumbers the monitoring of large dynamic systems. This work provides several efficient approaches to all aforementioned sides of the analysis. We established, that notable difference can be made, if the results from the theory of ensembles of random matrices are employed. Particularly important result of our study is a discovered family of methods based on projecting the data set on different subsets of the correlation spectrum. Generally, we start with traditional correlation matrix of a given data set. We perform singular value decomposition, and establish boundaries between essential and unimportant eigen-components of the spectrum. Then, depending on the nature of the problem at hand we either use former or later part for the projection purpose. Projecting the spectrum of interest is a common technique in linear and non-linear spectral methods such as Principal Component Analysis, Independent Component Analysis and Kernel Principal Component Analysis. Usually the part of the spectrum to project is defined by the amount of variance of overall data or feature space in non-linear case. The applicability of these spectral methods is limited by the assumption that larger variance has important dynamics, i.e. if the data has a high signal-to-noise ratio. If it is true, projection of principal components targets two problems in data mining, reduction in the number of features and selection of more important features. Our methodology does not make an assumption of high signal-to-noise ratio, instead, using the rigorous instruments of Random Matrix Theory (RNIT) it identifies the presence of noise and establishes its boundaries. The knowledge of the structure of the spectrum gives us possibility to make more insightful projections. For instance, in the application to router network traffic, the reconstruction error procedure for anomaly detection is based on the projection of noisy part of the spectrum. Whereas, in bioinformatics application of clustering the different types of leukemia, implicit denoising of the correlation matrix is achieved by decomposing the spectrum to random and non-random parts. For temporal high dimensional data, spectrum and eigenvectors of its correlation matrix is another representation of the data. Thus, eigenvalues, components of the eigenvectors, inverse participation ratio of eigenvector components and other operators of eigen analysis are spectral features of dynamic system. In our work we proposed to extract spectral features using the RMT. We demonstrated that with extracted spectral features we can monitor the changing dynamics of network traffic. Experimenting with the delayed correlation matrices of network traffic and extracting its spectral features, we visualized the delayed processes in the system. We demonstrated in our work that broad range of applications in feature extraction can benefit from the novel RMT based approach to the spectral representation of the data

    Machine learning methods for omics data integration

    Get PDF
    High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis
    • …
    corecore