3,705 research outputs found
Analysis of the singular value decomposition as a tool for processing microarray expression data
We give two informative derivations of a spectral algorithm for clustering and partitioning a bi-partite graph. In the first case we begin with a discrete optimization problem that relaxes into a tractable continuous analogue. In the second case we use the power method to derive an iterative interpretation of the algorithm. Both versions reveal a natural approach for re-scaling the edge weights and help to explain the performance of the algorithm in the presence of outliers. Our motivation for this work is in the analysis of microarray data from bioinformatics, and we give some numerical results for a publicly available acute leukemia data set
Discovering bipartite substructure in directed networks
Bipartivity is an important network concept that can be applied to nodes, edges and communities. Here we focus on directed networks and look for subnetworks made up of two distinct groups of nodes, connected by “one-way” links. We show that a spectral approach can be used to find hidden substructure of this form. Theoretical support is given for the idealised case where there is limited overlap between subnetworks. Numerical experiments show that the approach is robust to spurious and missing edges. A key application of this work is in the analysis of high-throughput gene expression data, and we give an example where a biologically meaningful directed bipartite subnetwork is found from a cancer microarray dataset
A comparative study of different strategies of batch effect removal in microarray data: a case study of three datasets
Batch effects refer to the systematic non-biological variability that is introduced by experimental design and sample processing in microarray experiments. It is a common issue in microarray data and could introduce bias into the analysis, if ignored. Many batch effect removal methods have been developed. Previous comparative work has been focused on their effectiveness of batch effects removal and impact on downstream classification analysis. The most common type of analysis for microarray data is differential expression (DE) analysis, yet no study has examined the impact of these methods on downstream DE analysis, which identifies markers that are significantly associated with the outcome of interest. In this project, we investigated the performance of five popular batch effect removal methods, mean-centering, ComBat_p, ComBat_n, SVA, and ratio based methods, on batch effects reduction and their impact on DE analysis using three experimental datasets with different sources of batch effects. We found that the performance of these methods is data-dependent: simple mean-centering method performed reasonably well in all three datasets, but the more complicated algorithms such as ComBat method’s performance could be unstable for certain dataset and should be applied with caution. Given a new dataset, we recommend either using the mean-centering method or carefully investigating a few different batch removal methods and choosing the one that is the best for the data, if possible. This study has important public health significance because better handling of batch effect in microarray data can reduce biased results and lead to improved biomarker identification
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …