1,078 research outputs found
Singular value decomposition of large random matrices (for two-way classification of microarrays)
Asymptotic behavior of the singular value decomposition (SVD) of blown up
matrices and normalized blown up contingency tables exposed to Wigner-noise is
investigated.It is proved that such an m\times n matrix almost surely has a
constant number of large singular values (of order \sqrt{mn}), while the rest
of the singular values are of order \sqrt{m+n} as m,n\to\infty. Concentration
results of Alon et al. for the eigenvalues of large symmetric random matrices
are adapted to the rectangular case, and on this basis, almost sure results for
the singular values as well as for the corresponding isotropic subspaces are
proved. An algorithm, applicable to two-way classification of microarrays, is
also given that finds the underlying block structure.Comment: to be submitted to a special ussue of JMV
Transposable regularized covariance models with an application to missing data imputation
Missing data estimation is an important challenge with high-dimensional data
arranged in the form of a matrix. Typically this data matrix is transposable,
meaning that either the rows, columns or both can be treated as features. To
model transposable data, we present a modification of the matrix-variate
normal, the mean-restricted matrix-variate normal, in which the rows and
columns each have a separate mean vector and covariance matrix. By placing
additive penalties on the inverse covariance matrices of the rows and columns,
these so-called transposable regularized covariance models allow for maximum
likelihood estimation of the mean and nonsingular covariance matrices. Using
these models, we formulate EM-type algorithms for missing data imputation in
both the multivariate and transposable frameworks. We present theoretical
results exploiting the structure of our transposable models that allow these
models and imputation methods to be applied to high-dimensional data.
Simulations and results on microarray data and the Netflix data show that these
imputation techniques often outperform existing methods and offer a greater
degree of flexibility.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS314 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A review on initialization methods for nonnegative matrix factorization: Towards omics data experiments
Nonnegative Matrix Factorization (NMF) has acquired a relevant role in the panorama of knowledge extraction, thanks to the peculiarity that non-negativity applies to both bases and weights, which allows meaningful interpretations and is consistent with the natural human part-based learning process. Nevertheless, most NMF algorithms are iterative, so initialization methods affect convergence behaviour, the quality of the final solution, and NMF performance in terms of the residual of the cost function. Studies on the impact of NMF initialization techniques have been conducted for text or image datasets, but very few considerations can be found in the literature when biological datasets are studied, even though NMFs have largely demonstrated their usefulness in better understanding biological mechanisms with omic datasets. This paper aims to present the state-of-the-art on NMF initialization schemes along with some initial considerations on the impact of initialization methods when microarrays (a simple instance of omic data) are evaluated with NMF mechanisms. Using a series of measures to qualitatively examine the biological information extracted by a given NMF scheme, it preliminary appears that some information (e.g., represented by genes) can be extracted regardless of the initialization scheme used
Noise and nonlinearities in high-throughput data
High-throughput data analyses are becoming common in biology, communications,
economics and sociology. The vast amounts of data are usually represented in
the form of matrices and can be considered as knowledge networks. Spectra-based
approaches have proved useful in extracting hidden information within such
networks and for estimating missing data, but these methods are based
essentially on linear assumptions. The physical models of matching, when
applicable, often suggest non-linear mechanisms, that may sometimes be
identified as noise. The use of non-linear models in data analysis, however,
may require the introduction of many parameters, which lowers the statistical
weight of the model. According to the quality of data, a simpler linear
analysis may be more convenient than more complex approaches.
In this paper, we show how a simple non-parametric Bayesian model may be used
to explore the role of non-linearities and noise in synthetic and experimental
data sets.Comment: 12 pages, 3 figure
Spectral Sequence Motif Discovery
Sequence discovery tools play a central role in several fields of
computational biology. In the framework of Transcription Factor binding
studies, motif finding algorithms of increasingly high performance are required
to process the big datasets produced by new high-throughput sequencing
technologies. Most existing algorithms are computationally demanding and often
cannot support the large size of new experimental data. We present a new motif
discovery algorithm that is built on a recent machine learning technique,
referred to as Method of Moments. Based on spectral decompositions, this method
is robust under model misspecification and is not prone to locally optimal
solutions. We obtain an algorithm that is extremely fast and designed for the
analysis of big sequencing data. In a few minutes, we can process datasets of
hundreds of thousand sequences and extract motif profiles that match those
computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl
Towards large scale continuous EDA: a random matrix theory perspective
Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging and as a result existing EDAs lose their strengths in large scale problems.
Large scale continuous global optimisation is key to many real world problems of modern days. Scaling up EAs to large scale problems has become one of the biggest challenges of the field.
This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections of the set of fittest search points to low dimensions as a basis for developing a new and generic divide-and-conquer methodology. This is rooted in the theory of random projections developed in theoretical computer science, and will exploit recent advances of non-asymptotic random matrix theory
- …