682 research outputs found
Pairwise gene GO-based measures for biclustering of high-dimensional expression data
Background: Biclustering algorithms search for groups of genes that share the same
behavior under a subset of samples in gene expression data. Nowadays, the biological
knowledge available in public repositories can be used to drive these algorithms to
find biclusters composed of groups of genes functionally coherent. On the other hand,
a distance among genes can be defined according to their information stored in Gene
Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each
pair of genes which establishes their functional similarity. A scatter search-based
algorithm that optimizes a merit function that integrates GO information is studied in
this paper. This merit function uses a term that addresses the information through a GO
measure.
Results: The effect of two possible different gene pairwise GO measures on the
performance of the algorithm is analyzed. Firstly, three well known yeast datasets with
approximately one thousand of genes are studied. Secondly, a group of human
datasets related to clinical data of cancer is also explored by the algorithm. Most of
these data are high-dimensional datasets composed of a huge number of genes. The
resultant biclusters reveal groups of genes linked by a same functionality when the
search procedure is driven by one of the proposed GO measures. Furthermore, a
qualitative biological study of a group of biclusters show their relevance from a cancer
disease perspective.
Conclusions: It can be concluded that the integration of biological information
improves the performance of the biclustering process. The two different GO measures
studied show an improvement in the results obtained for the yeast dataset. However, if
datasets are composed of a huge number of genes, only one of them really improves
the algorithm performance. This second case constitutes a clear option to explore
interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-
Profile Likelihood Biclustering
Biclustering, the process of simultaneously clustering the rows and columns
of a data matrix, is a popular and effective tool for finding structure in a
high-dimensional dataset. Many biclustering procedures appear to work well in
practice, but most do not have associated consistency guarantees. To address
this shortcoming, we propose a new biclustering procedure based on profile
likelihood. The procedure applies to a broad range of data modalities,
including binary, count, and continuous observations. We prove that the
procedure recovers the true row and column classes when the dimensions of the
data matrix tend to infinity, even if the functional form of the data
distribution is misspecified. The procedure requires computing a combinatorial
search, which can be expensive in practice. Rather than performing this search
directly, we propose a new heuristic optimization procedure based on the
Kernighan-Lin heuristic, which has nice computational properties and performs
well in simulations. We demonstrate our procedure with applications to
congressional voting records, and microarray analysis.Comment: 40 pages, 11 figures; R package in development at
https://github.com/patperry/biclustp
Minimax Structured Normal Means Inference
We provide a unified treatment of a broad class of noisy structure recovery
problems, known as structured normal means problems. In this setting, the goal
is to identify, from a finite collection of Gaussian distributions with
different means, the distribution that produced some observed data. Recent work
has studied several special cases including sparse vectors, biclusters, and
graph-based structures. We establish nearly matching upper and lower bounds on
the minimax probability of error for any structured normal means problem, and
we derive an optimality certificate for the maximum likelihood estimator, which
can be applied to many instantiations. We also consider an experimental design
setting, where we generalize our minimax bounds and derive an algorithm for
computing a design strategy with a certain optimality property. We show that
our results give tight minimax bounds for many structure recovery problems and
consider some consequences for interactive sampling
- …