72 research outputs found
Compositional Mining of Multi-Relational Biological Datasets
High-throughput biological screens are yielding ever-growing streams of
information about multiple aspects of cellular activity. As more and more
categories of datasets come online, there is a corresponding multitude of ways
in which inferences can be chained across them, motivating the need for
compositional data mining algorithms. In this paper, we argue that such
compositional data mining can be effectively realized by functionally cascading
redescription mining and biclustering algorithms as primitives. Both these
primitives mirror shifts of vocabulary that can be composed in arbitrary ways
to create rich chains of inferences. Given a relational database and its
schema, we show how the schema can be automatically compiled into a
compositional data mining program, and how different domains in the schema can
be related through logical sequences of biclustering and redescription
invocations. This feature allows us to rapidly prototype new data mining
applications, yielding greater understanding of scientific datasets. We
describe two applications of compositional data mining: (i) matching terms
across categories of the Gene Ontology and (ii) understanding the molecular
mechanisms underlying stress response in human cells
Neural Biclustering in Gene Expression Analysis
Clustering in high dimensional spaces is a very difficult task. Dealing with DNA microarrays is even more difficult because gene subsets are coregulated and coexpressed only under specific conditions. Biclusterng addresses the problem of finding such submanifolds by exploiting both gene and condition (tissue) clustering. The paper proposes a self-organizing neural network, GH EXIN, which builds a hierarchical tree by adapting its architecture to data. It is integrated in a framework in which gene and tissue clustering are alternated and controlled by the quality of the bicluster. Examples of the approach and a biological validation of results are also given
A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series
<p>Abstract</p> <p>Background</p> <p>The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.</p> <p>Methods</p> <p>In this work, we propose <it>e</it>-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.</p> <p>Results</p> <p>We present results in real data showing the effectiveness of <it>e</it>-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in <it>Saccharomyces cerevisiae </it>in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.</p> <p>Discussion</p> <p>The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.</p> <p>Availability</p> <p>A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in <url>http://kdbio.inesc-id.pt/software/e-ccc-biclustering</url>.</p
Unsupervised Algorithms for Microarray Sample Stratification
The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe
A method for visual identification of small sample subgroups and potential biomarkers
In order to find previously unknown subgroups in biomedical data and generate
testable hypotheses, visually guided exploratory analysis can be of tremendous
importance. In this paper we propose a new dissimilarity measure that can be
used within the Multidimensional Scaling framework to obtain a joint
low-dimensional representation of both the samples and variables of a
multivariate data set, thereby providing an alternative to conventional
biplots. In comparison with biplots, the representations obtained by our
approach are particularly useful for exploratory analysis of data sets where
there are small groups of variables sharing unusually high or low values for a
small group of samples.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS460 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …