5,197 research outputs found
Gamma-based clustering via ordered means with application to gene-expression analysis
Discrete mixture models provide a well-known basis for effective clustering
algorithms, although technical challenges have limited their scope. In the
context of gene-expression data analysis, a model is presented that mixes over
a finite catalog of structures, each one representing equality and inequality
constraints among latent expected values. Computations depend on the
probability that independent gamma-distributed variables attain each of their
possible orderings. Each ordering event is equivalent to an event in
independent negative-binomial random variables, and this finding guides a
dynamic-programming calculation. The structuring of mixture-model components
according to constraints among latent means leads to strict concavity of the
mixture log likelihood. In addition to its beneficial numerical properties, the
clustering method shows promising results in an empirical study.Comment: Published in at http://dx.doi.org/10.1214/10-AOS805 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Probabilistic Clustering of Sequences: Inferring new bacterial regulons by comparative genomics
Genome wide comparisons between enteric bacteria yield large sets of
conserved putative regulatory sites on a gene by gene basis that need to be
clustered into regulons. Using the assumption that regulatory sites can be
represented as samples from weight matrices we derive a unique probability
distribution for assignments of sites into clusters. Our algorithm, 'PROCSE'
(probabilistic clustering of sequences), uses Monte-Carlo sampling of this
distribution to partition and align thousands of short DNA sequences into
clusters. The algorithm internally determines the number of clusters from the
data, and assigns significance to the resulting clusters. We place theoretical
limits on the ability of any algorithm to correctly cluster sequences drawn
from weight matrices (WMs) when these WMs are unknown. Our analysis suggests
that the set of all putative sites for a single genome (e.g. E. coli) is
largely inadequate for clustering. When sites from different genomes are
combined and all the homologous sites from the various species are used as a
block, clustering becomes feasible. We predict 50-100 new regulons as well as
many new members of existing regulons, potentially doubling the number of known
regulatory sites in E. coli.Comment: 27 pages including 9 figures and 3 table
Statistical inference from large-scale genomic data
This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series.
This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis.
Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome
Comparing high dimensional partitions, with the Coclustering Adjusted Rand Index
We consider the simultaneous clustering of rows and columns of a matrix and
more particularly the ability to measure the agreement between two
co-clustering partitions. The new criterion we developed is based on the
Adjusted Rand Index and is called the Co-clustering Adjusted Rand Index named
CARI. We also suggest new improvements to existing criteria such as the
Classification Error which counts the proportion of misclassified cells and the
Extended Normalized Mutual Information criterion which is a generalization of
the criterion based on mutual information in the case of classic
classifications. We study these criteria with regard to some desired properties
deriving from the co-clustering context. Experiments on simulated and real
observed data are proposed to compare the behavior of these criteria.Comment: 52 page
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
On clustering stability
JEL Classification: C100; C150; C380This work is dedicated to the evaluation of the stability of clustering solutions, namely
the stability of crisp clusterings or partitions. We specifically refer to stability as the
concordance of clusterings across several samples. In order to evaluate stability, we use
a weighted cross-validation procedure, the result of which is summarized by simple and
paired agreement indices values. To exclude the amount of agreement by chance of
these values, we propose a new method – IADJUST – that resorts to simulated crossclassification
tables. This contribution makes viable the correction of any index of
agreement.
Experiments on stability rely on 540 simulated data sets, design factors being the
number of clusters, their balance and overlap. Six real data with a priori known clusters
are also considered. The experiments conducted enable to illustrate the precision and
pertinence of the IADJUST procedure and allow to know the distribution of indices
under the hypothesis of agreement by chance. Therefore, we recommend the use of
adjusted indices to be common practice when addressing stability. We then compare the
stability of two clustering algorithms and conclude that Expectation-Maximization
(EM) results are more stable when referring to unbalanced data sets than K means
results. Finally, we explore the relationship between stability and external validity of a
clustering solution. When all experimental scenarios’ results are considered there is a
strong correlation between stability and external validity. However, within a specific
experimental scenario (when a practical clustering task is considered), we find no
relationship between stability and agreement with ground truth.Este trabalho é dedicado à avaliação da estabilidade de agrupamentos, nomeadamente
de partições. Consideramos a estabilidade como sendo a concordância dos
agrupamentos obtidos sobre diversas amostras. Para avaliar a estabilidade, usamos um
procedimento de validação cruzada ponderada, cujo resultado é resumido pelos valores
de Ãndices de concordância simples e pareados. Para excluir, destes valores, a parcela de
concordância por acaso, propomos um novo método - IADJUST - que recorre Ã
simulação de tabelas cruzadas de classificação. Essa contribuição torna viável a
correção de qualquer Ãndice de concordância.
A análise experimental da estabilidade baseia-se em 540 conjuntos de dados simulados,
controlando os números de grupos, dimensões relativas e graus de sobreposição dos
grupos. Também consideramos seis conjuntos de dados reais com classes a priori
conhecidas. As experiências realizadas permitem ilustrar a precisão e pertinência do
procedimento IADJUST e conhecer a distribuição dos Ãndices sob a hipótese de
concordância por acaso. Assim sendo, recomendamos a utilização de Ãndices ajustados
como prática comum ao abordar a estabilidade. Comparamos, então, a estabilidade de
dois algoritmos de agrupamento e concluÃmos que as soluções do algoritmo Expectation
Maximization são mais estáveis que as do K-médias em conjuntos de dados não
balanceados. Finalmente, estudamos a relação entre a estabilidade e validade externa de
um agrupamento. Agregando os resultados dos cenários experimentais obtemos uma
forte correlação entre estabilidade e validade externa. No entanto, num cenário
experimental particular (para uma tarefa prática de agrupamento), não encontramos
relação entre estabilidade e a concordância com a verdadeira estrutura dos dados
Methods for protein complex prediction and their contributions towards understanding the organization, function and dynamics of complexes
Complexes of physically interacting proteins constitute fundamental
functional units responsible for driving biological processes within cells. A
faithful reconstruction of the entire set of complexes is therefore essential
to understand the functional organization of cells. In this review, we discuss
the key contributions of computational methods developed till date
(approximately between 2003 and 2015) for identifying complexes from the
network of interacting proteins (PPI network). We evaluate in depth the
performance of these methods on PPI datasets from yeast, and highlight
challenges faced by these methods, in particular detection of sparse and small
or sub- complexes and discerning of overlapping complexes. We describe methods
for integrating diverse information including expression profiles and 3D
structures of proteins with PPI networks to understand the dynamics of complex
formation, for instance, of time-based assembly of complex subunits and
formation of fuzzy complexes from intrinsically disordered proteins. Finally,
we discuss methods for identifying dysfunctional complexes in human diseases,
an application that is proving invaluable to understand disease mechanisms and
to discover novel therapeutic targets. We hope this review aptly commemorates a
decade of research on computational prediction of complexes and constitutes a
valuable reference for further advancements in this exciting area.Comment: 1 Tabl
- …