Search CORE

34,351 research outputs found

Inferring gene ontologies from pairwise similarity data.

Author: Bafna Vineet
Dutkowski Janusz
Ideker Trey
Kramer Michael
Yu Michael
Publication venue: eScholarship, University of California
Publication date: 01/06/2014
Field of study

MotivationWhile the manually curated Gene Ontology (GO) is widely used, inferring a GO directly from -omics data is a compelling new problem. Recognizing that ontologies are a directed acyclic graph (DAG) of terms and hierarchical relations, algorithms are needed that: analyze a full matrix of gene-gene pairwise similarities from -omics data; infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; and respect biological pleiotropy, by which a term in the hierarchy can relate to multiple higher level terms. Methods addressing these requirements are just beginning to emerge-none has been evaluated for GO inference.MethodsWe consider two algorithms [Clique Extracted Ontology (CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive thresholding of a similarity matrix. We evaluate each method's ability to reconstruct the GO biological process ontology from a similarity matrix based on (a) semantic similarities for GO itself or (b) three -omics datasets for yeast.ResultsFor task (a) using semantic similarity, CliXO accurately reconstructs GO (>99% precision, recall) and outperforms other approaches (<20% precision, <20% recall). For task (b) using -omics data, CliXO outperforms other methods using two -omics datasets and achieves ∼30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or standard clustering (20-25% precision, recall).ConclusionThis study provides algorithmic foundation for building gene ontologies by capturing hierarchical and pleiotropic structure embedded in biomolecular data

PubMed Central

eScholarship - University of California

Efficient seeding techniques for protein similarity search

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mihkail
Szczurek Ewa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Efficient seeding techniques for protein similarity search

Author: Roytberg Mihkail
Gambin Anna
Noé Laurent
Lasota Slawomir
Furletova Eugenia
Szczurek Ewa
Kucherov Gregory
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

A methodology for determining amino-acid substitution matrices from set covers

Author: A. Bahr
A.D. McLachlan
D.F. Feng
G. Vogt
G.H. Gonnet
J. Setubal
J.D. Blake
J.K.M. Rao
M. Gribskov
M.F. Sagot
R.B. Russell
R.E. Green
R.F. Smith
S. Henikoff
S.A. Benner
T. Müller
T.P. Li
W.S.J. Valdar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/04/2005
Field of study

We introduce a new methodology for the determination of amino-acid substitution matrices for use in the alignment of proteins. The new methodology is based on a pre-existing set cover on the set of residues and on the undirected graph that describes residue exchangeability given the set cover. For fixed functional forms indicating how to obtain edge weights from the set cover and, after that, substitution-matrix elements from weighted distances on the graph, the resulting substitution matrix can be checked for performance against some known set of reference alignments and for given gap costs. Finding the appropriate functional forms and gap costs can then be formulated as an optimization problem that seeks to maximize the performance of the substitution matrix on the reference alignment set. We give computational results on the BAliBASE suite using a genetic algorithm for optimization. Our results indicate that it is possible to obtain substitution matrices whose performance is either comparable to or surpasses that of several others, depending on the particular scenario under consideration

arXiv.org e-Print Archive

Crossref

Information based clustering

Author: Ashburner
Bowers
Brown
Eisen
G. S. Atwal
G. Tkacik
Gasch
N. Slonim
Segal
W. Bialek
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 25/11/2005
Field of study

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "prototype", does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures non-linear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.Comment: To appear in Proceedings of the National Academy of Sciences USA, 11 pages, 9 figure

arXiv.org e-Print Archive

Crossref

Cold Spring Harbor Laboratory Institutional Repository

PubMed Central

A Short Survey on Data Clustering Algorithms

Author: Wong Ka-Chun
Publication venue
Publication date: 25/11/2015
Field of study

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end

arXiv.org e-Print Archive

Crossref

Link communities reveal multiscale complexity in networks

Author: Ahn Yong-Yeol
Bagrow James P.
Lehmann Sune
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Networks have become a key approach to understanding systems of interacting objects, unifying the study of diverse phenomena including biological organisms and human society. One crucial step when studying the structure and dynamics of networks is to identify communities: groups of related nodes that correspond to functional subunits such as protein complexes or social spheres. Communities in networks often overlap such that nodes simultaneously belong to several groups. Meanwhile, many networks are known to possess hierarchical organization, where communities are recursively grouped into a hierarchical structure. However, the fact that many real networks have communities with pervasive overlap, where each and every node belongs to more than one group, has the consequence that a global hierarchy of nodes cannot capture the relationships between overlapping groups. Here we reinvent communities as groups of links rather than nodes and show that this unorthodox approach successfully reconciles the antagonistic organizing principles of overlapping communities and hierarchy. In contrast to the existing literature, which has entirely focused on grouping nodes, link communities naturally incorporate overlap while revealing hierarchical organization. We find relevant link communities in many networks, including major biological networks such as protein-protein interaction and metabolic networks, and show that a large social network contains hierarchically organized community structures spanning inner-city to regional scales while maintaining pervasive overlap. Our results imply that link communities are fundamental building blocks that reveal overlap and hierarchical organization in networks to be two aspects of the same phenomenon.Comment: Main text and supplementary informatio

arXiv.org e-Print Archive

CiteSeerX

Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures

Author: De la Cruz Bernard J.
Ghahramani Zoubin
Rasmussen Carl Edward
Wild David L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data

Crossref

Warwick Research Archives Portal Repository

MPG.PuRe