9,194 research outputs found

    Enhancing the functional content of protein interaction networks

    Full text link
    Protein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of local structure in networks, to address these issues. Although several CNS measures have been proposed in the literature, an understanding of their relative efficacies for the analysis of interaction networks has been lacking. We follow the framework of graph transformation to convert the given interaction network into a transformed network corresponding to a variety of CNS measures evaluated. The effectiveness of each measure is then estimated by comparing the quality of protein function predictions obtained from its corresponding transformed network with those from the original network. Using a large set of S. cerevisiae interactions, and a set of 136 GO terms, we find that several of the transformed networks produce more accurate predictions than those obtained from the original network. In particular, the HC.contHC.cont measure proposed here performs particularly well for this task. Further investigation reveals that the two major factors contributing to this improvement are the abilities of CNS measures, especially HC.contHC.cont, to prune out noisy edges and introduce new links between functionally related proteins

    Predicting Genetic Regulatory Response Using Classification

    Full text link
    We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences (``motifs'') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment (``parents''). Thus our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurement to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S. cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website: http://www.cs.columbia.edu/compbio/geneclas

    A Computational Algebra Approach to the Reverse Engineering of Gene Regulatory Networks

    Full text link
    This paper proposes a new method to reverse engineer gene regulatory networks from experimental data. The modeling framework used is time-discrete deterministic dynamical systems, with a finite set of states for each of the variables. The simplest examples of such models are Boolean networks, in which variables have only two possible states. The use of a larger number of possible states allows a finer discretization of experimental data and more than one possible mode of action for the variables, depending on threshold values. Furthermore, with a suitable choice of state set, one can employ powerful tools from computational algebra, that underlie the reverse-engineering algorithm, avoiding costly enumeration strategies. To perform well, the algorithm requires wildtype together with perturbation time courses. This makes it suitable for small to meso-scale networks rather than networks on a genome-wide scale. The complexity of the algorithm is quadratic in the number of variables and cubic in the number of time points. The algorithm is validated on a recently published Boolean network model of segment polarity development in Drosophila melanogaster.Comment: 28 pages, 5 EPS figures, uses elsart.cl

    Toward a multilevel representation of protein molecules: comparative approaches to the aggregation/folding propensity problem

    Full text link
    This paper builds upon the fundamental work of Niwa et al. [34], which provides the unique possibility to analyze the relative aggregation/folding propensity of the elements of the entire Escherichia coli (E. coli) proteome in a cell-free standardized microenvironment. The hardness of the problem comes from the superposition between the driving forces of intra- and inter-molecule interactions and it is mirrored by the evidences of shift from folding to aggregation phenotypes by single-point mutations [10]. Here we apply several state-of-the-art classification methods coming from the field of structural pattern recognition, with the aim to compare different representations of the same proteins gathered from the Niwa et al. data base; such representations include sequences and labeled (contact) graphs enriched with chemico-physical attributes. By this comparison, we are able to identify also some interesting general properties of proteins. Notably, (i) we suggest a threshold around 250 residues discriminating "easily foldable" from "hardly foldable" molecules consistent with other independent experiments, and (ii) we highlight the relevance of contact graph spectra for folding behavior discrimination and characterization of the E. coli solubility data. The soundness of the experimental results presented in this paper is proved by the statistically relevant relationships discovered among the chemico-physical description of proteins and the developed cost matrix of substitution used in the various discrimination systems.Comment: 17 pages, 3 figures, 46 reference

    Refined Genetic Algorithms for Polypeptide Structure Prediction

    Get PDF
    Accurate and reliable prediction of macromolecular structures has eluded researchers for nearly 40 years. Prediction via energy minimization assumes the native conformation has the globally minimal energy potential. An exhaustive search is impossible since for molecules of normal size, the size of the search space exceeds the size of the universe. Domain knowledge sources, such as the Brookhaven PDB can be mined for constraints to limit the search space. Genetic algorithms (GAs) are stochastic, population based, search algorithms of polynomial (P) time complexity that can produce semi-optimal solutions for problems of nondeterministic polynomial (NP) time complexity such as PSP. Three refined GAs are presented: A farming model parallel hybrid GA (PHGA) preserves the effectiveness of the serial algorithm with substantial speed up. Portability across distributed and MPP platforms is accomplished with the Message Passing Interface (MPI) communications standard. A Real-valved GA system, real-valued Genetic Algorithm, Limited by constraints (REGAL), exploiting domain knowledge. Experiments with the pentapeptide Met-enkephalin have identified conformers with lower energies (CHARMM) than the accepted optimal conformer (Scheraga, et al), -31.98 vs -28.96 kcals/mol. Analysis of exogenous parameters yields additional insight into performance. A parallel version (Para-REGAL), an island model modified to allow different active constraints in the distributed subpopulations and novel concepts of Probability of Migration and Probability of Complete Migration

    Variational Inference for Stochastic Block Models from Sampled Data

    Full text link
    This paper deals with non-observed dyads during the sampling of a network and consecutive issues in the inference of the Stochastic Block Model (SBM). We review sampling designs and recover Missing At Random (MAR) and Not Missing At Random (NMAR) conditions for the SBM. We introduce variants of the variational EM algorithm for inferring the SBM under various sampling designs (MAR and NMAR) all available as an R package. Model selection criteria based on Integrated Classification Likelihood are derived for selecting both the number of blocks and the sampling design. We investigate the accuracy and the range of applicability of these algorithms with simulations. We explore two real-world networks from ethnology (seed circulation network) and biology (protein-protein interaction network), where the interpretations considerably depends on the sampling designs considered
    • …
    corecore