129 research outputs found

    Choosing negative examples for the prediction of protein-protein interactions

    Get PDF
    The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions

    Probabilistic analysis of a differential equation for linear programming

    Full text link
    In this paper we address the complexity of solving linear programming problems with a set of differential equations that converge to a fixed point that represents the optimal solution. Assuming a probabilistic model, where the inputs are i.i.d. Gaussian variables, we compute the distribution of the convergence rate to the attracting fixed point. Using the framework of Random Matrix Theory, we derive a simple expression for this distribution in the asymptotic limit of large problem size. In this limit, we find that the distribution of the convergence rate is a scaling function, namely it is a function of one variable that is a combination of three parameters: the number of variables, the number of constraints and the convergence rate, rather than a function of these parameters separately. We also estimate numerically the distribution of computation times, namely the time required to reach a vicinity of the attracting fixed point, and find that it is also a scaling function. Using the problem size dependence of the distribution functions, we derive high probability bounds on the convergence rates and on the computation times.Comment: 1+37 pages, latex, 5 eps figures. Version accepted for publication in the Journal of Complexity. Changes made: Presentation reorganized for clarity, expanded discussion of measure of complexity in the non-asymptotic regime (added a new section

    Amino acid composition predicts prion activity

    Get PDF
    Many prion-forming proteins contain glutamine/asparagine (Q/N) rich domains, and there are conflicting opinions as to the role of primary sequence in their conversion to the prion form: is this phenomenon driven primarily by amino acid composition, or, as a recent computational analysis suggested, dependent on the presence of short sequence elements with high amyloid-forming potential. The argument for the importance of short sequence elements hinged on the relatively-high accuracy obtained using a method that utilizes a collection of length-six sequence elements with known amyloid-forming potential. We weigh in on this question and demonstrate that when those sequence elements are permuted, even higher accuracy is obtained; we also propose a novel multiple-instance machine learning method that uses sequence composition alone, and achieves better accuracy than all existing prion prediction approaches. While we expect there to be elements of primary sequence that affect the process, our experiments suggest that sequence composition alone is sufficient for predicting protein sequences that are likely to form prions. A web-server for the proposed method is available at http://faculty.pieas.edu.pk/fayyaz/prank.html, and the code for reproducing our experiments is available at http://doi.org/10.5281/zenodo.167136

    Amino acid composition predicts prion activity

    Get PDF
    Many prion-forming proteins contain glutamine/asparagine (Q/N) rich domains, and there are conflicting opinions as to the role of primary sequence in their conversion to the prion form: is this phenomenon driven primarily by amino acid composition, or, as a recent computational analysis suggested, dependent on the presence of short sequence elements with high amyloid-forming potential. The argument for the importance of short sequence elements hinged on the relatively-high accuracy obtained using a method that utilizes a collection of length-six sequence elements with known amyloid-forming potential. We weigh in on this question and demonstrate that when those sequence elements are permuted, even higher accuracy is obtained; we also propose a novel multiple-instance machine learning method that uses sequence composition alone, and achieves better accuracy than all existing prion prediction approaches. While we expect there to be elements of primary sequence that affect the process, our experiments suggest that sequence composition alone is sufficient for predicting protein sequences that are likely to form prions. A web-server for the proposed method is available at http://faculty.pieas.edu.pk/fayyaz/prank.html, and the code for reproducing our experiments is available at http://doi.org/10.5281/zenodo.167136

    SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data

    Get PDF
    We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA-Seq and EST alignments. Results obtained using RNA-Seq experiments in Arabidopsis thaliana show that predictions made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net

    Genome-wide analysis of alternative splicing in Chlamydomonas reinhardtii

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide computational analysis of alternative splicing (AS) in several flowering plants has revealed that pre-mRNAs from about 30% of genes undergo AS. <it>Chlamydomonas</it>, a simple unicellular green alga, is part of the lineage that includes land plants. However, it diverged from land plants about one billion years ago. Hence, it serves as a good model system to study alternative splicing in early photosynthetic eukaryotes, to obtain insights into the evolution of this process in plants, and to compare splicing in simple unicellular photosynthetic and non-photosynthetic eukaryotes. We performed a global analysis of alternative splicing in <it>Chlamydomonas reinhardtii </it>using its recently completed genome sequence and all available ESTs and cDNAs.</p> <p>Results</p> <p>Our analysis of AS using BLAT and a modified version of the Sircah tool revealed AS of 498 transcriptional units with 611 events, representing about 3% of the total number of genes. As in land plants, intron retention is the most prevalent form of AS. Retained introns and skipped exons tend to be shorter than their counterparts in constitutively spliced genes. The splice site signals in all types of AS events are weaker than those in constitutively spliced genes. Furthermore, in alternatively spliced genes, the prevalent splice form has a stronger splice site signal than the non-prevalent form. Analysis of constitutively spliced introns revealed an over-abundance of motifs with simple repetitive elements in comparison to introns involved in intron retention. In almost all cases, AS results in a truncated ORF, leading to a coding sequence that is around 50% shorter than the prevalent splice form. Using RT-PCR we verified AS of two genes and show that they produce more isoforms than indicated by EST data. All cDNA/EST alignments and splice graphs are provided in a website at <url>http://combi.cs.colostate.edu/as/chlamy</url>.</p> <p>Conclusions</p> <p>The extent of AS in <it>Chlamydomonas </it>that we observed is much smaller than observed in land plants, but is much higher than in simple unicellular heterotrophic eukaryotes. The percentage of different alternative splicing events is similar to flowering plants. Prevalence of constitutive and alternative splicing in <it>Chlamydomonas</it>, together with its simplicity, many available public resources, and well developed genetic and molecular tools for this organism make it an excellent model system to elucidate the mechanisms involved in regulated splicing in photosynthetic eukaryotes.</p

    Deciphering the Plant Splicing Code: Experimental and Computational Approaches for Predicting Alternative Splicing and Splicing Regulatory Elements

    Get PDF
    Extensive alternative splicing (AS) of precursor mRNAs (pre-mRNAs) in multicellular eukaryotes increases the protein-coding capacity of a genome and allows novel ways to regulate gene expression. In flowering plants, up to 48% of intron-containing genes exhibit AS. However, the full extent of AS in plants is not yet known, as only a few high-throughput RNA-Seq studies have been performed. As the cost of obtaining RNA-Seq reads continues to fall, it is anticipated that huge amounts of plant sequence data will accumulate and help in obtaining a more complete picture of AS in plants. Although it is not an onerous task to obtain hundreds of millions of reads using high-throughput sequencing technologies, computational tools to accurately predict and visualize AS are still being developed and refined. This review will discuss the tools to predict and visualize transcriptome-wide AS in plants using short-reads and highlight their limitations. Comparative studies of AS events between plants and animals have revealed that there are major differences in the most prevalent types of AS events, suggesting that plants and animals differ in the way they recognize exons and introns. Extensive studies have been performed in animals to identify cis-elements involved in regulating AS, especially in exon skipping. However, few such studies have been carried out in plants. Here, we review the current state of research on splicing regulatory elements (SREs) and briefly discuss emerging experimental and computational tools to identify cis-elements involved in regulation of AS in plants. The availability of curated alternative splice forms in plants makes it possible to use computational tools to predict SREs involved in AS regulation, which can then be verified experimentally. Such studies will permit identification of plant-specific features involved in AS regulation and contribute to deciphering the splicing code in plants

    Probabilistic analysis of the phase space flow for linear programming

    Full text link
    The phase space flow of a dynamical system leading to the solution of Linear Programming (LP) problems is explored as an example of complexity analysis in an analog computation framework. An ensemble of LP problems with nn variables and mm constraints (n>mn>m), where all elements of the vectors and matrices are normally distributed is studied. The convergence time of a flow to the fixed point representing the optimal solution is computed. The cumulative distribution F(n,m)(Δ){\cal F}^{(n,m)}(\Delta) of the convergence rate Δmin\Delta_{min} to this point is calculated analytically, in the asymptotic limit of large (n,m)(n,m), in the framework of Random Matrix Theory. In this limit F(n,m)(Δ){\cal F}^{(n,m)}(\Delta) is found to be a scaling function, namely it is a function of one variable that is a combination of nn, mm and Δ\Delta rather then a function of these three variables separately. From numerical simulations also the distribution of the computation times is calculated and found to be a scaling function as well.Comment: 8 pages, latex, 2 eps figures; final published versio
    corecore