630 research outputs found

    SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome

    Full text link
    Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only program that can predict the entire human interactome. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/

    Protein Fingerprinting: A Domain-Free Approach to Protein Analysis

    Get PDF
    An alternative method for analyzing proteins is proposed. Currently, protein search engines available on the internet utilize domains (predefined sequences of amino acids) to align proteins. The method presented converts a protein sequence with the use of 1200 numeric codes that represent a unique three—amino-acid protein sequence. Each numeric code starts with one of three specific amino acids, followed by any two additional amino acids. With the use of the FPC (FingerPrinted Contig) program, the total protein database (including “redundant” records) from the National Center for Biotechnology Information (NCBI) has been processed and placed into “bins/contigs” based on associations of these triplet codes. When analyzed with FPC, proteins are “contigged” together based on the number of shared fragments, regardless of order. These associations were supported by additional analysis with the standard BLASTP utility from NCBI. Within the created contig sets, there are numerous examples of proteins (allotypes and orthotypes) that have evolved into different, seemingly unrelated proteins. The power of this domain-free technique has yet to be explored; however, the ability to bin proteins together with no a priori knowledge of domains may prove a powerful tool in the characterization of the hundreds of thousands of available, yet undescribed expressed protein and open reading frame sequences

    BayesMotif: de novo protein sorting motif discovery from impure datasets

    Get PDF
    Background Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. Methods We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Results Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. Conclusion We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model

    Subsequence-based feature map for protein function classification

    Get PDF
    Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets. © 2007 Elsevier Ltd. All rights reserved

    Nuclear Outsourcing of RNA Interference Components to Human Mitochondria

    Get PDF
    MicroRNAs (miRNAs) are small non-coding RNAs that associate with Argonaute proteins to regulate gene expression at the post-transcriptional level in the cytoplasm. However, recent studies have reported that some miRNAs localize to and function in other cellular compartments. Mitochondria harbour their own genetic system that may be a potential site for miRNA mediated post-transcriptional regulation. We aimed at investigating whether nuclear-encoded miRNAs can localize to and function in human mitochondria. To enable identification of mitochondrial-enriched miRNAs, we profiled the mitochondrial and cytosolic RNA fractions from the same HeLa cells by miRNA microarray analysis. Mitochondria were purified using a combination of cell fractionation and immunoisolation, and assessed for the lack of protein and RNA contaminants. We found 57 miRNAs differentially expressed in HeLa mitochondria and cytosol. Of these 57, a signature of 13 nuclear-encoded miRNAs was reproducibly enriched in mitochondrial RNA and validated by RT-PCR for hsa-miR-494, hsa-miR-1275 and hsa-miR-1974. The significance of their mitochondrial localization was investigated by characterizing their genomic context, cross-species conservation and instrinsic features such as their size and thermodynamic parameters. Interestingly, the specificities of mitochondrial versus cytosolic miRNAs were underlined by significantly different structural and thermodynamic parameters. Computational targeting analysis of most mitochondrial miRNAs revealed not only nuclear but also mitochondrial-encoded targets. The functional relevance of miRNAs in mitochondria was supported by the finding of Argonaute 2 localization to mitochondria revealed by immunoblotting and confocal microscopy, and further validated by the co-immunoprecipitation of the mitochondrial transcript COX3. This study provides the first comprehensive view of the localization of RNA interference components to the mitochondria. Our data outline the molecular bases for a novel layer of crosstalk between nucleus and mitochondria through a specific subset of human miRNAs that we termed ‘mitomiRs’

    Motif Discovery in Protein Sequences

    Get PDF
    Biology has become a data‐intensive research field. Coping with the flood of data from the new genome sequencing technologies is a major area of research. The exponential increase in the size of the datasets produced by “next‐generation sequencing” (NGS) poses unique computational challenges. In this context, motif discovery tools are widely used to identify important patterns in the sequences produced. Biological sequence motifs are defined as short, usually fixed length, sequence patterns that may represent important structural or functional features in nucleic acid and protein sequences such as transcription binding sites, splice junctions, active sites, or interaction interfaces. They can occur in an exact or approximate form within a family or a subfamily of sequences. Motif discovery is therefore an important field in bioinformatics, and numerous methods have been developed for the identification of motifs shared by a set of functionally related sequences. This chapter will review the existing motif discovery methods for protein sequences and their ability to discover biologically important features as well as their limitations for the discovery of new motifs. Finally, we will propose new horizons for motif discovery in order to address the short comings of the existent methods

    Determining significance of pairwise co-occurrences of events in bursty sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding sites, while in some, possibly very long regions, hardly any events occur. Also some types of events may occur in the sequence more often than others.</p> <p>Tendencies of co-occurrence of binding sites of two or more TFs are interesting, as they may imply a co-operative role between the TFs in regulatory processes. Determining a numerical value to summarize the tendency for co-occurrence between two TFs can be done in a number of ways. However, testing for the significance of such values should be done with respect to a relevant null model that takes into account the global sequence structure.</p> <p>Results</p> <p>We extend the existing techniques that have been considered for determining the significance of co-occurrence patterns between a pair of event types under different null models. These models range from very simple ones to more complex models that take the burstiness of sequences into account. We evaluate the models and techniques on synthetic event sequences, and on real data consisting of potential transcription factor binding sites.</p> <p>Conclusion</p> <p>We show that simple null models are poorly suited for bursty data, and they yield many false positives. More sophisticated models give better results in our experiments. We also demonstrate the effect of the window size, i.e., maximum co-occurrence distance, on the significance results.</p

    Native homing endonucleases can target conserved genes in humans and in animal models

    Get PDF
    In recent years, both homing endonucleases (HEases) and zinc-finger nucleases (ZFNs) have been engineered and selected for the targeting of desired human loci for gene therapy. However, enzyme engineering is lengthy and expensive and the off-target effect of the manufactured endonucleases is difficult to predict. Moreover, enzymes selected to cleave a human DNA locus may not cleave the homologous locus in the genome of animal models because of sequence divergence, thus hampering attempts to assess the in vivo efficacy and safety of any engineered enzyme prior to its application in human trials. Here, we show that naturally occurring HEases can be found, that cleave desirable human targets. Some of these enzymes are also shown to cleave the homologous sequence in the genome of animal models. In addition, the distribution of off-target effects may be more predictable for native HEases. Based on our experimental observations, we present the HomeBase algorithm, database and web server that allow a high-throughput computational search and assignment of HEases for the targeting of specific loci in the human and other genomes. We validate experimentally the predicted target specificity of candidate fungal, bacterial and archaeal HEases using cell free, yeast and archaeal assay
    corecore