879 research outputs found

    SeqNLS: Nuclear Localization Signal Prediction Based on Frequent Pattern Mining and Linear Motif Scoring

    Get PDF
    Nuclear localization signals (NLSs) are stretches of residues in proteins mediating their importing into the nucleus. NLSs are known to have diverse patterns, of which only a limited number are covered by currently known NLS motifs. Here we propose a sequential pattern mining algorithm SeqNLS to effectively identify potential NLS patterns without being constrained by the limitation of current knowledge of NLSs. The extracted frequent sequential patterns are used to predict NLS candidates which are then filtered by a linear motif-scoring scheme based on predicted sequence disorder and by the relatively local conservation (IRLC) based masking. The experiment results on the newly curated Yeast and Hybrid datasets show that SeqNLS is effective in detecting potential NLSs. The performance comparison between SeqNLS with and without the linear motif scoring shows that linear motif features are highly complementary to sequence features in discerning NLSs. For the two independent datasets, our SeqNLS not only can consistently find over 50% of NLSs with prediction precision of at least 0.7, but also outperforms other state-of-the-art NLS prediction methods in terms of F1 score or prediction precision with similar or higher recall rates. The web server of the SeqNLS algorithm is available at http://mleg.cse.sc.edu/seqNLS

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Evidence for the Concerted Evolution between Short Linear Protein Motifs and Their Flanking Regions

    Get PDF
    BACKGROUND: Linear motifs are short modules of protein sequences that play a crucial role in mediating and regulating many protein-protein interactions. The function of linear motifs strongly depends on the context, e.g. functional instances mainly occur inside flexible regions that are accessible for interaction. Sometimes linear motifs appear as isolated islands of conservation in multiple sequence alignments. However, they also occur in larger blocks of sequence conservation, suggesting an active role for the neighbouring amino acids. RESULTS: The evolution of regions flanking 116 functional linear motif instances was studied. The conservation of the amino acid sequence and order/disorder tendency of those regions was related to presence/absence of the instance. For the majority of the analysed instances, the pairs of sequences conserving the linear motif were also observed to maintain a similar local structural tendency and/or to have higher local sequence conservation when compared to pairs of sequences where one is missing the linear motif. Furthermore, those instances have a higher chance to co-evolve with the neighbouring residues in comparison to the distant ones. Those findings are supported by examples where the regulation of the linear motif-mediated interaction has been shown to depend on the modifications (e.g. phosphorylation) at neighbouring positions or is thought to benefit from the binding versatility of disordered regions. CONCLUSION: The results suggest that flanking regions are relevant for linear motif-mediated interactions, both at the structural and sequence level. More interestingly, they indicate that the prediction of linear motif instances can be enriched with contextual information by performing a sequence analysis similar to the one presented here. This can facilitate the understanding of the role of these predicted instances in determining the protein function inside the broader context of the cellular network where they arise

    A tree-based conservation scoring method for short linear motifs in multiple alignments of protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The structure of many eukaryotic cell regulatory proteins is highly modular. They are assembled from globular domains, segments of natively disordered polypeptides and short linear motifs. The latter are involved in protein interactions and formation of regulatory complexes. The function of such proteins, which may be difficult to define, is the aggregate of the subfunctions of the modules. It is therefore desirable to efficiently predict linear motifs with some degree of accuracy, yet sequence database searches return results that are not significant.</p> <p>Results</p> <p>We have developed a method for scoring the conservation of linear motif instances. It requires only primary sequence-derived information (e.g. multiple alignment and sequence tree) and takes into account the degenerate nature of linear motif patterns. On our benchmarking, the method accurately scores 86% of the known positive instances, while distinguishing them from random matches in 78% of the cases. The conservation score is implemented as a real time application designed to be integrated into other tools. It is currently accessible via a Web Service or through a graphical interface.</p> <p>Conclusion</p> <p>The conservation score improves the prediction of linear motifs, by discarding those matches that are unlikely to be functional because they have not been conserved during the evolution of the protein sequences. It is especially useful for instances in non-structured regions of the proteins, where a domain masking filtering strategy is not applicable.</p

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Computational Analysis and Prediction of Genome-Wide Protein Targeting Signals and Localization

    Get PDF
    Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. In the first part of the dissertation, we propose a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. In the second part of the dissertation, we propose a computational method, SeqNLS, to predict nuclear localization signal (NLS). The major difficulty of NLS prediction is that NLSs are known to have diverse patterns, but the knowledge to NLS patterns is limited and only a portion of NLSs can be covered by the known NLS motifs. In SeqNLS, on the one hand we propose a sequential-pattern approach to effectively detect potential NLS segments without constrained by the limited knowledge of NLS patterns. On the other hand, we introduce a model for NLS prediction which utilizes the fact that NLS is one type of linear motifs. Our experiment results show that our sequential-pattern approach is effectively in extensively searching potential NLSs. Our method can consistently find over 50% of NLSs with prediction precision at least 0.7 in the two independent datasets. The performance of our method can outperform the-state-of-art NLS prediction methods in terms of F1-score. The binding affinity between a nuclear localization signal (NLS) and its import receptor is closely related to corresponding nuclear import activity. PTM based modulation of the NLS binding affinity to the import receptor is one of the most understood mechanisms to regulate nuclear import of proteins. However, identification of such regulation mechanisms is challenging due to the difficulty of assessing the impact of the PTM on corresponding nuclear import activities. In the third part of the dissertation we proposed NIpredict, an effective algorithm to predict nuclear import activity given its NLS, in which molecular interaction energy components (MIECs) were used to characterize the NLS-import receptor interaction, and the support vector regression machine (SVR) was used to learn the relationship between the characterized NLS-import receptor interaction and the corresponding nuclear import activity. Our experiments showed that nuclear import activity change due to NLS change could be accurately predicted by the NIpredict algorithm. Based on NIpredict, we developed a systematic framework to identify potential PTM-based nuclear import regulations for human and yeast nuclear proteins. Application of this approach has uncovered the potential nuclear import regulation mechanisms by phosphorylation and/or acetylation of three nuclear proteins including SF1, histone H1, and ORC6

    Novel Peptide-Mediated Interactions Derived from High-Resolution 3-Dimensional Structures

    Get PDF
    Many biological responses to intra- and extracellular stimuli are regulated through complex networks of transient protein interactions where a globular domain in one protein recognizes a linear peptide from another, creating a relatively small contact interface. These peptide stretches are often found in unstructured regions of proteins, and contain a consensus motif complementary to the interaction surface displayed by their binding partners. While most current methods for the de novo discovery of such motifs exploit their tendency to occur in disordered regions, our work here focuses on another observation: upon binding to their partner domain, motifs adopt a well-defined structure. Indeed, through the analysis of all peptide-mediated interactions of known high-resolution three-dimensional (3D) structure, we found that the structure of the peptide may be as characteristic as the consensus motif, and help identify target peptides even though they do not match the established patterns. Our analyses of the structural features of known motifs reveal that they tend to have a particular stretched and elongated structure, unlike most other peptides of the same length. Accordingly, we have implemented a strategy based on a Support Vector Machine that uses this features, along with other structure-encoded information about binding interfaces, to search the set of protein interactions of known 3D structure and to identify unnoticed peptide-mediated interactions among them. We have also derived consensus patterns for these interactions, whenever enough information was available, and compared our results with established linear motif patterns and their binding domains. Finally, to cross-validate our identification strategy, we scanned interactome networks from four model organisms with our newly derived patterns to see if any of them occurred more often than expected. Indeed, we found significant over-representations for 64 domain-motif interactions, 46 of which had not been described before, involving over 6,000 interactions in total for which we could suggest the molecular details determining the binding

    In Silico Characterization of Protein-Protein Interactions Mediated by Short Linear Motifs

    Get PDF
    Short linear motifs (SLiMs), often found in intrinsically disordered regions (IDPs), can initiate protein-protein interactions in eukaryotes. Although pathogens tend to have less disorder than eukaryotes, their proteins alter host cellular function through molecular mimicry of SLiMs. The first objective was to study sequence-based structure properties of viral SLiMs in the ELM database and the conservation of selected viral motifs involved in the virus life cycle. The second objective was to compare the structural features for SliMs in pathogens and eukaryotes in the ELM database. Our analysis showed that many viral SliMs are not found in IDPs, particularly glycosylation motifs. Moreover, analysis of disorder and secondary structure properties in the same motif from pathogens and eukaryotes shed light on similarities and differences in motif properties between pathogens and their eukaryotic equivalents. Our results indicate that the interaction mechanism may differ between pathogens and their eukaryotic hosts for the same motif
    • …
    corecore