1,154 research outputs found

    Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family Boundaries

    Get PDF
    Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties

    Protein Local Tertiary Structure Prediction by Super Granule Support Vector Machines with Chou-Fasman Parameter

    Get PDF
    Prediction of a protein's tertiary structure from its sequence information alone is considered a major task in modern computational biology.  In order to closer the gap between protein sequences to its tertiary structures, we discuss the correlation between protein sequence and local tertiary structure information in this paper.  The strategy we used in this work is predict small portions (local) of protein tertiary structure with high confidence from conserved protein sequences, which are called “protein sequence motifs”. 799 protein sequence motifs that transcend protein family boundaries were obtained from our previous work.  The prediction accuracy generated from the best group of protein sequence motifs always keep higher than 90% while more than 8% of the independent testing data segments are predicted. Since the most meaningful result published in latest publication is merely 70.02% accuracy under the coverage of 4.45%, the research results achieved in this paper are obviously outperformed. Besides, we also set up a stricter evaluation to our prediction to further understand the relation between protein sequence motifs and tertiary structure predictions.  The results suggest that the hidden sequence-to-structure relationship can be uncovered using the Super Granule SVM Model with the Chou-Fasman Parameter.  With the high local tertiary structure prediction accuracy provided in this article, the hidden relation between protein primary sequences and their 3D structure are uncovered considerably

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

    Get PDF
    Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    DEGRONOPEDIA:a web server for proteome-wide inspection of degrons

    Get PDF
    E3 ubiquitin ligases recognize substrates through their short linear motifs termed degrons. While degron-signaling has been a subject of extensive study, resources for its systematic screening are limited. To bridge this gap, we developed DEGRONOPEDIA, a web server that searches for degrons and maps them to nearby residues that can undergo ubiquitination and disordered regions, which may act as protein unfolding seeds. Along with an evolutionary assessment of degron conservation, the server also reports on post-translational modifications and mutations that may modulate degron availability. Acknowledging the prevalence of degrons at protein termini, DEGRONOPEDIA incorporates machine learning to assess N-/C-terminal stability, supplemented by simulations of proteolysis to identify degrons in newly formed termini. An experimental validation of a predicted C-terminal destabilizing motif, coupled with the confirmation of a post-proteolytic degron in another case, exemplifies its practical application. DEGRONOPEDIA can be freely accessed at degronopedia.com

    DEGRONOPEDIA:a web server for proteome-wide inspection of degrons

    Get PDF
    E3 ubiquitin ligases recognize substrates through their short linear motifs termed degrons. While degron-signaling has been a subject of extensive study, resources for its systematic screening are limited. To bridge this gap, we developed DEGRONOPEDIA, a web server that searches for degrons and maps them to nearby residues that can undergo ubiquitination and disordered regions, which may act as protein unfolding seeds. Along with an evolutionary assessment of degron conservation, the server also reports on post-translational modifications and mutations that may modulate degron availability. Acknowledging the prevalence of degrons at protein termini, DEGRONOPEDIA incorporates machine learning to assess N-/C-terminal stability, supplemented by simulations of proteolysis to identify degrons in newly formed termini. An experimental validation of a predicted C-terminal destabilizing motif, coupled with the confirmation of a post-proteolytic degron in another case, exemplifies its practical application. DEGRONOPEDIA can be freely accessed at degronopedia.com
    • 

    corecore