850 research outputs found

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    Toward a General-Purpose Heterogeneous Ensemble for Pattern Classification

    Get PDF
    We perform an extensive study of the performance of different classification approaches on twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). The aim is to find General-Purpose (GP) heterogeneous ensembles (requiring little to no parameter tuning) that perform competitively across multiple datasets. The state-of-the-art classifiers examined in this study include the support vector machine, Gaussian process classifiers, random subspace of adaboost, random subspace of rotation boosting, and deep learning classifiers. We demonstrate that a heterogeneous ensemble based on the simple fusion by sum rule of different classifiers performs consistently well across all twenty-five datasets. The most important result of our investigation is demonstrating that some very recent approaches, including the heterogeneous ensemble we propose in this paper, are capable of outperforming an SVM classifier (implemented with LibSVM), even when both kernel selection and SVM parameters are carefully tuned for each dataset

    Prediction of peptide and protein propensity for amyloid formation

    Get PDF
    Understanding which peptides and proteins have the potential to undergo amyloid formation and what driving forces are responsible for amyloid-like fiber formation and stabilization remains limited. This is mainly because proteins that can undergo structural changes, which lead to amyloid formation, are quite diverse and share no obvious sequence or structural homology, despite the structural similarity found in the fibrils. To address these issues, a novel approach based on recursive feature selection and feed-forward neural networks was undertaken to identify key features highly correlated with the self-assembly problem. This approach allowed the identification of seven physicochemical and biochemical properties of the amino acids highly associated with the self-assembly of peptides and proteins into amyloid-like fibrils (normalized frequency of β-sheet, normalized frequency of β-sheet from LG, weights for β-sheet at the window position of 1, isoelectric point, atom-based hydrophobic moment, helix termination parameter at position j+1 and ΔGº values for peptides extrapolated in 0 M urea). Moreover, these features enabled the development of a new predictor (available at http://cran.r-project.org/web/packages/appnn/index.html) capable of accurately and reliably predicting the amyloidogenic propensity from the polypeptide sequence alone with a prediction accuracy of 84.9 % against an external validation dataset of sequences with experimental in vitro, evidence of amyloid formation

    Systematic analysis of primary sequence domain segments for the discrimination between class C GPCR subtypes

    Get PDF
    G-protein-coupled receptors (GPCRs) are a large and diverse super-family of eukaryotic cell membrane proteins that play an important physiological role as transmitters of extracellular signal. In this paper, we investigate Class C, a member of this super-family that has attracted much attention in pharmacology. The limited knowledge about the complete 3D crystal structure of Class C receptors makes necessary the use of their primary amino acid sequences for analytical purposes. Here, we provide a systematic analysis of distinct receptor sequence segments with regard to their ability to differentiate between seven class C GPCR subtypes according to their topological location in the extracellular, transmembrane, or intracellular domains. We build on the results from the previous research that provided preliminary evidence of the potential use of separated domains of complete class C GPCR sequences as the basis for subtype classification. The use of the extracellular N-terminus domain alone was shown to result in a minor decrease in subtype discrimination in comparison with the complete sequence, despite discarding much of the sequence information. In this paper, we describe the use of Support Vector Machine-based classification models to evaluate the subtype-discriminating capacity of the specific topological sequence segments.Peer ReviewedPostprint (author's final draft

    Highly Accurate Fragment Library for Protein Fold Recognition

    Get PDF
    Proteins play a crucial role in living organisms as they perform many vital tasks in every living cell. Knowledge of protein folding has a deep impact on understanding the heterogeneity and molecular functions of proteins. Such information leads to crucial advances in drug design and disease understanding. Fold recognition is a key step in the protein structure discovery process, especially when traditional computational methods fail to yield convincing structural homologies. In this work, we present a new protein fold recognition approach using machine learning and data mining methodologies. First, we identify a protein structural fragment library (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large-scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy. Then, based on Frag-k, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multimodal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolution neural network (CNN) to classify the fragment vectors into the corresponding folds. Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition

    Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

    Get PDF
    Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied
    • …
    corecore