3,196 research outputs found

    Supervised multivariate analysis of sequence groups to identify specificity determining residues

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments.</p> <p>Results</p> <p>We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids.</p> <p>Conclusion</p> <p>This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.</p

    An entropy based heuristic model for predicting functional sub-type divisions of protein families

    Get PDF
    Multiple sequence alignments of protein families are often used for locating residues that are widely apart in the sequence, which are considered as influential for determining functional specificity of proteins towards various substrates, ligands, DNA and other proteins. In this paper, we propose an entropy-score based heuristic algorithm model for predicting functional sub-family divisions of protein families, given the multiple sequence alignment of the protein family as input without any functional sub-type or key site information given for any protein sequence. Two of the experimented test-cases are reported in this paper. First test-case is Nucleotidyl Cyclase protein family consisting of guanalyate and adenylate cyclases. And the second test-case is a dataset of proteins taken from six superfamilies in Structure-Function Linkage Database (SFLD). Results from these test-cases are reported in terms of confirmed sub-type divisions with phylogeny relations from former studies in the literature

    Statistical deconvolution of enthalpic energetic contributions to MHC-peptide binding affinity

    Get PDF
    Background: MHC Class I molecules present antigenic peptides to cytotoxic T cells, which forms an integral part of the adaptive immune response. Peptides are bound within a groove formed by the MHC heavy chain. Previous approaches to MHC Class I-peptide binding prediction have largely concentrated on the peptide anchor residues located at the P2 and C-terminus positions. Results: A large dataset comprising MHC-peptide structural complexes was created by re-modelling pre-determined x-ray crystallographic structures. Static energetic analysis, following energy minimisation, was performed on the dataset in order to characterise interactions between bound peptides and the MHC Class I molecule, partitioning the interactions within the groove into van der Waals, electrostatic and total non-bonded energy contributions. Conclusion: The QSAR techniques of Genetic Function Approximation (GFA) and Genetic Partial Least Squares (G/PLS) algorithms were used to identify key interactions between the two molecules by comparing the calculated energy values with experimentally-determined BL50 data. Although the peptide termini binding interactions help ensure the stability of the MHC Class I-peptide complex, the central region of the peptide is also important in defining the specificity of the interaction. As thermodynamic studies indicate that peptide association and dissociation may be driven entropically, it may be necessary to incorporate entropic contributions into future calculations

    Multi-Harmony: detecting functional specificity from sequence alignment

    Get PDF
    Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww

    Ensemble approach to predict specificity determinants: benchmarking and validation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is extremely important and challenging to identify the sites that are responsible for functional specification or diversification in protein families. In this study, a rigorous comparative benchmarking protocol was employed to provide a reliable evaluation of methods which predict the specificity determining sites. Subsequently, three best performing methods were applied to identify new potential specificity determining sites through ensemble approach and common agreement of their prediction results.</p> <p>Results</p> <p>It was shown that the analysis of structural characteristics of predicted specificity determining sites might provide the means to validate their prediction accuracy. For example, we found that for smaller distances it holds true that the more reliable the prediction method is, the closer predicted specificity determining sites are to each other and to the ligand.</p> <p>Conclusion</p> <p>We observed certain similarities of structural features between predicted and actual subsites which might point to their functional relevance. We speculate that majority of the identified potential specificity determining sites might be indirectly involved in specific interactions and could be ideal target for mutagenesis experiments.</p

    Clustering of protein domains for functional and evolutionary studies

    Get PDF
    Background: The number of protein family members defined by DNA sequencing is usually much larger than those characterised experimentally. This paper describes a method to divide protein families into subtypes purely on sequence criteria. Comparison with experimental data allows an independent test of the quality of the clustering. Results: An evolutionary split statistic is calculated for each column in a protein multiple sequence alignment; the statistic has a larger value when a column is better described by an evolutionary model that assumes clustering around two or more amino acids rather than a single amino acid. The user selects columns (typically the top ranked columns) to construct a motif. The motif is used to divide the family into subtypes using a stochastic optimization procedure related to the deterministic annealing EM algorithm (DAEM), which yields a specificity score showing how well each family member is assigned to a subtype. The clustering obtained is not strongly dependent on the number of amino acids chosen for the motif. The robustness of this method was demonstrated using six well characterized protein families: nucleotidyl cyclase, protein kinase, dehydrogenase, two polyketide synthase domains and small heat shock proteins. Phylogenetic trees did not allow accurate clustering for three of the six families. Conclusion: The method clustered the families into functional subtypes with an accuracy of 90 to 100%. False assignments usually had a low specificity score

    Development of New Bioinformatic Approaches for Human Genetic Studies

    Get PDF
    The development of bioinformatics methods for human genetic studies utilizes the vast amount of data to generate new valuable information. Machine learning and statistical coupling analysis can be used in the study of human diseases. These diseases include intellectual disabilities (ID), prevalent in 1-3% of the population and caused primarily by genetics. Although many cases of ID are caused by mutations in protein-coding genes, the possible involvement of long non-coding RNAs (lncRNAs) in ID due to their role in gene expression regulation, has been explored. In this study, we used machine learning to develop a new expression-based model trained using ID genes encoded with the developing brain transcriptome. The model was fine-tuned using the class-balancing approach of synthetic over-sampling of the minority class, resulting in improved performance. We used the model to predict candidate ID-associated lncRNAs. Our model identified several candidates that overlapped with previously reported ID-associated lncRNAs, enriched with neurodevelopmental functions, and highly expressed in brain tissues. Machine learning was also used to predict protein stability changes caused by missense mutations, which can lead to disease conditions including ID. We tested Random Forests, Support Vector Machines (SVM) and Naïve Bayes to find the best-performing algorithm to develop a multi-class classifier. We developed an SVM model using relevant physico-chemical features after feature selection. Our work identified new features for predicting the effect of amino acid substitutions on protein stability and a well-performing multi-class classifier solely based on sequence information. Statistical approaches were used to analyze the association between mutations and phenotypes. In this study, we used statistical coupling analysis (SCA) to cluster disease-causing mutations and ID phenotypes. Using SCA we identified groups of co-evolving residues, known as protein sectors, in ID protein families. Within each distinct sector, mutations associated with different phenotypic manifestations associated with a syndromic ID were identified. Our results suggest that protein sector analysis can be used to associate mutations with phenotypic manifestations in human diseases. The bioinformatic methods developed in this dissertation can be used in human genetic research to understand the role of new genes and proteins in human disease

    Combining specificity determining and conserved residues improves functional site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Predicting the location of functionally important sites from protein sequence and/or structure is a long-standing problem in computational biology. Most current approaches make use of sequence conservation, assuming that amino acid residues conserved within a protein family are most likely to be functionally important. Most often these approaches do not consider many residues that act to define specific sub-functions within a family, or they make no distinction between residues important for function and those more relevant for maintaining structure (e.g. in the hydrophobic core). Many protein families bind and/or act on a variety of ligands, meaning that conserved residues often only bind a common ligand sub-structure or perform general catalytic activities.</p> <p>Results</p> <p>Here we present a novel method for functional site prediction based on identification of conserved positions, as well as those responsible for determining ligand specificity. We define Specificity-Determining Positions (SDPs), as those occupied by conserved residues within sub-groups of proteins in a family having a common specificity, but differ between groups, and are thus likely to account for specific recognition events. We benchmark the approach on enzyme families of known 3D structure with bound substrates, and find that in nearly all families residues predicted by SDPsite are in contact with the bound substrate, and that the addition of SDPs significantly improves functional site prediction accuracy. We apply SDPsite to various families of proteins containing known three-dimensional structures, but lacking clear functional annotations, and discusse several illustrative examples.</p> <p>Conclusion</p> <p>The results suggest a better means to predict functional details for the thousands of protein structures determined prior to a clear understanding of molecular function.</p

    The C-Terminal Fragment of Prostate-Specific Antigen, a 2331 Da Peptide, as a New Urinary Pathognomonic Biomarker Candidate for Diagnosing Prostate Cancer

    Get PDF
    Background and Objectives: Prostate cancer (PCa) is one of the most common cancers and leading cause of cancer-related deaths in men. Mass screening has been carried out since the 1990s using prostate-specific antigen (PSA) levels in the serum as a PCa biomarker. However, although PSA is an excellent organ-specific marker, it is not a cancer-specific marker. Therefore, the aim of this study was to discover new biomarkers for the diagnosis of PCa. Materials and Methods: We focused on urine samples voided following prostate massage (digital rectal examination [DRE]) and conducted a peptidomic analysis of these samples using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS_n). Urinary biomaterials were concentrated and desalted using CM-Sepharose prior to the following analyses being performed by MALDI-TOF/MS_n: 1) differential analyses of mass spectra; 2) determination of amino acid sequences; and 3) quantitative analyses using a stable isotope-labeled internal standard. Results: Multivariate analysis of the MALDI-TOF/MS mass spectra of urinary extracts revealed a 2331 Da peptide in urine samples following DRE. This peptide was identified as a C-terminal PSA fragment composed of 19 amino acid residues. Moreover, quantitative analysis of the relationship between isotope-labeled synthetic and intact peptides using MALDI-TOF/MS revealed that this peptide may be a new pathognomonic biomarker candidate that can differentiate PCa patients from non-cancer subjects. Conclusion: The results of the present study indicate that the 2331 Da peptide fragment of PSA may become a new pathognomonic biomarker for the diagnosis of PCa. A further large-scale investigation is currently underway to assess the possibility of using this peptide in the early detection of PCa

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution
    corecore