49,089 research outputs found

    Text Mining Improves Prediction of Protein Functional Sites

    Get PDF
    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions

    Structure- and context-based analysis of the GxGYxYP family reveals a new putative class of glycoside hydrolase.

    Get PDF
    BackgroundGut microbiome metagenomics has revealed many protein families and domains found largely or exclusively in that environment. Proteins containing the GxGYxYP domain are over-represented in the gut microbiota, and are found in Polysaccharide Utilization Loci in the gut symbiont Bacteroides thetaiotaomicron, suggesting their involvement in polysaccharide metabolism, but little else is known of the function of this domain.ResultsGenomic context and domain architecture analyses support a role for the GxGYxYP domain in carbohydrate metabolism. Sparse occurrences in eukaryotes are the result of lateral gene transfer. The structure of the GxGYxYP domain-containing protein encoded by the BT2193 locus reveals two structural domains, the first composed of three divergent repeats with no recognisable homology to previously solved structures, the second a more familiar seven-stranded β/α barrel. Structure-based analyses including conservation mapping localise a presumed functional site to a cleft between the two domains of BT2193. Matching to a catalytic site template from a GH9 cellulase and other analyses point to a putative catalytic triad composed of Glu272, Asp331 and Asp333.ConclusionsWe suggest that GxGYxYP-containing proteins constitute a novel glycoside hydrolase family of as yet unknown specificity

    FLORA: a novel method to predict protein function from structure in diverse superfamilies

    Get PDF
    Predicting protein function from structure remains an active area of interest, particularly for the structural genomics initiatives where a substantial number of structures are initially solved with little or no functional characterisation. Although global structure comparison methods can be used to transfer functional annotations, the relationship between fold and function is complex, particularly in functionally diverse superfamilies that have evolved through different secondary structure embellishments to a common structural core. The majority of prediction algorithms employ local templates built on known or predicted functional residues. Here, we present a novel method (FLORA) that automatically generates structural motifs associated with different functional sub-families (FSGs) within functionally diverse domain superfamilies. Templates are created purely on the basis of their specificity for a given FSG, and the method makes no prior prediction of functional sites, nor assumes specific physico-chemical properties of residues. FLORA is able to accurately discriminate between homologous domains with different functions and substantially outperforms (a 2–3 fold increase in coverage at low error rates) popular structure comparison methods and a leading function prediction method. We benchmark FLORA on a large data set of enzyme superfamilies from all three major protein classes (α, β, αβ) and demonstrate the functional relevance of the motifs it identifies. We also provide novel predictions of enzymatic activity for a large number of structures solved by the Protein Structure Initiative. Overall, we show that FLORA is able to effectively detect functionally similar protein domain structures by purely using patterns of structural conservation of all residues

    Rampant exchange of the structure and function of extramembrane domains between membrane and water soluble proteins.

    Get PDF
    Of the membrane proteins of known structure, we found that a remarkable 67% of the water soluble domains are structurally similar to water soluble proteins of known structure. Moreover, 41% of known water soluble protein structures share a domain with an already known membrane protein structure. We also found that functional residues are frequently conserved between extramembrane domains of membrane and soluble proteins that share structural similarity. These results suggest membrane and soluble proteins readily exchange domains and their attendant functionalities. The exchanges between membrane and soluble proteins are particularly frequent in eukaryotes, indicating that this is an important mechanism for increasing functional complexity. The high level of structural overlap between the two classes of proteins provides an opportunity to employ the extensive information on soluble proteins to illuminate membrane protein structure and function, for which much less is known. To this end, we employed structure guided sequence alignment to elucidate the functions of membrane proteins in the human genome. Our results bridge the gap of fold space between membrane and water soluble proteins and provide a resource for the prediction of membrane protein function. A database of predicted structural and functional relationships for proteins in the human genome is provided at sbi.postech.ac.kr/emdmp

    Prediction of protein functional residues from sequence by probability density estimation.

    Get PDF
    Motivation: The prediction of ligand-binding residues or catalytically active residues of a protein may give important hints that can guide further genetic or biochemical studies. Existing sequence-based prediction methods mostly rank residue positions by evolutionary conservation calculated from a multiple sequence alignment of homologs. A problem hampering more wide-spread application of these methods is the low per-residue precision, which at 20% sensitivity is around 35% for ligand-binding residues and 20% for catalytic residues. Results: We combine information from the conservation at each site, its amino acid distribution, as well as its predicted secondary structure (ss) and relative solvent accessibility (rsa). First, we measure conservation by how much the amino acid distribution at each site differs from the distribution expected for the predicted ss and rsa states. Second, we include the conservation of neighboring residues in a weighted linear score by analytically optimizing the signal-to-noise ratio of the total score. Third, we use conditional probability density estimation to calculate the probability of each site to be functional given its conservation, the observed amino acid distribution, and the predicted ss and rsa states. We have constructed two large data sets, one based on the Catalytic Site Atlas and the other on PDB SITE records, to benchmark methods for predicting functional residues. The new method FRcons predicts ligand-binding and catalytic residues with higher precision than alternative methods over the entire sensitivity range, reaching 50% and 40% precision at 20% sensitivity, respectively

    High-Resolution Structure of the N-Terminal Endonuclease Domain of the Lassa Virus L Polymerase in Complex with Magnesium Ions

    Get PDF
    Lassa virus (LASV) causes deadly hemorrhagic fever disease for which there are no vaccines and limited treatments. LASV-encoded L polymerase is required for viral RNA replication and transcription. The functional domains of L–a large protein of 2218 amino acid residues–are largely undefined, except for the centrally located RNA-dependent RNA polymerase (RdRP) motif. Recent structural and functional analyses of the N-terminal region of the L protein from lymphocytic choriomeningitis virus (LCMV), which is in the same Arenaviridae family as LASV, have identified an endonuclease domain that presumably cleaves the cap structures of host mRNAs in order to initiate viral transcription. Here we present a high-resolution crystal structure of the N-terminal 173-aa region of the LASV L protein (LASV L173) in complex with magnesium ions at 1.72 Å. The structure is highly homologous to other known viral endonucleases of arena- (LCMV NL1), orthomyxo- (influenza virus PA), and bunyaviruses (La Crosse virus NL1). Although the catalytic residues (D89, E102 and K122) are highly conserved among the known viral endonucleases, LASV L endonuclease structure shows some notable differences. Our data collected from in vitro endonuclease assays and a reporter-based LASV minigenome transcriptional assay in mammalian cells confirm structural prediction of LASV L173 as an active endonuclease. The high-resolution structure of the LASV L endonuclease domain in complex with magnesium ions should aid the development of antivirals against lethal Lassa hemorrhagic fever

    The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

    Get PDF
    One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

    Primary Structure and Catalytic Mechanism of the Epoxide Hydrolase from Agrobacterium radiobacter AD1

    Get PDF
    The epoxide hydrolase gene from Agrobacterium radiobacter AD1, a bacterium that is able to grow on epichlorohydrin as the sole carbon source, was cloned by means of the polymerase chain reaction with two degenerate primers based on the N-terminal and C-terminal sequences of the enzyme. The epoxide hydrolase gene coded for a protein of 294 amino acids with a molecular mass of 34 kDa. An identical epoxide hydrolase gene was cloned from chromosomal DNA of the closely related strain A. radiobacter CFZ11. The recombinant epoxide hydrolase was expressed up to 40% of the total cellular protein content in Escherichia coli BL21(DE3) and the purified enzyme had a kcat of 21 s-1 with epichlorohydrin. Amino acid sequence similarity of the epoxide hydrolase with eukaryotic epoxide hydrolases, haloalkane dehalogenase from Xanthobacter autotrophicus GJ10, and bromoperoxidase A2 from Streptomyces aureofaciens indicated that it belonged to the α/β-hydrolase fold family. This conclusion was supported by secondary structure predictions and analysis of the secondary structure with circular dichroism spectroscopy. The catalytic triad residues of epoxide hydrolase are proposed to be Asp107, His275, and Asp246. Replacement of these residues to Ala/Glu, Arg/Gln, and Ala, respectively, resulted in a dramatic loss of activity for epichlorohydrin. The reaction mechanism of epoxide hydrolase proceeds via a covalently bound ester intermediate, as was shown by single turnover experiments with the His275 → Arg mutant of epoxide hydrolase in which the ester intermediate could be trapped.
    corecore