73 research outputs found
Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs
<p>Abstract</p> <p>Background</p> <p>One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function.</p> <p>Results</p> <p>Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM.</p> <p>Conclusions</p> <p>Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.</p
SA-Mot: a web server for the identification of motifs of interest extracted from protein loops
The detection of functional motifs is an important step for the determination of protein functions. We present here a new web server SA-Mot (Structural Alphabet Motif) for the extraction and location of structural motifs of interest from protein loops. Contrary to other methods, SA-Mot does not focus only on functional motifs, but it extracts recurrent and conserved structural motifs involved in structural redundancy of loops. SA-Mot uses the structural word notion to extract all structural motifs from uni-dimensional sequences corresponding to loop structures. Then, SA-Mot provides a description of these structural motifs using statistics computed in the loop data set and in SCOP superfamily, sequence and structural parameters. SA-Mot results correspond to an interactive table listing all structural motifs extracted from a target structure and their associated descriptors. Using this information, the users can easily locate loop regions that are important for the protein folding and function. The SA-Mot web server is available at http://sa-mot.mti.univ-paris-diderot.fr
PockDrug-Server : a new web server for predicting pocket druggability on holo and apo proteins
Predicting protein pocket's ability to bind drug-like molecules with high affinity, i.e. druggability, is of major interest in the target identification phase of drug discovery. Therefore, pocket druggability investigations represent a key step of compound clinical progression projects. Currently computational druggability prediction models are attached to one unique pocket estimation method despite pocket estimation uncertainties. In this paper, we propose 'PockDrug-Server' to predict pocket druggability, efficient on both (i) estimated pockets guided by the ligand proximity (extracted by proximity to a ligand from a holo protein structure) and (ii) estimated pockets based solely on protein structure information (based on amino atoms that form the surface of potential binding cavities). PockDrug-Server provides consistent druggability results using different pocket estimation methods. It is robust with respect to pocket boundary and estimation uncertainties, thus efficient using apo pockets that are challenging to estimate. It clearly distinguishes druggable from less druggable pockets using different estimation methods and outperformed recent druggability models for apo pockets. It can be carried out from one or a set of apo/holo proteins using different pocket estimation methods proposed by our web server or from any pocket previously estimated by the user. PockDrug-Server is publicly available at: http://pockdrug.rpbs.univ-paris-diderot.fr.Peer reviewe
Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
<p>Abstract</p> <p>Background</p> <p>In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.</p> <p>Results</p> <p>The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.</p> <p>Conclusions</p> <p>Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p
Mining protein loops using a structural alphabet and statistical exceptionality
<p>Abstract</p> <p>Background</p> <p>Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.</p> <p>Results</p> <p>We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.</p> <p>Conclusions</p> <p>We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p
Dataset of the study "Insights into an original pocket-ligand pair classification: a promising tool for ligand profile prediction"
<p>In this webpage, we stock two files :</p>
<p><strong> * </strong><strong><em>DataDescription.csv<br></em></strong> * <em><strong>descriptor_values.csv </strong></em></p>
<p> </p>
<p>These two files<strong> contain the dataset of the study entitled "</strong>Insights on an original pocket-ligand pair classification: a promising tool for ligand profile prediction"</p>
<p>written by : S. Pérot, L. Regad, C. Reynès, O. Sperandio, M.A. Miteva, B.O. Villoutreix, A.C. Camproux</p>
<p> </p>
<p>(Affiliations : Univ. Paris Diderot, Sorbonne Paris Cité, INSERM UMRS 973, MTi, F-75205 Paris, France)</p>
<p> </p>
<p><strong>Abstract of the study :</strong><br>This study presents a multivariate approach relating ligand properties with protein pockets properties from the analysis of known ligand-protein interactions. We explore and optimize the pocket-ligand pairs space by combining pocket and ligand descriptors using Principal Component Analysis and developing a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physicochemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physicochemical properties and capture relevant information with respect to protein-ligand interactions.</p>
<p><strong><br></strong></p>
<p><strong>Dataset Description:</strong></p>
<p><strong><em>file : DataDescription.csv</em></strong><br>This study is based on a dataset composed of 483 pocket-ligand pairs. To create a large training set of pocket-ligand pairs, we initially gathered the refined set from the PDBbind database (Wang et al., 2004; 2005) and the Astex test set (Hartshorn et al., 2007) and selected complexes that contained drug-like ligands (i.e. small chemical compounds). Two protein-ligand complex datasets were compiled for the training set. The first one is composed of 560 non-redundant protein-ligand structures, with a resolution better than 2.5 Å retrieved from the refined set of the PDBbind database. From these structures, we removed those containing metal ions or cofactors next to the co-crystallized ligand resulting in a selection of 432 structures. The second one is composed of 85 manually curated protein-ligand complexes from the Astex test set. As for the previous set, we removed the structures with some ions or cofactors next to the ligand, resulting in a selection of 51 structures. The resulting dataset corresponds at the end to 483 protein-ligand structures.</p>
<p>The Id and information about each pair are available in the following table <em><strong>DataDescription.csv</strong></em><br>This table contains:</p>
<p>* PDB code of the complex</p>
<p>* Protein chains containing in the PDB files. If the complex contains several protein chains, their chain Id are separated by "/"</p>
<p>* Amino acid (AA) sequence of each chain. AA sequences correspond to sequence of the crystallized proteins. If the complex contains several protein chains, their AA sequences are separated by "/"</p>
<p>* UniProt Id of each chain. If the complex contains several protein chains, their UniProt Id are separated by "/". NoId means no UniProt Id was find for a given chainSmile code of the interested ligandPDB code of the interested ligandObtained cluster of the pocket-ligand pairs.</p>
<p> </p>
<p><em><strong>file : descriptor_values.csv </strong></em></p>
<p>This file is a table containing the 24 pocket and ligand descriptors and the cluster assignation of each pocket-ligand pair (last column).</p>
<p> </p>
<p>pocket descriptors :</p>
<p>Based on the current literature, we developed some tools/scripts or used available packages to compute the following standard pocket descriptors on the binding cavities.</p>
<p> </p>
<p> </p>
<p><strong>pocket_volume</strong> : volume of the pocket estimated using Chimera software (Sanner et al. 1996)</p>
<p><strong>protomol_polarity_ratio</strong> : polarity ratio of the pocket (Eyrisch and Helms, 2007). It ranges from 0 (not polar) to 1 (polar).</p>
<p><strong>pocket_rugosity</strong> : pocket rugosity (Pettit and Bowie, 1999). Roughness represents how rough a pocket is: a high value induces a rough pocket.</p>
<p><strong>pocket_planarity</strong> : pocket planarity (Sugaya ans Ikeda, 2009). The planarity ranges from 0 (concave) to 1 (flat). </p>
<p><strong>pocket_narrowness</strong> : pocket narrowness (Sugaya ans Ikeda, 2009). The narrowness ranges from 0 (full circle) to 1 (line).</p>
<p><strong>pocket_lambda0,pocket_lambda2</strong> : The three moments of inertia correspond to the eigenvalues of the inertia matrix computed on the pocket. The moments of inertia of a virtual pocket with regards to a given axis describe how many probes the pocket has overall and how far each probe is from the axis. Consequently the closest the moments of inertia are one from another, the more spherical the pocket is. And conversely the more lambda0 is different from lambda2, the more cylindrical the pocket tends to be.</p>
<p><strong>pocket_hbond_acceptor</strong> : number of hydrogen-bond acceptors of the pocket (Schalon et al., 2008)</p>
<p><strong>pocket_hba.pour</strong> : % of hydrogen-bond acceptors of the pocket (Schalon et al., 2008)</p>
<p><strong>pocket_hbond_donor</strong> : number of hydrogen-bond donor of the pocket (Schalon et al., 2008)</p>
<p><strong>pocket_hbd.pour</strong> : % of hydrogen-bond donor of the pocket (Schalon et al., 2008)</p>
<p><strong>pocket_charge</strong> : pocket charge is computed as the difference between the number of positively charged amino acids and the number of negatively charged</p>
<p> </p>
<p>ligand descriptors :<br>The ligand descriptors also computed on pockets were computed as described in pocket descriptors section while the remaining ones were computed using the software FAF-Drugs2 (Lagorce et al., 2008).</p>
<p> </p>
<p><strong>ligand_volume</strong> : ligand volume</p>
<p><strong>ligand_polarity_ratio</strong> : ligand polarity</p>
<p><strong>RotatableB</strong> : number of rotable bonds of the ligand</p>
<p><strong>rot.pour</strong> : % of rotatable bonds of the ligand</p>
<p><strong>LogP</strong> : LogP of the ligand</p>
<p><strong>HBA</strong> : number of hydrogen-bond acceptors of the ligand</p>
<p><strong>HBD</strong> : number of hydrogen-bond donors of the ligand</p>
<p><strong>hbd.pour</strong> : % of hydrogen-bond donors of the ligand</p>
<p><strong>PSA</strong> : polar surface area of the ligand</p>
<p><strong>Charge</strong> : ligand charge</p>
<p><strong>ligand_lambda0, ligand_lambda2</strong> : first and third moments of inertia of the ligand</p>
<p> </p
Regad et al. dataset
<p><strong>Description of the file DataDescription.csv</strong></p>
<p><br>This study is based on a dataset composed of 483 pocket-ligand pairs. To create a large training set of pocket-ligand pairs, we initially gathered the refined set from the PDBbind database (Wang et al., 2004; 2005) and the Astex test set (Hartshorn et al., 2007) and selected complexes that contained drug-like ligands (i.e. small chemical compounds). Two protein-ligand complex datasets were compiled for the training set. The first one is composed of 560 non-redundant protein-ligand structures, with a resolution better than 2.5 Å retrieved from the refined set of the PDBbind database. From these structures, we removed those containing metal ions or cofactors next to the co-crystallized ligand resulting in a selection of 432 structures. The second one is composed of 85 manually curated protein-ligand complexes from the Astex test set. As for the previous set, we removed the structures with some ions or cofactors next to the ligand, resulting in a selection of 51 structures. The resulting dataset corresponds at the end to 483 protein-ligand structures.</p>
<p>The id and information about each pair are available in the following table DataDescription.csv<br>This table contains:</p>
<p>* PDB code of the complex</p>
<p>* Protein chains containing in the PDB files. If the complex contains several protein chains, their chain Id are separated by "/"</p>
<p>* Amino acid (AA) sequence of each chain. AA sequences correspond to sequence of the crystallized proteins. If the complex contains several protein chains, their AA sequences are separated by "/"</p>
<p>* UniProt Id of each chain. If the complex contains several protein chains, their UniProt Id are separated by "/". NoId means no UniProt Id was find for a given chainSmile code of the interested ligandPDB code of the interested ligandObtained cluster of the pocket-ligand pairs</p>
<p> </p
Analyse statistique des boucles protéiques (développement d'une méthode d'extraction systématique de motifs récurrents au sein des régions en boucles)
PARIS7-Bibliothèque centrale (751132105) / SudocSudocFranceF
- …