Search CORE

Public Library of Science (PLOS)

Extreme Evolutionary Disparities Seen in Positive Selection across Seven Complex Diseases

Author: A Di Rienzo
Atul J. Butte
BF Voight
BM Rothschild
CC Babbitt
CC Spencer
EC Walsh
Erik Corona
EY Fung
FG Jimenez
J Askling
J Maddox
JB Kirsner
JD Wall
JL Mobley
Joel T. Dudley
John Hawks
K Ding
K Zhang
KJ Guinan
L Arbiza
LM Chevin
M Currat
M Sirota
MA Schaub
N Patil
PC Sabeti
PC Sabeti
PC Sabeti
PD Thomas
R Blekhman
R Plomin
S Fabre
S Myles
S Nejentsev
S Podder
SG Filler
SJ Richardson
ST Sherry
V Butty
Publication venue: Public Library of Science
Publication date: 17/08/2010
Field of study

Positive selection is known to occur when the environment that an organism inhabits is suddenly altered, as is the case across recent human history. Genome-wide association studies (GWASs) have successfully illuminated disease-associated variation. However, whether human evolution is heading towards or away from disease susceptibility in general remains an open question. The genetic-basis of common complex disease may partially be caused by positive selection events, which simultaneously increased fitness and susceptibility to disease. We analyze seven diseases studied by the Wellcome Trust Case Control Consortium to compare evidence for selection at every locus associated with disease. We take a large set of the most strongly associated SNPs in each GWA study in order to capture more hidden associations at the cost of introducing false positives into our analysis. We then search for signs of positive selection in this inclusive set of SNPs. There are striking differences between the seven studied diseases. We find alleles increasing susceptibility to Type 1 Diabetes (T1D), Rheumatoid Arthritis (RA), and Crohn's Disease (CD) underwent recent positive selection. There is more selection in alleles increasing, rather than decreasing, susceptibility to T1D. In the 80 SNPs most associated with T1D (p-value <7.01×10−5) showing strong signs of positive selection, 58 alleles associated with disease susceptibility show signs of positive selection, while only 22 associated with disease protection show signs of positive selection. Alleles increasing susceptibility to RA are under selection as well. In contrast, selection in SNPs associated with CD favors protective alleles. These results inform the current understanding of disease etiology, shed light on potential benefits associated with the genetic-basis of disease, and aid in the efforts to identify causal genetic factors underlying complex disease

Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase

Author: A Bairoch
A Sekowska
B Barcelona-Andres
B Labedan
B Labedan
B Wargnies
C Tricot
C Vander Wauven
GH Gonnet
I Paulsen
I Schomburg
J Felsenstein
JA Gerlt
JP Simon
L Grivell
M Kanehisa
M Zuniga
PC Babbitt
PD Karp
R Apweiler
R Cunin
RJ Roon
S Dashuang
SE Brenner
T Janowitz
TA Hall
The Gene Ontology Consortium
V Stalon
Y Nakada
Y Nakada
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data. RESULTS: We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH(2)-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences. CONCLUSIONS: We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context

Public Library of Science (PLOS)

DI-fusion

Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies

Author: A Bateman
AC Howlett
AJ Enright
AJ Enright
AT Adai
B Rost
B Xu
CL Huang
CS Goh
D Medini
DH Huson
DL Wheeler
EC Meng
G Manning
HM Holden
Holly J. Atkinson
I. King Jordan
J Bockaert
J Boudeau
J Dvorák
J Kazius
JH Morris
JH Morris
JM Young
John H. Morris
JP Huelsenbeck
K Palczewski
L Buck
L Song
M Ashburner
M Bashton
M Murakami
N Saitou
P Bhaumik
P Shannon
P Storz
Patricia C. Babbitt
PC Babbitt
R Wiese
R Wiese
RC Edgar
RD Finn
RS Hall
SC Pegg
SF Altschul
SF Altschul
SG Rasmussen
T Frickey
T Warne
Thomas E. Ferrin
TT Nguyen
V Cherezov
VP Jaakola
W Li
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The dramatic increase in heterogeneous types of biological data—in particular, the abundance of new protein sequences—requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity—GPCRs and kinases from humans, and the crotonase superfamily of enzymes—we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships

CiteSeerX

ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts

Author: A Andreeva
A Bateman
A Marchler-Bauer
A Osterman
AL Barabási
C Gene Ontology
C The UniProt
CJA Sigrist
CM Zmasek
DA Benson
DH Haft
HM Berman
HW Ma
J Wu
K Choi
Kwangmin Choi
L Pireddu
M Kanehisa
M Kanehisa
M Madera
N Hulo
P Stothard
PC Babbitt
PD Karp
R Caspi
R Overbeek
RA George
S Kim
S Kim
S Kim
S Kim
SCH Pegg
SF Altschul
Sun Kim
V BATAGELJL
VM Markowitz
W Thompson
WR Pearson
Y Ye
Y Zheng
YI Wolf
Publication venue: BioMed Central
Publication date: 01/03/2008
Field of study

Abstract Background Once a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses. Results ComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for <it>de novo </it>prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make <it>de novo </it>enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation. Conclusion ComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface. </p

InterPro in 2017-beyond protein family and domain annotations

Author: Attwood TK
Babbitt PC
Bateman A
Bork P
Bridge AJ
Chang HY
Dosztányi Z
El-Gebali S
Finn RD
Fraser M
Gough J
Haft D
Holliday GL
Huang H
Huang X
Letunic I
Lopez R
Lu S
Marchler-Bauer A
Mi H
Mistry J
Mitchell AL
Natale DA
Necci M
Nuka G
Orengo CA
Park Y
Pesseat S
Piovesan D
Potter SC
Rawlings ND
Redaschi N
Richardson L
Rivoire C
Sangrador-Vegas A
Sigrist C
Sillitoe I
Smithers B
Squizzato S
Sutton G
Thanki N
Thomas PD
Tosatto SC
Wu CH
Xenarios I
Yeh LS
Young SY
Publication venue
Publication date: 29/11/2016
Field of study

InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences

UCL Discovery

Is EC class predictable from reaction mechanism?

Author: A Statnikov
AG McDonald
BV Dasarathy
BW Matthews
C Andreini
C Andreini
DARS Latino
DE Almonacid
DE Almonacid
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
IUBMB
J Gorodkin
J Menke
John BO Mitchell
JW Torrance
K Astikainen
KM Borgwardt
L Breiman
L De Ferrari
LD Hughes
M Aizerman
M Kanehisa
M Leber
N Furnham
N Nagano
N Nagano
Neetika Nath
NM O'Boyle
O Sacher
PC Babbitt
PD Dobson
R Lowe
RD Uriarte
SA Rahman
SCH Pegg
SCH Pegg
T Bray
T Bylander
V Egelhofer
VN Vapnik
WS Noble
X Hu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe

University of St. Andrews - Pure

St Andrews Research Repository

Target selection and annotation for the structural genomics of the amidohydrolase and enolase superfamilies

Author: A Andreeva
A Sakai
A Weeks
AE Todd
Andrej Sali
C Nowlan
CH Wu
CM Seibert
D Vitkup
DA Benson
DL Wheeler
EF Pettersen
F Melo
Frank M. Raushel
H Berman
HJ Imker
J Akana
J Gough
J Lee
J. Michael Sauder
JA Gerlt
JA Gerlt
JA Gerlt
JB Bonanno
JB Thoden
JC Hermann
JC Hermann
JC Norvell
JC Venter
JE Vick
JE Vick
Jeffrey B. Bonanno
Jennifer J. Seffernick
JF Rakus
JJ Irwin
John A. Gerlt
L Holm
L Song
L Williams
Libusha Kelly
Margaret E. Glasner
Mark R. Chance
Matthew P. Jacobson
ME Glasner
ME Glasner
ME Glasner
N Eswar
N Nagano
Narayanan Eswar
P Shannon
Patricia C. Babbitt
PC Babbitt
R Marti-Arbona
R Marti-Arbona
R Marti-Arbona
R Sanchez
R Tyagi
Ranyee Chiang
RS Hall
RZ Liao
SC Almo
SC Pegg
SD Brown
SF Altschul
Shoshana D. Brown
SL Schafer
Stephen K. Burley
Steven C. Almo
Subramanyam Swaminathan
TN Porter
TT Nguyen
U Pieper
Ursula Pieper
WS Yew
WS Yew
WS Yew
Xiaojing Zheng
Y Li
Publication venue: Springer Netherlands
Publication date: 01/01/2009
Field of study

To study the substrate specificity of enzymes, we use the amidohydrolase and enolase superfamilies as model systems; members of these superfamilies share a common TIM barrel fold and catalyze a wide range of chemical reactions. Here, we describe a collaboration between the Enzyme Specificity Consortium (ENSPEC) and the New York SGX Research Center for Structural Genomics (NYSGXRC) that aims to maximize the structural coverage of the amidohydrolase and enolase superfamilies. Using sequence- and structure-based protein comparisons, we first selected 535 target proteins from a variety of genomes for high-throughput structure determination by X-ray crystallography; 63 of these targets were not previously annotated as superfamily members. To date, 20 unique amidohydrolase and 41 unique enolase structures have been determined, increasing the fraction of sequences in the two superfamilies that can be modeled based on at least 30% sequence identity from 45% to 73%. We present case studies of proteins related to uronate isomerase (an amidohydrolase superfamily member) and mandelate racemase (an enolase superfamily member), to illustrate how this structure-focused approach can be used to generate hypotheses about sequence–structure–function relationships

The LabelHash algorithm for substructure matching

Background: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95 % sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs a

CiteSeerX

University of Toronto Research Repository

clusterMaker: a multi-algorithm clustering plugin for Cytoscape

Abstract Background In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present <it>clusterMaker</it>, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. <it>clusterMaker </it>is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast <it>Saccharomyces cerevisiae</it>; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. Conclusions The Cytoscape plugin <it>clusterMaker </it>provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the <it>clusterMaker </it>plugin. <it>clusterMaker </it>is available via the Cytoscape plugin manager.</p