Search CORE

Oxford University Research Archive

FigShare

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

Author: A Andreeva
A Bairoch
A Bateman
A Kelil
AP Bradley
B Rost
B Thiruv
BE Blaisdell
CH Wu
D Barthel
EM Taylor
F Pearl
F Ronquist
G Didier
G Fichant
G Reinert
GW Stuart
J Felsenstein
J Felsenstein
J Felsenstein
J Lowe
J Soppa
JM Word
JP Egan
JP Huelsenbeck
K Komatsu
KP Wu
LP Chew
M Hirano
M Sierk
N Cobbe
N Krasnogor
N Saitoh
P Ferragina
Qi Dai
S Hochreiter
S Kumar
S Vinga
S Vinga
SF Altschul
SF Altschul
TD Pham
TD Pham
Tianming Wang
TJ Wu
TJ Wu
W Li
Y Fujioka
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using <it>k</it>-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: <it>gre.k </it>(generalized relative entropy) and <it>gsm.k </it>(gapped similarity measure). Results We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained. Conclusion Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel <it>gsm.k </it>introduced by this article, the <it>cos.k </it>followed. When the data becomes less redundant, <it>gre.k </it>proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that <it>Gdis.k </it>based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.</p

Springer - Publisher Connector

Public Library of Science (PLOS)

Classification of Protein Kinases on the Basis of Both Kinase and Non-Kinase Regions

Author: A Bateman
A Bleasby
A Champion
A Kelil
A Krupa
A Krupa
A Krupa
A Krupa
A Marchler-Bauer
C Bradham
C Dardick
C Smith
D Hardie
D Miranda-Saavedra
DM Martin
E Han
ED Scheeff
G Manning
G Manning
G Plowman
J Goldberg
J Tchieu
J Toshima
J Zheng
Jason E. Stajich
Juliette Martin
K Anamika
K Anamika
K Anamika
K Deshmukh
Krishanpal Anamika
M Gribskov
M Jacobs
M Kostich
N Kannan
N Srinivasan
Narayanaswamy Srinivasan
O Buzko
P Cohen
P Gouet
P Ward
R Niedner
R Tatusov
R Tatusov
S Caenepeel
S Cheek
S Cheek
S Datta
S Hanks
S Hanks
S Katamine
S Shiu
SR Eddy
T Hunter
VS Gowri
Publication venue: Public Library of Science
Publication date: 15/09/2010
Field of study

BACKGROUND: Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies. METHODOLOGY: Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica. CONCLUSIONS: The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture

Public Library of Science (PLOS)

Genome-Wide Comparative Gene Family Classification

Author: A Barriere
A Heger
A Jaccard
A Kelil
A Krause
A Krause
A Paccanaro
AJ Enright
AJ Enright
AJ Vilella
C-Y Chen
CF Higgins
CH Wu
Christian Frech
CP Ponting
D Lee
E Bolten
E Jacoby
ER Troemel
ES Lander
EV Kriventseva
EV Kriventseva
EV Kriventseva
F Abascal
F Tekaia
G Yona
H Li
HM Robertson
HM Robertson
HM Robertson
HM Robertson
IV Tetko
J Huerta-Cepas
J Schultz
JA Sheps
JC Venter
JD Thompson
JH Thomas
JH Thomas
JH Thomas
JP Demuth
K Tamura
LD Stein
MO Dayhoff
N Chen
N Hulo
N Kaplan
Nansheng Chen
P Pipenbacher
P Sperisen
PK Wall
Q Ma
RD Finn
Robert DeSalle
S Aftab
S Kim
S Nakanishi
SA Rahman
T Meinel
T Wittkop
TJ Harlow
Y Chen
Y Loewenstein
Z Zhao
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species

Simon Fraser University Institutional Repository

CLUSS: Clustering of protein sequences based on a new similarity measure

Author: A Krause
Abdellali Kelil
AJ Enright
Alain Fleury
C Notredame
D Higgins
ELL Sonnhammer
F Titgemeyer
G Reinert
G Yona
H Lodish
IV Tetko
J Felsenstein
J Heringa
J Rocha
JD Thompson
JD Thompson
JH Ward
JH Ward
JS Varré
K Katoh
K Sjölander
K Sjölander
M Ike
M Kimura
MO Dayhoff
MY Leung
N Côté
N Wicker
P Pipenbacher
R Jothi
RC Edgar
RC Edgar
RO Duda
Ryszard Brzezinski
S Fanning
S Henikoff
S Karlin
S Karlin
S Karlin
S Vinga
SF Altschul
SF Altschul
Shengrui Wang
T Fukamizo
T Ishimizu
V Batagelj
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "<it>phylogenetic</it>" in the sense of "<it>relatedness of biological functions</it>". Results To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity. Conclusion We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.</p

Springer - Publisher Connector

A bioinformatics pipeline to search functional motifs within whole-proteome data: a case study of poxviruses

Author: A Kelil
A Via
ANN Ba
B Moss
C Grangeasse
CJ Sigrist
CM Robinson
CM Robinson
D Dou
G Kleiger
H Dinkel
H Horn
H Luo
H Sobhy
H Sobhy
Haitham Sobhy
I Kirmitzoglou
JG Smith
K Kadaveru
K Roey Van
M Seiler
MA Huntley
N Palopoli
NE Davey
P Tompa
RJ Edwards
S Kosugi
S Wolff
T Mi
TG Senkevich
TL Bailey
W Haerty
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Prioritizing Health Care Strategies to Reduce Childhood Mortality

IMPORTANCE: Although child mortality trends have decreased worldwide, deaths among children younger than 5 years of age remain high and disproportionately circumscribed to sub-Saharan Africa and Southern Asia. Tailored and innovative approaches are needed to increase access, coverage, and quality of child health care services to reduce mortality, but an understanding of health system deficiencies that may have the greatest impact on mortality among children younger than 5 years is lacking. OBJECTIVE: To investigate which health care and public health improvements could have prevented the most stillbirths and deaths in children younger than 5 years using data from the Child Health and Mortality Prevention Surveillance (CHAMPS) network. DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used longitudinal, population-based, and mortality surveillance data collected by CHAMPS to understand preventable causes of death. Overall, 3390 eligible deaths across all 7 CHAMPS sites (Bangladesh, Ethiopia, Kenya, Mali, Mozambique, Sierra Leone, and South Africa) between December 9, 2016, and December 31, 2021 (1190 stillbirths, 1340 neonatal deaths, 860 infant and child deaths), were included. Deaths were investigated using minimally invasive tissue sampling (MITS), a postmortem approach using biopsy needles for sampling key organs and fluids. MAIN OUTCOMES AND MEASURES: For each death, an expert multidisciplinary panel reviewed case data to determine the plausible pathway and causes of death. If the death was deemed preventable, the panel identified which of 10 predetermined health system gaps could have prevented the death. The health system improvements that could have prevented the most deaths were evaluated for each age group: stillbirths, neonatal deaths (aged <28 days), and infant and child deaths (aged 1 month to <5 years). RESULTS: Of 3390 deaths, 1505 (44.4%) were female and 1880 (55.5%) were male; sex was not recorded for 5 deaths. Of all deaths, 3045 (89.8%) occurred in a healthcare facility and 344 (11.9%) in the community. Overall, 2607 (76.9%) were deemed potentially preventable: 883 of 1190 stillbirths (74.2%), 1010 of 1340 neonatal deaths (75.4%), and 714 of 860 infant and child deaths (83.0%). Recommended measures to prevent deaths were improvements in antenatal and obstetric care (recommended for 588 of 1190 stillbirths [49.4%], 496 of 1340 neonatal deaths [37.0%]), clinical management and quality of care (stillbirths, 280 [23.5%]; neonates, 498 [37.2%]; infants and children, 393 of 860 [45.7%]), health-seeking behavior (infants and children, 237 [27.6%]), and health education (infants and children, 262 [30.5%]). CONCLUSIONS AND RELEVANCE: In this cross-sectional study, interventions prioritizing antenatal, intrapartum, and postnatal care could have prevented the most deaths among children younger than 5 years because 75% of deaths among children younger than 5 were stillbirths and neonatal deaths. Measures to reduce mortality in this population should prioritize improving existing systems, such as better access to antenatal care, implementation of standardized clinical protocols, and public education campaigns

LSHTM Research Online