Search CORE

Springer - Publisher Connector

Partitioning clustering algorithms for protein sequence data sets

Author: A Enright
A Enright
A Herger
A Krause
DW Mount
E Bolten
E Kriventseva
F Can
G Yona
H Cathy
H Spath
J Hartigan
J Shi
KJ Anil
L Kaufman
Mohamed Limam
N Essoussi
Nadia Essoussi
O Sasson
P Cabena
P Clote
P Pipenbacher
P Sperisen
R Ng
R Tatusov
RC Dubes
S Altschul
S Henikoff
S Schneckener
S Van Dongen
SB Needleman
SE Brenner
Sondes Fayech
TF Smith
UM Fayyad
V Faber
V Guralnik
WR Pearson
Z Wu
Publication venue: BioMed Central
Publication date: 01/04/2009
Field of study

Abstract Background Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. Methods We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. Results We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.</p

CLUSS: Clustering of protein sequences based on a new similarity measure

Author: A Krause
Abdellali Kelil
AJ Enright
Alain Fleury
C Notredame
D Higgins
ELL Sonnhammer
F Titgemeyer
G Reinert
G Yona
H Lodish
IV Tetko
J Felsenstein
J Heringa
J Rocha
JD Thompson
JD Thompson
JH Ward
JH Ward
JS Varré
K Katoh
K Sjölander
K Sjölander
M Ike
M Kimura
MO Dayhoff
MY Leung
N Côté
N Wicker
P Pipenbacher
R Jothi
RC Edgar
RC Edgar
RO Duda
Ryszard Brzezinski
S Fanning
S Henikoff
S Karlin
S Karlin
S Karlin
S Vinga
SF Altschul
SF Altschul
Shengrui Wang
T Fukamizo
T Ishimizu
V Batagelj
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "<it>phylogenetic</it>" in the sense of "<it>relatedness of biological functions</it>". Results To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity. Conclusion We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.</p

Springer - Publisher Connector

Public Library of Science (PLOS)

Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

Author: A Krogh
A Lupas
A Sali
AC McHardy
Adam Godzik
AJ Enright
AJ Enright
B Rodriguez-Brito
BE Suzek
David Jones
DB Rusch
DH Huson
EF DeLong
FE Angly
G Yona
GW Tyson
J Park
JA Cuff
JC Venter
JD Bendtsen
JD Thompson
John C. Wooley
K Mavromatis
L Holm
L Krause
L Rychlewski
ML Tress
O Sasson
P Pipenbacher
PD Schloss
R Apweiler
RL Tatusov
S Mika
S Yooseph
SF Altschul
SG Tringe
SR Eddy
SR Gill
U Hobohm
W Li
W Li
W Li
W Li
Weizhong Li
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project

eScholarship - University of California

Genome-Wide Comparative Gene Family Classification

Author: A Barriere
A Heger
A Jaccard
A Kelil
A Krause
A Krause
A Paccanaro
AJ Enright
AJ Enright
AJ Vilella
C-Y Chen
CF Higgins
CH Wu
Christian Frech
CP Ponting
D Lee
E Bolten
E Jacoby
ER Troemel
ES Lander
EV Kriventseva
EV Kriventseva
EV Kriventseva
F Abascal
F Tekaia
G Yona
H Li
HM Robertson
HM Robertson
HM Robertson
HM Robertson
IV Tetko
J Huerta-Cepas
J Schultz
JA Sheps
JC Venter
JD Thompson
JH Thomas
JH Thomas
JH Thomas
JP Demuth
K Tamura
LD Stein
MO Dayhoff
N Chen
N Hulo
N Kaplan
Nansheng Chen
P Pipenbacher
P Sperisen
PK Wall
Q Ma
RD Finn
Robert DeSalle
S Aftab
S Kim
S Nakanishi
SA Rahman
T Meinel
T Wittkop
TJ Harlow
Y Chen
Y Loewenstein
Z Zhao
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species

Public Library of Science (PLOS)

Simon Fraser University Institutional Repository

ProClust: Improved Clustering of Protein Sequences with an extended graph-based approach

Author: P. Pipenbacher A. Schliep, S. Schneckener, A. Schönhuth, D. Schomburg
R. Schrader
Publication venue
Publication date: 01/01/2002
Field of study

CiteSeerX

ProClust: improved clustering of protein sequences with an extended graph-based approach

Author: A. Schliep
A. Schonhuth
D. Schomburg
P. Pipenbacher
R. Schrader
S. Schneckener
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

computer science publication server

ProClust: Improved Clustering of Protein Sequences with an extended graph-based approach

Author: A. Schliep
A. Schönhuth
D. Schomburg
P. Pipenbacher
R. Schrader
S. Schneckener
Publication venue
Publication date: 01/01/2002
Field of study

Motivation: The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle

CiteSeerX

Kölner UniversitätsPublikationsServer

GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm

Author: A Sarkar
AJ Enright
AJ Vilella
B Dujon
D Durand
D Kristensen
DA Dalquen
DR Scannell
EV Koonin
F Lemoine
F Lemoine
F Sievers
FS Dietrich
G Bhardwaj
G Yona
H Li
I Wapinski
I Wapinski
J Jun
J Parker
JM Joseph
JP Doyon
KH Wolfe
KP Byrne
Lars Arvestad
LL Conte
M Goodman
M Kellis
M Levitt
MK Basu
MN Price
MV Han
N Song
N Song
O Mahmudi
P Flicek
P Pipenbacher
Raja H. Ali
RD Finn
RH Ali
Sayyed A. Muhammad
SB Needleman
SF Altschul
TF Smith
V Miele
V Miele
WM Fitch
Z Fu
Z Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study