EVEREST: automatic identification and classification of protein domains in all protein sequences

A Bairoch; A Barak; A Bateman; A Heger; Amir Harel; B Boeckmann; CH Wu; E Portugaly; E Portugaly; Elon Portugaly; F Servant; HM Berman; J Gracy; J Gracy; J Liu; J Liu; J Park; J Schultz; JD Thompson; JM Chandonia; Michal Linial; N Kaplan; N Nagarajan; Nathan Linial; NJ Mulder; O Dekel; O Sasson; O Sasson; O Shachar; SF Altschul; SR Eddy; TF Smith; TJ Hubbard; Y Inbar

EVEREST: automatic identification and classification of protein domains in all protein sequences

Authors: A Bairoch
A Barak
A Bateman
A Heger
Amir Harel
B Boeckmann
CH Wu
E Portugaly
E Portugaly
Elon Portugaly
F Servant
HM Berman
J Gracy
J Gracy
J Liu
J Liu
J Park
J Schultz
JD Thompson
JM Chandonia
Michal Linial
N Kaplan
N Nagarajan
Nathan Linial
NJ Mulder
O Dekel
O Sasson
O Sasson
O Shachar
SF Altschul
SR Eddy
TF Smith
TJ Hubbard
Y Inbar
Publication date: 1 January 2006
Publisher: BioMed Central
Doi

Abstract

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

Last time updated on 04/12/2019

Springer - Publisher Connector

Last time updated on 05/06/2019

Directory of Open Access Journals

oai:doaj.org/article:5d4a021e4...

Last time updated on 18/12/2014

Springer - Publisher Connector

Last time updated on 03/05/2017