Search CORE

14,164 research outputs found

The Pfam protein families database

Author: A. Bateman
Andreeva
Berman
Bru
Dowell
E. L. L. Sonnhammer
Finn
G. Ceric
H.-R. Hotz
J. Mistry
J. Tate
K. Forslund
K ll
Mistry
P. C. Coggill
R. D. Finn
Rusch
S. J. Sammut
S. R. Eddy
Schloss
Sonnhammer
Yooseph
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metagenomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/), as well as from mirror sites in France (http://pfam.jouy.inra.fr/) and South Korea (http://pfam.ccbb.re.kr/)

Pfam: clans, web tools and services

Author: Bateman Alex
Durbin Richard
Eddy Sean R
Finn Robert D
Griffiths-Jones Sam
Hollich Volker
Khanna Ajay
Lassmann Timo
Marshall Mhairi
Mistry Jaina
Moxon Simon
Schuster-Böckler Benjamin
Sonnhammer Erik L L
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/12/2005
Field of study

Pfam is a database of protein families that currently contains 7973 entries (release 18.0). A recent development in Pfam has enabled the grouping of related families into clans. Pfam clans are described in detail, together with the new associated web pages. Improvements to the range of Pfam web tools and the first set of Pfam web services that allow programmatic access to the database and associated tools are also presented. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://pfam.cgb.ki.se/)

CiteSeerX

Crossref

PubMed Central

Oxford University Research Archive

The University of Manchester - Institutional Repository

University of East Anglia digital repository

The Pfam protein families database in 2019

Author: Bateman Alex
Eddy Sean R.
El-Gebali Sara
Finn Robert D.
Hirsh Layla
Luciani Aur\ue9lien
Mistry Jaina
Paladin Lisanna
Piovesan Damiano
Potter Simon C.
Qureshi Matloob
Richardson Lorna J.
Salazar Gustavo A.
Smart Alfredo
Sonnhammer Erik L L
Tosatto Silvio C E
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2019
Field of study

Archivio istituzionale della ricerca - Università di Padova

The Pfam protein families database

Author: Bateman Alex
Ceric Goran
Coggill Penny
Eddy Sean R.
Finn Robert D.
Forslund Kristoffer
Gavin O. Luke
Gunasekaran Prasad
Heger Andreas
Holm Liisa
Mistry Jaina
Pollington Joanne E.
Sonnhammer Erik L. L.
Tate John
Publication venue
Publication date: 17/11/2009
Field of study

Peer reviewe

CiteSeerX

PubMed Central

Oxford University Research Archive

Helsingin yliopiston digitaalinen arkisto

MDC Repository

The Pfam protein families database

Author: Bateman Alex
Boursnell Chris
Ceric Goran
Clements Jody
Coggill Penny C.
Eberhardt Ruth Y.
Eddy Sean R.
Finn Robert D.
Forslund Kristoffer
Heger Andreas
Holm Liisa
Mistry Jaina
Pang Ningze
Punta Marco
Sonnhammer Erik L. L.
Tate John
Publication venue
Publication date: 29/11/2011
Field of study

Peer reviewe

CiteSeerX

PubMed Central

Oxford University Research Archive

Helsingin yliopiston digitaalinen arkisto

MDC Repository

Pfam: the protein families database

Author: Bateman Alex
Clements Jody
Coggill Penelope
Eberhardt Ruth Y.
Eddy Sean R.
Finn Robert D.
Heger Andreas
Hetherington Kirstie
Holm Liisa
Mistry Jaina
Punta Marco
Sonnhammer Erik L. L.
Tate John
Publication venue
Publication date: 27/11/2013
Field of study

Peer reviewe

Crossref

PubMed Central

Helsingin yliopiston digitaalinen arkisto

Testing statistical hypothesis on random trees and applications to the protein classification problem

Author: Busch Jorge R.
Ferrari Pablo A.
Flesia Ana Georgina
Fraiman Ricardo
Grynberg Sebastian P.
Leonardi Florencia
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2009
Field of study

Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Universidade de São Paulo

EVEREST: automatic identification and classification of protein domains in all protein sequences

Author: A Bairoch
A Barak
A Bateman
A Heger
Amir Harel
B Boeckmann
CH Wu
E Portugaly
E Portugaly
Elon Portugaly
F Servant
HM Berman
J Gracy
J Gracy
J Liu
J Liu
J Park
J Schultz
JD Thompson
JM Chandonia
Michal Linial
N Kaplan
N Nagarajan
Nathan Linial
NJ Mulder
O Dekel
O Sasson
O Sasson
O Shachar
SF Altschul
SR Eddy
TF Smith
TJ Hubbard
Y Inbar
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ADDA: a domain database with global coverage of the protein universe

Author: Heger Andreas
Holm Liisa
Sivakumar Ashwin
Wilton Christopher Andrew
Publication venue: Oxford University Press
Publication date: 17/12/2004
Field of study

We used the Automatic Domain Decomposition Algorithm (ADDA) to generate a database of protein domain families with complete coverage of all protein sequences. Sequences are split into domains and domains are grouped into protein domain families in a completely automated process. The current database contains domains for more than 1.5 million sequences in more than 40 000 domain families. In particular, there are 3828 novel domain families that do not overlap with the curated domain databases Pfam, SCOP and InterPro. The data are freely available for downloading and querying via a web interface (http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb)

CiteSeerX

Crossref

PubMed Central