Domains are the structural subunits of proteins. They are considered to be the basic units of folding, evolution and function. Understanding the domain structure helps to improve the functional annotation of proteins, tertiary protein structure prediction, protein engineering and protein mutagenesis . Domains are the minimal functional units of a protein. To elucidate the cellular functions of a protein we need to understand the molecular functions of its domains.
Protein sequences are annotated by the matches to domains in domain family databases such as Pfam, SCOP or CATH. However, existing domain databases cover less than half of the known sequence space and encompass only a small fraction of all protein domain families. Here, we developed an algorithm based on a Bayesian statistical model, called Pdom, with which we can consistently decompose the entire protein sequence space into its evolutionary units. Pdom predicts domains on the basis of an all-against-all search of protein Hidden Markov models with HHblits. An alignment of HHblits indicate the shared homologous region between the query and a template. This shared homologous region encompasses usually one or multiple domains shared by query and template. We can infer the domain borders of the query from its alignments to different templates. For this purpose, we can use the probabilities for the beginning and ending of an homologous region calculated by HHblits.
HHblits is an iterative protein homology detection tool that is more sensitive, generates more accurate alignments and is faster than its best competitors, PSI-BLAST and HMMER3. HHblits searches with a query Hidden Markov Model (HMM) against a database of HMMs of multiple sequence alignments. For the clustering of the Uniprot sequence database to the uniprot20 HMM database, we expect the sequences within a cluster to cover each other by at least 90%. Therefore, the HMMs of the uniprot20 are very conservative with few mostly very similar sequences contained in each cluster.
For the domain decomposition with Pdom we wanted to increase the sensitivity of HHblits to detect more remote homologs. For this purpose, we enriched our database clusters by jumpstarting HHblits with each cluster alignment in the uniprot20 and adding significant matches to the cluster alignments. The resulting uniprot_boost1 database (boost1 for one iteration with HHblits) has ~14x more sequences per cluster than uniprot20. The effective number of sequences is raised from 1.18 to 3.78. We developed efficient sparse compression and alignment handling algorithms that keep memory size nearly the same. HHblits finds with the uniprot_boost1 ~20% more homologs in three iterations compared to the uniprot20. The alignments of HHblits with the uniprot_boost1 have a 11.32% and 17.39% increased per-residue precision and sensitivity, respectively. The structural models of global alignments with fixed query template pairs against the uniprot_boost1 have on average a 4.9% higher TMscore with the native structure. Especially, the structural models of queries with a TMscore <= 0.3 in the uniprot20 can be improved with the uniprot_boost1. In a simple homology modeling pipeline with free template selection we showed that the TMscore of models built with the uniprot_boost1 is on average 4.9% higher.
In summary, HHblits finds with the diversity-enriched uniprot_boost1 database more homologs and generates more accurate alignments that lead to better structure models than with the uniprot20. It further widens the gap to HHblits' competitors HMMER and PSI-BLAST.
We showed that the performance of HHblits with the uniprot_boost1 can be transferred to the downstream application of homology modeling. The uniprot_boost1 is capable to become a default database for HHblits and may impact sequence-based predictions of evolutionarily conserved properties, such as secondary or tertiary structure, disorder, catalytic sites, post-translational modifications, short linear motifs, or interaction interface.
Protein domains occur in different proteins with different protein architectures. Pairwise alignments of a query protein against multiple homologous template proteins reveal with the different alignment start and end positions the boundaries of protein domains in the query protein. Some of those alignments encompass a single domain, others might encompass multiple domains. On the basis of an all-against-all search of HHblits within the uniprot_boost1, we decomposed the protein sequence space into its domains with Pdom. We compared the predictions of Pdom, ADDA and Pfam to SCOP annotations mapped onto full length protein sequences. ADDA applies the same fundamental idea as Pdom. ADDA predicts domains on the basis of an all-against-all search of protein sequences with BLAST. Pfam uses manually curated seed alignments that incorporate available data from literature.
On average ADDA covers 32%, Pfam covers 80% and Pdom covers 75% of the reference domain annotations.
ADDA predicts domain start and end sites within 20 residues in 15% of the reference domain start and end sites. Pfam annotates domain start sites within 20 residues in 67% of the reference domain start sites and annotates domain end sites within 20 residues in 58% of the reference domain end sites. Pdom predicts domain start sites within 20 residues in 50% of the reference domain start sites and predicts domain end sites within 20 residues in 45% of the reference domain end sites.
The seed alignments of Pfam have a very high quality due to the manual curation effort. But those seed alignments are limited to domains analyzed in literature. With our fully automatic approach in Pdom we are able to find new domains in the protein sequence space. The clustered database of Pdom's domain predictions, UniDom, has the potential to become a fundamental tool for homology-based protein sequence annotation efforts