329 research outputs found

    Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

    Get PDF
    We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era

    Context-specific methods for sequence homology searching and alignment

    Get PDF

    HHrep: de novo protein repeat detection and the origin of TIM barrels

    Get PDF
    HHrep is a web server for the de novo identification of repeats in protein sequences, which is based on the pairwise comparison of profile hidden Markov models (HMMs). Its main strength is its sensitivity, allowing it to detect highly divergent repeat units in protein sequences whose repeats could as yet only be detected from their structures. Examples include sequences with β-propellor fold, ferredoxin-like fold, double psi barrels or (βα)(8) (TIM) barrels. We illustrate this with proteins from four superfamilies of TIM barrels by revealing a clear 4- and 8-fold symmetry, which we detect solely from their sequences. This symmetry might be the trace of an ancient origin through duplication of a βαβα or βα unit. HHrep can be accessed at

    From Structure Prediction to Genomic Screens for Novel Non-Coding RNAs

    Get PDF
    Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other

    Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection

    Get PDF
    The accuracy of a homology model based on the structure of a distant relative or other topologically equivalent protein is primarily limited by the quality of the alignment. Here we describe a systematic approach for sequence-to-structure alignment, called ‘K*Sync’, in which alignments are generated by dynamic programming using a scoring function that combines information on many protein features, including a novel measure of how obligate a sequence region is to the protein fold. By systematically varying the weights on the different features that contribute to the alignment score, we generate very large ensembles of diverse alignments, each optimal under a particular constellation of weights. We investigate a variety of approaches to select the best models from the ensemble, including consensus of the alignments, a hydrophobic burial measure, low- and high-resolution energy functions, and combinations of these evaluation methods. The effect on model quality and selection resulting from loop modeling and backbone optimization is also studied. The performance of the method on a benchmark set is reported and shows the approach to be effective at both generating and selecting accurate alignments. The method serves as the foundation of the homology modeling module in the Robetta server

    Distribution and phylogeny of the bacterial translational GTPases and the Mqsr/YgiT regulatory system

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Valgud on raku ehituskivideks ja eluks vajalike reaktsioonide katalüüsijateks. Bioinformaatika on meid varustanud võimsate järjestuste analüüsi vahenditega. Järjestuse sarnasuse alusel grupeeruvad valgud perekondadeks. Valguperekonna moodustavad homoloogsed järjestused ehk siis järjestused, mis pärinevad samast eellasjärjestusest. Tihti omavad samasse perekonda kuuluvad valgud ka sama või üksteisele lähedast funktsiooni. Meie teadmised valkude funktsioonidest pärinevad üksikutelt mudelorganismidelt. Tihti huvitab teadlasi kui universaalne või spetsiifiline on üks või teine kirjeldatud funktsioon. Kuidas ja millal evolutsiooni käigus tekib olemasolevast materjalist uute omadustega (uue funktsiooniga) valk läbi geeniduplikatsiooni? Kui tihti on sellised sündmused evolutsioonilises ajaskaalas aset leidud? Oma töös olen ma analüüsinud bakterite translatsioonilisi GTPaase (trGTPaas) ja mqsR/ygiT toksiin-antitoksiin (TA) süsteemi valke. Ühiseks nime¬¬tajaks mõlemale on valgusünteesi aparaat – mõlemad on seotud ribosoomiga ja sealtkaudu raku võimega sõltuvalt vajadusele toota valke. Küsimused, mida selles kontekstis on küsitud, saab laias laastus jagada kaheks: a) valguperekonna esindatusega seotud ja b) valguperekonna evolutsiooni ja funktsionaalse innovatsiooniga seotud. Translatsiooniliste GTPaaside puhul bakterites saame rääkida üheksast erinevast perekonnast – üheksast erinevast funktsioonide komplektist. Täisgenoomidele põhinev analüüs näitas, et üheksast trGTPaaside perekonnast on bakterites konserveerunud neli: IF2, EF-Tu, EFG ja LepA(EF4). Vaatamata sellele, et RF3’e on omistatud klassikalise valgusünteesi mudeli valguses kanooniline roll translatsiooni lõpetamisel, puudus RF3 geen ligikaudu 40% analüüsitud bakteri genoomides. Samas aga ebaselge funktsiooniga LepA osutus bakterite spetsiifiliseks trGTPaasiks. Eelnev analüüs tõi ka välja EFG paraloogide laia esinemise – paljud bakteri¬genoomid sisaldasid 2–3 üksteisest küllaltki erinevat (divergeerunud) EFG geeni. Lähem analüüs tõi välja, et kogu varieeruvuse EFG perekonnas võib jagada neljaks alamperekonnaks: EFG I, spdEFG1, spdEFG2 ja EFG II. Eksperimentaalselt on hästi iseloomustatud EFG I. Uuritud on ka spdEFG’sid ja leitud, et esimene neist omab translokaasi aktiivsust translatsioonil ja teine osaleb ribosoomide retsükleerimisel. Laialt levinud EFG II alamperekond on aga halvasti uuritud. Fülogeneetiline analüüs võimaldab püstitada hüpoteesi nelja EFG alamperekonna iidsest päritolust, st. nad on tekkinud ajalises skaalas enne (või samaaegselt) eukarüootse rakuvormi lahknemist arhedest ja bakteritest. Funktsionaalse innovatsiooni kandjaks EFG II valgus võib pidada eelkõige 12 positsiooni, mis on spetsiifiliselt konserveerunud just EFG II alamperekonnal. EFG II’e iseloomulikus kõrge divergentsuse taustal tõusevad need positsioonid esile GTPaasi domäänis, domäänis II ja neljandas domäänis. Konserveerunud muutused GTPaasi domäänis, millest osad on GTP’d siduvas G1 motiivis, võimaldavad teha järeldusi muutunud GTP sidumise ja hüdrolüüsi tingimuste kohta. Suurenenud laeng neljanda domääni lingu otsas, mis E. coli EFG’l siseneb A-saiti, võimaldab spekuleerida muutuse üle translokatsiooni keskkonnas. Konserveerunud muutused domään II piirkonnas viitavad muutunud interaktsioonile ribosoomi, domään I ja domään III vahel. EFG II alamperekonna fülogeneetiline ja järjestuste analüüs näitab selgelt hõimkonna/klassi spetsiifiliste alam-alamgruppide olemasolu. Need alam-alamgrupid erinevad teineteisest G2 motiivi konserveeruvuse ja insertsioonide/deletsioonide mustri alusel. See teine tase kirjeldab EFG II kui hõimkonna/klassi spetsiifilist faktorit. Mis on EFG II roll tegelikult ja kuidas ning millistes tingimustes ta komplementeerib EFG I, ootab alles vastuseid. Antud töö on loonud raamistiku tulevaste eksperimentide tarvis.Proteins are vital for the cell – they serve as building blocks and catalysts for many different reactions. Bioinformatics has equipped us with powerful analysis tools. According to sequence similarity, proteins can be grouped into families. Protein family is composed of homologous sequences, i. e. from sequences, which share a common ancestor. Proteins, which belong to the same family, perform their function in a similar way. Our knowledge about functional properties of proteins originates from experimental works performed with a limited number of model organisms. Scientists are often interested in the universality or specificity of one or another described protein and function. How often is gene duplication and following innovation the source for genes/proteins with a new function? How often such events take place in the evolutionary timescale? In my dissertation I have analyzed gene and protein sequences of translational GTPases (trGTPases) and mqsR/ygiT toxin-antitoxin of bacteria. Common denominator for both protein families is their connection to cells protein synthesis machinery. Two types of questions can be asked in this context: those that are related to a) the representation of specific proteins/function, and b) the evolution and functional innovation. In the case of trGTPases nine different protein families, i. e. presence or absence of nine different functional complexes in the cell were described. Analyzes carried on completed genome sequences of bacteria revealed four conserved families: IF2, EF-Tu, EFG, and LepA(EF4). Despite the fact that in the classical model of protein synthesis RF3 carries canonic role at the final step of translation, RF3 coding gene was found missing approximately in 40% of analyzed bacteria. Surprisingly, LepA, whose function is still not well understood, appears to be specific trGTPase for bacteria. The analysis also revealed a wide distribution of EFG paralogs – many bacteria contained two to three relatively diverged gene copies for EFG. The phylogenetic tree of EFG revealed four subfamilies: EFG I, spdEFG1, spdEFG2, and EFG II. The EFG I subfamily is experimentally well characterized. Also, spdEFG1 was found to act as translocase and spdEFG2 helps recycle ribosome, indicating functional split between co-occurring paralogs. However, little research has been done on widely distributed EFG II subfamily. Phylogenetic analyses, performed by us, enable to propose hypothesis about ancient origin of EFG subfamilies - they have appeared at the same timescale with (or even before) arousing eukaryotic life-forms. Functional innovation, common for the whole subfamily, is carried by 12 EFG II specific positions. In contrast to overall high divergeny, these conserved positions have spotlighted in the GTPase domain, and in the domain II and IV. Conserved changes in the GTPase domain, some of which are located in the G1 motif, indicate changed conditions in GTP binding and hydrolysis. Increased charge in protruding loop of the fourth domain, which inserts into A-site, enables us to speculate about changes in the local conditions of the A-site during translocation. Conserved changes in the domain II indicate changed interaction between EFG domains I, II, and III and the ribosome. Phylogenetic analysis of the EFG II subfamily reveals phyla/class specific sub-subgroups. These sub-subgroups differ from each other by conserved amino acids pattern of the G2 motif and insertion/deletion pattern detected from multiple sequence alignment. This another level characterizes EFG II as phyla/class specific factor. Further research should be conducted on what role EFG II actually performs and how it complements EFG I. Current study can serve as framework for future experiments
    corecore