45 research outputs found

    Determining and comparing protein function in Bacterial genome sequences

    Get PDF

    Next Generation Sequencing, Assembly, and Analysis of Bovine and Feline \u3ci\u3eTritrichomonas foetus\u3c/i\u3e Genomes Toward Taxonomic Clarification And Improved Therapeutic and Preventive Targets

    Get PDF
    Tritrichomonas foetus is a bovine and feline parasite and a porcine commensal. This organism is the causative agent of bovine and feline trichomonosis. In cattle, the parasite colonizes the urogenital tract and causes similar symptoms to those caused by Trichomonas vaginalis in humans. In cats, the parasite colonizes the gastrointestinal tract and produces a protracted watery diarrhea. In cattle, this parasite can lead to abortions and substantial herd loss due to culling of infected animals, whereas in cats prolonged courses of diarrhea can lead to abandonment or euthanasia.At the inception of this dissertation work, no genomic data was available for T. foetus. The parasitology community has debated the taxonomic relationship between bovine and feline-associated strains of T. foetus and other Trichomonad parasites. Some have hypothesized that different pathotypes of T. foetus constitute wholly separate species based on a limited number of cross-infectivity studies, a scant amount of genomic DNA and protein sequences, and non-targeted nuclease-based strategies. Still, the community has been slow to adopt this idea.We aimed to use next-generation sequencing (NGS) technology to: sequence the genomes of bovine and feline isolates, utilize the genomes to determine the taxonomic relationship between bovine and feline-associated T. foetus, and determine whether there were detectable genomic differences that might lead to host-specific targets. We hypothesized that significant genomic changes would be detectable and would lead to host-specific targets for future therapeutics.We successfully extracted genomic DNA and produced de novo draft genome assemblies for two Tritrichomonas foetus isolates: strain Beltsville and strain Auburn. Our resulting genomic analyses reveal that these are two members of the same species at the molecular level. These results ran contrary to our initial hypothesis, showing that the difference between these two pathotypes may be subtler than previously believed. We used numerous house-keeping, gold standard phylogenetic markers in addition to bioinformatic and phylogenetic analyses to highlight the profound similarity between these two samples. This work should lay the foundation for a multitude of future investigations into Tritrichomonas foetus in the hopes of producing better therapeutic strategies and clinical outcomes for bovine and feline populations alike

    Protocols to capture the functional plasticity of protein domain superfamilies

    Get PDF
    Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function

    Functional discovery in the oxidative D-galacturonate assimilation pathway and development of the enzyme similarity web tool

    Get PDF
    Sequencing technology has improved dramatically over the past few decades. Before the sequencing of complete genomes was possible, the sequencing of a gene was directly linked to the biochemical characterization of its product [1], however biochemical and genetic characterization has not benefited from being scaled up in the same way as has sequencing. Thus, the scientific community is confronted with exponentially growing sequence databases in which roughly half of the entries are either annotated incorrectly or not at all. Therefore, in order to realize the true potential of the data being generated by sequencing projects, something must be done about the way the functions of those sequences are being discovered and identified. One approach to addressing the problem of the growing number of sequences without a known function is that set forth by the Enzyme Function Initiative (EFI). The goal of the EFI is to develop tools and strategies to characterize enzymes discovered in genome projects, and the EFI uses an interdisciplinary approach to address the problem. EFI labs include those with expertise in bioinformatics, computational biology, structural biology, enzymology, and biology, that work together to develop a systematic approach that starts with using bioinformatics to select enzyme candidates for structural elucidation, ligand docking to identify potential substrates, in vitro biochemistry to test those predictions, and microbiology to test for the physiological role of activities identified in vitro. The approach just described is the general approach taken, but other tools and approaches also have been tested and developed in each of the areas mentioned (e.g., bioinformatics, computational biology). Bioinformatics tools that have been further developed include sequence similarity networks (SSNs) and genomic context networks. SSNs have a long history and are useful in visualizing trends across groups of related protein sequences, namely function. Before this work, access to SSNs by experimentalists with little bioinformatics training was limited. To provide the ability for experimentalist to generate an SSN for any protein family (~16,000 now in Pfam), we developed a web tool to generate SSNs quickly and easily. The networks can be viewed in Cytoscape and contain an aggregate of annotation data pulled from different sources (e.g., UniProt, GenomesOnline). The first part of this work (Chapter 2) describes the web tool and provides an example in which members of the enolase superfamily from Agrobacterium tumefaciens strain C58 are mined in a shotgun approach to discover novel enzymatic activities. In the second part of this work, combined bioinformatics and experimental approaches are used to identify two novel enzymes in the oxidative pathway to degrade pectin, the abundant plant cell wall polysaccharide. In the first example (Chapter 3), genomic context and pathway reconstruction combined with in vitro biochemistry and gene expression analysis reveal a novel enzymatic activity of isomerizing the 6-member ring lactone of D-galacturonate (D-galA) to its 5-member ring lactone counterpart. An enzyme to catalyze this reaction had not been identified before this work. In the second example (Chapter 4), in a large scale screening of transporters we were lead to microbial gene neighborhoods containing many enzymes in the known D-galA oxidative pathway but noticed in a number of cases components of the known pathway were missing; in their place candidate enzymes were likely involved in an alternative pathway for metabolizing D-galA. This work lead us to the discovery of an enzyme that hydrolyzed the 6-member ring lactone of D-galA to its acyclic diacid counterpart, meso-galactarate

    In silico analysis of the coffee transcriptome : identification of SNPs and inference of mechanisms of gene expression regulation

    Get PDF
    Orientador: Gonçalo Amarante Guimarães PereiraTese (doutorado) - Universidade Estadual de Campinas, Instituto de BiologiaResumo: O café é uma das culturas mais importantes do mundo, sendo consumido mundialmente e com significativa participação na economia em países em desenvolvimento. Coffea arabica e Coffea canephora são responsáveis por 70% e 30% da produção comercial, respectivamente. Análise citogenética indicou que C. arabica é uma planta alotetraploide autógama formada por uma hibridação (1 milhão de anos atrás) entre os diplóides C. canephora e Coffea eugenioides. C. eugenioides é uma espécie silvestre que cresce em maiores altitudes próximo das bordas de florestas e produz poucas e pequenas sementes com baixo teor de cafeína. Por outro lado, C. canephora é alógama e cresce melhor em terras baixas, é também caracterizada por maior produtividade, maior tolerância a pragas e maior teor de cafeína, mas tem uma bebida considerada de qualidade inferior em comparação com a produzida por C. arabica. Durante a última década, algumas iniciativas de pesquisa têm sido lançadas para produzir dados genômicos e transcritômicos de algumas espécies de café. Esta coleção de ESTs representa uma boa visão do transcriptoma de C. arabica e C. canephora, sendo um importante recurso para análise molecular dessas espécies. Este trabalho teve como objetivo obter mais informações sobre algumas espécies do gênero Coffea, incluindo a estrutura dos genes, análise de expressão e identificação de genes e famílias gênicas que são específicos ou expandidos em café. Além disso, também foi proposto estudar a regulação da expressão gênica nos genes homeólogos da alotetraploide C. arabica. A fim de investigar estes conjuntos de dados de EST foram realizadas duas montagens: (i) a primeira montagem com cada espécie individualmente, com o objetivo de fazer uma análise comparativa entre as C. arabica, C. canephora e outras culturas, e (ii) com as duas espécies de café juntas, permitindo a identificação de SNPs entre C. arabica e C. canephora, e avaliar questões evolutivas em C. arabica. A identificação dos transcritos diferencialmente expressos e novas famílias gênicas foram utilizados como ponto de partida para a correlação de características de desenvolvimento e de perfis de expressão gênica em Coffea sp.. Domínios de proteínas e análises de Gene Ontology sugerem diferenças significativas entre os dados das espécies de café analisadas, principalmente em relação a síntese de açúcares, ligação de proteínas a nucleotídeos, retrotransposons e proteínas de resposta a estresse. A ferramenta OrthoMCL identificou as famílias de proteínas específicas ou predominante de café quando comparado com outras cinco espécies de plantas. Usando as discrepâncias de alta qualidade encontradas em ESTs sobrepostos de C. arabica e C. canephora, os perfis de diversidade de seqüência foram avaliados em ambas as espécies e utilizados para deduzir a contribuição de C. canephora e C. eugenioides na transcrição de C. arabica. A identificação de genes homeologous de C. arabica aos genomas ancestrais permitiu analisar as contribuições de expressão gênica de cada subgenoma. Nós sugerimos que este fenômeno tem uma questão importante na expressão dos genes e fisiologia de Coffea.Abstract: Coffee is one of the most important crops in the world, being worldwide consumed and having significant participation in under development economies. Coffea arabica and Coffea canephora are responsible for 70% and 30% of commercial production, respectively. Cytogenetic analysis established that C. arabica is an autogamous alotetraploid formed by a recent (1 mya) hybridization between the diploids C. canephora and Coffea eugenioides. C. eugenioides is a wild species which grows in higher altitudes near forest edges, and produces few berries with small beans of low caffeine content. On the other hand, C. canephora is alogamous and grows better in lowlands. It is also characterized by higher productivity, more tolerance to pests, and higher caffeine content, but it has an inferior beverage compared with C. arabica. During the last decade, research initiatives have been launched to produce genomic and transcriptomic data about Coffea spp. This EST collection represents a good overview of C. arabica and C. canephora transcriptome, being appropriate as a resource for Coffea molecular analysis. This work aimed to obtain further information about Coffea spp. gene structure and expression and to identify genes that are specific or expanded in coffee plants. Moreover, it also intended to study the homeologous gene expression regulation in the alotetraploid C. arabica. In order to investigate these data two different EST assemblies were performed: (i) with each species individually, aiming the comparative analysis between the C. arabica, C. canephora and other crops; and (ii) with both coffee species together, allowing the identification of SNPs between C. arabica and one of its direct ancestors C. canephora and the examination of evolutive issues in C. arabica. The identification of differentially expressed transcrip ts and new gene families offered a starting point for the correlation of gene expression profil es and Coffea sp. development traits. Protein domain and Gene Ontology analyzes suggested significant differences between the data of coffee species analyzed, mainly in relation to complex sugar synthases, nucleotide binding proteins, retrotransposons and stress response. OrthoMCL tool identified specific or prevalent coffee protein families when compared with other five plant species. Using the high quality discrepancies, found in overlapped ESTs from C. arabica and C. canephora, sequence diversity profiles were evaluated within both species and used to deduce the transcript contribution of the C. canephora and C. eugenioides ancestors in the C. arabica. The assignment of the C. arabica homeologous genes to the ancestral genomes allowed us to analyze gene expression contributions of each subgenome. We suggest that this phenomenon has an important issue in Coffea gene expression and physiology.DoutoradoBioinformaticaDoutor em Genetica e Biologia Molecula

    Bioinformatic insights into the diversity and evolution of bacterial toxins

    Get PDF
    Bacterial toxins are a broad category of molecules ranging from small organic compounds and peptides to large multi-domain or multi-meric enzymes. Several important diseases are caused primarily by bacterial toxins including botulism and diphtheria. Paradoxically, the same toxins have proven useful for the treatment of muscular disorders and cancer, respectively. Given their importance in medicine and their utility as drugs, it is desirable to attain a greater functional and mechanistic understanding of toxin families. However, a full description of any sequence's functionality must incorporate an understanding of the evolutionary processes that produced them, and currently little is known about these forces. Using a bioinformatic approach, this thesis presents analyses of three bacterial toxin families: clostridial neurotoxins, which cause botulism and tetanus; diphtheria toxins, which cause diphtheria; and large clostridial toxins, which contribute to the infections produced by various clostridia, including \textit{Clostridioides difficile}. The detection of toxin-related sequences from bacterial genomes allows the discovery of toxin variants that may have gone undetected by other methods of toxin identification. Based on the available genomic data, toxin families that cause disease in humans appear to be broader than previously imagined. Toxin-related sequences are capable of performing unique functions compared to the toxin variants more traditionally associated with human disease. By examining human toxins in evolutionary terms, it is possible to identify the functional innovations that have occurred to result in human specificity, as well as delve more deeply into the relationships between toxin sequences and their functions. Thus, the studies presented here provide examples of how the field of toxin biology, like many other disciplines, has much to gain from the genomic revolution

    Computational genomics of lactobacilli

    Get PDF
    Lactobacilli are generally harmless gram-positive lactic acid bacteria and well known for their broad spectrum of beneficial effects on human health and usage in food production. However, relatively little is known at the molecular level about the relationships between lactobacilli and humans and about their food processing abilities. The aim of this thesis was to establish bioinformatics approaches for classifying proteins involved in the health effects and food production abilities of lactobacilli and to elucidate the functional potential of two biomedically important Lactobacillus species using whole-genome sequencing. To facilitate the genome-based analysis of lactobacilli, two new bioinformatics approaches were developed for the systematic analysis of protein function. The first approach, called LOCP, fulfilled the need for accurate genome-wide annotation of putative pilus operons in gram-positive bacteria, whereas the second approach, BLANNOTATOR, represented an improved homology-based solution for general function annotation of bacterial proteins. Importantly, both approaches showed superior accuracy in evaluation tests and proved to be useful in finding information ignored by other homology-search methods, illustrating their added value to the current repertoire of function classification systems. Their application also led to the discovery of several putative pilus operons and new potential effector molecules in lactobacilli, including many of the key findings of this thesis work. Lactobacillus rhamnosus GG is one of the clinically best-studied Lactobacillus strains and has a long history of safe use in the food industry. The whole-genome sequencing of the strain GG and a closely related dairy strain L. rhamnosus LC705 revealed two almost identical genomes, despite the physiological differences between the strains. Nevertheless of the extensive genomic similarity, present only in GG was a genomic region containing genes for three pilin subunits and a pilin-dedicated sortase. The presence of these pili on the cell surface of L. rhamnosus GG was also confirmed, and one of the GG-specific pilins was demonstrated to be central for the mucus interaction of strain GG. These discoveries established the presence of gram-positive pilus structures also in non-pathogenic bacteria and provided a long-awaited explanation for the highly efficient adhesion of the strain GG to the intestinal mucosa. The other Lactobacillus species investigated in this thesis was Lactobacillus crispatus. To gain insights into its physiology and to identify components by which this important constituent of the healthy human vagina may promote urogenital health, the genome of a representative L. crispatus strain was sequenced and compared to those of nine others. These analyses provided an accurate account of features associated with vaginal health and revealed a set of 1,224 gene families that were universally conserved across all the ten strains, and, most likely, also across the entire L. crispatus species. Importantly, this set of genes was shown to contain adhesion genes involved in the displacement of the bacterial vaginosis-associated Gardnerella vaginalis from vaginal cells and provided a molecular explanation for the inverse association between L. crispatus and G. vaginalis colonisation in the vagina. Taken together, the present study demonstrates the power of whole-genome sequencing and computer-assisted genome annotation in identifying genes that are involved in host-interactions and have industrial value. The discovery of gram-positive pili in L. rhamnosus GG and the mechanism by which L. crispatus excludes G. vaginalis from vaginal cells are both major steps forward in understanding the interaction between lactobacilli and host. We envisage that these findings together with the developed bioinformatics methods will aid the improvement of probiotic products and human health in the future.Laktobasillit ovat enimmäkseen harmittomia gram-positiivisia maitohappobakteereja. Vaikka näitä terveysvaikutteisiakin hyötybakteereja on hyödynnetty elintarvikkeiden valmistuksessa jo vuosisatoja, tietämyksemme laktobasillien molekyylibiologisista perusteista on varsin rajallinen. Tämän väitöskirjatyön tavoitteena oli kehittää uusia laskennallisia työkaluja laktobasillien tuottamien biomolekyylien karakterisointiin sekä selvittää kahden biolääketieteellisestikin merkittävän laktobasillilajin toimintaan perimän luentaa hyödyntäen. Väitöskirjatutkimuksessa esitellään kaksi laskennallisen biologian menetelmää laktobasillien ilmentämien ominaisuuksien ennustamiseen perimätiedosta sekä hyödynnetään näitä laktobasillien toiminnan tulkinnassa. Menetelmistä ensimmäinen, LOCP, on luotu seulomaan perimätiedosta pili-tartuntaelimien tuottamiseen tarvittavia geeniryhmiä, kun taas menetelmistä jälkimmäinen, BLANNOTATOR, on sekvenssivertailuihin ja lähisukuisista biomolekyyleistä lainattuun tietoon perustuva uusi proteiinisekvenssien luokitintyökalu. Osatöissä tehdyissä selvityksissä molemmat kehitetyistä menetelmistä osoittautuivat ennennäkemättömän tarkoiksi ja kykeneviksi löytämään muiden tehtäviin soveltuvien menetelmien erheellisesti sivuttamaa tietoa. Ohjelmien avulla pystyttiin myös löytämään uusia pili-tartuntaelimien tuottamiseen tarvittavia geeniryhmiä sekä muita mahdollisesti biolääketieteellisesti merkittäviä ominaisuuksia laktobasilleista, mukaan lukien useimmat tässäkin väitöskirjatyössä esitetyt havainnot. Ensimmäinen väitöskirjatyössä tarkasteltu bakteeri oli Lactobacillus rhamnosus GG, joka on eräs tunnetuimmista ja tutkituimmista probiooteista, eli terveysvaikutteisista bakteereista. Tämän teollisestikin merkittävän laktobasillin perimän luenta ja perimän vertailu toisen lähisukulaisen laktobasillin, L. rhamnosus LC705, perimään paljasti yllätyksellisen vähän perinnöllisiä eroja näiden kahden biologisesti erilaisen bakteerin välillä. Perimien vastaavuudesta huolimatta tutkimuksessa onnistuttiin laskennallisia menetelmiä hyödyntämällä kuitenkin myös tunnistamaan yhteensä viisi L. rhamnosus GG -bakteerille ominaista perimäjaksoa, joista merkittävimmän havaittiin sisältävän pili-tartuntaelimien biosynteesissä tarvittavan geeniryhmän. Työssä myös todistettiin pili-tartuntaelimen ilmentyminen bakteerisolun pinnalle ja tartuntaelimen erään osakomponentin merkitys L. rhamnosus GG -bakteerin sitoutumiselle ihmisen ruuansulatusjärjestelmää peittävään limaan. Yhdessä nämä löydökset todistivat kiistatta ensimmäistä kertaa pili-tartuntaelimen ilmentymisen hyötybakteerissa ja tarjosivat uraauurtavan näkökulman L. rhamnosus GG -bakteerin terveysvaikutuksille sekä kyvylle sitoutua ruuansulatusjärjestelmän eri osiin L. rhamnosus LC705 -bakteeria paremmin. Lisäksi väitöskirjatyössä selvitettiin ihmisen emättimessä runsaastikin läsnä olevan ja emätinterveydelle tärkeän Lactobacillus crispatus -bakteerin perinnöllistä perustaa. Työssä kartoitettiin L. crispatus -lajia hyvin edustavan kannan perimä. Vertaamalla kannan perimää yhdeksän muun saman lajin kannan perimiin, luotiin kattava kuvaus lajin ominaisuuksista ja tunnistettiin yhteensä 1224 geeniperhettä, joiden voidaan olettaa vastaavan bakteerin lajityypillisistä piirteistä. Nämä lajityypilliset geeniperheet muodostavat merkittävän osan kunkin L. crispatus -kannan perimästä, ja niiden joukosta onnistuttiin tunnistamaan lajin tarttumiskyvystä mahdollisesti vastaavia geenejä. Erään tällaisen tarttumisgeenin tuotteen osoitettiin myös kykenevän estämään Gardnerella vaginalis -haittabakteerin kiinnittymistä emättimen epiteelin. Tämä löydös selittää osaltaan L. crispatus -bakteerin roolia terveen emättimen valtalajina. Loppupäätelmänä voidaan esittää, että bakteerien perimän luenta ja bakteeriperäisten proteiinisekvenssien luokitusennustukset ovat äärimmäisen hyödyllinen tapa tulkita laktobasillien ilmentämiä ominaisuuksia ja löytää terveysvaikutteisia biomolekyylejä. Pili-tartuntaelimien ja G. vaginalis -haittabakteerin kiinnittymistä estävän proteiinin löytyminen ovat tärkeä edistysaskel kohti kokonaisvaltaista laktobasillien ja ihmisten vuorovaikutuksien ymmärtämistä ja voivat avata yhdessä kehitettyjen laskennallisten biologisten menetelmien kanssa täysin uudenlaisia lähestymistapoja tuottaa entistä parempia terveyttä edistäviä elintarvikkeita ja parantaa ihmisterveyttä

    Genomic Analyses of Polysaccharide Utilization in Marine Flavobacteriia

    Get PDF
    Marine algae convert a substantial fraction of fixed carbon dioxide into various polysaccharides. Flavobacteriia that are specialized on algal polysaccharide degradation feature genomic clusters termed polysaccharide utilization loci (PULs). Since knowledge on extant PUL diversity is sparse, we sequenced the genomes of 53 North Sea Flavobacteriia. We obtained 400 PULs, suggesting usage of a large array of polysaccharides, including laminarin, A A /-- and A A -mannans, fucose-, xylose-, galactose-, rhamnose- and arabinose-containing substrates, pectins, and chitins. Many of the PULs exhibit new genetic architectures and suggest substrates rarely described for marine environments. The isolates' PUL repertoires often differed considerably within genera, corroborating ecological niche-associated glycan partitioning. Polysaccharide uptake in Flavobacteriia is mediated by SusCD-like transporter complexes. Respective protein trees revealed clustering according to polysaccharide specificities predicted by PUL annotations rather than phylogenetic affiliation. Using the trees, we analyzed expression of SusC/D homologs in multiyear phytoplankton bloom-associated metaproteomes and found indications for profound changes in microbial utilization of laminarin, A A /--glucans, ß-mannan and sulfated xylan. We hence suggest the suitability of SusC/D-like transporter protein expression within heterotrophic bacteria as a proxy for the temporal utilization of discrete polysaccharides

    Comparative Genome Analysis of Malaria Parasite Species

    Get PDF
    With over 200 million infections and up to one million deaths every year, malaria remains one of the most devastating infectious diseases affecting humans. Over the last few years, complete genome sequences of both human and non-human malaria parasite species have become available, adding comparative genomics to the toolbox of molecular biologists to study the genetic basis of human virulence. In this thesis, I computationally compared the published genomes of seven malaria parasite species with the aim to gain new insights into genes underlying human virulence. This comparison was performed using two complementary approaches. In the first approach, I used whole-genome synteny analysis to find genes present in human but not non-human malaria parasites. In the second approach, I first clustered virulence-associated genes into gene families and then examined these gene families for species-specific differences. Both comparisons resulted in interesting gene lists. Synteny analysis identified three key enzymes of the thiamine (vitamin B1) biosynthesis pathway to be present in human but not rodent malaria parasites, indicating that these two groups of parasites differ in their ability to synthesize vitamin B1 de novo. My gene family classification exposed within the largest and highly divergent surface antigen gene family pir a group of unusually well conserved orthologs, which should be considered as high-priority targets for experimental characterization and vaccine development. In conclusion, this thesis highlights genes and pathways that are different between human and non-human malaria parasites and therefore could play important roles in human virulence. Experimental studies can now be initiated to confirm virulence-associated functions and to explore their potential value for drug and vaccine development

    Polymorphic toxin systems: Comprehensive characterization of trafficking modes, processing, mechanisms of action, immunity and ecology using comparative genomics

    Get PDF
    Background: Proteinaceous toxins are observed across all levels of inter-organismal and intra-genomic conflicts. These include recently discovered prokaryotic polymorphic toxin systems implicated in intra-specific conflicts. They are characterized by a remarkable diversity of C-terminal toxin domains generated by recombination with standalone toxin-coding cassettes. Prior analysis revealed a striking diversity of nuclease and deaminase domains among the toxin modules. We systematically investigated polymorphic toxin systems using comparative genomics, sequence and structure analysis. Results: Polymorphic toxin systems are distributed across all major bacterial lineages and are delivered by at least eight distinct secretory systems. In addition to type-II, these include type-V, VI, VII (ESX), and the poorly characterized "Photorhabdus virulence cassettes (PVC)", PrsW-dependent and MuF phage-capsid-like systems. We present evidence that trafficking of these toxins is often accompanied by autoproteolytic processing catalyzed by HINT, ZU5, PrsW, caspase-like, papain-like, and a novel metallopeptidase associated with the PVC system. We identified over 150 distinct toxin domains in these systems. These span an extraordinary catalytic spectrum to include 23 distinct clades of peptidases, numerous previously unrecognized versions of nucleases and deaminases, ADP-ribosyltransferases, ADP ribosyl cyclases, RelA/SpoT-like nucleotidyltransferases, glycosyltranferases and other enzymes predicted to modify lipids and carbohydrates, and a pore-forming toxin domain. Several of these toxin domains are shared with host-directed effectors of pathogenic bacteria. Over 90 families of immunity proteins might neutralize anywhere between a single to at least 27 distinct types of toxin domains. In some organisms multiple tandem immunity genes or immunity protein domains are organized into polyimmunity loci or polyimmunity proteins. Gene-neighborhood-analysis of polymorphic toxin systems predicts the presence of novel trafficking-related components, and also the organizational logic that allows toxin diversification through recombination. Domain architecture and protein-length analysis revealed that these toxins might be deployed as secreted factors, through directed injection, or via inter-cellular contact facilitated by filamentous structures formed by RHS/YD, filamentous hemagglutinin and other repeats. Phyletic pattern and life-style analysis indicate that polymorphic toxins and polyimmunity loci participate in cooperative behavior and facultative 'cheating' in several ecosystems such as the human oral cavity and soil. Multiple domains from these systems have also been repeatedly transferred to eukaryotes and their viruses, such as the nucleo-cytoplasmic large DNA viruses. Conclusions: Along with a comprehensive inventory of toxins and immunity proteins, we present several testable predictions regarding active sites and catalytic mechanisms of toxins, their processing and trafficking and their role in intra-specific and inter-specific interactions between bacteria. These systems provide insights regarding the emergence of key systems at different points in eukaryotic evolution, such as ADP ribosylation, interaction of myosin VI with cargo proteins, mediation of apoptosis, hyphal heteroincompatibility, hedgehog signaling, arthropod toxins, cell-cell interaction molecules like teneurins and different signaling messengers.intramural funds of the US Department of Health and Human Services (National Library of Medicine, NIH)intramural funds of the US Department of Health and Human Services (National Library of Medicine, NIH
    corecore