47 research outputs found

    Ultra-fast sequence clustering from similarity networks with SiLiX

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time.</p> <p>Results</p> <p>We present the software package <monospace>SiLiX</monospace> that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity.</p> <p>Conclusions</p> <p>Comparing state-of-the-art software, <monospace>SiLiX</monospace> presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. <monospace>SiLiX</monospace> is freely available at <url>http://lbbe.univ-lyon1.fr/SiLiX</url>.</p

    Building a History of Horizontal Gene Transfer in E. Coli

    Get PDF
    Bacteria\u27s ability to pass entire genes between one another, a process called Horizontal Gene Transfer (HGT), has a major impact on bacterial evolution. In an ongoing project at Harvey Mudd, computational methods have been used to catalogue the HGT events that have impacted a group of closely related bacteria. This thesis builds on that project, by improving our ability to identify gene families --- groups of genes in different strains that are related. Previously, similarity was measured only by comparing two genes\u27 DNA sequences, ignoring their positions on the organism\u27s DNA. Here, we leverage genes\u27 relative position to make a better measurement of gene similarity. These improved similarity measurements will improve the existing pipeline\u27s ability to identify HGT events

    Complete Genome Sequence of \u3ci\u3eBurkholderia phymatum\u3c/i\u3e STM815T , a Broad Host Range and Efficient Nitrogen-Fixing Symbiont of \u3ci\u3eMimosa\u3c/i\u3e Species

    Get PDF
    Burkholderia phymatum is a soil bacterium able to develop a nitrogen-fixing symbiosis with species of the legume genus Mimosa, and is frequently found associated specifically with Mimosa pudica. The type strain of the species, STM 815T , was isolated from a root nodule in French Guiana in 2000. The strain is an aerobic, motile, non-spore forming, Gram-negative rod, and is a highly competitive strain for nodulation compared to other Mimosa symbionts, as it also nodulates a broad range of other legume genera and species. The 8,676,562 bp genome is composed of two chromosomes (3,479,187 and 2,697,374 bp), a megaplasmid (1,904,893 bp) and a plasmid hosting the symbiotic functions (595,108 bp)

    Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes

    Get PDF
    In the ongoing debates about eukaryogenesis—the series of evolutionary events leading to the emergence of the eukaryotic cell from prokaryotic ancestors— members of the Asgard archaea play a key part as the closest archaeal relatives of eukaryotes1. However, the nature and phylogenetic identity of the last common ancestor of Asgard archaea and eukaryotes remain unresolved2–4. Here we analyse distinct phylogenetic marker datasets of an expanded genomic sampling of Asgard archaea and evaluate competing evolutionary scenarios using state-of-the-art phylogenomic approaches. We find that eukaryotes are placed, with high confidence, as a well-nested clade within Asgard archaea and as a sister lineage to Hodarchaeales, a newly proposed order within Heimdallarchaeia. Using sophisticated gene tree and species tree reconciliation approaches, we show that analogous to the evolution of eukaryotic genomes, genome evolution in Asgard archaea involved significantly more gene duplication and fewer gene loss events compared with other archaea. Finally, we infer that the last common ancestor of Asgard archaea was probably a thermophilic chemolithotroph and that the lineage from which eukaryotes evolved adapted to mesophilic conditions and acquired the genetic potential to support a heterotrophic lifestyle. Our work provides key insights into the prokaryote-to-eukaryote transition and a platform for better understanding the emergence of cellular complexity in eukaryotic cells

    kClust: fast and sensitive clustering of large protein sequence databases

    Get PDF
    Background: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results: Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20\%-30\% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20\%-30\% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions: kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30\% sequence identity. kClust is freely available under GPL at ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/

    Key components of the eight classes of type IV secretion systems involved in bacterial conjugation or protein secretion

    Get PDF
    Conjugation of DNA through a type IV secretion system (T4SS) drives horizontal gene transfer. Yet little is known on the diversity of these nanomachines. We previously found that T4SS can be divided in eight classes based on the phylogeny of the only ubiquitous protein of T4SS (VirB4). Here, we use an ab initio approach to identify protein families systematically and specifically associated with VirB4 in each class. We built profiles for these proteins and used them to scan 2262 genomes for the presence of T4SS. Our analysis led to the identification of thousands of occurrences of 116 protein families for a total of 1623 T4SS. Importantly, we could identify almost always in our profiles the essential genes of well-studied T4SS. This allowed us to build a database with the largest number of T4SS described to date. Using profile-profile alignments, we reveal many new cases of homology between components of distant classes of T4SS. We mapped these similarities on the T4SS phylogenetic tree and thus obtained the patterns of acquisition and loss of these protein families in the history of T4SS. The identification of the key VirB4-associated proteins paves the way toward experimental analysis of poorly characterized T4SS classes

    Population genomics of the maize pathogen Ustilago maydis: demographic history and role of virulence clusters in adaptation

    Get PDF
    The tight interaction between pathogens and their hosts results in reciprocal selective forces that impact the genetic diversity of the interacting species. The footprints of this selection differ between pathosystems because of distinct life-history traits, demographic histories, or genome architectures. Here, we studied the genome-wide patterns of genetic diversity of 22 isolates of the causative agent of the corn smut disease, Ustilago maydis, originating from five locations in Mexico, the presumed center of origin of this species. In this species, many genes encoding secreted effector proteins reside in so-called virulence clusters in the genome, an arrangement that is so far not found in other filamentous plant pathogens. Using a combination of population genomic statistical analyses, we assessed the geographical, historical, and genome-wide variation of genetic diversity in this fungal pathogen. We report evidence of two partially admixed subpopulations that are only loosely associated with geographic origin. Using the multiple sequentially Markov coalescent model, we inferred the demographic history of the two pathogen subpopulations over the last 0.5 Myr. We show that both populations experienced a recent strong bottleneck starting around 10,000years ago, coinciding with the assumed time of maize domestication. Although the genome average genetic diversity is low compared with other fungal pathogens, we estimated that the rate of nonsynonymous adaptive substitutions is three times higher in genes located within virulence clusters compared with nonclustered genes, including nonclustered effector genes. These results highlight the role that these singular genomic regions play in the evolution of this pathogen

    Computational Analysis of Large-Scale Trends and Dynamics in Eukaryotic Protein Family Evolution

    Get PDF
    The myriad protein-coding genes found in present-day eukaryotes arose from a combination of speciation and gene duplication events, spanning more than one billion years of evolution. Notably, as these proteins evolved, the individual residues at each site in their amino acid sequences were replaced at markedly different rates. The relationship between protein structure, protein function, and site-specific rates of amino acid replacement is a topic of ongoing research. Additionally, there is much interest in the different evolutionary constraints imposed on sequences related by speciation (orthologs) versus sequences related by gene duplication (paralogs). A principal aim of this dissertation is to evaluate and characterize several broad trends in eukaryote protein evolution. To this end, I use sequence-based computational predictors of protein structure (intrinsic disorder and protein secondary structure) and protein function (predicted functional domains), in addition to Bayesian phylogenetic inference methods, to analyze thousands of homologous protein sequence clusters from four eukaryotic lineages: animals, plants, fungi and protists. Using these data, I performed large-scale factorial analyses, testing the correlation between protein structure/function and rates of sequence evolution. The combined results of these analyses somewhat corroborate the findings of previous research in the field, but they also illuminate a subtle interaction among multiple drivers of protein sequence evolution, which is consistently observed across multiple eukaryote groups. Furthermore, using the results of Bayesian phylogenetic analysis on real and simulated protein sequence alignments, I show that orthologous and paralogous proteins exhibit significantly different overall patterns of sequence divergence, indicating that paralogs tend to evolve under relaxed selective pressure. The acquisition of homologous biological sequence clusters is a prominent component of computational biological research. To assist in the identification of protein families within large sequence databases, I implement a simple, graph-based single-linkage clustering procedure, and I demonstrate its capacity to recover homologous subunits of the Rpt regulatory ring in the 26S proteasome complex
    corecore