4,461 research outputs found

    The amphioxus genome and the evolution of the chordate karyotype

    Get PDF
    Lancelets ('amphioxus') are the modern survivors of an ancient chordate lineage, with a fossil record dating back to the Cambrian period. Here we describe the structure and gene content of the highly polymorphic approx520-megabase genome of the Florida lancelet Branchiostoma floridae, and analyse it in the context of chordate evolution. Whole-genome comparisons illuminate the murky relationships among the three chordate groups (tunicates, lancelets and vertebrates), and allow not only reconstruction of the gene complement of the last common chordate ancestor but also partial reconstruction of its genomic organization, as well as a description of two genome-wide duplications and subsequent reorganizations in the vertebrate lineage. These genome-scale events shaped the vertebrate genome and provided additional genetic variation for exploitation during vertebrate evolution

    Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

    Get PDF
    BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

    Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement

    Full text link
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200

    Revealing mammalian evolutionary relationships by comparative analysis of gene clusters

    Get PDF
    Many software tools for comparative analysis of genomic sequence data have been released in recent decades. Despite this, it remains challenging to determine evolutionary relationships in gene clusters due to their complex histories involving duplications, deletions, inversions, and conversions. One concept describing these relationships is orthology. Orthologs derive from a common ancestor by speciation, in contrast to paralogs, which derive from duplication. Discriminating orthologs from paralogs is a necessary step in most multispecies sequence analyses, but doing so accurately is impeded by the occurrence of gene conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting its definition: by genomic context or by sequence content. X-orthology (based on context) traces orthology resulting from speciation and duplication only, while N-orthology (based on content) includes the influence of conversion events

    Towards plant pangenomics

    Get PDF
    As an increasing number of genome sequences become available for a wide range of species, there is a growing understanding that the genome of a single individual is insufficient to represent the gene diversity within a whole species. Many studies examine the sequence diversity within genes, and this allelic variation is an important source of phenotypic variation which can be selected for by man or nature. However, the significant gene presence/absence variation that has been observed within species and the impact of this variation on traits is only now being studied in detail. The sum of the genes for a species is termed the pangenome, and the determination and characterization of the pangenome is a requirement to understand variation within a species. In this review, we explore the current progress in pangenomics as well as methods and approaches for the characterization of pangenomes for a wide range of plant species

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences

    The aspartic proteinase family of three Phytophthora species

    Get PDF
    Background - Phytophthora species are oomycete plant pathogens with such major social and economic impact that genome sequences have been determined for Phytophthora infestans, P. sojae and P. ramorum. Pepsin-like aspartic proteinases (APs) are produced in a wide variety of species (from bacteria to humans) and contain conserved motifs and landmark residues. APs fulfil critical roles in infectious organisms and their host cells. Annotation of Phytophthora APs would provide invaluable information for studies into their roles in the physiology of Phytophthora species and interactions with their hosts. Results - Genomes of Phytophthora infestans, P. sojae and P. ramorum contain 11-12 genes encoding APs. Nine of the original gene models in the P. infestans database and several in P. sojae and P. ramorum (three and four, respectively) were erroneous. Gene models were corrected on the basis of EST data, consistent positioning of introns between orthologues and conservation of hallmark motifs. Phylogenetic analysis resolved the Phytophthora APs into 5 clades. Of the 12 sub-families, several contained an unconventional architecture, as they either lacked a signal peptide or a propart region. Remarkably, almost all APs are predicted to be membrane-bound. Conclusions - One of the twelve Phytophthora APs is an unprecedented fusion protein with a putative G-protein coupled receptor as the C-terminal partner. The others appear to be related to well-documented enzymes from other species, including a vacuolar enzyme that is encoded in every fungal genome sequenced to date. Unexpectedly, however, the oomycetes were found to have both active and probably-inactive forms of an AP similar to vertebrate BACE, the enzyme responsible for initiating the processing cascade that generates the Aß peptide central to Alzheimer's Disease. The oomycetes also encode enzymes similar to plasmepsin V, a membrane-bound AP that cleaves effector proteins of the malaria parasite Plasmodium falciparum during their translocation into the host red blood cell. Since the translocation of Phytophthora effector proteins is currently a topic of intense research activity, the identification in Phytophthora of potential functional homologues of plasmepsin V would appear worthy of investigation. Indeed, elucidation of the physiological roles of the APs identified here offers areas for future study. The significant revision of gene models and detailed annotation presented here should significantly facilitate experimental design

    The draft genome of the parasitic nematode Trichinella spiralis

    Get PDF
    Genome evolution studies for the phylum Nematoda have been limited by focusing on comparisons involving Caenorhabditis elegans. We report a draft genome sequence of Trichinella spiralis, a food-borne zoonotic parasite, which is the most common cause of human trichinellosis. This parasitic nematode is an extant member of a clade that diverged early in the evolution of the phylum, enabling identification of archetypical genes and molecular signatures exclusive to nematodes. We sequenced the 64-Mb nuclear genome,which is estimated to contain 15,808 protein-coding genes,at ~35-fold coverage using whole-genome shotgun and hierarchal map–assisted sequencing. Comparative genome analyses support intrachromosomal rearrangements across the phylum, disproportionate numbers of protein family deaths over births in parasitic compared to a non-parasitic nematode and a preponderance of gene-loss and -gain events in nematodes relative to Drosophila melanogaster. This genome sequence and the identified pan-phylum characteristics will contribute to genome evolution studies of Nematoda as well as strategies to combat global parasites of humans, food animals and crops
    corecore