1,771 research outputs found

    Towards realistic benchmarks for multiple alignments of non-coding sequences

    Get PDF
    <p><b>Abstract</b></p> <p>Background</p> <p>With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.</p> <p>Results</p> <p>We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments.</p> <p>Conclusion</p> <p>We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p

    Multiple species comparative analysis of human chromosome 22 between markers D22S1687 and D22S419 and gene expression profiling in zebrafish.

    Get PDF
    Major large scale insertions or deletions that resulted in gene number differences between human and chimpanzee were discovered in the IGLL and LCR22s within this region, with four human insertions from 6 Kb to 75 Kb and three chimpanzee insertions from 12 Kb to 74 Kb observed in the IGLL region, two human insertions of 59 Kb and 36 Kb in LCR22-6, and a 67 Kb chimpanzee insertion in LCR22-8. Small scale insertions and deletions, in addition to exon shuffling, elevated nucleotide divergence rate and positive selection were also observed in the putative genes, partially duplicated genes and pseudogenes in the IGLL and LCR22s. Thus, the second major conclusion of this study is the major differences between human and chimpanzee in this region lies in the highly repetitive regions of the IGLL and the LCR22s.Comparison of a 4.5 Mb region of human chromosome 22 between markers D22s1687 and D22s419, with the syntenic region in chimpanzee had revealed overall DNA sequence identity of approximately 97.6%, Ka/Ks ratio of known protein coding genes at approximately 0.25, with the majority of amino acid changes between hydrophilic amino acids, followed by changes between hydrophobic amino acids, and the least changes between hydrophobic to hydrophilic amino acids or vise versa. Thus, the first major conclusion of this study is that overall, this chromosomal region is highly conserved between human and chimpanzee, and the known protein coding genes are undergoing purifying selections, in which 75% of nucleotide substitutions that led to amino acid changes were eliminated by adaptive evolution.Through whole mount in situ hybridization studies, a total of 12 human orthologs in zebrafish, including 4 newly predicted putative genes with no previously known expression profile and function, showed specific expression in the developing zebrafish embryonic central nervous system, optic system, the neural crest cells, ottic vesicle, liver, and notochord. Thus, the third major conclusion from this present study is that many predicted genes which currently lack expression data and functional information likely are time and tissue specific during developmental processes

    Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates.

    Get PDF
    Mammalian and avian genomes are characterized by a substantial spatial heterogeneity of GC-content, which is often interpreted as reflecting the effect of local GC-biased gene conversion (gBGC), a meiotic repair bias that favors G and C over A and T alleles in high-recombining genomic regions. Surprisingly, the first fully sequenced nonavian sauropsid (i.e., reptile), the green anole Anolis carolinensis, revealed a highly homogeneous genomic GC-content landscape, suggesting the possibility that gBGC might not be at work in this lineage. Here, we analyze GC-content evolution at third-codon positions (GC3) in 44 vertebrates species, including eight newly sequenced transcriptomes, with a specific focus on nonavian sauropsids. We report that reptiles, including the green anole, have a genome-wide distribution of GC3 similar to that of mammals and birds, and we infer a strong GC3-heterogeneity to be already present in the tetrapod ancestor. We further show that the dynamic of coding sequence GC-content is largely governed by karyotypic features in vertebrates, notably in the green anole, in agreement with the gBGC hypothesis. The discrepancy between third-codon positions and noncoding DNA regarding GC-content dynamics in the green anole could not be explained by the activity of transposable elements or selection on codon usage. This analysis highlights the unique value of third-codon positions as an insertion/deletion-free marker of nucleotide substitution biases that ultimately affect the evolution of proteins

    Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates.

    Get PDF
    Mammalian and avian genomes are characterized by a substantial spatial heterogeneity of GC-content, which is often interpreted as reflecting the effect of local GC-biased gene conversion (gBGC), a meiotic repair bias that favors G and C over A and T alleles in high-recombining genomic regions. Surprisingly, the first fully sequenced nonavian sauropsid (i.e., reptile), the green anole Anolis carolinensis, revealed a highly homogeneous genomic GC-content landscape, suggesting the possibility that gBGC might not be at work in this lineage. Here, we analyze GC-content evolution at third-codon positions (GC3) in 44 vertebrates species, including eight newly sequenced transcriptomes, with a specific focus on nonavian sauropsids. We report that reptiles, including the green anole, have a genome-wide distribution of GC3 similar to that of mammals and birds, and we infer a strong GC3-heterogeneity to be already present in the tetrapod ancestor. We further show that the dynamic of coding sequence GC-content is largely governed by karyotypic features in vertebrates, notably in the green anole, in agreement with the gBGC hypothesis. The discrepancy between third-codon positions and noncoding DNA regarding GC-content dynamics in the green anole could not be explained by the activity of transposable elements or selection on codon usage. This analysis highlights the unique value of third-codon positions as an insertion/deletion-free marker of nucleotide substitution biases that ultimately affect the evolution of proteins

    Long-term trends in evolution of indels in protein sequences

    Get PDF
    BACKGROUND: In this paper we describe an analysis of the size evolution of both protein domains and their indels, as inferred by changing sizes of whole domains or individual unaligned regions or "spacers". We studied relatively early evolutionary events and focused on protein domains which are conserved among various taxonomy groups. RESULTS: We found that more than one third of all domains have a statistically significant tendency to increase/decrease in size in evolution as judged from the overall domain size distribution as well as from the size distribution of individual spacers. Moreover, the fraction of domains and individual spacers increasing in size is almost twofold larger than the fraction decreasing in size. CONCLUSION: We showed that the tolerance to insertion and deletion events depends on the domain's taxonomy span. Eukaryotic domains are depleted in insertions compared to the overall test set, namely, the number of spacers increasing in size is about the same as the number of spacers decreasing in size. On the other hand, ancient domain families show some bias towards insertions or spacers which grow in size in evolution. Domains from several Gene Ontology categories also demonstrate certain tendencies for insertion or deletion events as inferred from the analysis of spacer sizes

    The draft genome of the parasitic nematode Trichinella spiralis

    Get PDF
    Genome evolution studies for the phylum Nematoda have been limited by focusing on comparisons involving Caenorhabditis elegans. We report a draft genome sequence of Trichinella spiralis, a food-borne zoonotic parasite, which is the most common cause of human trichinellosis. This parasitic nematode is an extant member of a clade that diverged early in the evolution of the phylum, enabling identification of archetypical genes and molecular signatures exclusive to nematodes. We sequenced the 64-Mb nuclear genome,which is estimated to contain 15,808 protein-coding genes,at ~35-fold coverage using whole-genome shotgun and hierarchal map–assisted sequencing. Comparative genome analyses support intrachromosomal rearrangements across the phylum, disproportionate numbers of protein family deaths over births in parasitic compared to a non-parasitic nematode and a preponderance of gene-loss and -gain events in nematodes relative to Drosophila melanogaster. This genome sequence and the identified pan-phylum characteristics will contribute to genome evolution studies of Nematoda as well as strategies to combat global parasites of humans, food animals and crops

    XX/XY Sex Chromosomes in the South American Dwarf Gecko (\u3cem\u3eGonatodes humeralis\u3c/em\u3e)

    Get PDF
    Sex-specific genetic markers identified using restriction site-associated DNA sequencing, or RADseq, permits the recognition of a species’ sex chromosome system in cases where standard cytogenetic methods fail. Thus, species with male-specific RAD markers have an XX/XY sex chromosome system (male heterogamety) while species with female-specific RAD markers have a ZZ/ZW sex chromosome (female heterogamety). Here, we use RADseq data from 5 male and 5 female South American dwarf geckos (Gonatodes humeralis) to identify an XX/XY sex chromosome system. This is the first confidently known sex chromosome system in a Gonatodes species. We used a low-coverage de novo G. humeralis genome assembly to design PCR primers to validate the male-specificity of a subset of the sex-specific RADseq markers and describe how even modest genome assemblies can facilitate the design of sex-specific PCR primers in species with diverse sex chromosome systems

    Comparative analysis of human chromosome 22 CES-DGCR syntenic regions in chimpanzee, baboon, bovine, mouse and zebrafish and expression profiling in zebrafish early developmental stages using whole mount in situ hybridization.

    Get PDF
    The final series of experiments were based on the earlier observation that 16 genes in the human chromosome 22 CES-DGCR region had reported expression but no detailed expression profiles while 6 others had no known expression profiles. Through the comparative sequencing and subsequent whole mount in situ studies reported in this dissertation, expression of these 22 genes was observed to occur during zebrafish development, mainly during early developmental stages followed by either decreased or no expression in later stages in the brain, ear, eyes, heart, pharyngeal arches, liver, and kidney, all organs related to anomalies resulting in phenotype observed in CES-DGCR patients. Therefore, the third major conclusions from this present work is that contrary to prior studies pointing to single gene alterations resulting in these diseases, it now is clear that both CES and DGCR are multigene-based diseases.**This dissertation is a compound document (contains both a paper copy and a CD as part of the dissertation). The CD requires the following system requirements: Adobe Acrobat.The majority of the amino acid substitutions in humans, chimpanzees, baboons and bovines are changes from hydrophilic to hydrophilic amino acids. The observed human and chimpanzee substitution rate was 1.2% and that between humans and baboons was 2.6%, with Ka/Ks ratios for human and chimpanzee at 0.44 and human and baboon at 0.48. Thus, the second major conclusion from this present work is that at least in the case of humans vs. primates, the genes are evolving by purifying selection.Comparative genomic analysis is a powerful tool that can illuminate the genomic sequence features that result in the changes that drive evolution. In this dissertation, the 4.5 Mb region proximal to the centromere of human chromosome 22 that encodes the contiguous Cat Eye Syndrome and DiGeorge-Velocardiofacial Syndrome (CES-DGCR/VCFS) Critical Regions and the orthologous regions from chimpanzee, baboon, cow, mice and zebrafish have been sequenced and compared. Overall the human and chimpanzee sequences were &sim; 98.5% identical and the human-baboon sequences were &sim; 92% identical at the nucleotide level. A high degree of conservation was observed in both the gene order and the coding region sequences for these synteny regions, with a lower degree of conservation in the intronic and intergenic regions. The conserved structural features likely represent conserved functional properties while the observed differences must be responsible for portions of the human and primate specific phenotypes. The region studied was slightly larger in humans than in chimpanzees and baboons, since the human lineage had a higher insertion frequency relative to the other primates (or the other primates have a higher deletion frequency compared to humans). By comparing the sequenced regions of the chimpanzee genome from three different individual chimpanzees, Clint (ch251), Donald (rp43) and Gon (ptb), the first major conclusion from this dissertation research is that these three chimpanzees differ from each other by &sim; 1.2%, almost as much as humans differ from chimpanzees
    corecore