2,709 research outputs found

    Characterization of DNA methylation as a function of biological complexity via dinucleotide inter-distances

    Full text link
    We perform a statistical study of the distances between successive occurrencies of a given dinucleotide in the DNA sequence for a number of organisms of different complexity. Our analysis highlights peculiar features of the dinucleotide CG distribution in mammalian DNA, pointing towards a connection with the role of such dinucleotide in DNA methylation. While the CG distributions of mammals exhibit exponential tails with comparable parameters, the picture for the other organisms studied (e.g., fish, insects, bacteria and viruses) is more heterogeneous, possibly because in these organisms DNA methylation has different functional roles. Our analysis suggests that the distribution of the distances between dinucleotides CG provides useful insights in characterizing and classifying organisms in terms of methylation functionalities.Comment: 13 pages, 5 figures. To be published in the Philosophical Transactions A theme issue "DNA as information

    Evidence of widespread degradation of gene control regions in hominid genomes

    Get PDF
    Although sequences containing regulatory elements located close to protein-coding genes are often only weakly conserved during evolution, comparisons of rodent genomes have implied that these sequences are subject to some selective constraints. Evolutionary conservation is particularly apparent upstream of coding sequences and in first introns, regions that are enriched for regulatory elements. By comparing the human and chimpanzee genomes, we show here that there is almost no evidence for conservation in these regions in hominids. Furthermore, we show that gene expression is diverging more rapidly in hominids than in murids per unit of neutral sequence divergence. By combining data on polymorphism levels in human noncoding DNA and the corresponding human¿chimpanzee divergence, we show that the proportion of adaptive substitutions in these regions in hominids is very low. It therefore seems likely that the lack of conservation and increased rate of gene expression divergence are caused by a reduction in the effectiveness of natural selection against deleterious mutations because of the low effective population sizes of hominids. This has resulted in the accumulation of a large number of deleterious mutations in sequences containing gene control elements and hence a widespread degradation of the genome during the evolution of humans and chimpanzees

    Evidence of widespread degradation of gene control regions in hominid genomes

    Get PDF
    Although sequences containing regulatory elements located close to protein-coding genes are often only weakly conserved during evolution, comparisons of rodent genomes have implied that these sequences are subject to some selective constraints. Evolutionary conservation is particularly apparent upstream of coding sequences and in first introns, regions that are enriched for regulatory elements. By comparing the human and chimpanzee genomes, we show here that there is almost no evidence for conservation in these regions in hominids. Furthermore, we show that gene expression is diverging more rapidly in hominids than in murids per unit of neutral sequence divergence. By combining data on polymorphism levels in human noncoding DNA and the corresponding human¿chimpanzee divergence, we show that the proportion of adaptive substitutions in these regions in hominids is very low. It therefore seems likely that the lack of conservation and increased rate of gene expression divergence are caused by a reduction in the effectiveness of natural selection against deleterious mutations because of the low effective population sizes of hominids. This has resulted in the accumulation of a large number of deleterious mutations in sequences containing gene control elements and hence a widespread degradation of the genome during the evolution of humans and chimpanzees

    Simple sequence repeats in zebra finch (Taeniopygia guttata) expressed sequence tags: a new resource for evolutionary genetic studies of passerines

    Get PDF
    Background Passerines (perching birds) are widely studied across many biological disciplines including ecology, population biology, neurobiology, behavioural ecology and evolutionary biology. However, understanding the molecular basis of relevant traits is hampered by the paucity of passerine genomics tools. Efforts to address this problem are underway, and the zebra finch (Taeniopygia guttata) will be the first passerine to have its genome sequenced. Here we describe a bioinformatic analysis of zebra finch expressed sequence tag (EST) Genbank entries. Results A total of 48,862 ESTs were downloaded from GenBank and assembled into contigs, representing an estimated 17,404 unique sequences. The unique sequence set contained 638 simple sequence repeats (SSRs) or microsatellites of length ≥20 bp and purity ≥90% and 144 simple sequence repeats of length ≥30 bp. A chromosomal location for the majority of SSRs was predicted by BLASTing against assembly 2.1 of the chicken genome sequence. The relative exonic location (5' untranslated region, coding region or 3' untranslated region) was predicted for 218 of the SSRs, by BLAST search against the ENSEMBL chicken peptide database. Ten loci were examined for polymorphism in two zebra finch populations and two populations of a distantly related passerine, the house sparrow Passer domesticus. Linkage was confirmed for four loci that were predicted to reside on the passerine homologue of chicken chromosome 7. Conclusion We show that SSRs are abundant within zebra finch ESTs, and that their genomic location can be predicted from sequence similarity with the assembled chicken genome sequence. We demonstrate that a useful proportion of zebra finch EST-SSRs are likely to be polymorphic, and that they can be used to build a linkage map. Finally, we show that many zebra finch EST-SSRs are likely to be useful in evolutionary genetic studies of other passerines

    p-Adic Modelling of the Genome and the Genetic Code

    Full text link
    The present paper is devoted to foundations of p-adic modelling in genomics. Considering nucleotides, codons, DNA and RNA sequences, amino acids, and proteins as information systems, we have formulated the corresponding p-adic formalisms for their investigations. Each of these systems has its characteristic prime number used for construction of the related information space. Relevance of this approach is illustrated by some examples. In particular, it is shown that degeneration of the genetic code is a p-adic phenomenon. We have also put forward a hypothesis on evolution of the genetic code assuming that primitive code was based on single nucleotides and chronologically first four amino acids. This formalism of p-adic genomic information systems can be implemented in computer programs and applied to various concrete cases.Comment: 26 pages. Submitted to the Computer Journal for a special issu

    Precise targeted integration by a chimaeric transposase zinc-finger fusion protein

    Get PDF
    Transposons of the Tc1/mariner family have been used to integrate foreign DNA stably into the genome of a large variety of different cell types and organisms. Integration is at TA dinucleotides located essentially at random throughout the genome, potentially leading to insertional mutagenesis, inappropriate activation of nearby genes, or poor expression of the transgene. Here, we show that fusion of the zinc-finger DNA-binding domain of Zif268 to the C-terminus of ISY100 transposase leads to highly specific integration into TA dinucleotides positioned 6-17 bp to one side of a Zif268 binding site. We show that the specificity of targeting can be changed using Zif268 variants that bind to sequences from the HIV-1 promoter, and demonstrate a bacterial genetic screen that can be used to select for increased levels of targeted transposition. A TA dinucleotide flanked by two Zif268 binding sites was efficiently targeted by our transposase-Zif268 fusion, suggesting the possibility of designer ‘Z-transposases’ that could deliver transgenic cargoes to chosen genomic locations

    The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication

    Get PDF
    Most RNA viruses infecting mammals and other vertebrates show profound suppression of CpG and UpA dinucleotide frequencies. To investigate this functionally, mutants of the picornavirus, echovirus 7 (E7), were constructed with altered CpG and UpA compositions in two 1.1–1.3 Kbase regions. Those with increased frequencies of CpG and UpA showed impaired replication kinetics and higher RNA/infectivity ratios compared with wild-type virus. Remarkably, mutants with CpGs and UpAs removed showed enhanced replication, larger plaques and rapidly outcompeted wild-type virus on co-infections. Luciferase-expressing E7 sub-genomic replicons with CpGs and UpAs removed from the reporter gene showed 100-fold greater luminescence. E7 and mutants were equivalently sensitive to exogenously added interferon-β, showed no evidence for differential recognition by ADAR1 or pattern recognition receptors RIG-I, MDA5 or PKR. However, kinase inhibitors roscovitine and C16 partially or entirely reversed the attenuated phenotype of high CpG and UpA mutants, potentially through inhibition of currently uncharacterized pattern recognition receptors that respond to RNA composition. Generating viruses with enhanced replication kinetics has applications in vaccine production and reporter gene construction. More fundamentally, the findings introduce a new evolutionary paradigm where dinucleotide composition of viral genomes is subjected to selection pressures independently of coding capacity and profoundly influences host–pathogen interactions

    Genomic Selective Constraints in Murid Noncoding DNA

    Get PDF
    Recent work has suggested that there are many more selectively constrained, functional noncoding than coding sites in mammalian genomes. However, little is known about how selective constraint varies amongst different classes of noncoding DNA. We estimated the magnitude of selective constraint on a large dataset of mouse-rat gene orthologs and their surrounding noncoding DNA. Our analysis indicates that there are more than three times as many selectively constrained, nonrepetitive sites within noncoding DNA as in coding DNA in murids. The majority of these constrained noncoding sites appear to be located within intergenic regions, at distances greater than 5 kilobases from known genes. Our study also shows that in murids, intron length and mean intronic selective constraint are negatively correlated with intron ordinal number. Our results therefore suggest that functional intronic sites tend to accumulate toward the 5' end of murid genes. Our analysis also reveals that mean number of selectively constrained noncoding sites varies substantially with the function of the adjacent gene. We find that, among others, developmental and neuronal genes are associated with the greatest numbers of putatively functional noncoding sites compared with genes involved in electron transport and a variety of metabolic processes. Combining our estimates of the total number of constrained coding and noncoding bases we calculate that over twice as many deleterious mutations have occurred in intergenic regions as in known genic sequence and that the total genomic deleterious point mutation rate is 0.91 per diploid genome, per generation. This estimated rate is over twice as large as a previous estimate in murids

    CpGcluster: a distance-based algorithm for CpG-island detection

    Get PDF
    BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content. RESULTS: Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome. CONCLUSION: CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions
    corecore