12 research outputs found

    Estimating Human Point Mutation Rates from Codon Substitution Rates

    Get PDF

    A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes

    Get PDF
    Abstract Background Developing a model for codon substitutions is essential for the analyses of protein sequences. Recent studies on the mutation rates in the non-coding regions have shown that CpG mutation rates in the human genome are negatively correlated to the local GC content and to the densities of functional elements. This study aimed at understanding the effect of genomic features, namely, GC content, gene density, and frequency of CpG islands, on the rates of codon substitution in human chromosomes. Results Codon substitution rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions in humans were estimated by comparing the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by using macaque genes as the outgroup. Since the genomic features are depending on each other, partial regression coefficients of these features were obtained. Conclusion The substitution rates of codons depend on gene densities of the chromosomes. Transcription-associated mutation is one such pressure. On the basis of these results, a model of codon substitutions that incorporates the effect of genomic features on codon substitution in human chromosomes was developed.</p

    Context dependent substitution biases vary within the human genome

    Get PDF
    Background: Models of sequence evolution typically assume that different nucleotide positions evolve independently. This assumption is widely appreciated to be an over-simplification. The best known violations involve biases due to adjacent nucleotides. There have also been suggestions that biases exist at larger scales, however this possibility has not been systematically explored. Results: To address this we have developed a method which identifies over- and under-represented substitution patterns and assesses their overall impact on the evolution of genome composition. Our method is designed to account for biases at smaller pattern sizes, removing their effects. We used this method to investigate context bias in the human lineage after the divergence from chimpanzee. We examined bias effects in substitution patterns between 2 and 5 bp long and found significant effects at all sizes. This included some individual three and four base pair patterns with relatively large biases. We also found that bias effects vary across the genome, differing between transposons and non-transposons, between different classes of transposons, and also near and far from genes. Conclusions: We found that nucleotides beyond the immediately adjacent one are responsible for substantial context effects, and that these biases vary across the genome

    Coupling times with ambiguities for particle systems and applications to context-dependent DNA substitution models

    Full text link
    We define a notion of coupling time with ambiguities for interacting particle systems, and show how this can be used to prove ergodicity and to bound the convergence time to equilibrium and the decay of correlations at equilibrium. A motivation is to provide simple conditions which ensure that perturbed particle systems share some properties of the underlying unperturbed system. We apply these results to context-dependent substitution models recently introduced by molecular biologists as descriptions of DNA evolution processes. These models take into account the influence of the neighboring bases on the substitution probabilities at a site of the DNA sequence, as opposed to most usual substitution models which assume that sites evolve independently of each other.Comment: 33 page

    Vertebrate gene finding from multiple-species alignments using a two-level strategy

    Get PDF
    BACKGROUND: One way in which the accuracy of gene structure prediction in vertebrate DNA sequences can be improved is by analyzing alignments with multiple related species, since functional regions of genes tend to be more conserved. RESULTS: We describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set. CONCLUSION: We present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification

    Guanine Holes Are Prominent Targets for Mutation in Cancer and Inherited Disease

    Get PDF
    Albino Bacolla, Guliang Wang, Aklank Jain, Karen M. Vasquez, Division of Pharmacology and Toxicology, The University of Texas at Austin, Dell Pediatric Research Institute, Austin, Texas, United States of AmericaAlbino Bacolla, Nuri A. Temiz, Ming Yi, Joseph Ivanic, Regina Z. Cer, Duncan E. Donohue, Uma S. Mudunuri, Natalia Volfovsky, Brian T. Luke, Robert M., Stephens, Jack R. Collins, Advanced Biomedical Computing Center, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of AmericaEdward V. Ball, David N. Cooper, Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, United KingdomSingle base substitutions constitute the most frequent type of human gene mutation and are a leading cause of cancer and inherited disease. These alterations occur non-randomly in DNA, being strongly influenced by the local nucleotide sequence context. However, the molecular mechanisms underlying such sequence context-dependent mutagenesis are not fully understood. Using bioinformatics, computational and molecular modeling analyses, we have determined the frequencies of mutation at G•C bp in the context of all 64 5′-NGNN-3′ motifs that contain the mutation at the second position. Twenty-four datasets were employed, comprising >530,000 somatic single base substitutions from 21 cancer genomes, >77,000 germline single-base substitutions causing or associated with human inherited disease and 16.7 million benign germline single-nucleotide variants. In several cancer types, the number of mutated motifs correlated both with the free energies of base stacking and the energies required for abstracting an electron from the target guanines (ionization potentials). Similar correlations were also evident for the pathological missense and nonsense germline mutations, but only when the target guanines were located on the non-transcribed DNA strand. Likewise, pathogenic splicing mutations predominantly affected positions in which a purine was located on the non-transcribed DNA strand. Novel candidate driver mutations and tissue-specific mutational patterns were also identified in the cancer datasets. We conclude that electron transfer reactions within the DNA molecule contribute to sequence context-dependent mutagenesis, involving both somatic driver and passenger mutations in cancer, as well as germline alterations causing or associated with inherited disease.This work was supported by grants from the NIH (CA097175 and CA093729) to KMV, NCI/NIH contract HHSN261200800001E to AB and the Frederick National Laboratory for Cancer Research, and CBIIT/caBIG ISRCE yellow task #09-260 to the Frederick National Laboratory for Cancer Research. DNC and EVB received financial support from BIOBASE GmbH through a license agreement (for HGMD) with Cardiff University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.PharmacyEmail: [email protected]

    Patterns of mutation in the human genome

    Get PDF
    The processes that underlie point mutations in the human genome are largely unknown. However, the cumulative effect of these processes have a large impact on how mutation rates vary across a number of different scales and contexts, and consequently guide our understanding of human disease and evolution. Although variation in the mutation rate has been characterized on many different levels, it is not fully understood the extent to which the rate of mutation can vary outside of the general patterns already observed. Beginning with the human genome project, many studies have produced large unbiased sequence datasets within a number of human populations. To that end, we analysed a number of sequence datasets in an attempt to better understand the patterns and causes of variation in the rate of mutation that exists across the genome. Firstly, we find that the mutation rates of single sites vary by more than is currently understood, and that this variation is not associated with any specific process or feature on either a local or genomic scale. Although we have been unable to uncover the source of such variation, understanding the range of mutability at sites in the human genome is important since it may point to functional regions, disease phenotypes and prompt further ideas on the underlying mechanisms associated with such a result. Furthermore, we find evidence that a mutational process that can generate the simultaneous production of two new alleles within the same individual during a single, or tightly linked series of mutation events increases the number of tri-allelic sites in the human genome. There are a number of potential mechanisms that may drive this process, and the consequences of such an event may be far reaching, as the generation of two new alleles at a single site in functional regions may allow a more rapid exploration of evolutionary space. Furthermore, this process appears to make a reasonable contribution to variation in the human genome, thus providing a substrate for evolutionary change. Finally, we observe significant variation in the mutation rate over all scales in cancer genomes. Part of this result can be explained by the actions of specific carcinogens, however it is striking that patterns of mutation can be both consistent across different cancer types, but also very different between individuals with the same type of cancer over different scales. This result points to the idea that the patterns of mutation may vary widely between different genomes under different conditions, and the identification of general patterns in a small number of samples may not fully describe the extent to which mutation rates can vary. Taken together, these conclusions suggest that the patterns and processes underlying mutation are highly complex, and require further analysis if they are to be fully understood

    Análisis de metilación de las islas CpG de los genes APOE y TOMM40 en una muestra de pacientes colombianos con Enfermedad de Alzheimer

    Get PDF
    Resumen: INTRODUCCIÓN: La Enfermedad de Alzheimer (EA) es la forma más común de demencia. Un estudio previo en población colombiana reportó la asociación significativa entre el alelo ε4 de APOE, el alelo G del rs2075650 de TOMM40 y el anticipo de la edad de inicio de la enfermedad. Aunque el alelo ε4 de APOE es el principal factor de riesgo genético para EA, la presencia de este alelo por sí sola no es suficiente para causar la enfermedad. Múltiples investigaciones han reportado cambios en la metilación del ADN en EA, tanto a nivel global como a nivel de locus específicos. OBJETIVO: Caracterizar los patrones de metilación de las islas CpG de los genes APOE y TOMM40 en sangre periférica de pacientes colombianos con EA. MATERIALES Y MÉTODOS: Se extrajo ADN de sangre periférica de 54 pacientes colombianos con EAE y 54 controles sanos, el cual fue tratado con bisulfito de sodio. Se identificaron los niveles de metilación de la isla CpG de APOE y la playa de la isla CpG de TOMM40 usando las metodologías MS-HRM y BSP. Los datos obtenidos fueron analizados mediante un modelo de regresión beta. RESULTADOS: Se encontraron discrepancias entre los resultados obtenidos mediante ambas técnicas, mientras que la metodología MS-HRM mostró una tendencia hacia la hipermetilación en el grupo de pacientes con EAE con respecto a los controles, la metodología BSP mostró una tendencia hacia la hipometilación. Se evidenció mediante la metodología BSP hipometilación en pacientes con EAE en dos CpGs, la CpG 148 de APOE y la CpG 141 de TOMM40. Independientemente del diagnóstico, los portadores del alelo ε4 de APOE tienen menor porcentaje de metilación en la CpG 162 y CpG 182 de APOE, mientras que los portadores del genotipo GG del SNP rs2075650 de TOMM40 tienen diferencias en el porcentaje de metilación, disminuyendo en la CpG 148 de APOE y aumentando en la CpG 130 y CpG 155 de TOMM40. Adicionalmente, a medida que aumenta la edad de inicio disminuye la metilación en la CpG 148, CpG 162 y CpG 213 de APOE y CpG 155 de TOMM40. CONCLUSIONES: La metodología BSP ofrece información más precisa sobre el estado de metilación de las regiones analizadas. Se identificó que las regiones evaluadas de la isla CpG de APOE y la playa CpG de TOMM40, tanto en pacientes como en controles se encuentran hipermetiladas. Las diferencias significativas encontradas entre la metilación del ADN en las regiones analizadas de la isla de APOE y la playa de la isla de TOMM40, no solo entre los pacientes y los controles, sino también en portadores y no portadores de ciertos alelos, indican la gran complejidad de esta región a nivel epigenético.Abstract. INTRODUCTION: Alzheimer’s disease (AD) is the most common dementia. A Previous study in a Colombian population reported a significant association between APOE ε4 allele, TOMM40 G allele (rs2075650) and the earlier onset of the disease. Although APOE ε4 allele is the main genetic risk factor for AD, the presence of this allele by itself is not enough to cause the disease. Several research studies have reported changes in the DNA methylation in AD globally and loci specific. AIM: To characterize methylation patterns of the CpG islands of APOE and TOMM40 genes in peripheral blood of Colombian AD patients. METHODS: DNA was isolated from peripheral blood in 54 Colombian AD patients and 54 healthy controls and then treated with sodium bisulphite. Methylation levels of APOE CpG Island and TOMM40 CpG shore were identified by MS-HRM and BSP methodologies. Data analysis was performed using a beta regression model. RESULTS: Differences between results were identified. MS-HRM methodology showed hypermethylation in AD patients; BSP methodology showed hypomethylation in AD patients. BSP methodology allowed to identify hypomethylation in AD patients at APOE CpG148 and TOMM40 CpG141. In the spite of diagnosis, APOE ε4 allele carriers have a lower methylation percentage at APOE CpG162 and CpG182, whereas TOMM40 GG genotype (rs2075650) carriers have a lower methylation percentage at APOE CpG148 and higher methylation percentage at TOMM40 CpG130 and CpG155. In addition, as onset age increases, methylation at APOE CpG 148, CpG 162, CpG 213 and TOMM40 CpG 155 decreases. CONCLUSIONS: BSP methodology provides more specific information about methylation levels in the evaluated regions. APOE CpG Island and TOMM40 CpG shore are hypermethylated in AD patients and healthy controls. The differences found in DNA methylation in APOE CpG Island and TOMM40 CpG shore indicate the epigenetic complexity of this region, not only among patients and controls, but also in carriers and no carries of certain alleles.Maestrí

    Inferring strength of selection in vertebrate genomes

    Get PDF
    Protein-coding sequences have long been assumed to evolve under selection, but the quantification of the process at the nucleotide sequence level only started when a simple null model, the neutral theory of molecular evolution, was formulated by Kimura. Several methods were developed, which were based on the assumption that synonymous sites (nucleotides at third codon positions which do not change the encoded amino acid) evolve close to neutrally, and could be used as local neutral standards. Most of our current knowledge on the direction and strength of selection still depends on this simple assumption. One method, notably the non-synonymous to synonymous substitution rate ratio (dN/dS) has gained prevalence and is still widely used, in spite of the growing body of evidence that synonymous sites evolve under selection. In this thesis, I quantify the strength of selection in different sequence compartments of mammalian genomes, in order to obtain estimates of their functional importance from comparative genomics analyses. I quantify the fraction of mutations that have been selectively eliminated since the divergence of the species pairs examined, the so called genome wide selective constraint. This in turn is used to approximate the genomic deleterious mutation rate, which is an important parameter for several evolutionary problems. As estimates of selection depend on a large extent on the chosen neutral standard, here I use orthologous transposable elements, so called ancestral repeats, as these have been found to be evolving at a largely neutral fashion, and contain the least number of constrained sites in mammalian genomes. This enables me to quantify the level of selection even at synonymous sites, and the results suggest that these sites indeed evolve under constraint, the consequences of which I discuss. The selective constraint estimates enable me to test some simple hypotheses, such as Ohta's nearly neutral theory of molecular evolution, which suggests that selection is more efficient in species with larger effective population sizes. Beside the choice of neutral standards, there are several additional factors which are known to affect the selective constraint estimates. Here I also test the consequences of one of these, notably when sequences are not at compositional equilibrium (i.e. their GC content is away from the equilibrium GC content), which predicts that sequences with different GC content should evolve with different rates. This can cause bias in the estimates of level of selection or can even imitate selection in sequences which evolve completely neutrally. This effect is quantified here, and a simple correction is discussed

    A nucleotide substitution model with nearest-neighbour interactions.

    No full text
    MOTIVATION: It is well known that neighbouring nucleotides in DNA sequences do not mutate independently of each other. In this paper, we introduce a context-dependent substitution model and derive an algorithm to calculate the likelihood of sequences evolving under this model. We use this algorithm to estimate neighbour-dependent substitution rates, as well as rates for dinucleotide substitutions, using a Bayesian sampling procedure. The model is irreversible, giving an arrow to time, and allowing the position of the root between a pair of sequences to be inferred without using out-groups. RESULTS: We applied the model upon aligned human-mouse non-coding data. Clear neighbour dependencies were observed, including 17-18-fold increased CpG to TpG/CpA rates compared with other substitutions. Root inference positioned the root halfway the mouse and human tips, suggesting an approximately clock-like behaviour of the irreversible part of the substitution process
    corecore