814 research outputs found

    Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model

    Get PDF
    It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to humanā€“mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Futhermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes

    Adaptive Evolution of Conserved Noncoding Elements in Mammals

    Get PDF
    Conserved noncoding elements (CNCs) are an abundant feature of vertebrate genomes. Some CNCs have been shown to act as cis-regulatory modules, but the function of most CNCs remains unclear. To study the evolution of CNCs, we have developed a statistical method called the ā€œshared rates testā€ to identify CNCs that show significant variation in substitution rates across branches of a phylogenetic tree. We report an application of this method to alignments of 98,910 CNCs from the human, chimpanzee, dog, mouse, and rat genomes. We find that āˆ¼68% of CNCs evolve according to a null model where, for each CNC, a single parameter models the level of constraint acting throughout the phylogeny linking these five species. The remaining āˆ¼32% of CNCs show departures from the basic model including speed-ups and slow-downs on particular branches and occasionally multiple rate changes on different branches. We find that a subset of the significant CNCs have evolved significantly faster than the local neutral rate on a particular branch, providing strong evidence for adaptive evolution in these CNCs. The distribution of these signals on the phylogeny suggests that adaptive evolution of CNCs occurs in occasional short bursts of evolution. Our analyses suggest a large set of promising targets for future functional studies of adaptation

    A Phylogenomic Study of Human, Dog, and Mouse

    Get PDF
    In recent years the phylogenetic relationship of mammalian orders has been addressed in a number of molecular studies. These analyses have frequently yielded inconsistent results with respect to some basal ordinal relationships. For example, the relative placement of primates, rodents, and carnivores has differed in various studies. Here, we attempt to resolve this phylogenetic problem by using data from completely sequenced nuclear genomes to base the analyses on the largest possible amount of data. To minimize the risk of reconstruction artifacts, the trees were reconstructed under different criteriaā€”distance, parsimony, and likelihood. For the distance trees, distance metrics that measure independent phenomena (amino acid replacement, synonymous substitution, and gene reordering) were used, as it is highly improbable that all of the trees would be affected the same way by any reconstruction artifact. In contradiction to the currently favored classification, our results based on full-genome analysis of the phylogenetic relationship between human, dog, and mouse yielded overwhelming support for a primateā€“carnivore clade with the exclusion of rodents

    Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper Pseudogenes

    Get PDF
    Comparisons of the DNA sequences of metazoa show an excess of transitional over transversional substitutions. Part of this bias is due to the relatively high rate of mutation of methylated cytosines to thymine. Postmutation processes also introduce a bias, particularly selection for codon-usage bias in coding regions. It is generally assumed, however, that there is a universal bias in favour of transitions over transversions, possibly as a result of the underlying chemistry of mutation. Surprisingly, this underlying trend has been evaluated only in two types of metazoan, namely Drosophila and the Mammalia. Here, we investigate a third group, and find no such bias. We characterize the point substitution spectrum in Podisma pedestris, a grasshopper species with a very large genome. The accumulation of mutations was surveyed in two pseudogene families, nuclear mitochondrial and ribosomal DNA sequences. The cytosine-guanine (CpG) dinucleotides exhibit the high transition frequencies expected of methylated sites. The transition rate at other cytosine residues is significantly lower. After accounting for this methylation effect, there is no significant difference between transition and transversion rates. These results contrast with reports from other taxa and lead us to reject the hypothesis of a universal transition/transversion bias. Instead we suggest fundamental interspecific differences in point substitution processes

    Bias of Selection on Human Copy-Number Variants

    Get PDF
    Although large-scale copy-number variation is an important contributor to conspecific genomic diversity, whether these variants frequently contribute to human phenotype differences remains unknown. If they have few functional consequences, then copy-number variants (CNVs) might be expected both to be distributed uniformly throughout the human genome and to encode genes that are characteristic of the genome as a whole. We find that human CNVs are significantly overrepresented close to telomeres and centromeres and in simple tandem repeat sequences. Additionally, human CNVs were observed to be unusually enriched in those protein-coding genes that have experienced significantly elevated synonymous and nonsynonymous nucleotide substitution rates, estimated between single human and mouse orthologues. CNV genes encode disproportionately large numbers of secreted, olfactory, and immunity proteins, although they contain fewer than expected genes associated with Mendelian disease. Despite mouse CNVs also exhibiting a significant elevation in synonymous substitution rates, in most other respects they do not differ significantly from the genomic background. Nevertheless, they encode proteins that are depleted in olfactory function, and they exhibit significantly decreased amino acid sequence divergence. Natural selection appears to have acted discriminately among human CNV genes. The significant overabundance, within human CNVs, of genes associated with olfaction, immunity, protein secretion, and elevated coding sequence divergence, indicates that a subset may have been retained in the human population due to the adaptive benefit of increased gene dosage. By contrast, the functional characteristics of mouse CNVs either suggest that advantageous gene copies have been depleted during recent selective breeding of laboratory mouse strains or suggest that they were preferentially fixed as a consequence of the larger effective population size of wild mice. It thus appears that CNV differences among mouse strains do not provide an appropriate model for large-scale sequence variations in the human population

    A Macaque's-Eye View of Human Insertions and Deletions: Differences in Mechanisms

    Get PDF
    Insertions and deletions (indels) cause numerous genetic diseases and lead to pronounced evolutionary differences among genomes. The macaque sequences provide an opportunity to gain insights into the mechanisms generating these mutations on a genome-wide scale by establishing the polarity of indels occurring in the human lineage since its divergence from the chimpanzee. Here we apply novel regression techniques and multiscale analyses to demonstrate an extensive regional indel rate variation stemming from local fluctuations in divergence, GC content, male and female recombination rates, proximity to telomeres, and other genomic factors. We find that both replication and, surprisingly, recombination are significantly associated with the occurrence of small indels. Intriguingly, the relative inputs of replication versus recombination differ between insertions and deletions, thus the two types of mutations are likely guided in part by distinct mechanisms. Namely, insertions are more strongly associated with factors linked to recombination, while deletions are mostly associated with replication-related features. Indel as a term misleadingly groups the two types of mutations together by their effect on a sequence alignment. However, here we establish that the correct identification of a small gap as an insertion or a deletion (by use of an outgroup) is crucial to determining its mechanism of origin. In addition to providing novel insights into insertion and deletion mutagenesis, these results will assist in gap penalty modeling and eventually lead to more reliable genomic alignments

    SSRD: Simple Sequence Repeats Database of the Human Genome

    Get PDF
    Simple sequence repeats are predominantly found in most organisms. They play a major role in studies of genetic diversity, and are useful as diagnostic markers for many diseases. The simple sequence repeats database (SSRD) for the human genome was created for easy access to such repeats, for analysis, and to be used to understand their biological significance. The data includes the abundance and distribution of SSRs in the coding and non-coding regions of the genome, as well as their association with the UTRs of genes. The exact locations of repeats with respect to genomic regions (such as UTRs, exons, introns or intergenic regions) and their association with STS markers are also highlighted. The resource will facilitate repeat sequence analysis in the human genome and the understanding of the functional and evolutionary significance of simple sequence repeats. SSRD is available through two websites, http://www.ccmb.res.in/ssr and http://www.ingenovis.com/ssr

    Intronic Alternative Splicing Regulators Identified by Comparative Genomics in Nematodes

    Get PDF
    Many alternative splicing events are regulated by pentameric and hexameric intronic sequences that serve as binding sites for splicing regulatory factors. We hypothesized that intronic elements that regulate alternative splicing are under selective pressure for evolutionary conservation. Using a Wobble Aware Bulk Aligner genomic alignment of Caenorhabditis elegans and Caenorhabditis briggsae, we identified 147 alternatively spliced cassette exons that exhibit short regions of high nucleotide conservation in the introns flanking the alternative exon. In vivo experiments on the alternatively spliced let-2 gene confirm that these conserved regions can be important for alternative splicing regulation. Conserved intronic element sequences were collected into a dataset and the occurrence of each pentamer and hexamer motif was counted. We compared the frequency of pentamers and hexamers in the conserved intronic elements to a dataset of all C. elegans intron sequences in order to identify short intronic motifs that are more likely to be associated with alternative splicing. High-scoring motifs were examined for upstream or downstream preferences in introns surrounding alternative exons. Many of the high- scoring nematode pentamer and hexamer motifs correspond to known mammalian splicing regulatory sequences, such as (T)GCATG, indicating that the mechanism of alternative splicing regulation is well conserved in metazoans. A comparison of the analysis of the conserved intronic elements, and analysis of the entire introns flanking these same exons, reveals that focusing on intronic conservation can increase the sensitivity of detecting putative splicing regulatory motifs. This approach also identified novel sequences whose role in splicing is under investigation and has allowed us to take a step forward in defining a catalog of splicing regulatory elements for an organism. In vivo experiments confirm that one novel high-scoring sequence from our analysis, (T)CTATC, is important for alternative splicing regulation of the unc-52 gene

    Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by log-linear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of so-called full-length cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species.</p> <p>Results</p> <p>We develop methods to perform model selection and parameter estimation in log-linear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of log-linear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient ā„“<sub>1</sub>-penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from full-length cDNA libraries.</p> <p>Conclusion</p> <p>We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.</p
    • ā€¦
    corecore