8 research outputs found

    Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model

    Get PDF
    It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to human–mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Futhermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes

    Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees

    Get PDF
    Genomic regions participating in recombination events may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Existing phylogenetic methods for recombination detection are challenged by the enormous number of possible topologies, even for a moderate number of taxa. If, however, the detection analysis is conducted independently between each putative recombinant sequence and a set of reference parentals, potential recombinations between the recombinants are neglected. In this context, a recombination hotspot can be inferred in phylogenetic analyses if we observe several consecutive breakpoints. We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once. On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. The procedure is robust to rate and transition∶transversion heterogeneities for simulations with and without recombination. This recombination distance is related to recombination hotspots. Applying this procedure to a genomic HIV-1 dataset, we found evidence for hotspots and de novo recombination

    Genomic Distribution of Intergap Distances

    No full text
    <p>Histogram of intergap distance counts (log<sub>10</sub> scale) in human–mouse alignments, (A) within the whole genome and (B) within ARs. Blue lines indicate predictions of the neutral model (central line, geometric distribution; the slope is related to the per-site indel probability <i>ρ</i>), and expected sampling errors (outer curves; 95% confidence intervals for a binomial distribution per length bin). Insets show a blow-up of the deviation from the model (log<sub>10</sub> scale). Parameters were obtained by linear regression to the log-counts, weighted by the expected binomial sampling error. The indel distribution on AR data shows an excellent model fit, in particular in the range 20–80 bp, with 92% of counts (56/61) lying within 95% confidence limits. The whole-genome histogram shows a similarly tight fit in the range 20–50 bp, and a large excess of long intergap distances over neutral model predictions (green) beyond ~50 bp. The intercept of the geometric prediction occurs at a length <i>L</i> = 300. This implies that less than one ungapped sequence of any length <i>L</i> > 300 is expected genome-wide under the neutral model; however the model does predict a small but nonzero probability for any such sequence, even under neutrality.</p

    Empirical Distribution of Sequence PID to Mouse

    No full text
    <p>Shown is the PID distribution for human segments under indel-purifying selection (at a 1% FDR; blue), and a background distribution obtained on putatively neutrally evolving segments (non-exonic, and not in identified set of segments at 10% FDR; grey). The blue distribution can be decomposed as a mixture of 6% background (shaded) and a remainder (red), suggesting that a proportion of ungapped elements (≈ 5%, mixture coefficient minus FDR of 1%) are under purifying selection with respect to indels, while evolving under relaxed constraints or positive selection with respect to substitutions (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020005#s4" target="_blank">Materials and Methods</a>).</p

    Extent of Indel-Purifying Selection in the Human Genome by G+C Content

    No full text
    <p>Vertical axis shows <i>σ</i> (fraction of nucleotides in ungapped segments that are overrepresented with respect to predictions of the neutral indel model) in human–mouse alignments, for the whole genome (blue), whole genome without exons (green, Ensembl exons including UTRs; shaded green, GenScan exons), both relative to 1,002-Mb mouse-aligning bases, and overrepresentation relative to 177 Mb of ARs (red). In all cases, overrepresentation on the X chromosome was measured separately; values shown are for all chromosomes combined. The measured overrepresentation of long ungapped segments is mainly due to indel-purifying selection, and in part to neutral indel rate variation and other causes (see the section Accounting for Indel Rate Variation). The exclusion of annotated exons, which tend to reside in G+C–rich regions of the genome, all but removed the peak at the highest G+C quantiles, indicating that non-genic functional material tends to accumulate at intermediate G+C levels.</p
    corecore