13 research outputs found

    An expansive human regulatory lexicon encoded in transcription factor footprints.

    Get PDF
    Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNase I, leaving nucleotide-resolution footprints. Using genomic DNase I footprinting across 41 diverse cell and tissue types, we detected 45 million transcription factor occupancy events within regulatory regions, representing differential binding to 8.4 million distinct short sequence elements. Here we show that this small genomic sequence compartment, roughly twice the size of the exome, encodes an expansive repertoire of conserved recognition sequences for DNA-binding proteins that nearly doubles the size of the human cis-regulatory lexicon. We find that genetic variants affecting allelic chromatin states are concentrated in footprints, and that these elements are preferentially sheltered from DNA methylation. High-resolution DNase I cleavage patterns mirror nucleotide-level evolutionary conservation and track the crystallographic topography of protein-DNA interfaces, indicating that transcription factor structure has been evolutionarily imprinted on the human genome sequence. We identify a stereotyped 50-base-pair footprint that precisely defines the site of transcript origination within thousands of human promoters. Finally, we describe a large collection of novel regulatory factor recognition motifs that are highly conserved in both sequence and function, and exhibit cell-selective occupancy patterns that closely parallel major regulators of development, differentiation and pluripotency

    Widespread Site-Dependent Buffering of Human Regulatory Polymorphism

    Get PDF
    The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF–binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein–DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human–chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of “perfect” genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements

    The accessible chromatin landscape of the human genome

    Get PDF
    DNaseI hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discovery of all classes of cis-regulatory elements including enhancers, promoters, insulators, silencers, and locus control regions. Here we present the first extensive map of human DHSs identified through genome-wide profiling in 125 diverse cell and tissue types. We identify ~2.9 million DHSs that encompass virtually all known experimentally-validated cis-regulatory sequences and expose a vast trove of novel elements, most with highly cell-selective regulation. Annotating these elements using ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation, and regulatory factor occupancy patterns. We connect ~580,000 distal DHSs with their target promoters, revealing systematic pairing of different classes of distal DHSs and specific promoter types. Patterning of chromatin accessibility at many regulatory regions is choreographed with dozens to hundreds of co-activated elements, and the trans-cellular DNaseI sensitivity pattern at a given region can predict cell type-specific functional behaviors. The DHS landscape shows signatures of recent functional evolutionary constraint. However, the DHS compartment in pluripotent and immortalized cells exhibits higher mutation rates than that in highly differentiated cells, exposing an unexpected link between chromatin accessibility, proliferative potential and patterns of human variation

    Functional SNPs recapitulate the CTCF binding motif.

    No full text
    <p>(A) 4,428 SNPs identified by resequencing at as many sites. Y-axis indicates the number of SNPs identified at a given position (x-axis) relative to the aligned and strand-oriented CTCF motif (below). Bar color indicates alleles of SNPs. Gray shading indicates the 44-bp extent of protein-DNA interaction. Note that SNPs are uniformly distributed throughout the entire window, except for a slight reduction in diversity corresponding to the high-information content positions of the motif. (B) Of the SNPs in (A), 218 are significantly associated with ChIP-seq occupancy (FDR 1%). Color indicates SNPs for which the higher-occupancy allele (according to association analysis) also had a higher log-odds score in the known motif. Gray indicates SNPs that affected occupancy, but the higher-occupancy allele had a lower score in the motif. See <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002599#pgen.1002599.s002" target="_blank">Figure S2C</a> for full color. Note that these SNPs are concentrated in the region of protein-DNA contact, and 84% match the allele predicted by the canonical motif (above).</p

    Sequence context buffers effect of polymorphism on occupancy.

    No full text
    <p>(A) Average effect of SNPs on occupancy across 1,368 different sites, broken down by genotypes (panels) and position (x-axis) relative to the canonical motif (top). Y-axis, proportion of sites where a change is associated with differences in occupancy (FDR 1%). In comparison, 1% of changes observed outside this 44 bp region affected binding (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002599#pgen.1002599.s011" target="_blank">Table S4</a>). Only changes observed at least 3 sites are considered; in particular, few A–T transversions were observed due to the GC-rich nature of the motif. (B) SNPs at the weakest and strongest sites are less likely to affect occupancy. X-axis, decile of ChIP-seq signal for the heterozygote genotype according to the regression model; each decile represents 583 sites. Y-axis, proportion of sites in at which SNPs are associated with differential occupancy. (C) SNPs affecting occupancy despite stronger motif contexts involve more severe perturbations. X-axis, log-odds score of motif match, stronger matches at the right, label represents lower limit of bin. Y-axis, magnitude of perturbation, represented by the difference in log-odds scores between the two alleles. Error bars indicate standard deviation. In contrast, SNPs not affecting occupancy show no such trend (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002599#pgen.1002599.s016" target="_blank">Table S9</a>). (D) Each cell measures the mutual information between the base pair at positions in the core motif (x-axis) and whether a SNP at another position in the motif (y-axis) affects occupancy (FDR 5%). (E) Sequence context at sites with SNPs (arrows) at position 1 (above), 6 (below), divided by whether the SNP affected occupancy. Red stars highlight significant sequence differences (q<0.05, see <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002599#s3" target="_blank">Materials and Methods</a>) between buffered and unbuffered sites at positions with elevated mutual information along the x-axis in (D).</p

    Power of existing measures to predict the effect of regulatory polymorphism.

    No full text
    <p>ROC curve evaluating the power of two measures on the 1,368 SNPs in this study found within the region of protein-DNA contact, 186 of which significantly affect occupancy. Dotted blue line represents predictions by ranking SNPs in decreasing order of inferred purifying selection (phyloP per-nucleotide conservation score) at the location of the SNP. Solid red line represents predictions by ranking SNPs based on the difference in log-odds scores between alleles. Area under the curve (AUC) summarizes overall predictive power. Gray line indicates a random predictor and has an AUC of 50%. A perfect predictor would be plotted as a right angle, ranking all functional SNPs ahead of all nonfunctional SNPs, and would have an AUC of 1.0. While per-nt conservation performs little better than chance, consideration of binding energetics substantially improves performance.</p

    Genome-wide survey of the effect of genetic variation.

    No full text
    <p>(A) Filtering strategy for testable CTCF binding sites. A number of binding sites were excluded from the analysis due to microarray probe design constraints, poor mappability, differing mappability between two alleles, or insufficient resequencing coverage. (B) Summary of the prevalence of SNPs that affect CTCF occupancy at an FDR of 1%. Some sites overlapping SNPs were excluded for having insufficient data points per genotype to perform a robust regression. The model explained a substantial amount of the variance at significant sites (median r<sup>2</sup> of 0.61).</p

    Systematic identification of the effect of genetic variation on transcription factor occupancy.

    No full text
    <p>(A) We performed ChIP-seq for the transcription factor CTCF followed by targeted resequencing of its complete occupancy landscape in 12 members of CEPH pedigree 1459 (CEU). (B) Three qualitative levels of occupancy correspond to three genotypes of a SNP located at the binding site, with G/G homozygotes having the highest occupancy (region shown: chr1:151,853,500–151,859,700 [hg18]). (C) The SNP shown in (B) disrupts a critical position in the CTCF consensus sequence (note that G better matches the consensus recognition sequence). (D) Regression of ChIP-seq signal on genotype at the site in (B) quantifies the effect of SNPs on occupancy. We applied this strategy genome-wide to identify sites where SNPs are associated with differences in occupancy. At this site, Akaike information criterion favored a dominant effect model (GT and GG coded identically) over an additive model.</p

    Genetic Variation at the O-Antigen Biosynthetic Locus in Pseudomonas aeruginosa

    No full text
    The outer carbohydrate layer, or O antigen, of Pseudomonas aeruginosa varies markedly in different isolates of these bacteria, and at least 20 distinct O-antigen serotypes have been described. Previous studies have indicated that the major enzymes responsible for O-antigen synthesis are encoded in a cluster of genes that occupy a common genetic locus. We used targeted yeast recombinational cloning to isolate this locus from the 20 internationally recognized serotype strains. DNA sequencing of these isolated segments revealed that at least 11 highly divergent gene clusters occupy this region. Homology searches of the encoded protein products indicated that these gene clusters are likely to direct O-antigen biosynthesis. The O15 serotype strains lack functional gene clusters in the region analyzed, suggesting that O-antigen biosynthesis genes for this serotype are harbored in a different portion of the genome. The overall pattern underscores the plasticity of the P. aeruginosa genome, in which a specific site in a well-conserved genomic region can be occupied by any of numerous islands of functionally related DNA with diverse sequences
    corecore