11 research outputs found
A Feature-Based Approach to Modeling ProteinβDNA Interactions
Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TFβDNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/
Systematic Dissection of the Sequence Determinants of Gene 3β End Mediated Expression Control
<div><p>The 3βend genomic region encodes a wide range of regulatory process including mRNA stability, 3β end processing and translation. Here, we systematically investigate the sequence determinants of 3β end mediated expression control by measuring the effect of 13,000 designed 3β end sequence variants on constitutive expression levels in yeast. By including a high resolution scanning mutagenesis of more than 200 native 3β end sequences in this designed set, we found that most mutations had only a mild effect on expression, and that the vast majority (~90%) of strongly effecting mutations localized to a single positive TA-rich element, similar to a previously described 3β end processing efficiency element, and resulted in up to ten-fold decrease in expression. Measurements of 3β UTR lengths revealed that these mutations result in mRNAs with aberrantly long 3βUTRs, confirming the role for this element in 3β end processing. Interestingly, we found that other sequence elements that were previously described in the literature to be part of the polyadenylation signal had a minor effect on expression. We further characterize the sequence specificities of the TA-rich element using additional synthetic 3β end sequences and show that its activity is sensitive to single base pair mutations and strongly depends on the A/T content of the surrounding sequences. Finally, using a computational model, we show that the strength of this element in native 3β end sequences can explain some of their measured expression variability (R = 0.41). Together, our results emphasize the importance of efficient 3β end processing for endogenous protein levels and contribute to an improved understanding of the sequence elements involved in this process.</p></div
Sequence determinants of 3β end functional elements.
<p><b>(A)</b> Heat map showing the mean effect of a mutation as a function of location in the 3β end sequence. Each row represents one sequence and the color represents the mean expression fold change across two replicates between the mutated and wild type sequences. Rows are sorted by the location of the maximal affecting mutation. <b>(B)</b> Heat map of predicted logistic values on a held-out test set (see main text and methods). Location of subsequences correspond to those in Fig 3A. <b>(C)</b> Frequency of AT dinucleotide, highest weighted feature in the inferred model, in sliding windows of 20bp. Location of subsequences correspond to those in Fig 3A. <b>(D)</b> Table of the features that contribute most to the classification. Color represents the mean coefficient across the 10 cross validation partitions. For each possible mono/di-nucleotide three types of features were considered: β[0|1]β β a binary feature that is one if the specified mono/di-nucleotide occurs at least once in the sequence and zero otherwise, β#β β a counter of the number that the specified mono/di-nucleotide occurs in the sequence. β%β percent of nucleotides of the sequence that are part of an occurrence of the specified mono/di-nucleotide. <b>(E)</b> DNA sequence motif found to be enriched in the positive subsequence instances. <b>(F)</b> Distribution of distances between the location (center) of the mutation that resulted in the maximal reduction in expression and the location of the main polyadenylation site for the wild type sequence. <b>(G)</b> Results of YFP specific 3β RACE, where each lane represents 4 expression bins. Lowest lane displays long aberrant 3βUTRs not apparent in the higher expression bins.</p
Scanning mutagenesis of native 3β end sequences reveals critical elements required to maintain expression.
<p><b>(A)</b> Illustration of the two scanning mutagenesis strategies used, in the upper panel two 10bp mutation windows were designed with non-overlapping 10bp steps. In the lower panel 9bp mutation windows were designed with overlapping 3bp steps. <b>(B)</b> Profile of the effect of mutations as a function of location for two genes: CDC24 and YTA5. Y-axis shows the expression log<sub>2</sub> fold change compared to the wild type sequence with each point representing a single 10bp mutation window centered around the corresponding x-axis value relative to the stop codon. The gray line connects the average of each pair of mutations. <b>(C)</b> Distribution of log2 fold ratio between mutated and wild type 3β end sequences showing a highly skewed distribution towards negative values. <b>(D)</b> Distribution of absolute expression values (a.u.) for non-mutated native 3β end sequences (dark red) and mutated 3β end sequences (gray). For the mutated sequences, the mutation that resulted in the largest reduction in expression was chosen for each native sequence.</p
Illustration of our method and overall expression distribution.
<p><b>(A)</b> 13,000 designed synthetic sequences were ligated into a low copy plasmid (top part). The plasmid pool was then transformed into yeast to create a heterogeneous pool of yeast cells each expressing YFP to a different level corresponding to one of the unique 13,000 cloned 3β end sequences. The cells were then sorted using fluorescence activated sorting (FACS) into 16 expression bins by the YFP/mCherry ratio (middle). Next, the reporter 3β end sequences of cells in each bin were amplified, using bar coded primers for each bin, and sequence barcodes was recovered using next-generation sequencing (NGS). Finally, each sequencing read was mapped to a specific 3β end sequence and a specific bin (bottom) to achieve the distribution of cells with each synthetic 3β end sequence across the expression bins. The distribution of each construct was fit to a gamma distribution and the mean expression value was inferred based on this fit. <b>(B)</b> The distribution of library expression values in induced and un-induced promoter states. The induced state displays a tri-modal distribution with 3 peaks corresponding to (1) non-induced promoter state (2) induced promoter state and low expressing 3β end sequences and (3) induced promoter state with a wide range of 3β end mediated expression.</p
Systematic mutagenesis of a designed synthetic terminator.
<p><b>(A)</b> Illustration of the construct design: a minimal terminator sequence was embedded within a mutated non-terminating 3β end sequence from the CYC1-512 3β end region. <b>(B)</b> All possible single bp mutations in the three elements EE, PE and cleavage on the left, middle and right panels, respectively. Boxes on the left of each panel show the mutated sequences with a highlighted white letter representing the location and exact mutation relative to the wild type sequence shown on the top. Bars show the expression value of each sequence. <b>(C)</b> Expression as a function of context A/T content. Each point represents a mutated sequence with A/T content of the relevant sequence region on the x-axis and expression on the y-axis. Black points show the expression of the non-mutated sequence with different barcodes. Mutated regions are: (1) upstream to EE (2) between EE to PE (3) between PE to cleavage and (4) downstream to cleavage, corresponding to the panels from left to right.</p
Prediction of polyadenylation signals in native sequences.
<p><b>(A)</b> Native sequences are aligned by the main polyadenylation site and ordered by the expression values (right panel). The color indicates the predicted logistic values using the classifier learned on the scanning mutagenesis set. The lower panel shows the mean predicted logistic in a 20bp sliding window (centered) relative to the polyadenylation site. <b>(B)</b> Mean predicted logistic in a 20 bp window, centered around the peak from Fig 4A on the y-axis versus expression levels in the x-axis. The red line shows a smoothing line with 50 instances window.</p