42 research outputs found

    Minimizing off-target signals in RNA fluorescent in situ hybridization

    Get PDF
    Fluorescent in situ hybridization (FISH) techniques are becoming extremely sensitive, to the point where individual RNA or DNA molecules can be detected with small probes. At this level of sensitivity, the elimination of ‘off-target’ hybridization is of crucial importance, but typical probes used for RNA and DNA FISH contain sequences repeated elsewhere in the genome. We find that very short (e.g. 20 nt) perfect repeated sequences within much longer probes (e.g. 350–1500 nt) can produce significant off-target signals. The extent of noise is surprising given the long length of the probes and the short length of non-specific regions. When we removed the small regions of repeated sequence from either short or long probes, we find that the signal-to-noise ratio is increased by orders of magnitude, putting us in a regime where fluorescent signals can be considered to be a quantitative measure of target transcript numbers. As the majority of genes in complex organisms contain repeated k-mers, we provide genome-wide annotations of k-mer-uniqueness at http://cbio.mskcc.org/∼aarvey/repeatmap

    High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

    Get PDF
    Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding

    ResBoost: characterizing and predicting catalytic residues in enzymes

    Get PDF
    Abstract Background Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed. Results We propose ResBoost, a new computational method to learn characteristics of catalytic residues. The method effectively selects and combines rules of thumb into a simple, easily interpretable logical expression that can be used for prediction. We formally define the rules of thumb that are often used to narrow the list of candidate residues, including residue evolutionary conservation, 3D clustering, solvent accessibility, and hydrophilicity. ResBoost builds on two methods from machine learning, the AdaBoost algorithm and Alternating Decision Trees, and provides precise control over the inherent trade-off between sensitivity and specificity. We evaluated ResBoost using cross-validation on a dataset of 100 enzymes from the hand-curated Catalytic Site Atlas (CSA). Conclusion ResBoost achieved 85% sensitivity for a 9.8% false positive rate and 73% sensitivity for a 5.7% false positive rate. ResBoost reduces the number of false positives by up to 56% compared to the use of evolutionary conservation scoring alone. We also illustrate the ability of ResBoost to identify recently validated catalytic residues not listed in the CSA

    Modulation of enhancer looping and differential gene targeting by Epstein-Barr virus transcription factors directs cellular reprogramming

    Get PDF
    Epstein-Barr virus (EBV) epigenetically reprogrammes B-lymphocytes to drive immortalization and facilitate viral persistence. Host-cell transcription is perturbed principally through the actions of EBV EBNA 2, 3A, 3B and 3C, with cellular genes deregulated by specific combinations of these EBNAs through unknown mechanisms. Comparing human genome binding by these viral transcription factors, we discovered that 25% of binding sites were shared by EBNA 2 and the EBNA 3s and were located predominantly in enhancers. Moreover, 80% of potential EBNA 3A, 3B or 3C target genes were also targeted by EBNA 2, implicating extensive interplay between EBNA 2 and 3 proteins in cellular reprogramming. Investigating shared enhancer sites neighbouring two new targets (WEE1 and CTBP2) we discovered that EBNA 3 proteins repress transcription by modulating enhancer-promoter loop formation to establish repressive chromatin hubs or prevent assembly of active hubs. Re-ChIP analysis revealed that EBNA 2 and 3 proteins do not bind simultaneously at shared sites but compete for binding thereby modulating enhancer-promoter interactions. At an EBNA 3-only intergenic enhancer site between ADAM28 and ADAMDEC1 EBNA 3C was also able to independently direct epigenetic repression of both genes through enhancer-promoter looping. Significantly, studying shared or unique EBNA 3 binding sites at WEE1, CTBP2, ITGAL (LFA-1 alpha chain), BCL2L11 (Bim) and the ADAMs, we also discovered that different sets of EBNA 3 proteins bind regulatory elements in a gene and cell-type specific manner. Binding profiles correlated with the effects of individual EBNA 3 proteins on the expression of these genes, providing a molecular basis for the targeting of different sets of cellular genes by the EBNA 3s. Our results therefore highlight the influence of the genomic and cellular context in determining the specificity of gene deregulation by EBV and provide a paradigm for host-cell reprogramming through modulation of enhancer-promoter interactions by viral transcription factors

    Interpreting the Epstein-Barr Virus (EBV) Epigenome Using High-Throughput Data

    Get PDF
    The Epstein-Barr virus (EBV) double-stranded DNA genome is subject to extensive epigenetic regulation. Large consortiums and individual labs have generated a vast number of genome-wide data sets on human lymphoblastoid and other cell lines latently infected with EBV. Analysis of these data sets reveals important new information on the properties of the host and viral chromosome structure organization and epigenetic modifications. We discuss the mapping of these data sets and the subsequent insights into the chromatin structure and transcription factor binding patterns on latent EBV genomes. Colocalization of multiple histone modifications and transcription factors at regulatory loci are considered in the context of the biology and regulation of EBV

    ResBoost: characterizing and predicting catalytic residues in enzymes

    No full text
    Abstract Background Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed. Results We propose ResBoost, a new computational method to learn characteristics of catalytic residues. The method effectively selects and combines rules of thumb into a simple, easily interpretable logical expression that can be used for prediction. We formally define the rules of thumb that are often used to narrow the list of candidate residues, including residue evolutionary conservation, 3D clustering, solvent accessibility, and hydrophilicity. ResBoost builds on two methods from machine learning, the AdaBoost algorithm and Alternating Decision Trees, and provides precise control over the inherent trade-off between sensitivity and specificity. We evaluated ResBoost using cross-validation on a dataset of 100 enzymes from the hand-curated Catalytic Site Atlas (CSA). Conclusion ResBoost achieved 85% sensitivity for a 9.8% false positive rate and 73% sensitivity for a 5.7% false positive rate. ResBoost reduces the number of false positives by up to 56% compared to the use of evolutionary conservation scoring alone. We also illustrate the ability of ResBoost to identify recently validated catalytic residues not listed in the CSA.</p
    corecore