112 research outputs found

    DeFCoM: analysis and modeling of transcription factor binding sites using a motif-centric genomic footprinter

    Get PDF
    Identifying the locations of transcription factor binding sites is critical for understanding how gene transcription is regulated across different cell types and conditions. Chromatin accessibility experiments such as DNaseI sequencing (DNase-seq) and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) produce genome-wide data that include distinct "footprint" patterns at binding sites. Nearly all existing computational methods to detect footprints from these data assume that footprint signals are highly homogeneous across footprint sites. Additionally, a comprehensive and systematic comparison of footprinting methods for specifically identifying which motif sites for a specific factor are bound has not been performed. Using DNase-seq data from the ENCODE project, we show that a large degree of previously uncharacterized site-to-site variability exists in footprint signal across motif sites for a transcription factor. To model this heterogeneity in the data, we introduce a novel, supervised learning footprinter called DeFCoM (Detecting Footprints Containing Motifs). We compare DeFCoM to nine existing methods using evaluation sets from four human cell-lines and eighteen transcription factors and show that DeFCoM outperforms current methods in determining bound and unbound motif sites. We also analyze the impact of several biological and technical factors on the quality of footprint predictions to highlight important considerations when conducting footprint analyses and assessing the performance of footprint prediction methods. Lastly, we show that DeFCoM can detect footprints using ATAC-seq data with similar accuracy as when using DNase-seq data. Python code available at https://bitbucket.org/bryancquach/defcom CONTACT: [email protected] or [email protected] SUPPLEMENTARY INFORMATION: Supplementary information available at Bioinformatics online

    ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions

    Get PDF
    Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) detect protein-DNA binding events and chemical modifications of histone proteins. Challenges in the standard ChIP-seq protocol have motivated recent enhancements in this approach, such as reducing the number of cells required and increasing the resolution. Complementary experimental approaches – for example DNaseI hypersensitive site mapping and analysis of chromatin interactions mediated by particular proteins - provide additional information about DNA-binding proteins and their function. These data are now being used to identify variability in the functions of DNA-binding proteins across genomes and individuals. In this Review, I describe the latest advances in methods to detect and functionally characterize DNA-bound proteins

    Identifying and Characterizing Regulatory Sequences in the Human Genome with Chromatin Accessibility Assays

    Get PDF
    After finishing a human genome reference sequence in 2002, the genomics community has turned to the task of interpreting it. A primary focus is to identify and characterize not only protein-coding genes, but all functional elements in the genome. The effort includes both individual investigators and large-scale projects like the Encyclopedia of DNA Elements (ENCODE) project. As part of the ENCODE project, several groups have identified millions of regulatory elements in hundreds of human cell-types using DNase-seq and FAIRE-seq experiments that detect regions of nucleosome-free open chromatin. ChIP-seq experiments have also been used to discover transcription factor binding sites and map histone modifications. Nearly all identified elements are found in non-coding DNA, hypothesizing a function for previously unannotated sequence. In this review, we provide an overview of the ENCODE effort to define regulatory elements, summarize the main results, and discuss implications of the millions of regulatory elements distributed throughout the genome

    GSAASeqSP: A Toolset for Gene Set Association Analysis of RNA-Seq Data

    Get PDF
    RNA-Seq is quickly becoming the preferred method for comprehensively characterizing whole transcriptome activity, and the analysis of count data from RNA-Seq requires new computational tools. We developed GSAASeqSP, a novel toolset for genome-wide gene set association analysis of sequence count data. This toolset offers a variety of statistical procedures via combinations of multiple gene-level and gene set-level statistics, each having their own strengths under different sample and experimental conditions. These methods can be employed independently, or results generated from multiple or all methods can be integrated to determine more robust profiles of significantly altered biological pathways. Using simulations, we demonstrate the ability of these methods to identify association signals and to measure the strength of the association. We show that GSAASeqSP analyses of RNA-Seq data from diverse tissue samples provide meaningful insights into the biological mechanisms that differentiate these samples. GSAASeqSP is a powerful platform for investigating molecular underpinnings of complex traits and diseases arising from differential activity within the biological pathways. GSAASeqSP is available at http://gsaa.unc.edu

    Evidence of Influence of Genomic DNA Sequence on Human X Chromosome Inactivation

    Get PDF
    A significant number of human X-linked genes escape X chromosome inactivation and are thus expressed from both the active and inactive X chromosomes. The basis for escape from inactivation and the potential role of the X chromosome primary DNA sequence in determining a gene's X inactivation status is unclear. Using a combination of the X chromosome sequence and a comprehensive X inactivation profile of more than 600 genes, two independent yet complementary approaches were used to systematically investigate the relationship between X inactivation and DNA sequence features. First, statistical analyses revealed that a number of repeat features, including long interspersed nuclear element (LINE) and mammalian-wide interspersed repeat repetitive elements, are significantly enriched in regions surrounding transcription start sites of genes that are subject to inactivation, while Alu repetitive elements and short motifs containing ACG/CGT are significantly enriched in those that escape inactivation. Second, linear support vector machine classifiers constructed using primary DNA sequence features were used to correctly predict the X inactivation status for >80% of all X-linked genes. We further identified a small set of features that are important for accurate classification, among which LINE-1 and LINE-2 content show the greatest individual discriminatory power. Finally, as few as 12 features can be used for accurate support vector machine classification. Taken together, these results suggest that features of the underlying primary DNA sequence of the human X chromosome may influence the spreading and/or maintenance of X inactivation

    Genome-wide sequence and functional analysis of early replicating DNA in normal human fibroblasts

    Get PDF
    BACKGROUND: The replication of mammalian genomic DNA during the S phase is a highly coordinated process that occurs in a programmed manner. Recent studies have begun to elucidate the pattern of replication timing on a genomic scale. Using a combination of experimental and computational techniques, we identified a genome-wide set of the earliest replicating sequences. This was accomplished by first creating a cosmid library containing DNA enriched in sequences that replicate early in the S phase of normal human fibroblasts. Clone ends were then sequenced and aligned to the human genome. RESULTS: By clustering adjacent or overlapping early replicating clones, we identified 1759 "islands" averaging 100 kb in length, allowing us to perform the most detailed analysis to date of DNA characteristics and genes contained within early replicating DNA. Islands are enriched in open chromatin, transcription related elements, and Alu repetitive elements, with an underrepresentation of LINE elements. In addition, we found a paucity of LTR retroposons, DNA transposon sequences, and an enrichment in all classes of tandem repeats, except for dinucleotides. CONCLUSION: An analysis of genes associated with islands revealed that nearly half of all genes in the WNT family, and a number of genes in the base excision repair pathway, including four of ten DNA glycosylases, were associated with island sequences. Also, we found an overrepresentation of members of apoptosis-associated genes in very early replicating sequences from both fibroblast and lymphoblastoid cells. These data suggest that there is a temporal pattern of replication for some functionally related genes

    A Predictive Framework for Integrating Disparate Genomic Data Types Using Sample-Specific Gene Set Enrichment Analysis and Multi-Task Learning

    Get PDF
    Understanding the root molecular and genetic causes driving complex traits is a fundamental challenge in genomics and genetics. Numerous studies have used variation in gene expression to understand complex traits, but the underlying genomic variation that contributes to these expression changes is not well understood. In this study, we developed a framework to integrate gene expression and genotype data to identify biological differences between samples from opposing complex trait classes that are driven by expression changes and genotypic variation. This framework utilizes pathway analysis and multi-task learning to build a predictive model and discover pathways relevant to the complex trait of interest. We simulated expression and genotype data to test the predictive ability of our framework and to measure how well it uncovered pathways with genes both differentially expressed and genetically associated with a complex trait. We found that the predictive performance of the multi-task model was comparable to other similar methods. Also, methods like multi-task learning that considered enrichment analysis scores from both data sets found pathways with both genetic and expression differences related to the phenotype. We used our framework to analyze differences between estrogen receptor (ER) positive and negative breast cancer samples. An analysis of the top 15 gene sets from the multi-task model showed they were all related to estrogen, steroids, cell signaling, or the cell cycle. Although our study suggests that multi-task learning does not enhance predictive accuracy, the models generated by our framework do provide valuable biological pathway knowledge for complex traits

    A Genome-Wide Analysis of Open Chromatin in Human Epididymis Epithelial Cells Reveals Candidate Regulatory Elements for Genes Coordinating Epididymal Function1

    Get PDF
    The epithelium lining the epididymis has a pivotal role in ensuring a luminal environment that can support normal sperm maturation. Many of the individual genes that encode proteins involved in establishing the epididymal luminal fluid are well characterized. They include ion channels, ion exchangers, transporters, and solute carriers. However, the molecular mechanisms that coordinate expression of these genes and modulate their activities in response to biological stimuli are less well understood. To identify cis-regulatory elements for genes expressed in human epididymis epithelial cells, we generated genome-wide maps of open chromatin by DNase-seq. This analysis identified 33 542 epididymis-selective DNase I hypersensitive sites (DHS), which were not evident in five cell types of different lineages. Identification of genes with epididymis-selective DHS at their promoters revealed gene pathways that are active in immature epididymis epithelial cells. These include processes correlating with epithelial function and also others with specific roles in the epididymis, including retinol metabolism and ascorbate and aldarate metabolism. Peaks of epididymis-selective chromatin were seen in the androgen receptor gene and the cystic fibrosis transmembrane conductance regulator (CFTR) gene, which has a critical role in regulating ion transport across the epididymis epithelium. In silico prediction of transcription factor binding sites that were overrepresented in epididymis-selective DHS identified epithelial transcription factors, including ELF5 and ELF3, the androgen receptor, Pax2, and Sox9, as components of epididymis transcriptional networks. Active genes, which are targets of each transcription factor, reveal important biological processes in the epididymis epithelium

    Molecular classification of Crohn's disease reveals two clinically relevant subtypes

    Get PDF
    The clinical presentation and course of Crohn’s disease (CD) is highly variable. We sought to better understand the cellular and molecular mechanisms that guide this heterogeneity, and characterize the cellular processes associated with disease phenotypes
    • 

    corecore