18 research outputs found

    VDA, a Method of Choosing a Better Algorithm with Fewer Validations

    Get PDF
    The multitude of bioinformatics algorithms designed for performing a particular computational task presents end-users with the problem of selecting the most appropriate computational tool for analyzing their biological data. The choice of the best available method is often based on expensive experimental validation of the results. We propose an approach to design validation sets for method comparison and performance assessment that are effective in terms of cost and discrimination power

    An Integrated Pipeline for the Genome-Wide Analysis of Transcription Factor Binding Sites from ChIP-Seq

    Get PDF
    ChIP-Seq has become the standard method for genome-wide profiling DNA association of transcription factors. To simplify analyzing and interpreting ChIP-Seq data, which typically involves using multiple applications, we describe an integrated, open source, R-based analysis pipeline. The pipeline addresses data input, peak detection, sequence and motif analysis, visualization, and data export, and can readily be extended via other R and Bioconductor packages. Using a standard multicore computer, it can be used with datasets consisting of tens of thousands of enriched regions. We demonstrate its effectiveness on published human ChIP-Seq datasets for FOXA1, ER, CTCF and STAT1, where it detected co-occurring motifs that were consistent with the literature but not detected by other methods. Our pipeline provides the first complete set of Bioconductor tools for sequence and motif analysis of ChIP-Seq and ChIP-chip data

    Genome-Wide Bovine H3K27me3 Modifications and the Regulatory Effects on Genes Expressions in Peripheral Blood Lymphocytes

    Get PDF
    Gene expression of lymphocytes was found to be influenced by histone methylation in mammals and trimethylation of lysine 27 on histone H3 (H3K27me3) normally represses genes expressions. Peripheral blood lymphocytes are the main source of somatic cells in the milk of dairy cows that vary frequently in response to the infection or injury of mammary gland and number of parities.The genome-wide status of H3K27me3 modifications on blood lymphocytes in lactating Holsteins was performed via ChIP-Seq approach. Combined with digital gene expression (DGE) technique, the regulation effects of H3K27me3 on genes expressions were analyzed.The ChIP-seq results showed that the peaks of H3K27me3 in cows lymphocytes were mainly enriched in the regions of up20K (~50%), down20K (~30%) and intron (~28%) of the genes. Only ~3% peaks were enriched in exon regions. Moreover, the highest H3K27me3 modification levels were mainly around the 2 Kb upstream of transcriptional start sites (TSS) of the genes. Using conjoint analysis with DGE data, we found that H3K27me3 marks tended to repress target genes expressions throughout whole gene regions especially acting on the promoter region. A total of 53 differential expressed genes were detected in third parity cows compared to first parity, and the 25 down-regulated genes (PSEN2 etc.) were negatively correlated with H3K27me3 levels on up2Kb to up1Kb of the genes, while the up-regulated genes were not showed in this relationship.The first blueprint of bovine H3K27me3 marks that mediates gene silencing was generated. H3K27me3 plays its repressed role mainly in the regulatory region in bovine lymphocytes. The up2Kb to up1Kb region of the down-regulated genes in third parity cows could be potential target of H3K27me3 regulation. Further studies are warranted to understand the regulation mechanisms of H3K27me3 on somatic cell count increases and milk losses in latter parities of cows

    Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts

    No full text
    The current understanding of the regulation of transcription does not keep the pace with the spectacular advances in the determination of genomic sequences. Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) promises to give better insight into transcription regulation by locating sites of protein-DNA interactions. Such loci of putative interactions can be inferred from the genome-wide distributions of ChIP-seq data by peak-calling software. The analysis of ChIP-seq data critically depends on this step and a multitude of these peak-callers have been deployed in the recent years. A recent study reported severe variation among peak-calling results. Yet, peak-calling still lacks systematic quantitative benchmarking. Here, we summarize benchmarking efforts and explain potential drawbacks of each benchmarking metho

    Development of Computational Techniques for Regulatory DNA Motif Identification Based on Big Biological Data

    Get PDF
    Accurate regulatory DNA motif (or motif) identification plays a fundamental role in the elucidation of transcriptional regulatory mechanisms in a cell and can strongly support the regulatory network construction for both prokaryotic and eukaryotic organisms. Next-generation sequencing techniques generate a huge amount of biological data for motif identification. Specifically, Chromatin Immunoprecipitation followed by high throughput DNA sequencing (ChIP-seq) enables researchers to identify motifs on a genome scale. Recently, technological improvements have allowed for DNA structural information to be obtained in a high-throughput manner, which can provide four DNA shape features. The DNA shape has been found as a complementary factor to genomic sequences in terms of transcription factor (TF)-DNA binding specificity prediction based on traditional machine learning models. Recent studies have demonstrated that deep learning (DL), especially the convolutional neural network (CNN), enables identification of motifs from DNA sequence directly. Although numerous algorithms and tools have been proposed and developed in this field, (1) the lack of intuitive and integrative web servers impedes the progress of making effective use of emerging algorithms and tools; (2) DNA shape has not been integrated with DL; and (3) existing DL models still suffer high false positive and false negative issues in motif identification. This thesis focuses on developing an integrated web server for motif identification based on DNA sequences either from users or built-in databases. This web server allows further motif-related analysis and Cytoscape-like network interpretation and visualization. We then proposed a DL framework for both sequence and shape motif identification from ChIP-seq data using a binomial distribution strategy. This framework can accept as input the different combinations of DNA sequence and DNA shape. Finally, we developed a gated convolutional neural network (GCNN) for capturing motif dependencies among long DNA sequences. Results show that our developed web server enables providing comprehensive motif analysis functionalities compared with existing web servers. The DL framework can identify motifs using an optimized threshold and disclose the strong predictive power of DNA shape in TF-DNA binding specificity. The identified sequence and shape motifs can contribute to TF-DNA binding mechanism interpretation. Additionally, GCNN can improve TF-DNA binding specificity prediction than CNN on most of the datasets

    Computational methods for studying epigenomic regulation

    Get PDF
    In the nucleus, DNA is tightly wrapped around proteins in a structure called chromatin in order to protect it from degradation. Chromatin is composed of nucleosomes which are a structure of eight histones around which the DNA is wrapped. Nucleosomes can be modified by enzymes on amino acids located on their N-terminal tails. These modifications allow the chromatin to open and close in targeted regions, providing control over gene expression. At present, chromatin immuno-precipitation (ChIP) and assay of transposase-accessible chromatin (ATAC) combined with high-throughput sequencing (ChIP-seq and ATAC-seq) are the major high-throughput methods allowing the study of histone modifications and genome-wide chromatin openness, respectively. Typically, ChIP-seq targets one histone at a time by enriching the histone-bound regions of the genome using immuno-precipitation, while ATAC-seq uses a transposase enzyme to cut the open chromatin into fragments of DNA. The DNA fragments obtained from both techniques can be sequenced and aligned against a reference genome. Once the location of the fragments is determined, the genome is scanned for significant enrichment in a process called peak calling. Differential analysis is then used to compare local enrichment-level variations between different biological conditions. Combining ChIP-seq and ATAC-seq data with other information, such as RNA-seq–derived transcriptomics data, can further help to build a comprehensive picture of the complex underlying biology. This work therefore focuses on the development of computational tools to help with the analysis of epigenomics research data. In this thesis, a robust workflow for the differential analysis of ChIP-seq and ATAC-seq data is developed and evaluated against existing tools using one synthetic dataset, two biological ChIP-seq datasets and two biological ATAC-seq datasets. RNA-seq data is then further correlated with the detected peaks. An efficient replicate-driven visualisation tool is also proposed to visualise coverage of DNA fragments on the genome, which is compared to two existing tools, highlighting its efficiency. Lastly, two studies are presented showcasing the usefulness of the differential analysis approaches in extracting knowledge in a real-life biological setting

    Identification of dTip60 Binding Partners by Bioinformatic Analysis of ChIP-Seq Data

    Get PDF
    Tip60 is a histone acetyltransferase that has recently been shown to play a significant role in various neuronal functions of Drosophila, including synaptic plasticity and axonogenesis, as well as the pathophysiology of Alzheimer’s disease (AD) in the fly brain. However, the mechanisms by which Tip60 affects these processes remain poorly understood. Due to a lack of a DNA-binding motif in the Tip60 protein structure, it is hypothesized that Tip60 is recruited to particular neuronal genes by one or more putative DNA-binding proteins, and consequently acetylates nearby histone tails to facilitate transcription of those genes. In order to identify these potential binding partners of Tip60, an in-depth bioinformatic analysis was performed using data from Drosophila Tip60 ChIP-Seq experiment, which provides a genome-wide occupancy profile for Tip60. Tip60-target regions were analyzed for the presence of genes, DNA-binding motifs and other structural features, in order to shed light on the mechanism of Tip60-dependent regulation of neuronal gene expression in the fly. Results show that Tip60-target genes are enriched for neuronal functions, and several candidate transcription factors, identified by various methods, represent possible binding-partners for Tip60 in a neuronal functional context. This project will allow for a better understanding of the cellular mechanisms of epigenetic regulations as a whole, as well as the mechanisms of action of various cognitive and neurodegenerative diseases such as AD.M.S., Biomedical Engineering -- Drexel University, 201

    INFORMATION INTEGRATION APPROACHES FOR INVESTIGATING ESTROGEN RECEPTOR MEDIATED TRANSCRIPTION

    Get PDF
    Estrogen plays essential roles in the function of normal physiology and diseases. Its effects are mainly mediated through two intracellular estrogen receptors, ERα and ERβ, which belong to a family of nuclear receptors (NRs) functioning as transcription regulators. In the first part of this thesis, we aim to derive a holistic view of the transcription machineries at estrogen-responsive genes and further, to reveal different mechanisms of estrogen-mediated transcription regulation. In order to achieve this, we integrated and systematically dissected a variety of genome-wide high-throughput datasets, including gene expression arrays, ChIP-seq, GRO-seq, and ChIA-PET. Our analyses have led to the following novel findings: In the absence of the ligand, most of the estrogen-responsive genes assumed a high-order chromatin configuration that involved Pol II, ERα and ERα-pioneer factors. Without the ligand, estrogen-induced genes showed active transcription at promoters but failed to elongate into gene bodies, and such a pause was lifted after estrogen treatment. However, the estrogen-repressed genes showed coordinated transcription at promoters and gene bodies in the absence and presence of estrogen. Through information integration, we inferred that, for estrogen-repressed genes, the majority of the high-order chromatin complexes containing actively transcribed genes were disrupted after estrogen treatment. The analyses led to the hypothesis that one mechanism for estrogen-mediated repression is through disrupting the original transcription-favoring chromatin structures. Further, nuclear receptors such as ERs interact with co-regulators to regulate gene transcription. Understanding the mechanism of action of co-regulator proteins—which do not bind DNA directly, but exert their effects by binding to transcription factors—is important for the study of normal physiology as well as diseased conditions. However, due to the nature of detecting indirect protein-DNA interaction, ChIP-seq signals from co-regulators can be relatively weak and thus biologically meaningful interactions remain difficult to identify. In the second part of this thesis, we investigated and compared different machine learning approaches to integrate multiple types of genomic and transcriptomic information derived from our experiments and from public databases. This helped us to overcome the difficulty of identifying functional DNA binding sites of the co-regulator SRC-1 in the context of estrogen response. Our results indicate that supervised learning with the naïve Bayes algorithm significantly enhanced the peak calling of weak ChIP-seq signals and outperformed other machine learning algorithms. Our integrative approach revealed many potential ERα/SRC-1 DNA binding sites that would otherwise be missed by conventional peak calling algorithms with default settings
    corecore