179 research outputs found

    Development of Computational Techniques for Regulatory DNA Motif Identification Based on Big Biological Data

    Get PDF
    Accurate regulatory DNA motif (or motif) identification plays a fundamental role in the elucidation of transcriptional regulatory mechanisms in a cell and can strongly support the regulatory network construction for both prokaryotic and eukaryotic organisms. Next-generation sequencing techniques generate a huge amount of biological data for motif identification. Specifically, Chromatin Immunoprecipitation followed by high throughput DNA sequencing (ChIP-seq) enables researchers to identify motifs on a genome scale. Recently, technological improvements have allowed for DNA structural information to be obtained in a high-throughput manner, which can provide four DNA shape features. The DNA shape has been found as a complementary factor to genomic sequences in terms of transcription factor (TF)-DNA binding specificity prediction based on traditional machine learning models. Recent studies have demonstrated that deep learning (DL), especially the convolutional neural network (CNN), enables identification of motifs from DNA sequence directly. Although numerous algorithms and tools have been proposed and developed in this field, (1) the lack of intuitive and integrative web servers impedes the progress of making effective use of emerging algorithms and tools; (2) DNA shape has not been integrated with DL; and (3) existing DL models still suffer high false positive and false negative issues in motif identification. This thesis focuses on developing an integrated web server for motif identification based on DNA sequences either from users or built-in databases. This web server allows further motif-related analysis and Cytoscape-like network interpretation and visualization. We then proposed a DL framework for both sequence and shape motif identification from ChIP-seq data using a binomial distribution strategy. This framework can accept as input the different combinations of DNA sequence and DNA shape. Finally, we developed a gated convolutional neural network (GCNN) for capturing motif dependencies among long DNA sequences. Results show that our developed web server enables providing comprehensive motif analysis functionalities compared with existing web servers. The DL framework can identify motifs using an optimized threshold and disclose the strong predictive power of DNA shape in TF-DNA binding specificity. The identified sequence and shape motifs can contribute to TF-DNA binding mechanism interpretation. Additionally, GCNN can improve TF-DNA binding specificity prediction than CNN on most of the datasets

    Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data

    Get PDF
    Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods

    CMStalker: a combinatorial tool for composite motif discovery

    Get PDF
    Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with twelve data sets and eleven state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant

    A functional and regulatory perspective on Arabidopsis thaliana

    Get PDF

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Development of Computational Techniques for Identification of Regulatory DNA Motif

    Get PDF
    Identifying precise transcription factor binding sites (TFBS) or regulatory DNA motif (motif) plays a fundamental role in researching transcriptional regulatory mechanism in cells and helping construct regulatory networks for biological investigation. Chromatin immunoprecipitation combined with sequencing (ChIP-seq) and lambda exonuclease digestion followed by high-throughput sequencing (ChIP-exo) enables researchers to identify TFBS on a genome-scale with improved resolution. Several algorithms have been developed to perform motif identification, employing widely different methods and often giving divergent results. In addition, these existing methods still suffer from prediction accuracy. Thesis focuses on the development of improved regulatory DNA motif identification techniques. We designed an integrated framework, WTSA, that can reliably combine the experimental signals from ChIP-exo data in base pair (bp) resolution to predict the statistically significant DNA motifs. The algorithm improves the prediction accuracy and extends the scope of applicability of the existing methods. We have applied the framework to Escherichia coli k12 genome and evaluated WTSA prediction performance through comparison with seven existing programs. The performance evaluation indicated that WTSA provides reliable predictive power for regulatory motifs using ChIP-exo data. An important application of DNA motif identification is to identify transcriptional regulatory mechanisms. The rapid development of single-cell RNA-Sequencing (scRNAseq) technologies provides an unprecedented opportunity to discover the gene transcriptional regulation at the single-cell level. In the scRNA-seq analyses, a critical step is to identify the cell-type-specific regulons (CTS-Rs), each of which is a group of genes co-regulated by the same transcription regulator in a specific cell type. We developed a web server, IRIS3 (Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq), to solve this problem by the integration of data preprocessing, cell type prediction, gene module identification, and cis-regulatory motif analyses. Compared with other packages, IRIS3 predicts more efficiently and provides more accurate regulon from scRNA-seq data. These CTS-Rs can substantially improve the elucidation of heterogeneous regulatory mechanisms among various cell types and allow reliable constructions of global transcriptional regulation networks encoded in a specific cell type. Also presented in this thesis is DESSO (DEep Sequence and Shape mOtif (DESSO), using deep neural networks and the binomial distribution model to identify DNA motifs, DESSO outperformed existing tools, including DeepBind, in 690 human ENCODE ChIP-Sequencing datasets. DESSO also further expanded motif identification power by integrating the detection of DNA shape features

    Studying the regulatory landscape of flowering plants

    Get PDF

    Transcriptional regulation of neurogenesis by the proneural factor Ascl1

    Get PDF
    Dissertação de mestrado BioinformaticsThis project aims to provide a better understanding of the transcriptional regulation of neurogenesis by the proneural factor Ascl1. The first genome-wide characterization of Ascl1 transcriptional program in the embryonic mouse brain was performed by ChIP-chip. However, the restriction to proximal promoter regions, excluding genes bound by Ascl1 to distal enhancers, and the need to validate the model with a more robust experimental approach, prompted the use of ChIP-seq. Genome-wide mapping of Ascl1 binding sites with higher resolution, reveals 3054 high confidence binding regions in ventral telencephalon. The chromatin states of genomic regions associated with Ascl1 recruitment were also characterised, concluding that these bear marks of distal enhancers, but also proximal promoter regions. Further integration of expression profiling data from Ascl1 LoF experiments identifies 643 target genes. Results from functional annotation of these targets corroborate previous findings, showing that Ascl1 coordinates neurogenesis by regulating a large number of target genes with a wide variety of biological functions, and associated with different stages of neurogenesis. Additional investigations should address how Ascl1 coordinates this complex transcriptional program along the neuronal lineage. This could explore a possible crosstalk with the Notch program, taking advantage of the 105 regulatory regions identified where Ascl1 is co-recruited by RBPJ, as assessed by ChIP-seq.O objetivo principal deste projeto consiste em compreender melhor a regulação transcricional da neurogénese pelo fator proneural Ascl1. A primeira caracterização à escala do genoma do programa de transcrição do Ascl1 no cérebro de embriões de ratinho foi realizada pela técnica de ChIP-chip. No entanto, a restrição a regiões próximas do promotor, com exclusão de genes ligados pelo Ascl1 a distal enhancers, e a necessidade de validar o modelo com uma abordagem experimental mais robusta, motivou o recurso à técnica de ChIP-seq. A análise de localização, com alta resolução, ao longo de todo o genoma para sítios de ligação do Ascl1, revelou 3054 regiões de ligação de elevada confiança no telencéfalo do ratinho. De seguida, caracterizaram-se os chromatin states de regiões genómicas associadas com o recrutamento do Ascl1. Desta análise conclui-se que estas regiões possuem marcas de distal enhancers, mas também de regiões próximas do promotor. A posterior integração de perfis de expressão em experiências de perda-de-função para o Ascl1 identificou 643 genes alvo. Os resultados da anotação funcional desses alvos corroboram as conclusões anteriormente publicadas, mostrando que o Ascl1 coordena a neurogénese através da regulação de um grande número de genes alvo, com uma ampla diversidade de funções biológicas, associados a diferentes fases da neurogénese. Estudos futuros deem abordar de que forma o Ascl1 coordena este programa de transcrição complexo ao longo da linhagem neuronal. Tal poderia explorar um possível crosstalk com o programa Notch, tirando partido das 105 regiões regulatórias identificadas por ChIP-seq, onde o Ascl1 é co-recrutado pelo RBPJ
    • …
    corecore