1,330 research outputs found
De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference
Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom
RECLU:a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (CAGE)
BACKGROUND: Next generation sequencing based technologies are being extensively used to study transcriptomes. Among these, cap analysis of gene expression (CAGE) is specialized in detecting the most 5’ ends of RNA molecules. After mapping the sequenced reads back to a reference genome CAGE data highlights the transcriptional start sites (TSSs) and their usage at a single nucleotide resolution. RESULTS: We propose a pipeline to group the single nucleotide TSS into larger reproducible peaks and compare their usage across biological states. Importantly, our pipeline discovers broad peaks as well as the fine structure of individual transcriptional start sites embedded within them. We assess the performance of our approach on a large CAGE datasets including 156 primary cell types and two cell lines with biological replicas. We demonstrate that genes have complicated structures of transcription initiation events. In particular, we discover that narrow peaks embedded in broader regions of transcriptional activity can be differentially used even if the larger region is not. CONCLUSIONS: By examining the reproducible fine scaled organization of TSS we can detect many differentially regulated peaks undetected by previous approaches
A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information
Identification of DNA motifs from ChIP-seq/ChIP-chip [chromatin immunoprecipitation (ChIP)] data is a powerful method for understanding the transcriptional regulatory network. However, most established methods are designed for small sample sizes and are inefficient for ChIP data. Here we propose a new k-mer occurrence model to reflect the fact that functional DNA k-mers often cluster around ChIP peak summits. With this model, we introduced a new measure to discover functional k-mers. Using simulation, we demonstrated that our method is more robust against noises in ChIP data than available methods. A novel word clustering method is also implemented to group similar k-mers into position weight matrices (PWMs). Our method was applied to a diverse set of ChIP experiments to demonstrate its high sensitivity and specificity. Importantly, our method is much faster than several other methods for large sample sizes. Thus, we have developed an efficient and effective motif discovery method for ChIP experiments
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences
Motivation: Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery
Recommended from our members
Conservation and synteny of long non-coding RNAs invertebrate genomes and their identification in novel transcriptomes
Long non-coding RNAs (IncRNAs) are a biological entity defined by what they are not, rather than by what they are. This indicates that our knowledge about them is sensibly limited. The aim of my PhD is to gain insights into the evolution and the functions of IncRNAs through computational approaches and the usage of large scale functional genomics dataset. I developed an annotation pipeline, which can effectively identify IncRNAs in entire transcriptomes. The pipeline is able to accurately annotate the coding genes while predicting a conservative estimate of the IncRNA population. It allowed me to show, for the first time, the presence of lncRNA transcription in a diverse range of organisms. Further, I analysed sequence and positional conservation of lncRNAs, demonstrating the presence of short segments of conserved sequence in IncRNAs and the existence of several syntenically conserved non-coding transcripts over large evolutionary distances. However, I also demonstrate that positional conservation of lncRNAs with a flanking coding gene is generally independent from the conservation of the lncRNA expression with respect to the coding gene. Finally, I have characterised the diversity of lncRNA transcription in specific cells and developmental stages of two teleost fishes. In summary, the work presented in the thesis provides novel findings and contributions in the field of lncRNAomics
Recommended from our members
Deciphering Regulatory Networks in the Mouse Genome
Regardless of all the major achievements in the field of genomics and in depth studies of the protein-coding genes, our knowledge about non-coding regions and their contribution in diseases remains incomplete. Large scale projects such as the ENCODE have produced a wealth of sequencing data which can be utilised to study epigenetic features associated with gene regulation. These studies have comprehensively identified regulatory elements such as enhancers in the human genome, but numerous questions still remain on their effect on gene function and disease causation.
The aim of this thesis is to identify enhancer regulatory networks in the mouse genome and investigate their effect on mouse models of human diseases. In order to study enhancer regulation, I have taken two approaches. First, I have produced a catalogue of well-defined multiple enhancer types in a diverse range of mouse tissues and cell-types. By systematically comparing different enhancer types, I found that super- and typical-enhancers have different effect on gene expression, but both are preferentially associated with relevant tissue-type phenotypes. Also genes associated with super- and typical-enhancers exhibit no difference in phenotype effect size or pleiotropy. Second, by utilising publicly available regulatory annotations, my enhancer catalogue and omics data, I have investigated regulatory mechanisms associated with metabolic and circadian mouse models. Here I identified novel regulatory networks or enhancers or transcription factor binding sites pertaining to the mutant mice.
In conclusion, my research has shown the usefulness of integrating enhancer annotations with an array of molecular data and has for the first time shown how different enhancer architectures influence gene function in the mouse genome. This study provides a valuable dataset to further characterise the mechanisms of gene regulation by enhancers in the mouse genome
Transcription factor binding specificity and occupancy : elucidation, modelling and evaluation
The major contributions of this thesis are addressing the need for an objective quality evaluation of a transcription factor binding model, demonstrating the value of the tools developed to this end and elucidating how in vitro and in vivo information can be utilized to improve TF binding specificity models. Accurate elucidation of TF binding specificity remains an ongoing challenge in gene regulatory research. Several in vitro and in vivo experimental techniques have been developed followed by a proliferation of algorithms, and ultimately, the binding models. This increase led to a choice problem for the end users: which tools to use, and which is the most accurate model for a given TF? Therefore, the first section of this thesis investigates the motif assessment problem: how scoring functions, choice and processing of benchmark data, and statistics used in evaluation affect motif ranking. This analysis revealed that TF motif quality assessment requires a systematic comparative analysis, and that scoring functions used have a TF-specific effect on motif ranking. These results advised the design of a Motif Assessment and Ranking Suite MARS, supported by PBM and ChIP-seq benchmark data and an extensive collection of PWM motifs. MARS implements consistency, enrichment, and scoring and classification-based motif evaluation algorithms. Transcription factor binding is also influenced and determined by contextual factors: chromatin accessibility, competition or cooperation with other TFs, cell line or condition specificity, binding locality (e.g. proximity to transcription start sites) and the shape of the binding site (DNA-shape). In vitro techniques do not capture such context; therefore, this thesis also combines PBM and DNase-seq data using a comparative k-mer enrichment approach that compares open chromatin with genome-wide prevalence, achieving a modest performance improvement when benchmarked on ChIP-seq data. Finally, since statistical and probabilistic methods cannot capture all the information that determine binding, a machine learning approach (XGBooost) was implemented to investigate how the features contribute to TF specificity and occupancy. This combinatorial approach improves the predictive ability of TF specificity models with the most predictive feature being chromatin accessibility, while the DNA-shape and conservation information all significantly improve on the baseline model of k-mer and DNase data. The results and the tools introduced in this thesis are useful for systematic comparative analysis (via MARS) and a combinatorial approach to modelling TF binding specificity, including appropriate feature engineering practices for machine learning modelling
Bioinformatic Inference of Specific and General Transcription Factor Binding Sites in the Plant Pathogen Phytophthora infestans
Plant infection by oomycete pathogens is a complex process. It requires precise expression of a plethora of genes in the pathogen that contribute to a successful interaction with the host. Whereas much effort has been made to uncover the molecular systems underlying this infection process, mechanisms of transcriptional regulation of the genes involved remain largely unknown. We performed the first systematic de-novo DNA motif discovery analysis in Phytophthora. To this end, we utilized the genome sequence of the late blight pathogen Phytophthora infestans and two related Phytophthora species (P. ramorum and P. sojae), as well as genome-wide in planta gene expression data to systematically predict 19 conserved DNA motifs. This catalog describes common eukaryotic promoter elements whose functionality is supported by the presence of orthologs of known general transcription factors. Together with strong functional enrichment of the common promoter elements towards effector genes involved in pathogenicity, we obtained a new and expanded picture of the promoter structure in P. infestans. More intriguingly, we identified specific DNA motifs that are either highly abundant or whose presence is significantly correlated with gene expression levels during infection. Several of these motifs are observed upstream of genes encoding transporters, RXLR effectors, but also transcriptional regulators. Motifs that are observed upstream of known pathogenicity-related genes are potentially important binding sites for transcription factors. Our analyses add substantial knowledge to the as of yet virtually unexplored question regarding general and specific gene regulation in this important class of pathogens. We propose hypotheses on the effects of cis-regulatory motifs on the gene regulation of pathogenicity-related genes and pinpoint motifs that are prime targets for further experimental validation
- …