1,612 research outputs found

    Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

    Get PDF
    Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels.\ua0Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    Solution structure and dynamic analysis of chicken MBD2 methyl binding domain bound to a target-methylated DNA sequence

    Get PDF
    The epigenetic code of DNA methylation is interpreted chiefly by methyl cytosine binding domain (MBD) proteins which in turn recruit multiprotein co-repressor complexes. We previously isolated one such complex, MBD2-NuRD, from primary erythroid cells and have shown it contributes to embryonic/fetal Ī²-type globin gene silencing during development. This complex has been implicated in silencing tumor suppressor genes in a variety of human tumor cell types. Here we present structural details of chicken MBD2 bound to a methylated DNA sequence from the Ļ-globin promoter to which it binds in vivo and mediates developmental transcriptional silencing in normal erythroid cells. While previous studies have failed to show sequence specificity for MBD2 outside of the symmetric mCpG, we find that this domain binds in a single orientation on the Ļ-globin target DNA sequence. Further, we show that the orientation and affinity depends on guanine immediately following the mCpG dinucleotide. Dynamic analyses show that DNA binding stabilizes the central Ī²-sheet, while the N- and C-terminal regions of the protein maintain mobility. Taken together, these data lead to a model in which DNA binding stabilizes the MBD2 structure and that binding orientation and affinity is influenced by the DNA sequence surrounding the central mCpG

    Molecular Evolution of Hominoid Primates: Phylogeny and Regulation

    Get PDF
    The complete mtDNA of one eastern gorilla was sequenced to provide the most accurate date for the mitochondrial divergence of gorillas. The most recent common ancestor of eastern lowland and western lowland gorillas existed about 1.9 million years ago, slightly more recent than that of chimpanzee and bonobo. This study also depicts that the eastern and western gorillas show species level genetic divergence. Hominoid mating systems differ tremendously. The level of sperm competition varies according to the mating system, which presumably imposes unique selective pressures on the seminal proteins of each species. Cartilage acidic protein 1 (CRTAC1) was identified in our lab as the protein with the largest difference in abundance between human and chimpanzee, being found at 142-fold higher in chimpanzee. The coding region of CRTAC1 is extremely conserved with signature of strong purifying selection. Paradoxically, CRTAC1 `promoter\u27 from human drives transcription significantly greater than chimpanzee, with or without androgen stimulation. Analyzing H3K27Ac data, a ~2.2kb region was identified as a possible additional cis-regulatory element. The cis-regulatory region behaved like a silencer and aided in strong transcriptional repression in humans. Although its underlying basis remains elusive, it can be speculated that the differential expression of CRTAC1 between human and chimpanzee seminal plasma results from tissue specific over/under expression of this gene. The unique gains and losses of miRNAs within hominoids have remained understudied. The overall goal of this project was to identify the uniquely gained and lost miRNAs and their targets within hominoids. I found 14 miRNAs uniquely gained in humans. Maximum uniquely gained and lost miRNAs were found to be brain specific. The targets of uniquely gained miRNAs in human are also associated with brain-associated functions. Older miRNAs were found to be more conserved compared to the newer miRNAs gained \u3c15 Mya

    Understanding protein-DNA binding events

    Get PDF
    DNA binding proteins regulate essential biological processes such as DNA replication, transcription, repair, and splicing. Transcription factors (TFs) are in the focus of this work because they have the largest effect of activating and repressing gene expression by influencing transcription rates. It is important to model TF binding affinity to DNA and to predict protein-DNA binding events to understand how they regulate cell mechanisms. Higher order Markov models bring \textit{de-novo} motif discovery to the next level. BaMM!motif has been shown to provide robust predictions of these more sophisticated binding models. Here I introduce the BaMM!motif web application, a web-based platform which combines \textit{de-novo} motif discovery with motif enrichment and motif-motif comparison tools and a database of known motifs. This web application enables the usage of the BaMM!motif algorithm in a straightforward and robust environment. Post-translational histone modifications and linker histone incorporation regulate chromatin structure and genome activity. How these systems interface on a molecular level is unclear. Using biochemistry, one observes that the modification behavior of N-terminal histone H3 tails depends on the nucleosomal contexts. I found that linker histones inhibit modifications of different H3 sites on a genome-wide level.This proposes that alterations of H3 tail-linker DNA interactions by linker histones execute basal control mechanisms of chromatin function. Pervasive transcription of eukaryotic genomes stems to a large extent from bidirectional promoters that synthesize mRNA and divergent noncoding RNA (ncRNA). Here, I show that early termination that relies on the essential RNA-binding factor Nrd1 attenuates transcription of 32 genes in yeast. Further, depletion of Nrd1 from the nucleus results in 1,526 Nrd1-unterminated transcripts (NUTs) that originate from nucleosome-depleted regions (NDRs) and can deregulate mRNA synthesis by antisense repression and transcription interference

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    The Genetic basis of resistance and susceptibility in the Albugo laibachii-Arabidopsis thaliana pathosystem

    Get PDF
    Albugo is a genus of biotrophic plant pathogens that can infect an extensive range of hosts including many Brassicaceae crop species. Little is known about the molecular mechanisms by which Albugo species can suppress host immunity and the mechanisms by which plants can resist Albugo infection. Albugo laibachii (Al) is a specialized pathogen of Arabidopsis thaliana (At). It can colonize ~90% of At accessions and suppress effector-triggered-immunity to other pathogens. It is postulated that Al secretes effector proteins. Analysis of the A. laibachii genome by Kemen et al, (2011, PLoS Biology) revealed a potential class of effectors with a ā€˜CHXCā€™ motif in their N-terminus that can mediate translocation into host cells. However, there are only ~35 CHXC effectors in A. laibachii, suggesting that they might not represent its entire effector complement. I took a traditional method to identify Al effectors: clone ā€œavirulence (Avr) genesā€. These typically encode effectors that are recognized and trigger a strong response by the immune system of some host accessions. I identified and sequenced four Al isolates from field samples. Using differential phenotype information to guide a genome-Ā­ā€ ide analysis, and my expectations of the allelic diversity of Avr genes, I identified two novel recognized effectors. These effectors, short secreted proteins named ā€œSSP16ā€ and ā€œSSP18ā€, are recognized by the Arabidopsis accessions HR-5 and Ksk-1 respectively. I used classical and Illumina-Ā­ā€based genetic mapping to identify the locus conferring SSP16 recognition in HR-5, Resistance to A. laibachii 4 (RAL4). This locus contains three putative CC-NB-LRR class Resistance protein-encoding genes with similarity to Resistance to Peronospora parasitica 7 (RPP7). I demonstrated the utility of combined genomics approaches to identify recognized effectors without known motifs. The identification of the first Avr-Resistance gene pair will pave the way for further dissection of the molecular interactions in this pathosystem

    Synthetic Biology

    Get PDF
    Synthetic biology gives us a new hope because it combines various disciplines, such as genetics, chemistry, biology, molecular sciences, and other disciplines, and gives rise to a novel interdisciplinary science. We can foresee the creation of the new world of vegetation, animals, and humans with the interdisciplinary system of biological sciences. These articles are contributed by renowned experts in their fields. The field of synthetic biology is growing exponentially and opening up new avenues in multidisciplinary approaches by bringing together theoretical and applied aspects of science

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2
    • ā€¦
    corecore