171 research outputs found

    CORENup: a combination of convolutional and recurrent deep neural networks for nucleosome positioning identification

    Get PDF
    Background: Nucleosomes wrap the DNA into the nucleus of the Eukaryote cell and regulate its transcription phase. Several studies indicate that nucleosomes are determined by the combined effects of several factors, including DNA sequence organization. Interestingly, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using DNA sequence as input data. Results: In this work, we propose CORENup, a deep learning model for nucleosome identification. CORENup processes a DNA sequence as input using one-hot representation and combines in a parallel fashion a fully convolutional neural network and a recurrent layer. These two parallel levels are devoted to catching both non-periodic and periodic DNA string features. A dense layer is devoted to their combination to give a final classification. Conclusions: Results computed on public data sets of different organisms show that CORENup is a state of the art methodology for nucleosome positioning identification based on a Deep Neural Network architecture. The comparisons have been carried out using two groups of datasets, currently adopted by the best performing methods, and CORENup has shown top performance both in terms of classification metrics and elapsed computation time

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Multi Layer Analysis

    Get PDF
    This thesis presents a new methodology to analyze one-dimensional signals trough a new approach called Multi Layer Analysis, for short MLA. It also provides some new insights on the relationship between one-dimensional signals processed by MLA and tree kernels, test of randomness and signal processing techniques. The MLA approach has a wide range of application to the fields of pattern discovery and matching, computational biology and many other areas of computer science and signal processing. This thesis includes also some applications of this approach to real problems in biology and seismology

    RNAi machinery cooperates with SWI/SNF complexes in nucleosome positioning at Transcriptional Start Sites

    Get PDF
    In Eukaryotes, Argonaute (AGO) proteins have a well-established role in the cytoplasm in post-transcriptional regulation of gene expression in association with different classes of small non-coding RNAs (sRNAs). In plants and yeast, it has been demonstrated that AGO proteins exert a role in the epigenetic regulation of chromatin modifications. Furthermore, AGO2 protein acts also in the nuclei of human cell lines and emerging literature reports that upon the transfection of sRNAs complementary to non-coding promoter transcripts, AGO2 is recruited on target promoters. Previous results in our laboratory demonstrated that AGO2 and SWI/SNF have a physical interaction, which is independent of RNA or DNA, in human cell lines. As SWI/SNF is the major chromatin-remodelling complex in human, these data suggest that AGO2 might participate in the regulation of chromatin plasticity. In eukaryotes, the proper organization of chromatin is essential for the control of gene expression and is achieved through the concerted activity of histone modifications, DNA methylation and nucleosome positioning. The focus of the present thesis has been the development of relevant bioinformatics pipelines for data processing, analysis and visualization, all aiming at dissection of the functional significance of the AGO2-SWI/SNF interaction. Interestingly, this bioinformatics pipeline allowed me to identify a novel class of nuclear AGO2-bound sRNAs arising from genomic regions 150 nt around the Transcription Start Sites (TSS) bound by SWI/SNF (swiRNAs). Furthermore, swiRNAs present a Dicer-dependent processing and show an involvement in nucleosome occupancy at nucleosome +1. These data represent the first description of a molecular mechanism through which AGO2 is involved in nucleosome positioning in mammalian cells

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Study of a novel evolutionarily conserved pattern of histone acetylation

    Full text link
    Le génome eucaryote est empaqueté dans une structure hautement ordonnée appelée chromatine. Même si la structure de la chromatine est importante pour le maintien de l'intégrité génomique, elle constitue une barrière à de nombreux processus basés sur l'ADN tels que la réplication de l'ADN, la transcription et la réparation de l'ADN. Les histones contiennent une diversité déconcertante de modifications covalentes qui sont concentrées principalement, mais non exclusivement dans leurs queues amino-terminales. Les modifications des histones jouent un rôle central dans la régulation de la structure et de la fonction de la chromatine. Cependant, la détermination de la stoechiométrie des modifications à des sites spécifiques, l'identification des motifs de modifications et l'établissement de leurs rôles physiologiques restent des défis redoutables. Dans cette étude, nous avons utilisé la spectrométrie de masse pour déterminer la stoechiométrie de l'acétylation de résidus lysine spécifiques des histones. En général, les résidus lysine des histones dépourvus d'acétylation sont dérivatisés pour rendre les peptides résultants chimiquement équivalents à leurs homologues acétylés, mais pouvant être distingués par spectrométrie de masse. Cependant, cette méthode est insuffisante pour étudier les peptides contenant plus d'une lysine acétylable, tels que ceux dérivés de la queue N-terminale des histones H3 et H4. La digestion trypsique de tels peptides génère des «isomères de position», des isomères qui ont la même masse, mais qui portent des groupes acétyle à des positions différentes. La quantification précise de l'acétylation d'un site spécifique dans ces peptides est donc un défi analytique majeur. Dans le deuxième chapitre, nous décrivons une nouvelle méthode, pour quantifier l'acétylation à un site spécifique dans les peptides co-éluants isomériques et isobariques, qui combine des données LC-MS / MS à haute résolution avec un nouvel algorithme bioinformatique, Iso-PeptidAce. En utilisant des spectres de masse en tandem (MS/MS) de peptides synthétiques, les produits de fragmentation caractéristiques de chaque isomère de position ont été identifiés et utilisés pour déconvoluer des spectres provenant de mélanges d'isomères de position et quantifier l'abondance de chaque isomère. Nous avons ensuite testé l'applicabilité de l'Iso-PeptidAce pour quantifier les augmentations en fonction du temps de l'acétylation des histones des cellules d'érythroleucémie K562 traitées avec des inhibiteurs d’histone déacétylase (HDAC). En utilisant notre méthode, nous avons également trouvé que les histones H3 et H4 associées à CAF-1, un facteur d'assemblage de la chromatine, ont une stoechiométrie élevée d’acétylation sur plusieurs résidus de H3 et H4, par rapport aux histones totales. Dans le chapitre 3, nous avons appliqué Iso-PeptidAce pour déterminer la stoechiométrie de l'acétylation chez la levure de fission présentant un mutant d’histone désacétylase. Conformément aux études antérieures impliquant Clr3 et Sir2 dans la régulation de l’hétérochromatine, nous avons observé que les cellules dépourvues de ces HDAC présentaient une augmentation de l'acétylation H3-K14 uniquement sur les peptides coexistant avec H3-K9 di / tri méthylé, une marque caractéristique de l'hétérochromatine. Au chapitre 4, nous décrivons la découverte de très hauts niveaux d'acétylation sur deux résidus de lysine. Nous avons trouvé qu'une stoechiométrie élevée d’acétylation à H3-K14 et H3-K23 et une faible stoechiométrie d’acétylation à H3-K9 et H3-K18 est un profil global de H3 conservé sur le plan évolutif d’acétylation. En utilisant des souches de levures de fission (S. pombe) où la seule source de gènes d'histone porte des mutations H3-K14R et / ou H3-K23R qui empêchent l'acétylation, nous avons démontré que H3-K14 et H3-K23 ont des fonctions distinctes. De façon surprenante, nous avons trouvé que les phénotypes observés dans les cellules mutantes H3-K14R sont largement dus à la mutation du résidu lysine, plutôt qu'à la perte d'acétylation. En utilisant des souches de S. pombe dépourvues d'histone acétyltransférases (HAT), nous avons identifié les acétyltransférases qui contribuent à H3-K14ac et à H3-K23ac in vivo. Très peu d'études ont cherché à déterminer spécifiquement les stoechiométries d'acétylation des histones. Nos résultats suggèrent qu’en moyenne, sur l'ensemble du génome, chaque deuxième ou troisième nucléosome contient une molécule H3 avec une acétylation K14 et / ou K23. Cela nous amène à penser que l'acétylation de l'histone H3 à l'échelle du génome peut jouer un rôle important dans la fonction chromosomique. Il est impératif de comprendre la signification fonctionnelle de ce modèle d'acétylation étant donné que la thérapie épigénétique est activement étudiée comme stratégie pour traiter de nombreuses maladies.The eukaryotic genome is packaged into a highly ordered chromatin structure. Even though chromatin structure is important for maintaining genomic integrity, it is a barrier to numerous DNA-based processes such as DNA replication, transcription and DNA repair. Histones contain a bewildering diversity of covalent modifications that are mostly but not exclusively concentrated within their amino-terminal tails. Histone modifications play a central role in regulating chromatin structure and function. However, determining the stoichiometry of site-specific modifications, identifying patterns of modifications and establishing their physiological roles remain formidable challenges. In this study, we exploited mass spectrometry to determine the stoichiometry of acetylation at specific histone lysine residues. In general, histone lysine residues lacking acetylation are derivatized to render the resulting peptides chemically equivalent but distinguishable by mass from their acetylated counterparts. However, this method is insufficient to study peptides that contain more than one acetylatable lysine, such as those derived from the N-terminal tail of histones H3 and H4. Tryptic digestion of such peptides generates ‘positional isomers’, isomers that have the same mass but bearing acetyl groups located at different positions. Accurate quantification of site-specific acetylation in those peptides is, therefore, a major analytical challenge. In the second chapter, we describe a novel method for quantifying site-specific acetylation of co-eluting isomeric and isobaric peptides that combines high-resolution LC-MS/MS data with a novel bioinformatics algorithm, Iso-PeptidAce. Using tandem mass spectra (MS/MS) of synthetic peptides, fragmentation products diagnostic of each positional isomer were identified and were used to deconvolute spectra that arise from mixtures of positional isomers and quantify the abundance of each isomer. We then tested the applicability of Iso-PeptidAce to quantify time-dependent increases in histone acetylation of K562 erythroleukaemia cells treated with histone deacetylase (HDAC) inhibitors. Using our method, we also found that histones H3 and H4 associated with CAF-1, a chromatin assembly factor, have a high stoichiometry of acetylation on multiple residues of H3 and H4, compared with total histones. In Chapter 3, we applied Iso-PeptidAce to determine the stoichiometry of acetylation in fission yeast histone deacetylase mutants. Consistent with previous reports implicating Clr3 and Sir2 in heterochromatin function, we observed that cells lacking these HDACs showed an increase in H3-K14 acetylation only on those peptides where it co-exists with di/trimethylated H3-K9, a mark of heterochromatin. In chapter 4, we describe the discovery of very high levels of acetylation on two lysine residues. We found that a high stoichiometry of acetylation at H3-K14 and H3-K23, and low stoichiometry of acetylation at H3-K9 and H3-K18, is an evolutionarily conserved global pattern of H3 acetylation. Using fission yeast (S. pombe) strains harboring histone mutations H3-K14R and/or H3-K23R that prevent acetylation, we demonstrate that H3-K14 and H3-K23 have separable functions. Surprisingly, we found that the phenotypes observed in H3-K14R mutant cells are largely due to mutation of the lysine residue, rather than loss of acetylation. Using S. pombe strains that lack histone acetyltransferases (HATs) we identified the acetyltransferases that contribute to H3-K14ac and H3-K23ac in vivo. Very few studies have aimed at specifically determining the acetylation stoichiometries of histones. Our results suggest that, on average, over the entire genome, every second or third nucleosome contains an H3 molecule with K14 and/or K23 acetylation. This leads us to surmise that genome-wide acetylation of histone H3 may have an important role in chromosome function. It is imperative to understand the functional significance of this acetylation pattern given that epigenetic therapy is actively pursued as a strategy to treat many diseases

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    A Bayesian system for modeling promoter structure: A case study of histone promoters

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Anti-bias training for (sc)RNA-seq : experimental and computational approaches to improve precision

    Get PDF
    RNA-seq, including single cell RNA-seq (scRNA-seq), is plagued by insufficient sensitivity and lack of precision. As a result, the full potential of (sc)RNA-seq is limited. Major factors in this respect are the presence of global bias in most datasets, which affects detection and quantitation of RNA in a length-dependent fashion. In particular, scRNA-seq is affected by technical noise and a high rate of dropouts, where the vast majority of original transcripts is not converted into sequencing reads. We discuss these biases origins and implications, bioinformatics approaches to correct for them, and how biases can be exploited to infer characteristics of the sample preparation process, which in turn can be used to improve library preparation
    corecore