187 research outputs found

    Computational Methods for Comparative Non-coding RNA Analysis: from Secondary Structures to Tertiary Structures

    Get PDF
    Unlike message RNAs (mRNAs) whose information is encoded in the primary sequences, the cellular roles of non-coding RNAs (ncRNAs) originate from the structures. Therefore studying the structural conservation in ncRNAs is important to yield an in-depth understanding of their functionalities. In the past years, many computational methods have been proposed to analyze the common structural patterns in ncRNAs using comparative methods. However, the RNA structural comparison is not a trivial task, and the existing approaches still have numerous issues in efficiency and accuracy. In this dissertation, we will introduce a suite of novel computational tools that extend the classic models for ncRNA secondary and tertiary structure comparisons. For RNA secondary structure analysis, we first developed a computational tool, named PhyloRNAalifold, to integrate the phylogenetic information into the consensus structural folding. The underlying idea of this algorithm is that the importance of a co-varying mutation should be determined by its position on the phylogenetic tree. By assigning high scores to the critical covariances, the prediction of RNA secondary structure can be more accurate. Besides structure prediction, we also developed a computational tool, named ProbeAlign, to improve the efficiency of genome-wide ncRNA screening by using high-throughput RNA structural probing data. It treats the chemical reactivities embedded in the probing information as pairing attributes of the searching targets. This approach can avoid the time-consuming base pair matching in the secondary structure alignment. The application of ProbeAlign to the FragSeq datasets shows its capability of genome-wide ncRNAs analysis. For RNA tertiary structure analysis, we first developed a computational tool, named STAR3D, to find the global conservation in RNA 3D structures. STAR3D aims at finding the consensus of stacks by using 2D topology and 3D geometry together. Then, the loop regions can be ordered and aligned according to their relative positions in the consensus. This stack-guided alignment method adopts the divide-and-conquer strategy into RNA 3D structural alignment, which has improved its efficiency dramatically. Furthermore, we also have clustered all loop regions in non-redundant RNA 3D structures to de novo detect plausible RNA structural motifs. The computational pipeline, named RNAMSC, was extended to handle large-scale PDB datasets, and solid downstream analysis was performed to ensure the clustering results are valid and easily to be applied to further research. The final results contain many interesting variations of known motifs, such as GNAA tetraloop, kink-turn, sarcin-ricin and t-loops. We also discovered novel functional motifs that conserved in a wide range of ncRNAs, including ribosomal RNA, sgRNA, SRP RNA, GlmS riboswitch and twister ribozyme

    Computational Methods and Software Tools for Functional Analysis of miRNA Data

    Get PDF
    miRNAs are important regulators of gene expression that play a key role in many biological processes. High-throughput techniques allow researchers to discover and characterize large sets of miRNAs, and enrichment analysis tools are becoming increasingly important in decoding which miRNAs are implicated in biological processes. Enrichment analysis of miRNA targets is the standard technique for functional analysis, but this approach carries limitations and bias; alternatives are currently being proposed, based on direct and curated annotations. In this review, we describe the two workflows of miRNAs enrichment analysis, based on target gene or miRNA annotations, highlighting statistical tests, software tools, up-to-date databases, and functional annotations resources in the study of metazoan miRNAs.Junta de Andalucia PI-0173-2017 CV20.3672

    Long non-coding RNAs in the epigenetic regulation of oligodendrocyte differentiation

    Get PDF
    Long non-coding RNAs (lncRNAs) constitute a heterogeneous class of RNAs with limited coding potential, united by an arbitrarily placed cut off of >200 ntd. The past decade has seen the emergence of lncRNAs as versatile regulators of gene expression, amidst skepticism regarding the biological usefulness of pervasive genomic transcription and its non-coding RNA products prevalent in most eukaryotes. A significant portion of lncRNAs operate in the development and functioning of the mammalian CNS. Oligodendrocytes (OLs) are the myelinating cells of the CNS that are essential for efficient saltatory conduction and axonal survival. They are derived from OL precursors (OPCs) and progress into transcriptomically heterogeneous OL sub-populations along the differentiation pathway to produce mature OLs, capable of myelination. These epigenetic transitions between different OL subpopulations are carefully regulated, spatially and temporally, by a network of transcription factors, chromatin modulators and lncRNAs. In demyelinating diseases like multiple sclerosis (MS), patients suffer immune mediated attacks against myelin. Eventually, remyelination strategies fail due to deficits in OPC migration and OL differentiation at the site of lesions. Thus, understanding molecular mechanisms governing OL differentiation and myelination is crucial not only for understanding OL function in health but also in disease, in order to develop suitable therapeutic interventions. The investigations presented in this thesis explore the role of lncRNAs and RNA-binding proteins in neurodevelopment, particularly in embryonic stem cells (ESCs) and cells of the OL lineage. Article 1 provides a resource for the protein interactome of a key pioneering transcription factor, Sox2, in different nuclear fractions of mouse ESCs. We found Sox2 to be a multifaceted regulator forming interactions with HP1 family of proteins, whose members perform as both activators and repressors in a context dependent manner. In addition to interacting with RBPs involved in post-transcriptional processes, Sox2 also interacted with Rn7sk, a well-known ncRNA involved in the regulation of transcriptional elongation at promoters and enhancers. Although they did not influence each other‘s recruitment to the chromatin, this interaction opens up the possibility for ncRNA mediated modulation of ES transcriptional programs dependent on Sox2. Article 2 draws important insights regarding lncRNAs from a broad transcriptomic resource established from single cell- as well as bulk RNA- sequencing of OL lineage cells from different developmental stages. From a subset of lncRNAs which were found to be specific for certain OL subpopulations, we investigated the role of 2610035D17Rik in modulating the expression of its neighboring gene, Sox9, a transcription factor essential for OPC specification. We decoupled the role of lncRNA transcript from its genomic locus using various loss-of-function strategies and found that the regulation of Sox9 was dependent on the regulatory elements and/or ongoing transcription at the 2610035D17Rik locus, rather than the transcript itself. In Article 4, we investigated a hitherto unexplored RNA-binding function of myelin gene expression factor 2 (Myef2), a known transcriptional repressor of myelin basic protein (MBP). To this end, we uncovered the RNA interactome of Myef2 in a mouse oligodendroglial cell line with individual nucleotide resolution CLIP (iCLIP) followed by sequencing. We show that Myef2 interacts with CUG motifs located within introns and 3‘UTRs of protein-coding genes, a finding which implicates Myef2 in post-transcriptional processes like splicing and RNA stability. Finally, in Article 3 we have identified disease specific transcriptomic profiles of OL lineage cells through single-cell RNA sequencing of OPCs and OLs derived from experimental autoimmune encephalomyelitis (EAE) mice, a model that recapitulates several aspects of MS. EAE specific OPC and OL clusters were enriched for genes involved in antigen processing and presentation (MHC class I/II). We could demonstrate that OPCs can phagocytose myelin debris and MHC-II-expressing OPCs can activate memory and effector CD4-positive T cells. These findings show OL lineage cells as active participants in MS pathology than passive targets. Further, the findings of Article 2 implicate 2610035D17Rik as a regulator of immunomodulatory properties of oligodendroglia, as 2610035D17Rik KO cells showed reduced expression of IFNγ responsive genes and elevated expression of those involved in antigen presentation, compared to the controls, following IFNγ stimulation

    Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth

    Get PDF
    Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page Abstract Acknowledgements Index 1. The structure of this thesis 1.1. Structure of this PhD dissertation 1.2. Publications of this PhD dissertation 1.3. Computational infrastructure and resources 1.4. Disclosure of financial support and information use 1.5. Acknowledgements 1.6. Author contributions and use of impersonal and personal pronouns 2. Biological background 2.1. The complexity of the eukaryotic genome 2.2. The problem of counting and defining “genes” in eukaryotes 2.3. The “function” concept for genes and “dark matter” 2.4. Increases of organismal complexity on Earth through multicellularity 2.5. Multicellularity is a “fitness transition” in individuality 2.6. The complexity of cell differentiation in multicellularity 3. Technical background 3.1. The Phylogenetic Comparative Method (PCM) 3.2. RNA secondary structure prediction 3.3. Some standards for genome and gene annotation 4. What is in a eukaryotic genome? GenomeContent provides a good answer 4.1. Background 4.2. Motivation: an interoperable tool for data retrieval of gene annotations 4.3. Methods 4.4. Results 4.5. Discussion 5. The evolutionary correlation between genome size and ncDNA 5.1. Background 5.2. Motivation: estimating the relationship between genome size and ncDNA 5.3. Methods 5.4. Results 5.5. Discussion 6. The relationship between non-coding DNA and Complex Multicellularity 6.1. Background 6.2. Motivation: How to define and measure complex multicellularity across eukaryotes? 6.3. Methods 6.4. Results 6.5. Discussion 7. The ceRNA motif pipeline: regulation of microRNAs by target mimics 7.1. Background 7.2. A revisited protocol for the computational analysis of Target Mimics 7.3. Motivation: a novel pipeline for ceRNA motif discovery 7.4. Methods 7.5. Results 7.6. Discussion 8. Conclusions and outlook 8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses 8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya 8.3. “Complex multicellularity” is a major evolutionary transition 8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth 9. Supplementary Data Bibliography Curriculum Scientiae Selbständigkeitserklärung (declaration of authorship

    Clinical potential of oligonucleotide-based therapeutics in the respiratory system

    Get PDF
    The discovery of an ever-expanding plethora of coding and non-coding RNAs with nodal and causal roles in the regulation of lung physiology and disease is reinvigorating interest in the clinical utility of the oligonucleotide therapeutic class. This is strongly supported through recent advances in nucleic acids chemistry, synthetic oligonucleotide delivery and viral gene therapy that have succeeded in bringing to market at least three nucleic acid-based drugs. As a consequence, multiple new candidates such as RNA interference modulators, antisense, and splice switching compounds are now progressing through clinical evaluation. Here, manipulation of RNA for the treatment of lung disease is explored, with emphasis on robust pharmacological evidence aligned to the five pillars of drug development: exposure to the appropriate tissue, binding to the desired molecular target, evidence of the expected mode of action, activity in the relevant patient population and commercially viable value proposition.</p

    Revising the evolutionary imprint of RNA structure in mammalian genomes

    Get PDF

    LincRNA profile in clear cell renal cell carcinoma using RNA-seq data

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional (Biologia Computacional), Universidade de Lisboa, Faculdade de Ciências, 2015O cancro renal ou carcinoma de células renais (renal cell carcinoma - RCC) é um grupo comum de doenças resistentes a quimioterapia. É um dos tipos de cancro mais letal no sistema urinário, sendo a taxa de sobrevivência para os pacientes com RCC metastático de menos de 10% após cinco anos de diagnóstico. Com base nas suas características genéticas e histológicas, no seu fenótipo clínico e diferentes respostas à terapia, os RCCs podem ser subdivididos em vários tipos, sendo um dos mais comuns o de células claras RCC ( clear cell renal cell carcinoma - ccRCC); correspondendo a mais de 80% dos casos de RCC. Uma das características do ccRCC, bem como de outros tipos de cancro, é a metabolização da glucose através da glicólise seguido pela produção¸ de lactato, processo primeiramente descrito por Warburg - ” efeito de Warburg ”. Este efeito ocorre em oposição à normal glicólise seguida de fosforilação oxidativa mitocondrial, a fim de produzir o adenosina trifosfato (ATP). Esta característica deriva principalmente do gene von Hippel-Lindau (VHL) inactivo, contudo este apenas apresenta mutações que podem inactivar a sua função em apenas 52% das amostras de ccRCC. Esta mutação poderá não ser suficiente para explicar este carcinoma e que mais estudos são necessários a fim de entendê-la. Um papel importante neste cancro também tem sido atribuído à regulação epigenética, bem como a microRNAs desregulados. Desde o início do século XXI, vários projectos a nível global têm permitido descartar a ideia de que o genoma humano é principalmente ” lixo” e para isso também contribuiu o desenvolvimento de tecnologias de sequenciação de nova geração (next generation sequencing – NGS). Algumas destas tecnologias são a Roche 454, Illumina / Solexa e tecnologias ABISolid que permitem sequenciar todo o genoma/ transcriptoma de uma só vez. De modo a ocorrer esta sequenciação é necessária uma fragmentação do material genético; uma reacção em cadeia da polimerase (polimerase chain reaction - PCR) em paralelo e determinação da sequência através de fluorescência. Nos últimos anos, surgiram tecnologias de sequenciação de terceira geração (como PacBio e Helicos) capazes de definir a sequência utilizando moléculas individuais de DNA, sem necessidade de reacções de PCR. Actualmente, grande parte das tecnologias disponíveis apresentam várias vantagens e limitações, sendo o mais importante na escolha de uma destas o equilíbrio entre os objectivos e o orçamento disponível. Uma das técnicas que tira partido destas tecnologias é a sequenciação de RNA (RNAseq). Esta técnica, utiliza tecnologias de sequenciação de nova geração, a fim de analisar todas as moléculas de ácido ribonucleico (ribonucleic acid - RNA) de uma ou mais células – transcriptoma. A análise deste tipo de dados permite fornecer informações a nível de sequência, bem como sobre níveis de transcrição, facilitando o desenvolvimento de novas terapêuticas e interpretação de dados experimentais. Esta revolução tecnológica levou ao reconhecimento de que o transcriptoma não é apenas constituído por transcritos codificantes de proteínas, mas também por um elevado número de transcritos não codificantes. Transcritos estes que estão a ser gerados a partir de regiões que se acreditava ser ”desertos”. A transcrição generalizada das regiões não codificantes pode estar na origem de moléculas funcionais. Torna-se assim evidente que existe uma necessidade de ter em conta elementos não codificantes, ao serem realizados estudos de associação ao nível do genoma. Os transcritos não codificantes (non coding RNA - ncRNA) estão associados a várias funções a nível celular e pedem ser dividos em várias categorias, de acordo com o seu tamanho e localização relativa a genes codificantes de proteína. Um dos grupos de ncRNA são os longos transcritos não codificantes interétnicos (long non coding intergenic RNA – lincRNA), que apresentam um tamanho superior a 200 nucleótidos e não apresentam nenhuma sobreposição com outros genes anotados. Estes não apresentam nenhuma característica específica sendo que podem ser transcritos pela mesma maquinaria que permite a transcrição de genes codificantes de proteínas. Normalmente apresentam cerca de 2 a 3 exões e o nível de expressão é menos elevado que o dos genes codificantes. O papel biológico da maioria destes ainda é, em grande parte desconhecido, contudo alguns deles têm sido associados a vários tipos de cancro. Apesar da quantidade de estudos feitos em ccRCC e da quantidade de mutações identificadas, ainda não é possível compreender este subtipo de carcinoma renal. Assim, decidiu-se explorar o perfil de expressão de lincRNAs em ccRCC e quantificar diferença na expressão destes, comparando amostras normais versus a amostras de tumor de 62 pacientes com ccRCC. Para isso, é necessário construir o transcriptoma base do ccRCC para a descoberta de potenciais novos lincRNAs; analisar a expressão diferencial de lincRNA e mostrar sua correlação com genes que codificam proteínas. Foi então utilizada uma análise computacional de dados de RNA-seq de 62 amostras de pacientes ccRCC (pares de amostra tumoral e normal). Primeirament foi construido um catálogo com lincRNAs humanos, utilizadando anotações de lincRNA de várias bases de dados (Ensembl, Gencode, Vega, Lncipedia, UCSC, do Instituto Broad, Noncode e dados publicados por Zhipeng e Adelson). A falta de correspondência entre as diferentes bases de dados, aumentou o grau de complexidade do processo, contudo no final foi obtido um catálogo de 38 134 lincRNAs humanos. De seguida, foi reconstruído o transcriptoma do ccRCC para usar como base para nova descoberta de lincRNAs. A caracterização das 62 amostras de pacientes ccRCC (tumor e normal combinado) revelou 5549 potenciais novos linRNAs. A análise diferencial entre as amostras de cancro e tecido normal permitiu a identificação de 2129 genes diferencialmente expressos ( entre os quais 239 lincRNA e 105 potencias novos lincRNAs). Devido aos seus baixos níveis de expressão, para muitos dos lincRNAS o teste estatístico não foi sequer efectuado. Facto pelo qual, o último passo envolveu uma análise que tem em conta a relação entre os transcritos, independentemente da sua expressão diferencial. Foi realizada uma análise de correlação génica em rede (gene correlation network analysis), permitindo encontrar genes altamente correlacionados entre si e o tipo de amostra - tumor / normal. ´E de realçar o lincRNA PVT1, que foi previamente associado a outros tipos de cancro e tem uma elevada expressão em amostras de tumor ccRCC. Pacientes com elevada expressaão relativa deste lincRNA nas amostras normais, têm uma probabilidade inferior de sobrevivência comparativamente aos que apresentam uma menor expressão relativa. No final, esta análise permitiu a dar os primeiros na compreensão a importância dos lincRNAs no ccRCC.Kidney cancer or renal cell carcinoma (RCC) is a common group of chemotherapy resistant diseases, and one of the most lethal type of cancer in the urinary system, being the survival rate for patients suffering from metastatic RCC is less than 10% survive five years subsequent to diagnosis. Based on their genetic characteristics, histological features, clinical phenotype and different responses to therapy, RCCs can be subdivided in several subtypes, one of the most common being clear cell RCC (ccRCC) accounting for more than 80% of RCC cases. ccRCC is usually characterized with an inactive von Hippel–Lindau (VHL) gene, the VHL gene mutations that can inactivate were observed only in 52% samples, which may indicate that this mutation is not sufficient to explain this carcinoma and that more studies are necessary in order to understand it. An important role for epigenetic regulation has also been suggested for ccRCC, as well for deregulated microRNAs. The development of next generation sequencing technologies (NGS) made possible for a bigger number of transcriptomes to be analysed. This allowed to acknowledge that a transcriptome is not only constitute by protein-coding transcripts but also by a high number of non-coding transcripts. This transcripts are being transcribed from regions previously thought to be “deserts”. This widespread transcription of non-coding regions may be in the origin of functional molecules, making apparent that there is a need to take into account non-coding elements when genome wide association studies are done. Non-coding RNA (ncRNA) are associated with plenty of functions and one group of ncRNA - long intergenic ncRNA, which have no overlap other annotated genes, have been associated with several other cancers. Despite the amount of studies made in ccRCC and the amount of identified mutations it is still not possible to comprehend this subtype of renal carcinoma. Thus, we decided to explore the long intergenic non-coding RNA (lincRNA) profile in ccRCC and quantify difference in gene expression when comparing the normal versus the tumor samples. For that is necessary to assemble the ccRCC transcriptome as base for potentially new lincRNA discovery, analyse differential lincRNA expression and show their correlation with protein coding genes. In order to achieve that, a computational analysis of RNA-seq pair-end data of 62 ccRCC patient samples (tumor and matched normal) was used. In order to accomplish these objectives, a human lincRNA catalog, with lincRNA annotations from several databases (Ensembl, Gencode, Vega, Lncipedia, UCSC, Broad Institute, Noncode and Zhipeng and Adelson published data) had to be constructed. The main preoccupation was to have the most complete tool/resource for assessing lincRNA expression. For that, 8 different databases with lincRNA annotations were merged in order to obtain a unified human catalogue of 38 134 lincRNAs. To uncover the lincRNA profile in ccRCC, the transcriptome composition of 62 ccRCC patient samples (tumor and matched normal) was assessed. Available bioinformatic tools were used and made possible the identification of 5549 potentially new lincRNA and determine 2129 differentially expressed genes (239 lincRNA and 105 potentially new lincRNAs). In order to proceed with an analysis that takes into account the relationship between the transcripts, independently of their differential expression, a weighted gene correlation network analysis followed. This analysis allowed to find highly co-expressed/correlated genes as well as genes highly correlated with sample type – tumor/normal sample, leading to uncover PVT1 lincRNA. This lincRNA was already associated with other cancers and has an expression highly upregulated in ccRCC tumor samples. Patients with relative high expression of this lincRNA in normal samples also show poor survival chances. In the end, this analysis allowed to give the first steps in order to understand the lincRNAs importance in ccRCC

    Understanding the Code of Life: Holistic Conceptual Modeling of the Genome

    Full text link
    [ES] En las últimas décadas, los avances en la tecnología de secuenciación han producido cantidades significativas de datos genómicos, hecho que ha revolucionado nuestra comprensión de la biología. Sin embargo, la cantidad de datos generados ha superado con creces nuestra capacidad para interpretarlos. Descifrar el código de la vida es un gran reto. A pesar de los numerosos avances realizados, nuestra comprensión del mismo sigue siendo mínima, y apenas estamos empezando a descubrir todo su potencial, por ejemplo, en áreas como la medicina de precisión o la farmacogenómica. El objetivo principal de esta tesis es avanzar en nuestra comprensión de la vida proponiendo una aproximación holística mediante un enfoque basado en modelos que consta de tres artefactos: i) un esquema conceptual del genoma, ii) un método para su aplicación en el mundo real, y iii) el uso de ontologías fundacionales para representar el conocimiento del dominio de una forma más precisa y explícita. Las dos primeras contribuciones se han validado mediante la implementación de sistemas de información genómicos basados en modelos conceptuales. La tercera contribución se ha validado mediante experimentos empíricos que han evaluado si el uso de ontologías fundacionales conduce a una mejor comprensión del dominio genómico. Los artefactos generados ofrecen importantes beneficios. En primer lugar, se han generado procesos de gestión de datos más eficientes, lo que ha permitido mejorar los procesos de extracción de conocimientos. En segundo lugar, se ha logrado una mejor comprensión y comunicación del dominio.[CA] En les últimes dècades, els avanços en la tecnologia de seqüenciació han produït quantitats significatives de dades genòmiques, fet que ha revolucionat la nostra comprensió de la biologia. No obstant això, la quantitat de dades generades ha superat amb escreix la nostra capacitat per a interpretar-los. Desxifrar el codi de la vida és un gran repte. Malgrat els nombrosos avanços realitzats, la nostra comprensió del mateix continua sent mínima, i a penes estem començant a descobrir tot el seu potencial, per exemple, en àrees com la medicina de precisió o la farmacogenómica. L'objectiu principal d'aquesta tesi és avançar en la nostra comprensió de la vida proposant una aproximació holística mitjançant un enfocament basat en models que consta de tres artefactes: i) un esquema conceptual del genoma, ii) un mètode per a la seua aplicació en el món real, i iii) l'ús d'ontologies fundacionals per a representar el coneixement del domini d'una forma més precisa i explícita. Les dues primeres contribucions s'han validat mitjançant la implementació de sistemes d'informació genòmics basats en models conceptuals. La tercera contribució s'ha validat mitjançant experiments empírics que han avaluat si l'ús d'ontologies fundacionals condueix a una millor comprensió del domini genòmic. Els artefactes generats ofereixen importants beneficis. En primer lloc, s'han generat processos de gestió de dades més eficients, la qual cosa ha permés millorar els processos d'extracció de coneixements. En segon lloc, s'ha aconseguit una millor comprensió i comunicació del domini.[EN] Over the last few decades, advances in sequencing technology have produced significant amounts of genomic data, which has revolutionised our understanding of biology. However, the amount of data generated has far exceeded our ability to interpret it. Deciphering the code of life is a grand challenge. Despite our progress, our understanding of it remains minimal, and we are just beginning to uncover its full potential, for instance, in areas such as precision medicine or pharmacogenomics. The main objective of this thesis is to advance our understanding of life by proposing a holistic approach, using a model-based approach, consisting of three artifacts: i) a conceptual schema of the genome, ii) a method for its application in the real-world, and iii) the use of foundational ontologies to represent domain knowledge in a more unambiguous and explicit way. The first two contributions have been validated by implementing genome information systems based on conceptual models. The third contribution has been validated by empirical experiments assessing whether using foundational ontologies leads to a better understanding of the genomic domain. The artifacts generated offer significant benefits. First, more efficient data management processes were produced, leading to better knowledge extraction processes. Second, a better understanding and communication of the domain was achieved.Las fructíferas discusiones y los resultados derivados de los proyectos INNEST2021 /57, MICIN/AEI/10.13039/501100011033, PID2021-123824OB-I00, CIPROM/2021/023 y PDC2021- 121243-I00 han contribuido en gran medida a la calidad final de este tesis.García Simón, A. (2022). Understanding the Code of Life: Holistic Conceptual Modeling of the Genome [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19143

    The Eukaryotic Chromatin Computer: Components, Mode of Action, Properties, Tasks, Computational Power, and Disease Relevance

    Get PDF
    Eukaryotic genomes are typically organized as chromatin, the complex of DNA and proteins that forms chromosomes within the cell\\\''s nucleus. Chromatin has pivotal roles for a multitude of functions, most of which are carried out by a complex system of covalent chemical modifications of histone proteins. The propagation of patterns of these histone post-translational modifications across cell divisions is particularly important for maintenance of the cell state in general and the transcriptional program in particular. The discovery of epigenetic inheritance phenomena - mitotically and/or meiotically heritable changes in gene function resulting from changes in a chromosome without alterations in the DNA sequence - was remarkable because it disproved the assumption that information is passed to daughter cells exclusively through DNA. However, DNA replication constitutes a dramatic disruption of the chromatin state that effectively amounts to partial erasure of stored information. To preserve its epigenetic state the cell reconstructs (at least part of) the histone post-translational modifications by means of processes that are still very poorly understood. A plausible hypothesis is that the different combinations of reader and writer domains in histone-modifying enzymes implement local rewriting rules that are capable of \\\"recomputing\\\" the desired parental patterns of histone post-translational modifications on the basis of the partial information contained in that half of the nucleosomes that predate replication. It is becoming increasingly clear that both information processing and computation are omnipresent and of fundamental importance in many fields of the natural sciences and the cell in particular. The latter is exemplified by the increasingly popular research areas that focus on computing with DNA and membranes. Recent work suggests that during evolution, chromatin has been converted into a powerful cellular memory device capable of storing and processing large amounts of information. Eukaryotic chromatin may therefore also act as a cellular computational device capable of performing actual computations in a biological context. A recent theoretical study indeed demonstrated that even relatively simple models of chromatin computation are computationally universal and hence conceptually more powerful than gene regulatory networks. In the first part of this thesis, I establish a deeper understanding of the computational capacities and limits of chromatin, which have remained largely unexplored. I analyze selected biological building blocks of the chromatin computer and compare it to system components of general purpose computers, particularly focusing on memory and the logical and arithmetical operations. I argue that it has a massively parallel architecture, a set of read-write rules that operate non-deterministically on chromatin, the capability of self-modification, and more generally striking analogies to amorphous computing. I therefore propose a cellular automata-like 1-D string as its computational paradigm on which sets of local rewriting rules are applied asynchronously with time-dependent probabilities. Its mode of operation is therefore conceptually similar to well-known concepts from the complex systems theory. Furthermore, the chromatin computer provides volatile memory with a massive information content that can be exploited by the cell. I estimate that its memory size lies in the realms of several hundred megabytes of writable information per cell, a value that I compare with DNA itself and cis-regulatory modules. I furthermore show that it has the potential to not only perform computations in a biological context but also in a strict informatics sense. At least theoretically it may therefore be used to calculate any computable function or algorithm more generally. Chromatin is therefore another representative of the growing number of non-standard computing examples. As an example for a biological challenge that may be solved by the \\\"chromatin computer\\\", I formulate epigenetic inheritance as a computational problem and develop a flexible stochastic simulation system for the study of recomputation-based epigenetic inheritance of individual histone post-translational modifications. The implementation uses Gillespie\\\''s stochastic simulation algorithm for exactly simulating the time evolution of the chemical master equation of the underlying stochastic process. Furthermore, it is efficient enough to use an evolutionary algorithm to find a system of enzymes that can stably maintain a particular chromatin state across multiple cell divisions. I find that it is easy to evolve such a system of enzymes even without explicit boundary elements separating differentially modified chromatin domains. However, the success of this task depends on several previously unanticipated factors such as the length of the initial state, the specific pattern that should be maintained, the time between replications, and various chemical parameters. All these factors also influence the accumulation of errors in the wake of cell divisions. Chromatin-regulatory processes and epigenetic (inheritance) mechanisms constitute an intricate and sensitive system, and any misregulation may contribute significantly to various diseases such as Alzheimer\\\''s disease. Intriguingly, the role of epigenetics and chromatin-based processes as well as non-coding RNAs in the etiology of Alzheimer\\\''s disease is increasingly being recognized. In the second part of this thesis, I explicitly and systematically address the two hypotheses that (i) a dysregulated chromatin computer plays important roles in Alzheimer\\\''s disease and (ii) Alzheimer\\\''s disease may be considered as an evolutionarily young disease. In summary, I found support for both hypotheses although for hypothesis 1, it is very difficult to establish causalities due to the complexity of the disease. However, I identify numerous chromatin-associated, differentially expressed loci for histone proteins, chromatin-modifying enzymes or integral parts thereof, non-coding RNAs with guiding functions for chromatin-modifying complexes, and proteins that directly or indirectly influence epigenetic stability (e.g., by altering cell cycle regulation and therefore potentially also the stability of epigenetic states). %Notably, we generally observed enrichment of probes located in non-coding regions, particularly antisense to known annotations (e.g., introns). For the identification of differentially expressed loci in Alzheimer\\\''s disease, I use a custom expression microarray that was constructed with a novel bioinformatics pipeline. Despite the emergence of more advanced high-throughput methods such as RNA-seq, microarrays still offer some advantages and will remain a useful and accurate tool for transcriptome profiling and expression studies. However, it is non-trivial to establish an appropriate probe design strategy for custom expression microarrays because alternative splicing and transcription from non-coding regions are much more pervasive than previously appreciated. To obtain an accurate and complete expression atlas of genomic loci of interest in the post-ENCODE era, this additional transcriptional complexity must be considered during microarray design and requires well-considered probe design strategies that are often neglected. This encompasses, for example, adequate preparation of a set of target sequences and accurate estimation of probe specificity. With the help of this pipeline, two custom-tailored microarrays have been constructed that include a comprehensive collection of non-coding RNAs. Additionally, a user-friendly web server has been set up that makes the developed pipeline publicly available for other researchers.Eukaryotische Genome sind typischerweise in Form von Chromatin organisiert, dem Komplex aus DNA und Proteinen, aus dem die Chromosomen im Zellkern bestehen. Chromatin hat lebenswichtige Funktionen in einer Vielzahl von Prozessen, von denen die meisten durch ein komplexes System von kovalenten Modifikationen an Histon-Proteinen ablaufen. Muster dieser Modifikationen sind wichtige Informationsträger, deren Weitergabe über die Zellteilung hinaus an beide Tochterzellen besonders wichtig für die Aufrechterhaltung des Zellzustandes im Allgemeinen und des Transkriptionsprogrammes im Speziellen ist. Die Entdeckung von epigenetischen Vererbungsphänomenen - mitotisch und/oder meiotisch vererbbare Veränderungen von Genfunktionen, hervorgerufen durch Veränderungen an Chromosomen, die nicht auf Modifikationen der DNA-Sequenz zurückzuführen sind - war bemerkenswert, weil es die Hypothese widerlegt hat, dass Informationen an Tochterzellen ausschließlich durch DNA übertragen werden. Die Replikation der DNA erzeugt eine dramatische Störung des Chromatinzustandes, welche letztendlich ein partielles Löschen der gespeicherten Informationen zur Folge hat. Um den epigenetischen Zustand zu erhalten, muss die Zelle Teile der parentalen Muster der Histonmodifikationen durch Prozesse rekonstruieren, die noch immer sehr wenig verstanden sind. Eine plausible Hypothese postuliert, dass die verschiedenen Kombinationen der Lese- und Schreibdomänen innerhalb von Histon-modifizierenden Enzymen lokale Umschreibregeln implementieren, die letztendlich das parentale Modifikationsmuster der Histone neu errechnen. Dies geschieht auf Basis der partiellen Informationen, die in der Hälfte der vererbten Histone gespeichert sind. Es wird zunehmend klarer, dass sowohl Informationsverarbeitung als auch computerähnliche Berechnungen omnipräsent und in vielen Bereichen der Naturwissenschaften von fundamentaler Bedeutung sind, insbesondere in der Zelle. Dies wird exemplarisch durch die zunehmend populärer werdenden Forschungsbereiche belegt, die sich auf computerähnliche Berechnungen mithilfe von DNA und Membranen konzentrieren. Jüngste Forschungen suggerieren, dass sich Chromatin während der Evolution in eine mächtige zelluläre Speichereinheit entwickelt hat und in der Lage ist, eine große Menge an Informationen zu speichern und zu prozessieren. Eukaryotisches Chromatin könnte also als ein zellulärer Computer agieren, der in der Lage ist, computerähnliche Berechnungen in einem biologischen Kontext auszuführen. Eine theoretische Studie hat kürzlich demonstriert, dass bereits relativ simple Modelle eines Chromatincomputers berechnungsuniversell und damit mächtiger als reine genregulatorische Netzwerke sind. Im ersten Teil meiner Dissertation stelle ich ein tieferes Verständnis des Leistungsvermögens und der Beschränkungen des Chromatincomputers her, welche bisher größtenteils unerforscht waren. Ich analysiere ausgewählte Grundbestandteile des Chromatincomputers und vergleiche sie mit den Komponenten eines klassischen Computers, mit besonderem Fokus auf Speicher sowie logische und arithmetische Operationen. Ich argumentiere, dass Chromatin eine massiv parallele Architektur, eine Menge von Lese-Schreib-Regeln, die nicht-deterministisch auf Chromatin operieren, die Fähigkeit zur Selbstmodifikation, und allgemeine verblüffende Ähnlichkeiten mit amorphen Berechnungsmodellen besitzt. Ich schlage deswegen eine Zellularautomaten-ähnliche eindimensionale Kette als Berechnungsparadigma vor, auf dem lokale Lese-Schreib-Regeln auf asynchrone Weise mit zeitabhängigen Wahrscheinlichkeiten ausgeführt werden. Seine Wirkungsweise ist demzufolge konzeptionell ähnlich zu den wohlbekannten Theorien von komplexen Systemen. Zudem hat der Chromatincomputer volatilen Speicher mit einem massiven Informationsgehalt, der von der Zelle benutzt werden kann. Ich schätze ab, dass die Speicherkapazität im Bereich von mehreren Hundert Megabytes von schreibbarer Information pro Zelle liegt, was ich zudem mit DNA und cis-regulatorischen Modulen vergleiche. Ich zeige weiterhin, dass ein Chromatincomputer nicht nur Berechnungen in einem biologischen Kontext ausführen kann, sondern auch in einem strikt informatischen Sinn. Zumindest theoretisch kann er deswegen für jede berechenbare Funktion benutzt werden. Chromatin ist demzufolge ein weiteres Beispiel für die steigende Anzahl von unkonventionellen Berechnungsmodellen. Als Beispiel für eine biologische Herausforderung, die vom Chromatincomputer gelöst werden kann, formuliere ich die epigenetische Vererbung als rechnergestütztes Problem. Ich entwickle ein flexibles Simulationssystem zur Untersuchung der epigenetische Vererbung von individuellen Histonmodifikationen, welches auf der Neuberechnung der partiell verlorengegangenen Informationen der Histonmodifikationen beruht. Die Implementierung benutzt Gillespies stochastischen Simulationsalgorithmus, um die chemische Mastergleichung der zugrundeliegenden stochastischen Prozesse über die Zeit auf exakte Art und Weise zu modellieren. Der Algorithmus ist zudem effizient genug, um in einen evolutionären Algorithmus eingebettet zu werden. Diese Kombination erlaubt es ein System von Enzymen zu finden, dass einen bestimmten Chromatinstatus über mehrere Zellteilungen hinweg stabil vererben kann. Dabei habe ich festgestellt, dass es relativ einfach ist, ein solches System von Enzymen zu evolvieren, auch ohne explizite Einbindung von Randelementen zur Separierung differentiell modifizierter Chromatindomänen. Dennoch ängt der Erfolg dieser Aufgabe von mehreren bisher unbeachteten Faktoren ab, wie zum Beispiel der Länge der Domäne, dem bestimmten zu vererbenden Muster, der Zeit zwischen Replikationen sowie verschiedenen chemischen Parametern. Alle diese Faktoren beeinflussen die Anhäufung von Fehlern als Folge von Zellteilungen. Chromatin-regulatorische Prozesse und epigenetische Vererbungsmechanismen stellen ein komplexes und sensitives System dar und jede Fehlregulation kann bedeutend zu verschiedenen Krankheiten, wie zum Beispiel der Alzheimerschen Krankheit, beitragen. In der Ätiologie der Alzheimerschen Krankheit wird die Bedeutung von epigenetischen und Chromatin-basierten Prozessen sowie nicht-kodierenden RNAs zunehmend erkannt. Im zweiten Teil der Dissertation adressiere ich explizit und auf systematische Art und Weise die zwei Hypothesen, dass (i) ein fehlregulierter Chromatincomputer eine wichtige Rolle in der Alzheimerschen Krankheit spielt und (ii) die Alzheimersche Krankheit eine evolutionär junge Krankheit darstellt. Zusammenfassend finde ich Belege für beide Hypothesen, obwohl es für erstere schwierig ist, aufgrund der Komplexität der Krankheit Kausalitäten zu etablieren. Dennoch identifiziere ich zahlreiche differentiell exprimierte, Chromatin-assoziierte Bereiche, wie zum Beispiel Histone, Chromatin-modifizierende Enzyme oder deren integrale Bestandteile, nicht-kodierende RNAs mit Führungsfunktionen für Chromatin-modifizierende Komplexe oder Proteine, die direkt oder indirekt epigenetische Stabilität durch veränderte Zellzyklus-Regulation beeinflussen. Zur Identifikation von differentiell exprimierten Bereichen in der Alzheimerschen Krankheit benutze ich einen maßgeschneiderten Expressions-Microarray, der mit Hilfe einer neuartigen Bioinformatik-Pipeline erstellt wurde. Trotz des Aufkommens von weiter fortgeschrittenen Hochdurchsatzmethoden, wie zum Beispiel RNA-seq, haben Microarrays immer noch einige Vorteile und werden ein nützliches und akkurates Werkzeug für Expressionsstudien und Transkriptom-Profiling bleiben. Es ist jedoch nicht trivial eine geeignete Strategie für das Sondendesign von maßgeschneiderten Expressions-Microarrays zu finden, weil alternatives Spleißen und Transkription von nicht-kodierenden Bereichen viel verbreiteter sind als ursprünglich angenommen. Um ein akkurates und vollständiges Bild der Expression von genomischen Bereichen in der Zeit nach dem ENCODE-Projekt zu bekommen, muss diese zusätzliche transkriptionelle Komplexität schon während des Designs eines Microarrays berücksichtigt werden und erfordert daher wohlüberlegte und oft ignorierte Strategien für das Sondendesign. Dies umfasst zum Beispiel eine adäquate Vorbereitung der Zielsequenzen und eine genaue Abschätzung der Sondenspezifität. Mit Hilfe der Pipeline wurden zwei maßgeschneiderte Expressions-Microarrays produziert, die beide eine umfangreiche Sammlung von nicht-kodierenden RNAs beinhalten. Zusätzlich wurde ein nutzerfreundlicher Webserver programmiert, der die entwickelte Pipeline für jeden öffentlich zur Verfügung stellt
    corecore