144 research outputs found

    Methods for developing a machine learning framework for precise 3D domain boundary prediction at base-level resolution

    Get PDF
    High-throughput chromosome conformation capture technology (Hi-C) has revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries by conventional TAD/loop-callers. However, high resolution genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggest a computational approach for precise location of domain boundaries. We developed preciseTAD, an optimized machine learning framework that leverages a random forest model to improve the location of domain boundaries. Our method introduces three concepts - shifted binning, distance-type predictors, and random under-sampling - which we use to build classification models for predicting boundary regions. The algorithm then uses density-based clustering (DBSCAN) and partitioning around medoids (PAM) to extract the most biologically meaningful domain boundary from models trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data. We benchmarked our method against a popular TAD-caller and a novel chromatin loop prediction algorithm. Boundaries predicted by preciseTAD were more enriched for known molecular drivers of 3D chromatin including CTCF, RAD21, SMC3, and ZNF143. preciseTAD-predicted boundaries were more conserved across cell lines, highlighting their higher biological significance. Additionally, models pre-trained in one cell line accurately predict boundaries in another cell line. Using cell line-specific genomic annotations, the pre-trained models enable detecting domain boundaries in cells without Hi-C data. The research presented provides a unified approach for precisely predicting domain boundaries. This improved precision will provide insight into the association between genomic regulators and the 3D genome organization. Furthermore, our methods will provide researchers with flexible and easy-to-use tools to continue to annotate the 3D structure of the human genome without relying on costly high resolution Hi-C data. The preciseTAD R package and supplementary ExperimentHub package, preciseTADhub, are available on Bioconductor (version 3.13; https://bioconductor.org/packages/preciseTAD/; https://bioconductor.org/packages/preciseTADhub/)

    MOCCA: a fexible suite for modelling DNA sequence motif occurrence combinatorics

    Get PDF
    Background Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA.publishedVersio

    Analysis, Visualization, and Machine Learning of Epigenomic Data

    Get PDF
    The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome

    Defining a Registry of Candidate Regulatory Elements to Interpret Disease Associated Genetic Variation

    Get PDF
    Over the last decade there has been a great effort to annotate noncoding regions of the genome, particularly those that regulate gene expression. These regulatory elements contain binding sites for transcription factors (TF), which interact with one another and transcriptional machinery to initiate, enhance, or repress gene expression. The Encyclopedia of DNA Elements (ENCODE) consortium has generated thousands of epigenomic datasets, such as DNase-seq and ChIP-seq experiments, with the goal of defining such regions. By integrating these assays, we developed the Registry of candidate Regulatory Elements (cREs), a collection of putative regulatory regions across human and mouse. In total, we identified over 1.3M human and 400k mouse cREs each annotated with cell-type specific signatures (e.g. promoter-like, enhancer-like) in over 400 human and 100 mouse biosamples. We then demonstrated the biological utility of these regions by analyzing cell type enrichments for genetic variants reported by genome wide association studies (GWAS). To search and visualize these cREs, we developed the online database SCREEN (search candidate regulatory elements by ENCODE). After defining cREs, we next sought to determine their potential gene targets. To compare target gene prediction methods, we developed a comprehensive benchmark of enhancer-gene links by curating ChIA-PET, Hi-C and eQTL datasets. We then used this benchmark to evaluate unsupervised linking approaches such as the correlation of epigenomic signal. We determined that these methods have low overall performance and do not outperform simply selecting the closest gene. We then developed a supervised Random Forest model which had notably better performance than unsupervised methods. We demonstrated that this model can be applied across cell types and can be used to predict target genes for GWAS associated variants. Finally, we used the registry of cREs to annotate variants associated with psychiatric disorders. We found that these psych SNPs are enriched in cREs active in brain tissue and likely target genes involved in neural development pathways. We also demonstrated that psych SNPs overlap binding sites for TFs involved in neural and immune pathways. Finally, by identifying psych SNPs with allele imbalance in chromatin accessibility, we highlighted specific cases of psych SNPs altering TF binding motifs resulting in the disruption of TF binding. Overall, we demonstrated our collection of putative regulatory regions, the Registry of cREs, can be used to understand the potential biological function of noncoding variation and develop hypotheses for future testing

    Data analysis for genomics, transcriptomics and proteomics

    Get PDF
    Genomics, transcriptomics, and proteomics are fundamental blocks that shaped modern biology. High throughput and large-scale techniques, such as next-generation sequencing (NGS) and mass spectrometry (MS), have been widely used in the life sciences. Due to the complexity of these data, the analysis needs to be done by sophisticated bioinfor-matic methods. During my doctorate research, I developed new computational methods and applied new strategies to advance the research in genomics, transcriptomics, and proteomics. NGS has brought tremendous and numerous changes to genomic research by providing higher sensitivity, sequencing depth, and throughput compared with traditional sequenc-ing methods, such as Sanger sequencing, qPCR, and microarrays. Benefiting from the advantages of NGS technology, RNA-seq has been widely used for the qualitative and quantitative analysis of genome wide changes in gene expression. Chromatin immuno-precipitation sequencing (ChIP-seq) as another popular application of NGS provides an efficient way to analyze the interaction between proteins and DNA. During my doctoral studies, I used these techniques to uncover the mechanisms behind the hybrid incompat-ibility between Drosophila melanogaster and D.simulans. The loss of HMR in D.melanogaster leads to mitotic defects, increased transcription of transposable elements, and deregulated heterochromatic genes. Through the genome-wide analysis of HMR’s localization by ChIP-seq, I found that genomic insulator sites bound by HMR can be grouped into two clusters. One set is composed of gypsy insula-tors, whereas the other is bordered by HP1a-bound areas of active genes. In Hmr mu-tant flies, the transcription of genes belonging to the latter group is severely disrupted in larval tissue and ovaries. These findings showed a novel connection between HMR and insulator proteins, indicating a possible role for genome organization in species devel-opment. Beyond the study of particular genes, and RNA transcripts, I also dedicated my work towards improving proteomic research by accurately predicting fragmentation patterns of peptides in tandem mass spectrometry (MS) with deep-learning. MS is an important and powerful technology for proteomic research. In recent years, with the development of both theoretical and industrial technology and methods, the research scope of proteome has improved at an unprecedented speed. SWATH-MS is a mass spectrometric technique that combines the advantages of targeted data analysis and combines it with the speed of time-of-flight (ToF) mass spectrometers to improve peptide quantitation and identification in a data-independent acquisition (DIA) mode. SWATH-MS can analyze proteomes on a much larger scale than traditional methods such as data-dependent acquisition (DDA), parallel reaction monitoring (PRM), or se-lected reaction monitoring (SRM) due to its increased reproducibility and accuracy. Moreover, SWATH-MS shows a significant increase in the detection rates of peptides and proteins along with higher accurate quantifications. However, mass spectra data generated by SWATH-MS showed a higher complexity compared to the traditional DDA mass spectrometry method. Therefore, more accurate data analysis strategies were required to address this complexity. At the beginning of my doctorate, SWATH-MS relied entirely on fragment libraries generated by DDA ex-periments, which greatly limited the number of detectible and identifiable peptides. Hence, the extension of the search space is crucial to improve both identification and quantitation on a proteome-wide scale, especially for SWATH-MS analysis. With the development of new computational approaches to complex problems, more and more biological questions were addressed successfully. In this work, we applied such advanced methods to build a prediction framework that is composed of several tools: dpMS for mass spectra prediction, dpRT for retention time prediction, and dpMC for missed tryptic cleavages prediction, along with other new strategies to improve the ef-fective search space for SWATH-MS in high quality. With the in-silico library, we can identify proteins and peptides that exceed the experimental library limitation. We demonstrated the reproducibility and efficiency of dpSWATH across different organ-isms from D. melanogaster and H. sapiens on a Q-TOF instrument. With different ex-perimental conditions, dpSWATH can build highly reliable theoretical libraries for SWATH-MS analysis. Consequently, the new searching space has improved both sensi-tivity and specificity for SWATH-MS analysis at a higher level. Within this thesis I summarize three publications I (co)authored: one of which is on the analysis of next generation sequencing, and the other two are on the work of pre-dictions for mass spectrometry, which are listed above.Genomik, Transkriptomik und Proteomik sind grundlegende Bausteine, die die moderne Biologie geprĂ€gt haben. Hochdurchsatz- und groß angelegte Techniken wie die Hoch-Durchsatz Sequenzierung (NGS) und die Massenspektrometrie (MS) werden in den Biowissenschaften in großem Umfang eingesetzt. Aufgrund der KomplexitĂ€t dieser Daten muss die Analyse mit ausgefeilten bioinformatischen Methoden durchgefĂŒhrt werden. WĂ€hrend meiner Doktorarbeit habe ich neue Methoden entwickelt und neue Strategien angewandt, um die Forschung in den Bereichen Genomik, Transkriptomik und Proteomik voranzutreiben. NGS hat die Genomforschung in vielerlei Hinsicht verĂ€ndert, da es im Vergleich zu herkömmlichen Sequenzierungsmethoden wie Sanger-Sequenzierung, qPCR und Microarrays eine höhere Empfindlichkeit, Sequenzierungstiefe und einen höheren Durchsatz bietet. RNA-seq profitiert von den Vorteilen der NGS-Technologie und wurde in großem Umfang fĂŒr die qualitative und quantitative Analyse genomweiter VerĂ€nderungen der Genexpression eingesetzt. Die Chromatin-ImmunprĂ€zipitations-Sequenzierung (ChIP-seq), eine weitere Anwendung von NGS, bietet eine effiziente Möglichkeit zur Analyse der Interaktion zwischen Proteinen und DNA. WĂ€hrend meines Promotionsstudiums habe ich diese Techniken eingesetzt, um die Mechanismen hinter der HybridinkompatibilitĂ€t zwischen Drosophila melanogaster und Drosophila simulans aufzudecken. Der Verlust von HMR in D. melanogaster fĂŒhrt zu mitotischen Defekten, erhöhter Transkription von transposablen Elementen und deregulierten heterochromatischen Genen. Durch die genomweite Analyse der HMR-Lokalisierung mittels ChIP-seq habe ich herausgefunden, dass genomische Isolatorstellen, die von HMR gebunden werden, in zwei Gruppen unterteilt werden können. Die eine Gruppe besteht aus Gypsy-Insulatoren, wĂ€hrend die andere von HP1a-gebundenen Bereichen aktiver Gene begrenzt wird. Bei Hmr-mutierten Fliegen ist die Transkription von Genen, die zur letzteren Gruppe gehören, im Larvengewebe und in den Eierstöcken stark gestört. Diese Ergebnisse zeigen eine neuartige Verbindung zwischen HMR und Isolatorproteinen, was auf eine mögliche Rolle der Genomorganisation bei der Entwicklung von Arten hinweist. Neben der Untersuchung bestimmter Gene und RNA-Transkripte widmete ich meine Arbeit auch der Verbesserung der Proteomforschung durch die genaue Vorhersage von Fragmentierungsmustern von Peptiden in der Tandem-Massenspektrometrie (MS) mit Hilfe von Deep-learning. Die MS ist eine wichtige und leistungsfĂ€hige Technologie in der Proteomforschung. In den letzten Jahren hat sich der Umfang der Proteomforschung durch die Entwicklung sowohl theoretischer als auch experimenteller Technologien und Methoden dramatisch verbessert. SWATH-MS ist eine massenspektrometrische Methode, die die Vorteile der gezielten Untersuchung von individuellen Analyten mit der Geschwindigkeit von Flugzeit-Massenspektrometern kombiniert, um die Quantifizierung und Identifizierung von Peptiden in einer datenunabhĂ€ngigen Messung (DIA) zu verbessern. SWATH-MS kann Proteome in einem viel grĂ¶ĂŸeren Umfang analysieren als herkömmliche Methoden wie die datenabhĂ€ngige Messung (DDA), die parallele Messung von FragmentĂŒbergĂ€ngen (PRM) oder die Messung ausgewĂ€hlter Fragmente (SRM), da es eine höhere Reproduzierbarkeit und Genauigkeit bietet. DarĂŒber hinaus zeigt SWATH-MS eine signifikante Steigerung der Detektionsraten von Peptiden und Proteinen zusammen mit einer höheren Quantifizierungsgenauigkeit. Die mit SWATH-MS erzeugten Massenspektren sind jedoch komplexer als bei der herkömmlichen DDA-Massenspektrometrie. Daher sind genauere Datenanalysestrategien erforderlich, um diese KomplexitĂ€t zu bewĂ€ltigen. Zu Beginn meiner Promotion stĂŒtzte sich SWATH-MS ausschließlich auf Fragmentbibliotheken, die aus DDA-Experimenten stammten, was die Zahl der nachweisbaren und identifizierbaren Peptide stark einschrĂ€nkte. Durch die von mir entwickelte Methode konnte ich den Suchraum deutlich erweitern, um sowohl die Identifizierung als auch die Quantifizierung auf proteomweiter Ebene zu verbessern. Mit der Entwicklung neuer computergestĂŒtzter AnsĂ€tze fĂŒr komplexe Probleme konnten immer mehr biologische Fragen erfolgreich beantwortet werden. Die von mir entwickelte bioinformatische Methode besteht aus mehreren Komponenten: dpMS fĂŒr die Vorhersage von Fragmentspektren, dpRT fĂŒr die Vorhersage von Retentionszeiten und dpMC fĂŒr die Vorhersage tryptischer Spaltungen, um den effektiven Suchraums fĂŒr SWATH-MS zu erweitern. Mit der so (in-silico) generierten Bibliothek von Fragmentspektren konnte ich deutlich mehr Proteine und Peptide identifizieren. Ich konnte die Reproduzierbarkeit und Effizienz von dpSWATH durch Messung von Proteomen aus verschiedenen Organismen auf einem Q-TOF-Instrument nachgeweisen. Unter verschiedenen Versuchsbedingungen kann dpSWATH sehr zuverlĂ€ssige theoretische Bibliotheken fĂŒr die SWATH-MS-Analyse erstellen und damit die SensitivitĂ€t als auch die SpezifitĂ€t der SWATH-MS-Analyse verbessern. In dieser Arbeit fasse ich drei Publikationen zusammen, die ich (mit-)verfasst habe: eine davon befasst sich mit der Analyse von Next Generation Sequencing, die beiden anderen mit der Arbeit an Vorhersagen fĂŒr die Massenspektrometrie, die oben aufgefĂŒhrt sind

    3D genomics::form and function of chromatin

    Get PDF

    3D genomics::form and function of chromatin

    Get PDF

    Unravelling higher order chromatin organisation through statistical analysis

    Get PDF
    Recent technological advances underpinned by high throughput sequencing have given new insights into the three-dimensional structure of mammalian genomes. Chromatin conformation assays have been the critical development in this area, particularly the Hi-C method which ascertains genome-wide patterns of intra and inter-chromosomal contacts. However many open questions remain concerning the functional relevance of such higher order structure, the extent to which it varies, and how it relates to other features of the genomic and epigenomic landscape. Current knowledge of nuclear architecture describes a hierarchical organisation ranging from small loops between individual loci, to megabase-sized self-interacting topological domains (TADs), encompassed within large multimegabase chromosome compartments. In parallel with the discovery of these strata, the ENCODE project has generated vast amounts of data through ChIP-seq, RNA-seq and other assays applied to a wide variety of cell types, forming a comprehensive bioinformatics resource. In this work we combine Hi-C datasets describing physical genomic contacts with a large and diverse array of chromatin features derived at a much finer scale in the same mammalian cell types. These features include levels of bound transcription factors, histone modifications and expression data. These data are then integrated in a statistically rigorous way, through a predictive modelling framework from the machine learning field. These studies were extended, within a collaborative project, to encompass a dataset of matched Hi-C and expression data collected over a murine neural differentiation timecourse. We compare higher order chromatin organisation across a variety of human cell types and find pervasive conservation of chromatin organisation at multiple scales. We also identify structurally variable regions between cell types, that are rich in active enhancers and contain loci of known cell-type specific function. We show that broad aspects of higher order chromatin organisation, such as nuclear compartment domains, can be accurately predicted in a variety of human cell types, using models based upon underlying chromatin features. We dissect these quantitative models and find them to be generalisable to novel cell types, presumably reflecting fundamental biological rules linking compartments with key activating and repressive signals. These models describe the strong interconnectedness between locus-level patterns of local histone modifications and bound factors, on the order of hundreds or thousands of basepairs, with much broader compartmentalisation of large, multi-megabase chromosomal regions. Finally, boundary regions are investigated in terms of chromatin features and co-localisation with other known nuclear structures, such as association with the nuclear lamina. We find boundary complexity to vary between cell types and link TAD aggregations to previously described lamina-associated domains, as well as exploring the concept of meta-boundaries that span multiple levels of organisation. Together these analyses lend quantitative evidence to a model of higher order genome organisation that is largely stable between cell types, but can selectively vary locally, based on the activation or repression of key loci

    Identification and characterisation of somatic regulatory mutations in the breast cancer genome

    Get PDF
    Luminal breast cancer remains a major clinical challenge with over 2 million cases diagnosed annually. While prognosis is favourable in these patients, roughly 40% will relapse over the course of the next 20 years. Understanding the evolution of disseminated tumour cells at distal sites is critical to effectively treating these patients. While metastatic driver mutations, such as those in the Oestrogen Receptor (ESR1) gene, can be identified in many cases for a significant proportion of patients, clear drivers remain elusive. A limitation of previous genomics studies in metastatic breast cancer is their focus on the coding genome. Advances in our understanding have revealed the critical role of regulatory elements such as enhancers and promoters in transcriptional regulation. This effect is mediated through the functional and hierarchical organisation of chromatin within the nucleus, the key unit of chromatin organisation is the Topologically Associating Domains (TADs). TAD organisation is, in part, mediated by the CCCTC-Binding Factor (CTCF) protein which physically binds to DNA mediating the formation of loops and domains. Together promoters, enhancers, and CTCF-bound regions provide potential as sites for non-coding mutations to occur, drastically impacting gene regulation and tumour evolution. In this work we interrogate the contribution of regulatory element mutations in the evolution of metastatic breast cancer. This is done through two projects. First, a proof of principle study functionally characterising a clinically relevant CTCF binding site mutation. Second, through the design of an informed panel of regulatory regions utilised in a longitudinal targeted sequencing study in patient samples and a CRISPRi perturbation study in cell lines. Through these studies we provide evidence that the mutation of TAD boundary associated CTCF binding sites is unlikely to contribute to tumour evolution. We also fail to identify recurrence of non-coding drivers, though more patient specific mutations may contribute to metastatic evolution. Results obtained from the CRISPRi screen illustrate the functionality of the regulatory regions in the panel, identifying regulatory elements that confer fitness or vulnerabilities when specifically repressed. This study identifies that repression of several members of the NF-ÎșB signalling pathway provides MCF7 cells with an advantage in adapting to oestrogen deprivation. This data underlines the importance of regulatory regions in the evolution of luminal breast cancers and indicates that non-genetic mechanisms may play a key role.Open Acces

    The Pharmacoepigenomics Informatics Pipeline and H-GREEN Hi-C Compiler: Discovering Pharmacogenomic Variants and Pathways with the Epigenome and Spatial Genome

    Full text link
    Over the last decade, biomedical science has been transformed by the epigenome and spatial genome, but the discipline of pharmacogenomics, the study of the genetic underpinnings of pharmacological phenotypes like drug response and adverse events, has not. Scientists have begun to use omics atlases of increasing depth, and inferences relating to the bidirectional causal relationship between the spatial epigenome and gene expression, as a foundational underpinning for genetics research. The epigenome and spatial genome are increasingly used to discover causative regulatory variants in the significance regions of genome-wide association studies, for the discovery of the biological mechanisms underlying these phenotypes and the design of genetic tests to predict them. Such variants often have more predictive power than coding variants, but in the area of pharmacogenomics, such advances have been radically underapplied. The majority of pharmacogenomics tests are designed manually on the basis of mechanistic work with coding variants in candidate genes, and where genome wide approaches are used, they are typically not interpreted with the epigenome. This work describes a series of analyses of pharmacogenomics association studies with the tools and datasets of the epigenome and spatial genome, undertaken with the intent of discovering causative regulatory variants to enable new genetic tests. It describes the potent regulatory variants discovered thereby to have a putative causative and predictive role in a number of medically important phenotypes, including analgesia and the treatment of depression, bipolar disorder, and traumatic brain injury with opiates, anxiolytics, antidepressants, lithium, and valproate, and in particular the tendency for such variants to cluster into spatially interacting, conceptually unified pathways which offer mechanistic insight into these phenotypes. It describes the Pharmacoepigenomics Informatics Pipeline (PIP), an integrative multiple omics variant discovery pipeline designed to make this kind of analysis easier and cheaper to perform, more reproducible, and amenable to the addition of advanced features. It described the successes of the PIP in rediscovering manually discovered gene networks for lithium response, as well as discovering a previously unknown genetic basis for warfarin response in anticoagulation therapy. It describes the H-GREEN Hi-C compiler, which was designed to analyze spatial genome data and discover the distant target genes of such regulatory variants, and its success in discovering spatial contacts not detectable by preceding methods and using them to build spatial contact networks that unite disparate TADs with phenotypic relationships. It describes a potential featureset of a future pipeline, using the latest epigenome research and the lessons of the previous pipeline. It describes my thinking about how to use the output of a multiple omics variant pipeline to design genetic tests that also incorporate clinical data. And it concludes by describing a long term vision for a comprehensive pharmacophenomic atlas, to be constructed by applying a variant pipeline and machine learning test design system, such as is described, to thousands of phenotypes in parallel. Scientists struggled to assay genotypes for the better part of a century, and in the last twenty years, succeeded. The struggle to predict phenotypes on the basis of the genotypes we assay remains ongoing. The use of multiple omics variant pipelines and machine learning models with omics atlases, genetic association, and medical records data will be an increasingly significant part of that struggle for the foreseeable future.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145835/1/ariallyn_1.pd
    • 

    corecore