144 research outputs found

    CSP for Executable Scientific Workflows

    Get PDF

    Big Data Analytics in Static and Streaming Provenance

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    A robust machine learning approach for the prediction of allosteric binding sites

    Get PDF
    Previously held under moratorium from 28 March 2017 until 28 March 2022Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known.Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known

    Bioinformatic Tools for Next Generation DNA Sequencing:Development and Analysis of Model Systems

    Get PDF

    2022 Review of Data-Driven Plasma Science

    Get PDF
    Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required

    Computational methods for analyzing complex high-throughput data from cancers

    Get PDF
    Cancers are a heterogeneous group of diseases that cause 7.6 million deaths yearly worldwide. At the cellular level, cancer is characterized by increased proliferation and invasion of tissue. These phenotypes are caused by environmental or inherited factors that increase the mutability of the genome, leading to dysregulation of a number of cellular processes. Identifying the genotypic changes and their phenotypic consequences is key to accurate diagnosis and prognosis, as well as improved treatment regimens. Cancer cells can be investigated at a genome-wide scale using high-throughput measurement techniques such as DNA sequencing and microarrays. These rapidly evolving technologies provide experimental data that have two challenging characteristics: the volume of data is large and data are structurally complex. These data need to be analyzed in an accurate and scalable manner to arrive at biomedically relevant conclusions. I have developed three computational methods for analyzing high-throughput genomic data, and applied the methods to experimental data from three cancers. The first computational method is an extensible workflow framework, Anduril, for organizing the overall software structure of an analysis in a scalable manner. The second method, SPINLONG, is a flexible algorithm for analyzing chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) data from complex experimental designs, such as time series measurements of multiple markers. The third method, GROK, is used for preprocessing deep sequencing data. Its design is based on a mathematical formalism that provides a succinct language for these operations. The experimental part studies gene regulation and expression in glioblastoma multiforme, and breast and prostate cancer. The results demonstrate the applicability of the developed methods to cancer research and provide insights into the dysregulation of gene expression in cancer. All three studies use both cell line and clinical material to connect the molecular and disease outcome aspects of cancer. These experiments yield results at two conceptual levels. At the holistic level, lists of significant genes or genomic regions provide a genome-wide view into genomic alterations in cancer. At the specific level, we focus on one or a few central genes, which are experimentally validated, to provide an accessible starting point for understanding the results. Together, the thesis focuses on understanding the complexity of cancer and managing the complexity of genome-wide data.Syövät ovat heterogeeninen joukko sairauksia, jotka aiheuttavat vuosittain 7,6 miljoonaa kuolemaa maailmanlaajuisesti. Solutasolla syövälle on ominaista lisääntynyt solukasvu sekä leviäminen ympäröivään kudokseen. Nämä solutason ilmiöt johtuvat ympäristö- ja perinnöllisistä tekijöistä, jotka lisäävät genomin mutaatioalttiutta ja häiritsevät solun biokemiallisia prosesseja. Syövän hoidolle sekä diagnoosille on tärkeää tunnistaa geneettiset muutokset syöpäsoluissa sekä niiden vaikutukset fenotyyppiin. Syövän solumuutoksia voi tutkia hiljattain kehitetyillä genominlaajuisilla mittaustekniikoilla, kuten DNA:n sekvensoinnilla ja mikrosiruilla. Nämä uuden sukupolven tekniikat tuottavat mittaustietoa, jolla on kaksi ominaispiirrettä: sitä on määrällisesti paljon ja se on rakenteeltaan monimutkaista. Tällainen mittaustieto on kyettävä analysoimaan täsmällisesti ja laskennallisesti skaalautuvasti, jotta tutkimuksesta saadaan lääketieteellistä lisäarvoa. Tässä työssä on kehitetty kolme laskennallista menetelmää genominlaajuisten aineistojen analyysiin, sekä hyödynnetty näitä menetelmiä kokeellisesti kolmen syövän tutkimuksessa. Ensimmäinen laskennallinen menetelmä on ohjelmistokehys Anduril, joka tarjoaa laajennettavan työnkulkuihin perustuvan alustan suurten ja monimutkaisten aineistojen analysointiin. Toinen menetelmä on SPINLONG-algoritmi, jolla analysoidaan proteiinien sitoutumista DNA:han genominlaajuisesti. Kolmas menetelmä, GROK, on ohjelmisto laajojen DNA-sekvensointiaineistojen tehokkaaseen esikäsittelyyn. Työn kokeellinen osuus käsittelee geenien ilmentymistä ja säätelyä glioblastoomassa sekä rinta- ja eturauhassyövässä. Saadut tulokset osoittavat kehitettyjen laskennallisten menetelmien soveltuvuutta kokeelliseen tutkimukseen ja lisäävät tietämystä näissä syövissä tapahtuvista genomitason muutoksista. Kokeellisissa tutkimuksissa on hyödynnetty sekä soluviljelmiä että potilasnäytteitä kytkemään molekyylitason muutokset kliiniseen tulokseen. Kokeista saatuja tuloksia voi tarkastella kahdella abstraktiotasolla. Holistisella tasolla, johon kuuluu listoja muuntuneista geeneistä sekä kromosomialueista, saadaan kokonaiskuva genominlaajuisista muutoksista syövissä. Spesifisellä tasolla tarkennetaan oleellisimpiin geeneihin, joiden merkitys on kokeellisesti todennettu, mikä tarjoaa luontevan lähtökohdan tuloksien tulkintaan. Kokonaisuutena väitöskirja tutkii syövän monimutkaisuutta ja kehittää menetelmiä monimutkaisten genominlaajuisten aineistojen tulkitsemiseen

    Computational epigenetics : bioinformatic methods for epigenome prediction, DNA methylation mapping and cancer epigenetics

    Get PDF
    Epigenetic research aims to understand heritable gene regulation that is not directly encoded in the DNA sequence. Epigenetic mechanisms such as DNA methylation and histone modifications modulate the packaging of the DNA in the nucleus and thereby influence gene expression. Patterns of epigenetic information are faithfully propagated over multiple cell divisions, which makes epigenetic gene regulation a key mechanism for cellular differentiation and cell fate decisions. In addition, incomplete erasure of epigenetic information can lead to complex patterns of non-Mendelian inheritance. Stochastic and environment-induced epigenetic defects are known to play a major role in cancer and ageing, and they may also contribute to mental disorders and autoimmune diseases. Recent technical advances — such as the development of the ChIP-on-chip and ChIP-seq protocols for genome-wide mapping of epigenetic information — have started to convert epigenetic research into a high-throughput endeavor, to which bioinformatics is expected to make significant contributions. This thesis describes computational work at the intersection of epigenetics and genome research, aiming to address the bioinformatic challenges posed by the human epigenome. While its methods are carried over and adapted from bioinformatics and related fields (including data mining, machine learning, statistics, algorithms, optimization, software engineering and databases), its overarching goal is to contribute to epigenetic research, both directly through analyzing and modeling of epigenetic information, and indirectly through the development of practically useful methods and software toolkits. This thesis is broadly structured into four parts. The first part gives a brief introduction into epigenetic regulation and inheritance, and reviews the emerging field of computational epigenetics. The second part addresses the question of genome-epigenome interactions using machine learning methods. It is shown that accurate predictions of DNA methylation and other epigenetic modifications can be derived from the genomic DNA sequence. Based on this finding, the EpiGRAPH web service for epigenome analysis and prediction is described, and methods for refined annotation of CpG islands in the human genome are proposed. The third part is dedicated to large-scale analysis of DNA methylation, which is the best-known epigenetic phenomenon. The BiQ Analyzer software toolkit is presented, together with a bioinformatic analysis of the "National Methylome Project for Chromosome 21'; dataset, for which BiQ Analyzer had played an enabling role. This part concludes with statistical modeling of DNA methylation variation and an analysis of its implications for DNA methylation mapping in a large number of human individuals. The fourth part describes two pilot projects applying the bioinformatic concepts of this thesis to cancer epigenetics. First, genome-scale datasets are probed for evidence of a link between DNA methylation and Polycomb binding, which is believed to play a role in epigenetic deregulation of cancer cells. Second, a biomarker that tests for cancer-specific DNA methylation is optimized and validated for use in clinical settings. Arguably the most interesting result of this thesis is the unexpectedly high correlation between genome and epigenome that was found by several methods and based on multiple epigenome datasets. This finding suggests that the role of the genome for epigenetic regulation has been underappreciated, and it underlines the importance of integrated analysis of genome and epigenome. With the EpiGRAPH web service for (epi-) genome analysis and prediction, a research tool is provided to facilitate further investigation of this striking interaction.Ziel epigenetischer Forschung ist ein besseres Verständnis der Mechanismen erblicher Gen-Regulation, die nicht direkt in der DNA-Sequenz codiert sind. Epigenetische Veränderungen des Genoms — wie zum Beispiel DNA-Methylierung und Histon-Modifikationen — beeinflussen die räumliche Anordnung der DNA im Zellkern und damit auch die Gen-Expression. Epigenetische Informationen werden über viele Zellteilungen stabil weitergegeben, weswegen die epigenetische Gen-Regulation ein Schlüsselmechanismus für Zell-Differenzierung und Determinierung ist. Darüber hinaus ergeben sich aus dem unvollständigen Löschen von epigenetischen Informationen komplexe nicht-Mendelsche Vererbungsgänge. Stochastische und umweltinduzierte epigenetische Defekte spielen eine wichtige Rolle für Krebs und molekulares Altern, und sie scheinen ebenfalls psychische Störungen und Autoimmun-Erkrankungen zu beeinflussen. In Folge technischer Fortschritte — wie etwa der Entwicklung der ChIP-on-chip und ChIP-seq Protokolle zur genomweiten Kartierung epigenetischer Informationen — hat eine Transformation der epigenetischen Forschung hin zu Hochdurchsatz-Analysen begonnen, zu der die Bioinformatik einen wichtigen Beitrag leisten muss. Diese Dissertation beschreibt bioinformatische Studien an der Schnittstelle von Epigenetik und Genomforschung, mit dem Ziel einer adäquaten Antwort auf die analytischen Herausforderungen des menschlichen Epigenoms. Während ihre Methoden aus der Bioinformatik und benachbarten Gebieten (Data Mining, maschinelles Lernen, Statistik, Algorithmik, Optimierung, Software Engineering und Datenbanken) entlehnt und adaptiert sind, ist es das übergeordnete Ziel der Arbeit, einen Beitrag zur epigenetischen Forschung zu leisten; und zwar sowohl direkt durch die Analyse und Modellierung epigenetischer Daten, also auch indirekt durch die Entwicklung praktisch verwertbarer Methoden und Software-Werkzeuge. Diese Dissertation gliedert sich grob in vier Teile. Der erste Teil führt in den Themenkomplex der epigenetischen Vererbung und Gen-Regulation ein und fasst das junge Forschungsgebiet "Computational Epigenetics" zusammen. Der zweite Teil adressiert die Frage nach Genom-Epigenom-Interaktionen mit Methoden des maschinellen Lernens. Es wird gezeigt, dass aus der genomischen DNA-Sequenz eine akkurate Vorhersage der DNA-Methylierung sowie anderer epigenetischer Modifikationen abgeleitet werden kann. Basierend auf diesem Ergebnis werden der EpiGRAPH-Webservice zur Epigenom-Analyse und Vorhersage beschrieben sowie Methoden für die verbesserte Annotation von CpG-Inseln in Wirbeltier- Genomen ausgearbeitet. Der dritte Teil beschäftigt sich mit der Hochdurchsatzanalyse von DNA-Methylierung, dem bekanntesten epigenetischen Phänomen. Die BiQ Analyzer Software wird vorgestellt, und die Ergebnisse einer bioinformatischen Analyse des "National Methylome Project for Chromosome 21"-Datensatzes werden beschrieben, zu dessen Generierung der BiQ Analyzer einen fundamentalen Beitrag leisten konnte. Den Abschluss dieses Teils bildet die statistische Modellierung von DNA-Methylierungs-Variation und eine Analyse ihrer Bedeutung für die DNA-Methylierungs-Kartierung einer großen Anzahl menschlicher Individuen. Der vierte Teil beschreibt zwei Pilotprojekte, in denen die bioinformatischen Konzepte dieser Arbeit in der Krebs-Epigenetik angewandt werden. Zum einen werden epigenomische Datensätze im Hinblick auf Interaktionen zwischen DNA-Methylierung und Polycomb- Bindestellen untersucht — eine Beziehung, die vermutlich bei der epigenetischen Deregulierung von Krebszellen eine Rolle spielt. Zum anderen wird ein Biomarker für die Verxiii wendung unter klinischen Bedingungen optimiert und validiert, der eine krebsspezifische Veränderung der DNA-Methylierung detektieren kann. Das vielleicht interessanteste Ergebnis dieser Dissertation ist eine unerwartet hohe Korrelation zwischen Genom und Epigenom, die mit mehreren Methoden und für verschiedenste Epigenom-Datensätze nachgewiesen werden konnte. Dieses Ergebnis legt nahe, dass der regulatorische Einfluss des Genoms auf das Epigenom bisher nicht ausreichend gewürdigt wurde, und es unterstreicht die Wichtigkeit einer integrierten Analyse von Genom und Epigenom. Der EpiGRAPH-Webservice bietet sich als Werkzeug für eine genauere Untersuchung dieser bemerkenswerten Interaktion an
    • …
    corecore