28 research outputs found

    The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond

    Get PDF
    Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases

    Ultra-fast screening of stress-sensitive (naturally fractured) reservoirs using flow diagnostics

    Get PDF
    Quantifying the impact of poro-mechanics on reservoir performance is critical to the sustainable management of subsurface reservoirs containing either hydrocarbons, groundwater, geothermal heat, or being targeted for geological storage of fluids (e.g., CO2 or H2). On the other hand, accounting for poro-mechanical effects in full-field reservoir simulation studies and uncertainty quantification workflows in complex reservoir models is challenging, mainly because exploring and capturing the full range of geological and mechanical uncertainties requires a large number of numerical simulations and is hence computationally intensive. Specifically, the integration of poro-mechanical effects in full-field reservoir simulation studies is still limited, mainly because of the high computational cost. Consequently, poro-mechanical effects are often ignored in reservoir engineering workflows, which may result in inadequate reservoir performance forecasts. This thesis hence develops an alternative approach that couples hydrodynamics using existing flow diagnostics simulations for single- and dual-porosity models with poro mechanics to screen the impact of coupled poro-mechanical processes on reservoir performance. Due to the steady-state nature of the calculations and the effective proposed coupling strategy, these calculations remain computationally efficient while providing first-order approximations of the interplay between poro-mechanics and hydrodynamics, as we demonstrate through a series of case studies. This thesis also introduces a new uncertainty quantification workflow using the proposed poro-mechanical informed flow diagnostics and proxy models. These computationally efficient calculations allow us to quickly screen poro-mechanics and assess a broader range of geological, petrophysical, and mechanical uncertainties to rank, compare, and cluster a large ensemble of models to select representative candidates for more detailed full-physics coupled reservoir simulations.James Watt Scholarshi

    STATISTICAL METHODS FOR INFERRING GENETIC REGULATION ACROSS HETEROGENEOUS SAMPLES AND MULTIMODAL DATA

    Get PDF
    As clinical datasets have increased in size and a wider range of molecular profiles can be credibly measured, understanding sources of heterogeneity has become critical in studying complex phenotypes. Here, we investigate and develop statistical approaches to address and analyze technical variation, genetic diversity, and tissue heterogeneity in large biological datasets. Commercially available methods for normalization of NanoString nCounter RNA expression data are suboptimal in fully addressing unwanted technical variation. First, we develop a more comprehensive quality control, normalization, and validation framework for nCounter data, benchmark it against existing normalization methods for nCounter, and show its advantages on four datasets of differing sample sizes. We then develop race-specific and genetic ancestry-adjusted tumor transcriptomic prediction models from germline genetics in the Carolina Breast Cancer Study (CBCS) and study the performance of these models across ancestral groups and molecular subtypes. These models are employed in a transcriptome-wide association study (TWAS) to identify four novel genetic loci associated with breast-cancer specific survival. Next, we extend TWAS to a novel suite of tools, MOSTWAS, to prioritize distal genetic variation in transcriptomic predictive models with two multi-omic approaches that draw from mediation analysis. We empirically show the utility of these extensions in simulation analyses, TCGA breast cancer data, and ROS/MAP brain tissue data. We develop a novel distal-SNPs added-last test, to be used with MOSTWAS models, to prioritize distal loci that give added information, beyond the association in the local locus around a gene. Lastly, we develop DeCompress, a deconvolution method from gene expression from targeted RNA panels such as NanoString, which have a much smaller feature space than traditional RNA expression assays. We propose an ensemble approach that leverages compressed sensing to expand the feature space and validate it on data from the CBCS. We conduct extensive benchmarking of existing deconvolution methods using simulated in-silico experiments, pseudo-targeted panels from published mixing experiments, and data from the CBCS to show the advantage of DeCompress over reference-free methods. We lastly show the utility of in-silico cell-type proportion estimation in outcome prediction and eQTL mapping.Doctor of Philosoph

    Phylogenetics in the Genomic Era

    Get PDF
    Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees. This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era. One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6). Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order. Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments. Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6). The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy. It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease

    Multivariate Statistical Machine Learning Methods for Genomic Prediction

    Get PDF
    This book is open access under a CC BY 4.0 license This open access book brings together the latest genome base prediction models currently being used by statisticians, breeders and data scientists. It provides an accessible way to understand the theory behind each statistical learning tool, the required pre-processing, the basics of model building, how to train statistical learning methods, the basic R scripts needed to implement each statistical learning tool, and the output of each tool. To do so, for each tool the book provides background theory, some elements of the R statistical software for its implementation, the conceptual underpinnings, and at least two illustrative examples with data from real-world genomic selection experiments. Lastly, worked-out examples help readers check their own comprehension. The book will greatly appeal to readers in plant (and animal) breeding, geneticists and statisticians, as it provides in a very accessible way the necessary theory, the appropriate R code, and illustrative examples for a complete understanding of each statistical learning tool. In addition, it weighs the advantages and disadvantages of each tool

    Multivariate Statistical Machine Learning Methods for Genomic Prediction

    Get PDF
    This book is open access under a CC BY 4.0 license This open access book brings together the latest genome base prediction models currently being used by statisticians, breeders and data scientists. It provides an accessible way to understand the theory behind each statistical learning tool, the required pre-processing, the basics of model building, how to train statistical learning methods, the basic R scripts needed to implement each statistical learning tool, and the output of each tool. To do so, for each tool the book provides background theory, some elements of the R statistical software for its implementation, the conceptual underpinnings, and at least two illustrative examples with data from real-world genomic selection experiments. Lastly, worked-out examples help readers check their own comprehension. The book will greatly appeal to readers in plant (and animal) breeding, geneticists and statisticians, as it provides in a very accessible way the necessary theory, the appropriate R code, and illustrative examples for a complete understanding of each statistical learning tool. In addition, it weighs the advantages and disadvantages of each tool

    TĂ€tigkeitsbericht 2011-2013

    Get PDF

    Iowa State University, Courses and Programs Catalog 2014–2015

    Get PDF
    The Iowa State University Catalog is a one-year publication which lists all academic policies, and procedures. The catalog also includes the following: information for fees; curriculum requirements; first-year courses of study for over 100 undergraduate majors; course descriptions for nearly 5000 undergraduate and graduate courses; and a listing of faculty members at Iowa State University.https://lib.dr.iastate.edu/catalog/1025/thumbnail.jp
    corecore