4,446 research outputs found
Comparative Analyses of De Novo Transcriptome Assembly Pipelines for Diploid Wheat
Gene expression and transcriptome analysis are currently one of the main focuses of research for a great number of scientists. However, the assembly of raw sequence data to obtain a draft transcriptome of an organism is a complex multi-stage process usually composed of pre-processing, assembling, and post-processing. Each of these stages includes multiple steps such as data cleaning, error correction and assembly validation. Different combinations of steps, as well as different computational methods for the same step, generate transcriptome assemblies with different accuracy. Thus, using a combination that generates more accurate assemblies is crucial for any novel biological discoveries. Implementing accurate transcriptome assembly requires a great knowledge of different algorithms, bioinformatics tools and software that can be used in an analysis pipeline. Many pipelines can be represented as automated scalable scientific workflows that can be run simultaneously on powerful distributed and computational resources, such as Campus Clusters, Grids, and Clouds, and speed-up the analyses.
In this thesis, we 1) compared and optimized de novo transcriptome assembly pipelines for diploid wheat; 2) investigated the impact of a few key parameters for generating accurate transcriptome assemblies, such as digital normalization and error correction methods, de novo assemblers and k-mer length strategies; 3) built distributed and scalable scientific workflow for blast2cap3, a step from the transcriptome assembly pipeline for protein-guided assembly, using the Pegasus Workflow Management System (WMS); and 4) deployed and examined the scientific workflow for blast2cap3 on two different computational platforms.
Based on the analysis performed in this thesis, we conclude that the best transcriptome assembly is produced when the error correction method is used with Velvet Oases and the “multi-k” strategy. Moreover, the performed experiments show that the Pegasus WMS implementation of blast2cap3 reduces the running time for more than 95% compared to its current serial implementation. The results presented in this thesis provide valuable insight for designing good de novo transcriptome assembly pipeline and show the importance of using scientific workflows for executing computationally demanding pipelines.
Advisor: Jitender S. Deogu
Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations
Machine learning (ML)-based steering can improve the performance of
ensemble-based simulations by allowing for online selection of more
scientifically meaningful computations. We present DeepDriveMD, a framework for
ML-driven steering of scientific simulations that we have used to achieve
orders-of-magnitude improvements in molecular dynamics (MD) performance via
effective coupling of ML and HPC on large parallel computers. We discuss the
design of DeepDriveMD and characterize its performance. We demonstrate that
DeepDriveMD can achieve between 100-1000x acceleration for protein folding
simulations relative to other methods, as measured by the amount of simulated
time performed, while covering the same conformational landscape as quantified
by the states sampled during a simulation. Experiments are performed on
leadership-class platforms on up to 1020 nodes. The results establish
DeepDriveMD as a high-performance framework for ML-driven HPC simulation
scenarios, that supports diverse MD simulation and ML back-ends, and which
enables new scientific insights by improving the length and time scales
accessible with current computing capacity
Recommended from our members
Citizen-led Work using Social Computing and Procedural Guidance
Online platforms enable people to interact with friends, family, and the world at large. How might people go beyond sharing stories and ideas to building and testing theories in the real world? While many are motivated to dig deeper into their lived experience, limited expertise and lack of platform support make complex activities like experimentation dauntingly hard. Novices benefit greatly from expert guidance: this thesis advocates baking the guidance into the interface itself.This dissertation introduces procedural guidance to build just-in-time expertise for difficult tasks. Procedural guidance has multiple advantages: it is minimal, leverages teachable moments, and can be ability-specific. This dissertation instantiates this insight of procedural guidance through a sequence of increasingly complex social computing systems: Gut Instinct for curating ideas, Docent for generating hypotheses, and Galileo for citizen-led experiments.Gut Instinct hosts online learning materials and enables people to collaboratively brainstorm potential influences on people’s microbiome. Docent explicitly teaches people to create hypotheses by combining personal insights and online learning with task-specific scaffolding. Finally, Galileo reifies experimentation in the software, provides multiple roles for contribution, and automatically manages interdependencies. Multiple evaluations—controlled experiments and field deployments with online communities including American Gut participants—demonstrate that procedural guidance enables people to transform intuitions to hypotheses and structurally-sound experiments. By enabling people to draw on lived experience, this dissertation harbingers a future where people can convert their intuitions to actionable plans and implement these plans with online communities. This dissertation concludes by discussing opportunities for complex work using social computing platforms
Essential oil phytocomplex activity, a review with a focus on multivariate analysis for a network pharmacology-informed phytogenomic approach
Thanks to omic disciplines and a systems biology approach, the study of essential oils and phytocomplexes has been lately rolling on a faster track. While metabolomic fingerprinting can provide an effective strategy to characterize essential oil contents, network pharmacology is revealing itself as an adequate, holistic platform to study the collective effects of herbal products and their multi-component and multi-target mediated mechanisms. Multivariate analysis can be applied to analyze the effects of essential oils, possibly overcoming the reductionist limits of bioactivity-guided fractionation and purification of single components. Thanks to the fast evolution of bioinformatics and database availability, disease-target networks relevant to a growing number of phytocomplexes are being developed. With the same potential actionability of pharmacogenomic data, phytogenomics could be performed based on relevant disease-target networks to inform and personalize phytocomplex therapeutic application
Integrative biological simulation praxis: Considerations from physics, philosophy, and data/model curation practices
Integrative biological simulations have a varied and controversial history in
the biological sciences. From computational models of organelles, cells, and
simple organisms, to physiological models of tissues, organ systems, and
ecosystems, a diverse array of biological systems have been the target of
large-scale computational modeling efforts. Nonetheless, these research agendas
have yet to prove decisively their value among the broader community of
theoretical and experimental biologists. In this commentary, we examine a range
of philosophical and practical issues relevant to understanding the potential
of integrative simulations. We discuss the role of theory and modeling in
different areas of physics and suggest that certain sub-disciplines of physics
provide useful cultural analogies for imagining the future role of simulations
in biological research. We examine philosophical issues related to modeling
which consistently arise in discussions about integrative simulations and
suggest a pragmatic viewpoint that balances a belief in philosophy with the
recognition of the relative infancy of our state of philosophical
understanding. Finally, we discuss community workflow and publication practices
to allow research to be readily discoverable and amenable to incorporation into
simulations. We argue that there are aligned incentives in widespread adoption
of practices which will both advance the needs of integrative simulation
efforts as well as other contemporary trends in the biological sciences,
ranging from open science and data sharing to improving reproducibility.Comment: 10 page
Enabling Data-Guided Evaluation of Bioinformatics Workflow Quality
Bioinformatics can be divided into two phases, the first phase is conversion of raw data into processed data and the second phase is using processed data to obtain scientific results. It is important to consider the first “workflow” phase carefully, as there are many paths on the way to a final processed dataset. Some workflow paths may be different enough to influence the second phase, thereby, leading to ambiguity in the scientific literature. Workflow evaluation in bioinformatics enables the investigator to carefully plan how to process their data. A system that uses real data to determine the quality of a workflow can be based on the inherent biological relationships in the data itself. To our knowledge, a general software framework that performs real data-driven evaluation of bioinformatics workflows does not exist.
The Evaluation and Utility of workFLOW (EUFLOW) decision-theoretic framework, developed and tested on gene expression data, enables users of bioinformatics workflows to evaluate alternative workflow paths using inherent biological relationships. EUFLOW is implemented as an R package to enable users to evaluate workflow data. EUFLOW is a framework which also permits user-guided utility and loss functions, which enables the type of analysis to be considered in the workflow path decision. This framework was originally developed to address the quality of identifier mapping services between UNIPROT accessions and Affymetrix probesets to facilitate integrated analysis1. An extension to this framework evaluates Affymetrix probeset filtering methods on real data from endometrial cancer and TCGA ovarian serous carcinoma samples.2 Further evaluation of RNASeq workflow paths demonstrates generalizability of the EUFLOW framework. Three separate evaluations are performed including: 1) identifier filtering of features with biological attributes, 2) threshold selection parameter choice for low gene count features, and 3) commonly utilized RNASeq data workflow paths on The Cancer Genome Atlas data.
The EUFLOW decision-theoretic framework developed and tested in my dissertation enables users of bioinformatics workflows to evaluate alternative workflow paths guided by inherent biological relationships and user utility
- …