118,073 research outputs found
Big data analytics in computational biology and bioinformatics
Big data analytics in computational biology and bioinformatics refers to an array of operations including biological pattern discovery, classification, prediction, inference, clustering as well as data mining in the cloud, among others. This dissertation addresses big data analytics by investigating two important operations, namely pattern discovery and network inference.
The dissertation starts by focusing on biological pattern discovery at a genomic scale. Research reveals that the secondary structure in non-coding RNA (ncRNA) is more conserved during evolution than its primary nucleotide sequence. Using a covariance model approach, the stems and loops of an ncRNA secondary structure are represented as a statistical image against which an entire genome can be efficiently scanned for matching patterns. The covariance model approach is then further extended, in combination with a structural clustering algorithm and a random forests classifier, to perform genome-wide search for similarities in ncRNA tertiary structures.
The dissertation then presents methods for gene network inference. Vast bodies of genomic data containing gene and protein expression patterns are now available for analysis. One challenge is to apply efficient methodologies to uncover more knowledge about the cellular functions. Very little is known concerning how genes regulate cellular activities. A gene regulatory network (GRN) can be represented by a directed graph in which each node is a gene and each edge or link is a regulatory effect that one gene has on another gene. By evaluating gene expression patterns, researchers perform in silico data analyses in systems biology, in particular GRN inference, where the “reverse engineering” is involved in predicting how a system works by looking at the system output alone.
Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioin-formatics tools capable of performing perfect GRN inference. Here, extensive experiments are conducted to evaluate and compare recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these tools based on both simulated and real data sets are generally low, suggesting that further efforts are needed to develop more reliable GRN inference tools. It is also observed that using multiple tools together can help identify true regulatory interactions between genes, a finding consistent with those reported in the literature. Finally, the dissertation discusses and presents a framework for parallelizing GRN inference methods using Apache Hadoop in a cloud environment
Computational thinking in the era of big data biology
It is fall again, and another class of students has arrived in the Watson School of Biological Sciences at Cold Spring Harbor Laboratory (CSHL). Building on the lab's 100-year history as a leading center for research and education, the Watson School was established in 1998 as a graduate program in biology with a focus on molecular, cellular and structural biology, and neuroscience, cancer, plant biology and genetics. All students in the program complete the same courses, centered around these research topics, with an emphasis on the principles of scientific reasoning and logic, as well as the importance of ethics and effective communication. Three years ago the curriculum was expanded to include a new course on quantitative biology (QB) and I, along with my co-instructor Mickey Atwal and other members of the QB program, have been teaching it ever since
Evaluation of IoT-Based Computational Intelligence Tools for DNA Sequence Analysis in Bioinformatics
In contemporary age, Computational Intelligence (CI) performs an essential
role in the interpretation of big biological data considering that it could
provide all of the molecular biology and DNA sequencing computations. For this
purpose, many researchers have attempted to implement different tools in this
field and have competed aggressively. Hence, determining the best of them among
the enormous number of available tools is not an easy task, selecting the one
which accomplishes big data in the concise time and with no error can
significantly improve the scientist's contribution in the bioinformatics field.
This study uses different analysis and methods such as Fuzzy, Dempster-Shafer,
Murphy and Entropy Shannon to provide the most significant and reliable
evaluation of IoT-based computational intelligence tools for DNA sequence
analysis. The outcomes of this study can be advantageous to the bioinformatics
community, researchers and experts in big biological data
Stochastic Block Coordinate Frank-Wolfe Algorithm for Large-Scale Biological Network Alignment
With increasingly "big" data available in biomedical research, deriving
accurate and reproducible biology knowledge from such big data imposes enormous
computational challenges. In this paper, motivated by recently developed
stochastic block coordinate algorithms, we propose a highly scalable randomized
block coordinate Frank-Wolfe algorithm for convex optimization with general
compact convex constraints, which has diverse applications in analyzing
biomedical data for better understanding cellular and disease mechanisms. We
focus on implementing the derived stochastic block coordinate algorithm to
align protein-protein interaction networks for identifying conserved functional
pathways based on the IsoRank framework. Our derived stochastic block
coordinate Frank-Wolfe (SBCFW) algorithm has the convergence guarantee and
naturally leads to the decreased computational cost (time and space) for each
iteration. Our experiments for querying conserved functional protein complexes
in yeast networks confirm the effectiveness of this technique for analyzing
large-scale biological networks
Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective
Phylogenetic trees are the fundamental mathematical representation of
evolutionary processes in biology. As data objects, they are characterized by
the challenges associated with "big data," as well as the complication that
their discrete geometric structure results in a non-Euclidean phylogenetic tree
space, which poses computational and statistical limitations. We propose and
study a novel framework to study sets of phylogenetic trees based on tropical
geometry. In particular, we focus on characterizing our framework for
statistical analyses of evolutionary biological processes represented by
phylogenetic trees. Our setting exhibits analytic, geometric, and topological
properties that are desirable for theoretical studies in probability and
statistics, as well as increased computational efficiency over the current
state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl
Spectral Sequence Motif Discovery
Sequence discovery tools play a central role in several fields of
computational biology. In the framework of Transcription Factor binding
studies, motif finding algorithms of increasingly high performance are required
to process the big datasets produced by new high-throughput sequencing
technologies. Most existing algorithms are computationally demanding and often
cannot support the large size of new experimental data. We present a new motif
discovery algorithm that is built on a recent machine learning technique,
referred to as Method of Moments. Based on spectral decompositions, this method
is robust under model misspecification and is not prone to locally optimal
solutions. We obtain an algorithm that is extremely fast and designed for the
analysis of big sequencing data. In a few minutes, we can process datasets of
hundreds of thousand sequences and extract motif profiles that match those
computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl
TauFactor: An open-source application for calculating tortuosity factors from tomographic data
TauFactor is a MatLab application for efficiently calculating the tortuosity factor, as well as volume fractions, surface areas and triple phase boundary densities, from image based microstructural data. The tortuosity factor quantifies the apparent decrease in diffusive transport resulting from convolutions of the flow paths through porous media. TauFactor was originally developed to improve the understanding of electrode microstructures for batteries and fuel cells; however, the tortuosity factor has been of interest to a wide range of disciplines for over a century, including geoscience, biology and optics. It is still common practice to use correlations, such as that developed by Bruggeman, to approximate the tortuosity factor, but in recent years the increasing availability of 3D imaging techniques has spurred interest in calculating this quantity more directly. This tool provides a fast and accurate computational platform applicable to the big datasets (>10^8 voxels) typical of modern tomography, without requiring high computational power
APEX2S: A Two-Layer Machine Learning Model for Discovery of host-pathogen protein-protein Interactions on Cloud-based Multiomics Data
Presented by the avalanche of biological interactions data, computational biology is now facing greater challenges on big data analysis and solicits more studies to mine and integrate cloud-based multiomics data, especially when the data are related to infectious diseases. Meanwhile, machine learning techniques have recently succeeded in different computational biology tasks. In this article, we have calibrated the focus for host-pathogen protein-protein interactions study, aiming to apply the machine learning techniques for learning the interactions data and making predictions. A comprehensive and practical workflow to harness different cloud-based multiomics data is discussed. In particular, a novel two-layer machine learning model, namely APEX2S, is proposed for discovery of the protein-protein interactions data. The results show that our model can better learn and predict from the accumulated host-pathogen protein-protein interactions
Health Fetishism Among The Nacirema: A Fugue On Jenny Reardon’s The Postgenomic Condition: Ethics, Justice, and Knowledge After The Genome (Chicago University Press, 2017) And Isabelle Stengers’ Another Science Is Possible: A Manifesto For Slow Science (Polity Press, 2018)
Personalized medicine has become a goal of genomics and of health policy makers. This article reviews two recent books that are highly critical of this approach, finding their arguments very thoughtful and important. According to Stengers, biology’s rush to become a science of genome sequences has made it part of the “speculative economy of promise.” Reardon claims that the postgenomic condition is the attempt to find meaning in all the troves of data that have been generated. The current paper attempts to extend these arguments by showing that scientific alternatives such as ecological developmental biology and the tissue organization field theory of cancer provide evidence demonstrating that genomic data alone is not sufficient to explain the origins of common disease. What does need to be explained is the intransience of medical scientists to recognize other explanatory models beside the “-omics” approaches based on computational algorithms. To this end, various notions of commodity and religious fetishism are used. This is not to say that there is no place for Big Data and genomics. Rather, these methodologies should have a definite place among others. These books suggest that Big Data genomics is like the cancer it is supposed to conquer. It has expanded unregulated and threatens to kill the body in which it arose
- …