118,073 research outputs found

    Big data analytics in computational biology and bioinformatics

    Get PDF
    Big data analytics in computational biology and bioinformatics refers to an array of operations including biological pattern discovery, classification, prediction, inference, clustering as well as data mining in the cloud, among others. This dissertation addresses big data analytics by investigating two important operations, namely pattern discovery and network inference. The dissertation starts by focusing on biological pattern discovery at a genomic scale. Research reveals that the secondary structure in non-coding RNA (ncRNA) is more conserved during evolution than its primary nucleotide sequence. Using a covariance model approach, the stems and loops of an ncRNA secondary structure are represented as a statistical image against which an entire genome can be efficiently scanned for matching patterns. The covariance model approach is then further extended, in combination with a structural clustering algorithm and a random forests classifier, to perform genome-wide search for similarities in ncRNA tertiary structures. The dissertation then presents methods for gene network inference. Vast bodies of genomic data containing gene and protein expression patterns are now available for analysis. One challenge is to apply efficient methodologies to uncover more knowledge about the cellular functions. Very little is known concerning how genes regulate cellular activities. A gene regulatory network (GRN) can be represented by a directed graph in which each node is a gene and each edge or link is a regulatory effect that one gene has on another gene. By evaluating gene expression patterns, researchers perform in silico data analyses in systems biology, in particular GRN inference, where the “reverse engineering” is involved in predicting how a system works by looking at the system output alone. Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioin-formatics tools capable of performing perfect GRN inference. Here, extensive experiments are conducted to evaluate and compare recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these tools based on both simulated and real data sets are generally low, suggesting that further efforts are needed to develop more reliable GRN inference tools. It is also observed that using multiple tools together can help identify true regulatory interactions between genes, a finding consistent with those reported in the literature. Finally, the dissertation discusses and presents a framework for parallelizing GRN inference methods using Apache Hadoop in a cloud environment

    Computational thinking in the era of big data biology

    Get PDF
    It is fall again, and another class of students has arrived in the Watson School of Biological Sciences at Cold Spring Harbor Laboratory (CSHL). Building on the lab's 100-year history as a leading center for research and education, the Watson School was established in 1998 as a graduate program in biology with a focus on molecular, cellular and structural biology, and neuroscience, cancer, plant biology and genetics. All students in the program complete the same courses, centered around these research topics, with an emphasis on the principles of scientific reasoning and logic, as well as the importance of ethics and effective communication. Three years ago the curriculum was expanded to include a new course on quantitative biology (QB) and I, along with my co-instructor Mickey Atwal and other members of the QB program, have been teaching it ever since

    Evaluation of IoT-Based Computational Intelligence Tools for DNA Sequence Analysis in Bioinformatics

    Full text link
    In contemporary age, Computational Intelligence (CI) performs an essential role in the interpretation of big biological data considering that it could provide all of the molecular biology and DNA sequencing computations. For this purpose, many researchers have attempted to implement different tools in this field and have competed aggressively. Hence, determining the best of them among the enormous number of available tools is not an easy task, selecting the one which accomplishes big data in the concise time and with no error can significantly improve the scientist's contribution in the bioinformatics field. This study uses different analysis and methods such as Fuzzy, Dempster-Shafer, Murphy and Entropy Shannon to provide the most significant and reliable evaluation of IoT-based computational intelligence tools for DNA sequence analysis. The outcomes of this study can be advantageous to the bioinformatics community, researchers and experts in big biological data

    Stochastic Block Coordinate Frank-Wolfe Algorithm for Large-Scale Biological Network Alignment

    Get PDF
    With increasingly "big" data available in biomedical research, deriving accurate and reproducible biology knowledge from such big data imposes enormous computational challenges. In this paper, motivated by recently developed stochastic block coordinate algorithms, we propose a highly scalable randomized block coordinate Frank-Wolfe algorithm for convex optimization with general compact convex constraints, which has diverse applications in analyzing biomedical data for better understanding cellular and disease mechanisms. We focus on implementing the derived stochastic block coordinate algorithm to align protein-protein interaction networks for identifying conserved functional pathways based on the IsoRank framework. Our derived stochastic block coordinate Frank-Wolfe (SBCFW) algorithm has the convergence guarantee and naturally leads to the decreased computational cost (time and space) for each iteration. Our experiments for querying conserved functional protein complexes in yeast networks confirm the effectiveness of this technique for analyzing large-scale biological networks

    Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective

    Full text link
    Phylogenetic trees are the fundamental mathematical representation of evolutionary processes in biology. As data objects, they are characterized by the challenges associated with "big data," as well as the complication that their discrete geometric structure results in a non-Euclidean phylogenetic tree space, which poses computational and statistical limitations. We propose and study a novel framework to study sets of phylogenetic trees based on tropical geometry. In particular, we focus on characterizing our framework for statistical analyses of evolutionary biological processes represented by phylogenetic trees. Our setting exhibits analytic, geometric, and topological properties that are desirable for theoretical studies in probability and statistics, as well as increased computational efficiency over the current state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl

    Spectral Sequence Motif Discovery

    Full text link
    Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl

    TauFactor: An open-source application for calculating tortuosity factors from tomographic data

    Get PDF
    TauFactor is a MatLab application for efficiently calculating the tortuosity factor, as well as volume fractions, surface areas and triple phase boundary densities, from image based microstructural data. The tortuosity factor quantifies the apparent decrease in diffusive transport resulting from convolutions of the flow paths through porous media. TauFactor was originally developed to improve the understanding of electrode microstructures for batteries and fuel cells; however, the tortuosity factor has been of interest to a wide range of disciplines for over a century, including geoscience, biology and optics. It is still common practice to use correlations, such as that developed by Bruggeman, to approximate the tortuosity factor, but in recent years the increasing availability of 3D imaging techniques has spurred interest in calculating this quantity more directly. This tool provides a fast and accurate computational platform applicable to the big datasets (>10^8 voxels) typical of modern tomography, without requiring high computational power

    APEX2S: A Two-Layer Machine Learning Model for Discovery of host-pathogen protein-protein Interactions on Cloud-based Multiomics Data

    Get PDF
    Presented by the avalanche of biological interactions data, computational biology is now facing greater challenges on big data analysis and solicits more studies to mine and integrate cloud-based multiomics data, especially when the data are related to infectious diseases. Meanwhile, machine learning techniques have recently succeeded in different computational biology tasks. In this article, we have calibrated the focus for host-pathogen protein-protein interactions study, aiming to apply the machine learning techniques for learning the interactions data and making predictions. A comprehensive and practical workflow to harness different cloud-based multiomics data is discussed. In particular, a novel two-layer machine learning model, namely APEX2S, is proposed for discovery of the protein-protein interactions data. The results show that our model can better learn and predict from the accumulated host-pathogen protein-protein interactions

    Health Fetishism Among The Nacirema: A Fugue On Jenny Reardon’s The Postgenomic Condition: Ethics, Justice, and Knowledge After The Genome (Chicago University Press, 2017) And Isabelle Stengers’ Another Science Is Possible: A Manifesto For Slow Science (Polity Press, 2018)

    Get PDF
    Personalized medicine has become a goal of genomics and of health policy makers. This article reviews two recent books that are highly critical of this approach, finding their arguments very thoughtful and important. According to Stengers, biology’s rush to become a science of genome sequences has made it part of the “speculative economy of promise.” Reardon claims that the postgenomic condition is the attempt to find meaning in all the troves of data that have been generated. The current paper attempts to extend these arguments by showing that scientific alternatives such as ecological developmental biology and the tissue organization field theory of cancer provide evidence demonstrating that genomic data alone is not sufficient to explain the origins of common disease. What does need to be explained is the intransience of medical scientists to recognize other explanatory models beside the “-omics” approaches based on computational algorithms. To this end, various notions of commodity and religious fetishism are used. This is not to say that there is no place for Big Data and genomics. Rather, these methodologies should have a definite place among others. These books suggest that Big Data genomics is like the cancer it is supposed to conquer. It has expanded unregulated and threatens to kill the body in which it arose
    • …
    corecore