711 research outputs found

    Big data analytics in computational biology and bioinformatics

    Get PDF
    Big data analytics in computational biology and bioinformatics refers to an array of operations including biological pattern discovery, classification, prediction, inference, clustering as well as data mining in the cloud, among others. This dissertation addresses big data analytics by investigating two important operations, namely pattern discovery and network inference. The dissertation starts by focusing on biological pattern discovery at a genomic scale. Research reveals that the secondary structure in non-coding RNA (ncRNA) is more conserved during evolution than its primary nucleotide sequence. Using a covariance model approach, the stems and loops of an ncRNA secondary structure are represented as a statistical image against which an entire genome can be efficiently scanned for matching patterns. The covariance model approach is then further extended, in combination with a structural clustering algorithm and a random forests classifier, to perform genome-wide search for similarities in ncRNA tertiary structures. The dissertation then presents methods for gene network inference. Vast bodies of genomic data containing gene and protein expression patterns are now available for analysis. One challenge is to apply efficient methodologies to uncover more knowledge about the cellular functions. Very little is known concerning how genes regulate cellular activities. A gene regulatory network (GRN) can be represented by a directed graph in which each node is a gene and each edge or link is a regulatory effect that one gene has on another gene. By evaluating gene expression patterns, researchers perform in silico data analyses in systems biology, in particular GRN inference, where the ā€œreverse engineeringā€ is involved in predicting how a system works by looking at the system output alone. Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioin-formatics tools capable of performing perfect GRN inference. Here, extensive experiments are conducted to evaluate and compare recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these tools based on both simulated and real data sets are generally low, suggesting that further efforts are needed to develop more reliable GRN inference tools. It is also observed that using multiple tools together can help identify true regulatory interactions between genes, a finding consistent with those reported in the literature. Finally, the dissertation discusses and presents a framework for parallelizing GRN inference methods using Apache Hadoop in a cloud environment

    Grouped graphical Granger modeling for gene expression regulatory networks discovery

    Get PDF
    We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time-series data and invoke the notion of ā€˜Granger causalityā€™ to make assertions on causality through inference on time-lagged effects. Existing algorithms, however, have neglected an important aspect of the problemā€”the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to the large datasets including a large number of genes and many data points. In the present article, we propose a novel methodology which we term ā€˜grouped graphical Granger modeling methodā€™, which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works

    Dynamic Analysis of High Dimensional Microarray Time Series Data Using Various Dimensional Reduction Methods

    Get PDF
    This dissertation focuses on dynamic analysis of reduced dimension models of two microarray time series datasets. Underlying research achieves two main objectives; namely, (1) various dimension reduction techniques used on time series microarray data, and (2) estimating autoregressive coefficients using several penalized regression methods like ridge, SCAD, and lasso.The research methodology includes two research tasks. Firstly, applying several dimension reduction methods on two microarray data sets, and modeling comparisons based on accuracy and computation cost. Secondly, applying the sparse vector autoregressive (SVAR) model to estimate gene regulatory network based on gene expression profile from time series microarray experiment on two datasets and the autoregressive coefficients estimation were calculated using several penalized regression methods, and then performing comparisons among various regression methods for each dimension reduction model.Study results show that the dimension reduction methods producing orthogonal independent variables are performing better because orthogonality leads to reasonable coefficient estimation with low standard errors. On the other hand, regarding dynamic analysis, it could be seen that factor analysis (FA) outperformed the rest of dimension reduction methods with regards to goodness of fit after applying several penalized regression methods on each model. The reason behind this is due to using varimax rotation in FA, in which most of the coordinates are set closer to zero, and in turn makes the data sparser. Hence inducing additional sparsity subject to maintaining a certain goodness of fit.Industrial Engineering & Managemen
    • ā€¦
    corecore