165,721 research outputs found

    Differential expression analysis with global network adjustment

    Get PDF
    <p>Background: Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a geneā€™s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.</p> <p>Results: We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient ā€œover-shrinkageā€ method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.</p> <p>Conclusions: By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.</p&gt

    The KM-Algorithm Identifies Regulated Genes in Time Series Expression Data

    Get PDF
    We present a statistical method to rank observed genes in gene expression time series experiments according to their degree of regulation in a biological process. The ranking may be used to focus on specific genes or to select meaningful subsets of genes from which gene regulatory networks can be built. Our approach is based on a state space model that incorporates hidden regulators of gene expression. Kalman (K) smoothing and maximum (M) likelihood estimation techniques are used to derive optimal estimates of the model parameters upon which a proposed regulation criterion is based. The statistical power of the proposed algorithm is investigated, and a real data set is analyzed for the purpose of identifying regulated genes in time dependent gene expression data. This statistical approach supports the concept that meaningful biological conclusions can be drawn from gene expression time series experiments by focusing on strong regulation rather than large expression values

    Dynamic modeling of gene expression in prokaryotes: application to glucose-lactose diauxie in Escherichia coli

    Get PDF
    Coexpression of genes or, more generally, similarity in the expression profiles poses an unsurmountable obstacle to inferring the gene regulatory network (GRN) based solely on data from DNA microarray time series. Clustering of genes with similar expression profiles allows for a course-grained view of the GRN and a probabilistic determination of the connectivity among the clusters. We present a model for the temporal evolution of a gene cluster network which takes into account interactions of gene products with genes and, through a non-constant degradation rate, with other gene products. The number of model parameters is reduced by using polynomial functions to interpolate temporal data points. In this manner, the task of parameter estimation is reduced to a system of linear algebraic equations, thus making the computation time shorter by orders of magnitude. To eliminate irrelevant networks, we test each GRN for stability with respect to parameter variations, and impose restrictions on its behavior near the steady state. We apply our model and methods to DNA microarray time series' data collected on Escherichia coli during glucose-lactose diauxie and infer the most probable cluster network for different phases of the experiment.Comment: 20 pages, 4 figures; Systems and Synthetic Biology 5 (2011

    Modeling Gene Regulatory Networks from Time Series Data using Particle Filtering

    Get PDF
    This thesis considers the problem of learning the structure of gene regulatory networks using gene expression time series data. A more realistic scenario where the state space model representing a gene network evolves nonlinearly is considered while a linear model is assumed for the microarray data. To capture the nonlinearity, a particle filter based state estimation algorithm is studied instead of the contemporary linear approximation based approaches. The parameters signifying the regulatory relations among various genes are estimated online using a Kalman filter. Since a particular gene interacts with a few other genes only, the parameter vector is expected to be sparse. The state estimates delivered by the particle filter and the observed microarray data are then fed to a LASSO based least squares regression operation, which yields a parsimonious and efficient description of the regulatory network by setting the irrelevant coefficients to zero. The performance of the aforementioned algorithm is compared with Extended Kalman filtering (EKF), employing Mean Square Error as fidelity criterion using synthetic data and real biological data. Extensive computer simulations illustrate that the particle filter based gene network inference algorithm outperforms EKF and therefore, it can serve as a natural framework for modeling gene regulatory networks

    Efficient gene set analysis of high-throughput data : From omics to pathway architecture of health and disease

    Get PDF
    Background: A wide range of diseases, normal variations in physiology and development of different species are caused by alterations in gene regulation. The study of gene expression is thus crucial for understanding both normal physiology and disease mechanisms. High-throughput mea- surement technologies allow the profiling of tens of thousands of genes simultaneously. However, the high volume of data thus generated poses methodological challenges in inferring biological consequences from gene expression changes. Traditional gene wise analysis of high dimensional data is overwhelming, prone to noise and unintuitive. The analysis of sets of genes (gene set analysis, GSA), solves the problem by boosting statistical power and biological interpretability. Despite more than a decade of research on gene set analysis, there are still serious limitations in the existing methods. Aims of the study: The objectives of this study were: (1) development of an efficient p-value estimation method for GSA; (2) development of an advanced permutation method for GSA of multi-group gene expression data with fewer replicates; and (3) implementation of the developed methods for the identification of novel smoking induced epigenetic signatures at biological pathway level. Materials and methods: The first study involved the assessment of four different statistical null models for modeling the distribution of gene set scores calculated with the Gene Set Z-score (GSZ) function from permuted gene expression data. A new GSA method - modified GSZ (mGSZ) - based on GSZ and the most optimal distribution model was developed. mGSZ was evaluated by comparing its results with seven other popular GSA methods using four different publicly available gene expression datasets. The second study involved the evaluation of six different permutation schemes for GSA of multi-group (more than two groups) datasets based on the identification of reference gene sets generated using a novel data splitting approach. A new GSA method based on a modification of mGSZ (mGSZm) was developed by implementing the best permutation method for the analysis of multi-group data with fewer than six replicates per group. mGSZm was evaluated by contrasting its performance with seven other state-of-the-art GSA methods suitable for multi-group data. The evaluation was based on three different publicly available multi-group datasets. The third study involved an implementation of mGSZ for GSA of genome-wide DNA methylation data from the Cardiovascular Risk in Young Finns study (YFS) cohort with gene sets downloaded from the Molecular Signature Database (MSigDB). Methylation measurements were done on a subset of 192 individuals from whole-blood samples from the 2011 follow-up study using Illumina Infinium HumanMethylation450 BeadChips. Results: Overall, efficient and robust GSA methods were developed (studies I-II) and implemented (study III). In study I, the results demonstrated a clear advantage of asymptotic p-value estimation over empirical methods. mGSZ, a GSA method based on asymptotic p-values, requires fewer permutations which speeds up the analysis process. mGSZ outperformed state-of-the-art methods based on three different evaluations with three different datasets. In study II, results from a novel evaluation approach with two different datasets suggested that the proposed advanced permutation method outperformed the naive permutation method in GSA of multi-group data with fewer than six replicates. Evaluation of mGSZm, a GSA method equipped with the advanced permutation method and asymptoticn/

    Estimating the Quality of Reprogrammed Cells Using ES Cell Differentiation Expression Patterns

    Get PDF
    Somatic cells can be reprogrammed to a pluripotent state by over-expression of defined factors, and pluripotency has been confirmed by the tetraploid complementation assay. However, especially in human cells, estimating the quality of Induced Pluripotent Stem Cell(iPSC) is still difficult. Here, we present a novel supervised method for the assessment of the quality of iPSCs by estimating the gene expression profile using a 2-D ā€œDifferentiation-index coordinateā€, which consists of two ā€œdeveloping linesā€ that reflects the directions of ES cell differentiation and the changes of cell states during differentiation. By applying a novel liner model to describe the differentiation trajectory, we transformed the ES cell differentiation time-course expression profiles to linear ā€œdeveloping linesā€; and use these lines to construct the 2-D ā€œDifferentiation-index coordinateā€ of mouse and human. We compared the published gene expression profiles of iPSCs, ESCs and fibroblasts in mouse and human ā€œDifferentiation-index coordinateā€. Moreover, we defined the Distance index to indicate the qualities of iPS cells, which based on the projection distance of iPSCs-ESCs and iPSCs-fibroblasts. The results indicated that the ā€œDifferentiation-index coordinateā€ can distinguish differentiation states of the different cells types. Furthermore, by applying this method to the analysis of expression profiles in the tetraploid complementation assay, we showed that the Distance index which reflected spatial distributions correlated the pluripotency of iPSCs. We also analyzed the significantly changed gene sets of ā€œdeveloping linesā€. The results suggest that the method presented here is not only suitable for the estimation of the quality of iPS cells based on expression profiles, but also is a new approach to analyze time-resolved experimental data

    Network estimation in State Space Model with L1-regularization constraint

    Full text link
    Biological networks have arisen as an attractive paradigm of genomic science ever since the introduction of large scale genomic technologies which carried the promise of elucidating the relationship in functional genomics. Microarray technologies coupled with appropriate mathematical or statistical models have made it possible to identify dynamic regulatory networks or to measure time course of the expression level of many genes simultaneously. However one of the few limitations fall on the high-dimensional nature of such data coupled with the fact that these gene expression data are known to include some hidden process. In that regards, we are concerned with deriving a method for inferring a sparse dynamic network in a high dimensional data setting. We assume that the observations are noisy measurements of gene expression in the form of mRNAs, whose dynamics can be described by some unknown or hidden process. We build an input-dependent linear state space model from these hidden states and demonstrate how an incorporated L1L_{1} regularization constraint in an Expectation-Maximization (EM) algorithm can be used to reverse engineer transcriptional networks from gene expression profiling data. This corresponds to estimating the model interaction parameters. The proposed method is illustrated on time-course microarray data obtained from a well established T-cell data. At the optimum tuning parameters we found genes TRAF5, JUND, CDK4, CASP4, CD69, and C3X1 to have higher number of inwards directed connections and FYB, CCNA2, AKT1 and CASP8 to be genes with higher number of outwards directed connections. We recommend these genes to be object for further investigation. Caspase 4 is also found to activate the expression of JunD which in turn represses the cell cycle regulator CDC2.Comment: arXiv admin note: substantial text overlap with arXiv:1308.359
    • ā€¦
    corecore