24 research outputs found

    Transcription factor activity estimation based on particle swarm optimization and fast network component analysis

    Get PDF
    Proceedings of the IEEE Engineering in Medicine and Biology Society Conference, 2010, p. 1061-1064Transcription factors (TFs) play an important role in regulating the expression of genes. The accurate measurement of transcription factor activities (TFAs) depends on a series of experimental technologies of molecular biology and is intractable in most practical situations. Some signal processing methods for blind source separation have been applied in the prediction of TFAs from gene expression data. Most of such methods make use of statistical properties of the gene expression data only, leading to the inaccurate detection of TFAs. In contrast, network component analysis (NCA) can provide much improved result through utilizing the structural information of the gene regulatory network. However, the structure of the gene regulatory network, required by NCA, is not available in most practical cases so that NCA is not directly applicable. In this paper, we propose to use particle swarm optimization (PSO) to find the most plausible network structure iteratively from the gene expression data, with the assistance of recently developed fast algorithm for network component analysis (FastNCA). This novel approach to TFA inference can thus take advantage of NCA, even when the required network structure is unknown. The effectiveness of our novel approach has been demonstrated by applications to both simulated data and real gene expression microarray data, in the sense that TFAs can be inferred with high accuracy. © 2010 IEEE.published_or_final_versio

    Robust Logistic Principal Component Regression for classification of data in presence of outliers

    Get PDF
    The Logistic Principal Component Regression (LPCR) has found many applications in classification of high-dimensional data, such as tumor classification using microarray data. However, when the measurements are contaminated and/or the observations are mislabeled, the performance of the LPCR will be significantly degraded. In this paper, we propose a new robust LPCR based on M-estimation, which constitutes a versatile framework to reduce the sensitivity of the estimators to outliers. In particular, robust detection rules are used to first remove the contaminated measurements and then a modified Huber function is used to further remove the contributions of the mislabeled observations. Experimental results show that the proposed method generally outperforms the conventional LPCR under the presence of outliers, while maintaining a performance comparable to that obtained under normal condition. © 2012 IEEE.published_or_final_versionThe 2012 IEEE International Symposium on Circuits and Systems (ISCAS), Seoul, Korea, 20-23 May 2012. In IEEE International Symposium on Circuits and Systems Proceedings, 2012, p. 2809-281

    A new optimization algorithm for network component analysis based on convex programming

    Get PDF
    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, p. 509-512Paper no. 2203Network component analysis (NCA) has been established as a promising tool for reconstructing gene regulatory networks from microarray data. NCA is a method that can resolve the problem of blind source separation when the mixing matrix instead has a known sparse structure despite the correlation among the source signals. The original NCA algorithm relies on alternating least squares (ALS) and suffers from local convergence as well as slow convergence. In this paper, we develop new and more robust NCA algorithms by incorporating additional signal constraints. In particular, we introduce the biologically sound constraints that all nonzero entries in the connectivity network are positive. Our new approach formulates a convex optimization problem which can be solved efficiently and effectively by fast convex programming algorithms. We verify the effectiveness and robustness of our new approach using simulations and gene regulatory network reconstruction from experimental yeast cell cycle microarray data. ©2009 IEEE.published_or_final_versio

    Data-mining the FlyAtlas online resource to identify core functional motifs across transporting epithelia

    Get PDF
    <p>Background Comparative analysis of tissue-specific transcriptomes is a powerful technique to uncover tissue functions. Our FlyAtlas.org provides authoritative gene expression levels for multiple tissues of Drosophila melanogaster (1). Although the main use of such resources is single gene lookup, there is the potential for powerful meta-analysis to address questions that could not easily be framed otherwise. Here, we illustrate the power of data-mining of FlyAtlas data by comparing epithelial transcriptomes to identify a core set of highly-expressed genes, across the four major epithelial tissues (salivary glands, Malpighian tubules, midgut and hindgut) of both adults and larvae.</p> <p>Method Parallel hypothesis-led and hypothesis-free approaches were adopted to identify core genes that underpin insect epithelial function. In the former, gene lists were created from transport processes identified in the literature, and their expression profiles mapped from the flyatlas.org online dataset. In the latter, gene enrichment lists were prepared for each epithelium, and genes (both transport related and unrelated) consistently enriched in transporting epithelia identified.</p> <p>Results: A key set of transport genes, comprising V-ATPases, cation exchangers, aquaporins, potassium and chloride channels, and carbonic anhydrase, was found to be highly enriched across the epithelial tissues, compared with the whole fly. Additionally, a further set of genes that had not been predicted to have epithelial roles, were co-expressed with the core transporters, extending our view of what makes a transporting epithelium work. Further insights were obtained by studying the genes uniquely overexpressed in each epithelium; for example, the salivary gland expresses lipases, the midgut organic solute transporters, the tubules specialize for purine metabolism and the hindgut overexpresses still unknown genes.</p> <p>Conclusion Taken together, these data provide a unique insight into epithelial function in this key model insect, and a framework for comparison with other species. They also provide a methodology for function-led datamining of FlyAtlas.org and other multi-tissue expression datasets.</p&gt

    Learning Transcriptional Regulatory Relationships Using Sparse Graphical Models

    Get PDF
    Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com

    An integrated machine learning approach for predicting DosR-regulated genes in Mycobacterium tuberculosis.

    Get PDF
    BACKGROUND: DosR is an important regulator of the response to stress such as limited oxygen availability in Mycobacterium tuberculosis. Time course gene expression data enable us to dissect this response on the gene regulatory level. The mRNA expression profile of a regulator, however, is not necessarily a direct reflection of its activity. Knowing the transcription factor activity (TFA) can be exploited to predict novel target genes regulated by the same transcription factor. Various approaches have been proposed to reconstruct TFAs from gene expression data. Most of them capture only a first-order approximation to the complex transcriptional processes by assuming linear gene responses and linear dynamics in TFA, or ignore the temporal information in data from such systems. RESULTS: In this paper, we approach the problem of inferring dynamic hidden TFAs using Gaussian processes (GP). We are able to model dynamic TFAs and to account for both linear and nonlinear gene responses. To test the validity of the proposed approach, we reconstruct the hidden TFA of p53, a tumour suppressor activated by DNA damage, using published time course gene expression data. Our reconstructed TFA is closer to the experimentally determined profile of p53 concentration than that from the original study. We then apply the model to time course gene expression data obtained from chemostat cultures of M. tuberculosis under reduced oxygen availability. After estimation of the TFA of DosR based on a number of known target genes using the GP model, we predict novel DosR-regulated genes: the parameters of the model are interpreted as relevance parameters indicating an existing functional relationship between TFA and gene expression. We further improve the prediction by integrating promoter sequence information in a logistic regression model. Apart from the documented DosR-regulated genes, our prediction yields ten novel genes under direct control of DosR. CONCLUSIONS: Chemostat cultures are an ideal experimental system for controlling noise and variability when monitoring the response of bacterial organisms such as M. tuberculosis to finely controlled changes in culture conditions and available metabolites. Nonlinear hidden TFA dynamics of regulators can be reconstructed remarkably well with Gaussian processes from such data. Moreover, estimated parameters of the GP can be used to assess whether a gene is controlled by the reconstructed TFA or not. It is straightforward to combine these parameters with further information, such as the presence of binding motifs, to increase prediction accuracy.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Sparse regulatory networks

    Full text link
    In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses L1L_1 penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS350 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore