5 research outputs found

    Enabling network inference methods to handle missing data and outliers

    Get PDF
    © 2015 Folch-Fortuny et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.[EN] Background: The inference of complex networks from data is a challenging problem in biological sciences, as well as in a wide range of disciplines such as chemistry, technology, economics, or sociology. The quantity and quality of the data greatly affect the results. While many methodologies have been developed for this task, they seldom take into account issues such as missing data or outlier detection and correction, which need to be properly addressed before network inference. Results: Here we present an approach to (i) handle missing data and (ii) detect and correct outliers based on multivariate projection to latent structures. The method, called trimmed scores regression (TSR), enables network inference methods to analyse incomplete datasets by imputing the missing values coherently with the latent data structure. Furthermore, it substitutes the faulty values in a dataset by proper estimations. We provide an implementation of this approach, and show how it can be integrated with any network inference method as a preliminary data curation step. This functionality is demonstrated with a state of the art network inference method based on mutual information distance and entropy reduction, MIDER. Conclusion: The methodology presented here enables network inference methods to analyse a large number of incomplete and faulty datasets that could not be reliably analysed so far. Our comparative studies show the superiority of TSR over other missing data approaches used by practitioners. Furthermore, the method allows for outlier detection and correction.Research in this study was partially supported by the European Union through project BioPreDyn (FP7-KBBE 289434), and the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grants MultiScales (DPI2011-28112-C04-02, DPI2011-28112-C04-03), and SynBioFactory (DPI2014-55276-C5-1-R, DPI2014-55276-C5-2-R). AF Villaverde also acknowledges funding from the Xunta de Galicia through an I2C postdoctoral fellowship (I2C ED481B 2014/133-0). We also gratefully acknowledge Associate Professor Francisco Arteaga for his help in the adaptation of TSR to the PCA model building context.Folch-Fortuny, A.; Fernández Villaverde, A.; Ferrer Riquelme, AJ.; Rodríguez Banga, J. (2015). Enabling network inference methods to handle missing data and outliers. BMC Bioinformatics. 16(283):1-12. https://doi.org/10.1186/s12859-015-0717-711216283Albert R, Barabási AL. Statistical mechanics of complex networks. Rev Mod Phys. 2002; 74(1):47–97.Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003; 45(2):167–256.De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010; 8(10):717–29.Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci. 2010; 107(14):6286–291.Prill RJ, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Stolovitzky G. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal. 2011; 4(189):7.Lecca P, Priami C. Biological network inference for drug discovery. Drug Discovery Today. 2013; 18(5-6):256–64.Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA. Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinform. 2013; 15(2):195–211.Grung B, Manne R. Missing values in principal component analysis. Chemometr Intell Lab Syst. 1998; 42(1-2):125–39.Arteaga F, Ferrer A. Missing data. In: Comprehensive chemometrics chemical and biochemical data analysis. Amsterdam: Elsevier: 2009. p. 285–314.Jackson JE. A user’s guide to principal components. Hoboken: Wiley Ser Probab Stat; 2004.Walczak B, Massart DL. Dealing with missing data. Chemometr Intell Lab Syst. 2001; 58(1):15–27.Martens H, Jr Russwurm H. Food research and data analysis. London; New York, NY, USA: Elsevier Applied Science; 1983.Arteaga F, Ferrer A. Dealing with missing data in MSPC: Several methods, different interpretations, some examples. J Chemom. 2002; 16(8-10):408–18.Folch-Fortuny A, Arteaga F, Ferrer A. PCA model building with missing data: new proposals and a comparative study. Chemometr Intell Lab Syst. 2015; 146:77–88.Liao SG, Lin Y, Kang DD, Chandra D, Bon J, Kaminski N, et al.Missing value imputation in high-dimensional phenomic data: imputable or not, and how?BMC Bioinforma. 2014; 15(1):346.Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr Intell Lab Syst. 1987; 2(1-3):37–52.Kourti T, MacGregor JF. Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometr Intell Lab Syst. 1995; 28(1):3–21.Ferrer A. Latent structures-based multivariate statistical process control: A paradigm shift. Qual Eng. 2014; 26(1):72–91.Villaverde AF, Ross J, Morán F, Banga JR. MIDER: Network inference with mutual information distance and entropy reduction. PLoS ONE. 2014; 9(5):96732.Shannon CE. A mathematical theory of communication. Bell Sys Tech J. 1948; 27(3):379–423.Cover TM, Thomas JA. Elements of information theory, 99 ed. New York: Wiley-Interscience; 1991.Villaverde AF, Ross J, Banga JR. Reverse engineering cellular networks with information theoretic methods. Cells. 2013; 2(2):306–29.Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al.Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8.Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al.ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma. 2006; 7(Suppl 1):7.Meyer PE, Kontos K, Lafitte F, Bontempi G. Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinforma Syst Biol. 2007; 2007(1):79879.Luo W, Hankenson KD, Woolf PJ. Learning transcriptional regulatory networks from high throughput gene expression data using continuous three-way mutual information. BMC Bioinforma. 2008; 9:467.Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC bioinforma. 2010; 11:154.Wu CC, Huang HC, Juan HF, Chen ST. GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics (Oxford, England). 2004; 20(18):3691–693.Gustafsson M, Hörnquist M, Lombardi A. Constructing and analyzing a large-scale gene-to-gene regulatory network–lasso-constrained inference and biological validation. IEEE/ACM trans comput biol bioinform/IEEE, ACM. 2005; 2(3):254–61.Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S. Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics (Oxford, England). 2005; 21(8):1626–34.Schulze S, Henkel SG, Driesch D, Guthke R, Linde J. Computational prediction of molecular pathogen-host interactions based on dual transcriptome data. Front Microbiol. 2015; 6:65.Hurley D, Araki H, Tamada Y, Dunmore B, Sanders D, Humphreys S, et al.Gene network inference and visualization tools for biologists: application to new human transcriptome datasets. Nucleic Acids Res. 2012; 40(6):2377–398.Souto MCd, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinforma. 2015; 16(1):64.Guitart-Pla O, Kustagi M, Rügheimer F, Califano A, Schwikowski B. The Cyni framework for network inference in Cytoscape. Bioinformatics (Oxford, England). 2015; 31(9):1499–1501.Camacho J, Picó J, Ferrer A. Data understanding with PCA: Structural and variance information plots. Chemometr Intell Lab Syst. 2010; 100(1):48–56.Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978; 20(4):397–405.Camacho J, Ferrer A. Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. J Chemom. 2012; 26(7):361–73.Little RJA, Rubin DB. Statistical analysis with missing data, 2nd ed. Hoboken, NJ: Wiley-Interscience; 2002.Ferrer A. Multivariate statistical process control based on principal component analysis (MSPC-PCA): Some reflections and a case study in an autobody assembly process. Qual Eng. 2007; 19(4):311–25.MacGregor JF, Kourti T. Statistical process control of multivariate processes. Control Eng Pract. 1995; 3(3):403–14.Stanimirova I, Daszykowski M, Walczak B. Dealing with missing values and outliers in principal component analysis. Talanta. 2007; 72(1):172–8.Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010; 2(4):433–59.Camacho J, Picó J, Ferrer A. The best approaches in the on-line monitoring of batch processes based on PCA: Does the modelling structure matter?Anal Chim Acta. 2009; 642(1-2):59–68.González-Martínez JM, de Noord OE, Ferrer A. Multisynchro: a novel approach for batch synchronization in scenarios of multiple asynchronisms. J Chemom. 2014; 28(5):462–75.Samoilov MS. Reconstruction and Functional Analysis of General Chemical Reactions and Reaction Networks. California, United States: Stanford University; 1997.Samoilov M, Arkin A, Ross J. On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos (Woodbury, NY). 2001; 11(1):108–14.Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, et al.A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009; 137(1):172–81.Arkin A, Shen P, Ross J. A test case of correlation metric construction of a reaction pathway from measurements. Science. 1997; 277(5330):1275–9.Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics (Oxford, England). 2011; 27(16):2263–270.Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol J Comput Mol Cell Biol. 2009; 16(2):229–39

    Missing Data Imputation Toolbox for MATLAB

    Full text link
    [EN] Here we introduce a graphical user-friendly interface to deal with missing values called Missing Data Imputation (MDI) Toolbox. This MATLAB toolbox allows imputing missing values, following missing completely at random patterns, exploiting the relationships among variables. In this way, principal component anal- ysis (PCA) models are fitted iteratively to impute the missing data until convergence. Different methods, using PCA internally, are included in the toolbox: trimmed scores regression (TSR), known data regres- sion (KDR), KDR with principal component regression (KDR-PCR), KDR with partial least squares regression (KDR-PLS), projection to the model plane (PMP), iterative algorithm (IA), modified nonlinear iterative partial least squares regression algorithm (NIPALS) and data augmentation (DA). MDI Toolbox presents a general procedure to impute missing data, thus can be used to infer PCA models with missing data, to estimate the covariance structure of incomplete data matrices, or to impute the missing values as a preprocessing step of other methodologies.Research in this study was partially supported by the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grant DPI2011-28112-C04-02 and DPI2014-55276-C5-1 R, and the Spanish Ministry of Economy and Competitiveness through grant ECO2013-43353-R.Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Missing Data Imputation Toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems. 154:93-100. https://doi.org/10.1016/j.chemolab.2016.03.019S9310015

    Unified feature association networks through integration of transcriptomic and proteomic data

    Get PDF
    High-throughput multi-omics studies and corresponding network analyses of multi-omic data have rapidly expanded their impact over the last 10 years. As biological features of different types (e.g. transcripts, proteins, metabolites) interact within cellular systems, the greatest amount of knowledge can be gained from networks that incorporate multiple types of -omic data. However, biological and technical sources of variation diminish the ability to detect cross-type associations, yielding networks dominated by communities comprised of nodes of the same type. We describe here network building methods that can maximize edges between nodes of different data types leading to integrated networks, networks that have a large number of edges that link nodes of different–omic types (transcripts, proteins, lipids etc). We systematically rank several network inference methods and demonstrate that, in many cases, using a random forest method, GENIE3, produces the most integrated networks. This increase in integration does not come at the cost of accuracy as GENIE3 produces networks of approximately the same quality as the other network inference methods tested here. Using GENIE3, we also infer networks representing antibody-mediated Dengue virus cell invasion and receptor-mediated Dengue virus invasion. A number of functional pathways showed centrality differences between the two networks including genes responding to both GM-CSF and IL-4, which had a higher centrality value in an antibody-mediated vs. receptor-mediated Dengue network. Because a biological system involves the interplay of many different types of molecules, incorporating multiple data types into networks will improve their use as models of biological systems. The methods explored here are some of the first to specifically highlight and address the challenges associated with how such multi-omic networks can be assembled and how the greatest number of interactions can be inferred from different data types. The resulting networks can lead to the discovery of new host response patterns and interactions during viral infection, generate new hypotheses of pathogenic mechanisms and confirm mechanisms of disease

    Additional file 2 of Enabling network inference methods to handle missing data and outliers

    No full text
    OUTLIERS. The source code of the outlier scheme for detection and correction is provided ready to be used in MATLAB. (M 3 kb

    Additional file 1 of Enabling network inference methods to handle missing data and outliers

    No full text
    TSR. The source code of trimmed scores regression (TSR) algorithm for missing data imputation is provided ready to be used in MATLAB. (M 2 kb
    corecore