51 research outputs found

    Chemometric Approaches for Systems Biology

    Full text link
    The present Ph.D. thesis is devoted to study, develop and apply approaches commonly used in chemometrics to the emerging field of systems biology. Existing procedures and new methods are applied to solve research and industrial questions in different multidisciplinary teams. The methodologies developed in this document will enrich the plethora of procedures employed within omic sciences to understand biological organisms and will improve processes in biotechnological industries integrating biological knowledge at different levels and exploiting the software packages derived from the thesis. This dissertation is structured in four parts. The first block describes the framework in which the contributions presented here are based. The objectives of the two research projects related to this thesis are highlighted and the specific topics addressed in this document via conference presentations and research articles are introduced. A comprehensive description of omic sciences and their relationships within the systems biology paradigm is given in this part, jointly with a review of the most applied multivariate methods in chemometrics, on which the novel approaches proposed here are founded. The second part addresses many problems of data understanding within metabolomics, fluxomics, proteomics and genomics. Different alternatives are proposed in this block to understand flux data in steady state conditions. Some are based on applications of multivariate methods previously applied in other chemometrics areas. Others are novel approaches based on a bilinear decomposition using elemental metabolic pathways, from which a GNU licensed toolbox is made freely available for the scientific community. As well, a framework for metabolic data understanding is proposed for non-steady state data, using the same bilinear decomposition proposed for steady state data, but modelling the dynamics of the experiments using novel two and three-way data analysis procedures. Also, the relationships between different omic levels are assessed in this part integrating different sources of information of plant viruses in data fusion models. Finally, an example of interaction between organisms, oranges and fungi, is studied via multivariate image analysis techniques, with future application in food industries. The third block of this thesis is a thoroughly study of different missing data problems related to chemometrics, systems biology and industrial bioprocesses. In the theoretical chapters of this part, new algorithms to obtain multivariate exploratory and regression models in the presence of missing data are proposed, which serve also as preprocessing steps of any other methodology used by practitioners. Regarding applications, this block explores the reconstruction of networks in omic sciences when missing and faulty measurements appear in databases, and how calibration models between near infrared instruments can be transferred, avoiding costs and time-consuming full recalibrations in bioindustries and research laboratories. Finally, another software package, including a graphical user interface, is made freely available for missing data imputation purposes. The last part discusses the relevance of this dissertation for research and biotechnology, including proposals deserving future research.Esta tesis doctoral se centra en el estudio, desarrollo y aplicación de técnicas quimiométricas en el emergente campo de la biología de sistemas. Procedimientos comúnmente utilizados y métodos nuevos se aplican para resolver preguntas de investigación en distintos equipos multidisciplinares, tanto del ámbito académico como del industrial. Las metodologías desarrolladas en este documento enriquecen la plétora de técnicas utilizadas en las ciencias ómicas para entender el funcionamiento de organismos biológicos y mejoran los procesos en la industria biotecnológica, integrando conocimiento biológico a diferentes niveles y explotando los paquetes de software derivados de esta tesis. Esta disertación se estructura en cuatro partes. El primer bloque describe el marco en el cual se articulan las contribuciones aquí presentadas. En él se esbozan los objetivos de los dos proyectos de investigación relacionados con esta tesis. Asimismo, se introducen los temas específicos desarrollados en este documento mediante presentaciones en conferencias y artículos de investigación. En esta parte figura una descripción exhaustiva de las ciencias ómicas y sus interrelaciones en el paradigma de la biología de sistemas, junto con una revisión de los métodos multivariantes más aplicados en quimiometría, que suponen las pilares sobre los que se asientan los nuevos procedimientos aquí propuestos. La segunda parte se centra en resolver problemas dentro de metabolómica, fluxómica, proteómica y genómica a partir del análisis de datos. Para ello se proponen varias alternativas para comprender a grandes rasgos los datos de flujos metabólicos en estado estacionario. Algunas de ellas están basadas en la aplicación de métodos multivariantes propuestos con anterioridad, mientras que otras son técnicas nuevas basadas en descomposiciones bilineales utilizando rutas metabólicas elementales. A partir de éstas se ha desarrollado software de libre acceso para la comunidad científica. A su vez, en esta tesis se propone un marco para analizar datos metabólicos en estado no estacionario. Para ello se adapta el enfoque tradicional para sistemas en estado estacionario, modelando las dinámicas de los experimentos empleando análisis de datos de dos y tres vías. En esta parte de la tesis también se establecen relaciones entre los distintos niveles ómicos, integrando diferentes fuentes de información en modelos de fusión de datos. Finalmente, se estudia la interacción entre organismos, como naranjas y hongos, mediante el análisis multivariante de imágenes, con futuras aplicaciones a la industria alimentaria. El tercer bloque de esta tesis representa un estudio a fondo de diferentes problemas relacionados con datos faltantes en quimiometría, biología de sistemas y en la industria de bioprocesos. En los capítulos más teóricos de esta parte, se proponen nuevos algoritmos para ajustar modelos multivariantes, tanto exploratorios como de regresión, en presencia de datos faltantes. Estos algoritmos sirven además como estrategias de preprocesado de los datos antes del uso de cualquier otro método. Respecto a las aplicaciones, en este bloque se explora la reconstrucción de redes en ciencias ómicas cuando aparecen valores faltantes o atípicos en las bases de datos. Una segunda aplicación de esta parte es la transferencia de modelos de calibración entre instrumentos de infrarrojo cercano, evitando así costosas re-calibraciones en bioindustrias y laboratorios de investigación. Finalmente, se propone un paquete software que incluye una interfaz amigable, disponible de forma gratuita para imputación de datos faltantes. En la última parte, se discuten los aspectos más relevantes de esta tesis para la investigación y la biotecnología, incluyendo líneas futuras de trabajo.Aquesta tesi doctoral es centra en l'estudi, desenvolupament, i aplicació de tècniques quimiomètriques en l'emergent camp de la biologia de sistemes. Procediments comúnment utilizats i mètodes nous s'apliquen per a resoldre preguntes d'investigació en diferents equips multidisciplinars, tant en l'àmbit acadèmic com en l'industrial. Les metodologies desenvolupades en aquest document enriquixen la plétora de tècniques utilitzades en les ciències òmiques per a entendre el funcionament d'organismes biològics i milloren els processos en la indústria biotecnològica, integrant coneixement biològic a distints nivells i explotant els paquets de software derivats d'aquesta tesi. Aquesta dissertació s'estructura en quatre parts. El primer bloc descriu el marc en el qual s'articulen les contribucions ací presentades. En ell s'esbossen els objectius dels dos projectes d'investigació relacionats amb aquesta tesi. Així mateix, s'introduixen els temes específics desenvolupats en aquest document mitjançant presentacions en conferències i articles d'investigació. En aquesta part figura una descripació exhaustiva de les ciències òmiques i les seues interrelacions en el paradigma de la biologia de sistemes, junt amb una revisió dels mètodes multivariants més aplicats en quimiometria, que supossen els pilars sobre els quals s'assenten els nous procediments ací proposats. La segona part es centra en resoldre problemes dins de la metabolòmica, fluxòmica, proteòmica i genòmica a partir de l'anàlisi de dades. Per a això es proposen diverses alternatives per a compendre a grans trets les dades de fluxos metabòlics en estat estacionari. Algunes d'elles estàn basades en l'aplicació de mètodes multivariants propostos amb anterioritat, mentre que altres són tècniques noves basades en descomposicions bilineals utilizant rutes metabòliques elementals. A partir d'aquestes s'ha desenvolupat software de lliure accés per a la comunitat científica. Al seu torn, en aquesta tesi es proposa un marc per a analitzar dades metabòliques en estat no estacionari. Per a això s'adapta l'enfocament tradicional per a sistemes en estat estacionari, modelant les dinàmiques dels experiments utilizant anàlisi de dades de dues i tres vies. En aquesta part de la tesi també s'establixen relacions entre els distints nivells òmics, integrant diferents fonts d'informació en models de fusió de dades. Finalment, s'estudia la interacció entre organismes, com taronges i fongs, mitjançant l'anàlisi multivariant d'imatges, amb futures aplicacions a la indústria alimentària. El tercer bloc d'aquesta tesi representa un estudi a fons de diferents problemes relacionats amb dades faltants en quimiometria, biologia de sistemes i en la indústria de bioprocessos. En els capítols més teòrics d'aquesta part, es proposen nous algoritmes per a ajustar models multivariants, tant exploratoris com de regressió, en presencia de dades faltants. Aquests algoritmes servixen ademés com a estratègies de preprocessat de dades abans de l'ús de qualsevol altre mètode. Respecte a les aplicacions, en aquest bloc s'explora la reconstrucció de xarxes en ciències òmiques quan apareixen valors faltants o atípics en les bases de dades. Una segona aplicació d'aquesta part es la transferència de models de calibració entre instruments d'infrarroig proper, evitant així costoses re-calibracions en bioindústries i laboratoris d'investigació. Finalment, es proposa un paquet software que inclou una interfície amigable, disponible de forma gratuïta per a imputació de dades faltants. En l'última part, es discutixen els aspectes més rellevants d'aquesta tesi per a la investigació i la biotecnologia, incloent línies futures de treball.Folch Fortuny, A. (2016). Chemometric Approaches for Systems Biology [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/77148TESISPremios Extraordinarios de tesis doctorale

    PLS model building with missing data: New algorithms and a comparative study

    Full text link
    [EN] New algorithms to deal with missing values in predictive modelling are presented in this article. Specifically, 2 trimmed scores regression adaptations are proposed, one from principal component analysis model building with missing data (MD) and other from partial least squares regression model exploitation with missing values. Using these methods, practitioners can impute MD both in the explanatory/predictor and the dependent/response variables. Partial least squares is used here to build the multivariate calibration models; however, any regression method can be used after MD imputation. Four case studies, with different latent structures, are analysed here to compare the trimmed scores regression¿based methods against state-of-the-art approaches. The MATLAB code for these methods is also provided for its direct implementation at http://mseg.webs.upv.es, under a GNU license.Spanish Ministry of Science and Innovation; FEDER; European Union, Grant/Award Number: DPI2011-28112-C04-02 and DPI2014-55276-C5-1R; Spanish Ministry of Economy and Competitiveness, Grant/Award Number: ECO2013-43353-RFolch-Fortuny, A.; Arteaga, F.; Ferrer, A. (2017). PLS model building with missing data: New algorithms and a comparative study. Journal of Chemometrics. 31(7):1-12. https://doi.org/10.1002/cem.2897S112317Grung, B., & Manne, R. (1998). Missing values in principal component analysis. Chemometrics and Intelligent Laboratory Systems, 42(1-2), 125-139. doi:10.1016/s0169-7439(98)00031-8Arteaga, F., & Ferrer-Riquelme, A. J. (2009). Missing Data. Comprehensive Chemometrics, 285-314. doi:10.1016/b978-044452701-1.00125-3Folch-Fortuny, A., Arteaga, F., & Ferrer, A. (2015). PCA model building with missing data: New proposals and a comparative study. Chemometrics and Intelligent Laboratory Systems, 146, 77-88. doi:10.1016/j.chemolab.2015.05.006Arteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xWalczak, B., & Massart, D. L. (2001). Dealing with missing data. Chemometrics and Intelligent Laboratory Systems, 58(1), 15-27. doi:10.1016/s0169-7439(01)00131-9Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. doi:10.1201/9781439821862Folch-Fortuny, A., Arteaga, F., & Ferrer, A. (2016). Missing Data Imputation Toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems, 154, 93-100. doi:10.1016/j.chemolab.2016.03.019ProSensus Multivariate release 16.02 2016SIMCA release 14 2015The Unscrambler X Release 10.4 2016PLS_Toolbox Release 8.1 2016Liu, Y., & Brown, S. D. (2013). Comparison of five iterative imputation methods for multivariate classification. Chemometrics and Intelligent Laboratory Systems, 120, 106-115. doi:10.1016/j.chemolab.2012.11.010White, I. R., Royston, P., & Wood, A. M. (2010). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377-399. doi:10.1002/sim.4067Schneider, T. (2001). Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. Journal of Climate, 14(5), 853-871. doi:10.1175/1520-0442(2001)0142.0.co;2Fierro, R. D., Golub, G. H., Hansen, P. C., & O’Leary, D. P. (1997). Regularization by Truncated Total Least Squares. SIAM Journal on Scientific Computing, 18(4), 1223-1241. doi:10.1137/s1064827594263837Puwakkatiya-Kankanamage, E. H., García-Muñoz, S., & Biegler, L. T. (2014). An optimization-based undeflated PLS (OUPLS) method to handle missing data in the training set. Journal of Chemometrics, 28(7), 575-584. doi:10.1002/cem.2618Camacho, J., Picó, J., & Ferrer, A. (2008). Bilinear modelling of batch processes. Part II: a comparison of PLS soft-sensors. Journal of Chemometrics, 22(10), 533-547. doi:10.1002/cem.1179Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: a tutorial. Analytica Chimica Acta, 185, 1-17. doi:10.1016/0003-2670(86)80028-9Kubinyi, H. (1996). Evolutionary variable selection in regression and PLS analyses. Journal of Chemometrics, 10(2), 119-133. doi:10.1002/(sici)1099-128x(199603)10:23.0.co;2-4González-Martínez, J. M., Folch-Fortuny, A., Llaneras, F., Tortajada, M., Picó, J., & Ferrer, A. (2014). Metabolic flux understanding of Pichia pastoris grown on heterogenous culture media. Chemometrics and Intelligent Laboratory Systems, 134, 89-99. doi:10.1016/j.chemolab.2014.02.003Folch-Fortuny, A., Vitale, R., de Noord, O. E., & Ferrer, A. (2017). Calibration transfer between NIR spectrometers: New proposals and a comparative study. Journal of Chemometrics, 31(3), e2874. doi:10.1002/cem.2874Arteaga, F., & Ferrer, A. (2010). How to simulate normal data sets with the desired correlation structure. Chemometrics and Intelligent Laboratory Systems, 101(1), 38-42. doi:10.1016/j.chemolab.2009.12.003Arteaga, F., & Ferrer, A. (2013). Building covariance matrices with the desired structure. Chemometrics and Intelligent Laboratory Systems, 127, 80-88. doi:10.1016/j.chemolab.2013.06.003Folch-Fortuny, A., Arteaga, F., & Ferrer, A. (2016). Assessment of maximum likelihood PCA missing data imputation. Journal of Chemometrics, 30(7), 386-393. doi:10.1002/cem.2804Saccenti, E., & Camacho, J. (2015). On the use of the observation-wisek-fold operation in PCA cross-validation. Journal of Chemometrics, 29(8), 467-478. doi:10.1002/cem.272

    Assessment of maximum likelihood PCA missing data imputation

    Full text link
    Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non-measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression-based methods, with other state-of-the-art methods.Research in this study was partially supported by the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grant DPI2011-28112-C04-02 and DPI2014-55276-C5-1R, and the Spanish Ministry of Economy and Competitiveness through grant ECO2013-43353-R.Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Assessment of maximum likelihood PCA missing data imputation. Journal of Chemometrics. 30(7):386-393. https://doi.org/10.1002/cem.280438639330

    Missing Data Imputation Toolbox for MATLAB

    Full text link
    [EN] Here we introduce a graphical user-friendly interface to deal with missing values called Missing Data Imputation (MDI) Toolbox. This MATLAB toolbox allows imputing missing values, following missing completely at random patterns, exploiting the relationships among variables. In this way, principal component anal- ysis (PCA) models are fitted iteratively to impute the missing data until convergence. Different methods, using PCA internally, are included in the toolbox: trimmed scores regression (TSR), known data regres- sion (KDR), KDR with principal component regression (KDR-PCR), KDR with partial least squares regression (KDR-PLS), projection to the model plane (PMP), iterative algorithm (IA), modified nonlinear iterative partial least squares regression algorithm (NIPALS) and data augmentation (DA). MDI Toolbox presents a general procedure to impute missing data, thus can be used to infer PCA models with missing data, to estimate the covariance structure of incomplete data matrices, or to impute the missing values as a preprocessing step of other methodologies.Research in this study was partially supported by the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grant DPI2011-28112-C04-02 and DPI2014-55276-C5-1 R, and the Spanish Ministry of Economy and Competitiveness through grant ECO2013-43353-R.Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Missing Data Imputation Toolbox for MATLAB. Chemometrics and Intelligent Laboratory Systems. 154:93-100. https://doi.org/10.1016/j.chemolab.2016.03.019S9310015

    Calibration transfer between NIR spectrometers: new proposals and a comparative study

    Full text link
    [EN] Calibration transfer between near-infrared (NIR) spectrometers is a subtle issue in chemometrics and process industry. In fact, as even very similar instruments may generate strongly different spectral responses, regression models developed on a first NIR system can rarely be used with spectra collected by a second apparatus. In this work, two novel methods to perform calibration transfer between NIR spectrometers are proposed. Both of them permit to exploit the specific relationships between instruments for imputing new unmeasured spectra, which will be then resorted to for building an improved predictive model, suitable for the analysis of future incoming data. Specifically, the two approaches are based on trimmed scores regression and joint-Y partial least squares regression, respectively. The performance of these novel strategies will be assessed and compared to that of well-established techniques such as maximum likelihood principal component analysis and piecewise direct standardisation in two real case studies.This research work was partially supported by the Spanish Ministry of Economy and Competitiveness under the project DPI2014-55276-C5-1R and Shell Global Solutions International B.V. (Amsterdam, the Netherlands).Folch-Fortuny, A.; Vitale, R.; De Noord, OE.; Ferrer, A. (2017). Calibration transfer between NIR spectrometers: new proposals and a comparative study. Journal of Chemometrics. 31(3):1-11. doi:10.1002/cem.2874S11131

    VIS/NIR hyperspectral imaging and N-way PLS-DA models for detection of decay lesions in citrus fruits

    Full text link
    [EN] In this work an N-way partial least squares regression discriminant analysis (NPLS-DA) methodology is developed to detect symptoms of disease caused by Penicillium digitatum in citrus fruits (green mould) using visible/near infrared (VIS/NIR) hyperspectral images. To build the discriminant model a set of oranges and mandarins was infected by the fungus and another set was infiltrated just with water for control purposes. A double cross-validation strategy is used to validate the discriminant models. Finally, permutation testing is used to select a few bands offering the best correct classification rates in the validation set. The discriminant models developed here can be potentially implemented in a fruit packinghouse to detect infected citrus fruits at their arrival from the field with affordable multispectral (3 5 channels) cameras installed in the packinglines.This research was partially funded by the Spanish Ministry of Science and Innovation through grants DPI2011-28112-C04-02 and DPI2014-55276-C05-1R, and by INIA through grant RTA2012-00062-C04-01. In all cases with the support of European FEDER funds. Authors thank Lluis Palou from the Centro de Tecnologia Postcosecha at the IVIA for the help and supervision in the innoculation process of the fruits.Folch Fortuny, A.; Prats-Montalbán, JM.; Cubero-García, S.; Blasco Ivars, J.; Ferrer, A. (2016). VIS/NIR hyperspectral imaging and N-way PLS-DA models for detection of decay lesions in citrus fruits. Chemometrics and Intelligent Laboratory Systems. 156:241-248. https://doi.org/10.1016/j.chemolab.2016.05.005S24124815

    Principal elementary mode analysis (PEMA)

    Full text link
    Principal component analysis (PCA) has been widely applied in fluxomics to compress data into a few latent structures in order to simplify the identification of metabolic patterns. These latent structures lack a direct biological interpretation due to the intrinsic constraints associated with a PCA model. Here we introduce a new method that significantly improves the interpretability of the principal components with a direct link to metabolic pathways. This method, called principal elementary mode analysis (PEMA), establishes a bridge between a PCA-like model, aimed at explaining the maximum variance in flux data, and the set of elementary modes (EMs) of a metabolic network. It provides an easy way to identify metabolic patterns in large fluxomics datasets in terms of the simplest pathways of the organism metabolism. The results using a real metabolic model of Escherichia coli show the ability of PEMA to identify the EMs that generated the different simulated flux distributions. Actual flux data of E. coli and Pichia pastoris cultures confirm the results observed in the simulated study, providing a biologically meaningful model to explain flux data of both organisms in terms of the EM activation. The PEMA toolbox is freely available for non-commercial purposes on http://mseg.webs.upv.es.Research in this study was partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grants DPI2011-28112-C04-02 and DPI2014-55276-C5-1R. We would also acknowledge Fundacao para a Ciencia e Tecnologia for PhD fellowships with references SFRH/BD/67033/2009, SFRH/BD/70768/2010 and PTDC/BBB-BSS/2800/2012.Folch Fortuny, A.; Marques, R.; Isidro, IA.; Oliveira, R.; Ferrer, A. (2016). Principal elementary mode analysis (PEMA). Molecular BioSystems. 12(3):737-746. doi:10.1039/c5mb00828jS73774612

    Enabling network inference methods to handle missing data and outliers

    Get PDF
    © 2015 Folch-Fortuny et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.[EN] Background: The inference of complex networks from data is a challenging problem in biological sciences, as well as in a wide range of disciplines such as chemistry, technology, economics, or sociology. The quantity and quality of the data greatly affect the results. While many methodologies have been developed for this task, they seldom take into account issues such as missing data or outlier detection and correction, which need to be properly addressed before network inference. Results: Here we present an approach to (i) handle missing data and (ii) detect and correct outliers based on multivariate projection to latent structures. The method, called trimmed scores regression (TSR), enables network inference methods to analyse incomplete datasets by imputing the missing values coherently with the latent data structure. Furthermore, it substitutes the faulty values in a dataset by proper estimations. We provide an implementation of this approach, and show how it can be integrated with any network inference method as a preliminary data curation step. This functionality is demonstrated with a state of the art network inference method based on mutual information distance and entropy reduction, MIDER. Conclusion: The methodology presented here enables network inference methods to analyse a large number of incomplete and faulty datasets that could not be reliably analysed so far. Our comparative studies show the superiority of TSR over other missing data approaches used by practitioners. Furthermore, the method allows for outlier detection and correction.Research in this study was partially supported by the European Union through project BioPreDyn (FP7-KBBE 289434), and the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grants MultiScales (DPI2011-28112-C04-02, DPI2011-28112-C04-03), and SynBioFactory (DPI2014-55276-C5-1-R, DPI2014-55276-C5-2-R). AF Villaverde also acknowledges funding from the Xunta de Galicia through an I2C postdoctoral fellowship (I2C ED481B 2014/133-0). We also gratefully acknowledge Associate Professor Francisco Arteaga for his help in the adaptation of TSR to the PCA model building context.Folch-Fortuny, A.; Fernández Villaverde, A.; Ferrer Riquelme, AJ.; Rodríguez Banga, J. (2015). Enabling network inference methods to handle missing data and outliers. BMC Bioinformatics. 16(283):1-12. https://doi.org/10.1186/s12859-015-0717-711216283Albert R, Barabási AL. Statistical mechanics of complex networks. Rev Mod Phys. 2002; 74(1):47–97.Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003; 45(2):167–256.De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010; 8(10):717–29.Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci. 2010; 107(14):6286–291.Prill RJ, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Stolovitzky G. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal. 2011; 4(189):7.Lecca P, Priami C. Biological network inference for drug discovery. Drug Discovery Today. 2013; 18(5-6):256–64.Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA. Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinform. 2013; 15(2):195–211.Grung B, Manne R. Missing values in principal component analysis. Chemometr Intell Lab Syst. 1998; 42(1-2):125–39.Arteaga F, Ferrer A. Missing data. In: Comprehensive chemometrics chemical and biochemical data analysis. Amsterdam: Elsevier: 2009. p. 285–314.Jackson JE. A user’s guide to principal components. Hoboken: Wiley Ser Probab Stat; 2004.Walczak B, Massart DL. Dealing with missing data. Chemometr Intell Lab Syst. 2001; 58(1):15–27.Martens H, Jr Russwurm H. Food research and data analysis. London; New York, NY, USA: Elsevier Applied Science; 1983.Arteaga F, Ferrer A. Dealing with missing data in MSPC: Several methods, different interpretations, some examples. J Chemom. 2002; 16(8-10):408–18.Folch-Fortuny A, Arteaga F, Ferrer A. PCA model building with missing data: new proposals and a comparative study. Chemometr Intell Lab Syst. 2015; 146:77–88.Liao SG, Lin Y, Kang DD, Chandra D, Bon J, Kaminski N, et al.Missing value imputation in high-dimensional phenomic data: imputable or not, and how?BMC Bioinforma. 2014; 15(1):346.Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr Intell Lab Syst. 1987; 2(1-3):37–52.Kourti T, MacGregor JF. Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometr Intell Lab Syst. 1995; 28(1):3–21.Ferrer A. Latent structures-based multivariate statistical process control: A paradigm shift. Qual Eng. 2014; 26(1):72–91.Villaverde AF, Ross J, Morán F, Banga JR. MIDER: Network inference with mutual information distance and entropy reduction. PLoS ONE. 2014; 9(5):96732.Shannon CE. A mathematical theory of communication. Bell Sys Tech J. 1948; 27(3):379–423.Cover TM, Thomas JA. Elements of information theory, 99 ed. New York: Wiley-Interscience; 1991.Villaverde AF, Ross J, Banga JR. Reverse engineering cellular networks with information theoretic methods. Cells. 2013; 2(2):306–29.Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al.Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8.Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al.ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma. 2006; 7(Suppl 1):7.Meyer PE, Kontos K, Lafitte F, Bontempi G. Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinforma Syst Biol. 2007; 2007(1):79879.Luo W, Hankenson KD, Woolf PJ. Learning transcriptional regulatory networks from high throughput gene expression data using continuous three-way mutual information. BMC Bioinforma. 2008; 9:467.Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC bioinforma. 2010; 11:154.Wu CC, Huang HC, Juan HF, Chen ST. GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics (Oxford, England). 2004; 20(18):3691–693.Gustafsson M, Hörnquist M, Lombardi A. Constructing and analyzing a large-scale gene-to-gene regulatory network–lasso-constrained inference and biological validation. IEEE/ACM trans comput biol bioinform/IEEE, ACM. 2005; 2(3):254–61.Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S. Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics (Oxford, England). 2005; 21(8):1626–34.Schulze S, Henkel SG, Driesch D, Guthke R, Linde J. Computational prediction of molecular pathogen-host interactions based on dual transcriptome data. Front Microbiol. 2015; 6:65.Hurley D, Araki H, Tamada Y, Dunmore B, Sanders D, Humphreys S, et al.Gene network inference and visualization tools for biologists: application to new human transcriptome datasets. Nucleic Acids Res. 2012; 40(6):2377–398.Souto MCd, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinforma. 2015; 16(1):64.Guitart-Pla O, Kustagi M, Rügheimer F, Califano A, Schwikowski B. The Cyni framework for network inference in Cytoscape. Bioinformatics (Oxford, England). 2015; 31(9):1499–1501.Camacho J, Picó J, Ferrer A. Data understanding with PCA: Structural and variance information plots. Chemometr Intell Lab Syst. 2010; 100(1):48–56.Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978; 20(4):397–405.Camacho J, Ferrer A. Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. J Chemom. 2012; 26(7):361–73.Little RJA, Rubin DB. Statistical analysis with missing data, 2nd ed. Hoboken, NJ: Wiley-Interscience; 2002.Ferrer A. Multivariate statistical process control based on principal component analysis (MSPC-PCA): Some reflections and a case study in an autobody assembly process. Qual Eng. 2007; 19(4):311–25.MacGregor JF, Kourti T. Statistical process control of multivariate processes. Control Eng Pract. 1995; 3(3):403–14.Stanimirova I, Daszykowski M, Walczak B. Dealing with missing values and outliers in principal component analysis. Talanta. 2007; 72(1):172–8.Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010; 2(4):433–59.Camacho J, Picó J, Ferrer A. The best approaches in the on-line monitoring of batch processes based on PCA: Does the modelling structure matter?Anal Chim Acta. 2009; 642(1-2):59–68.González-Martínez JM, de Noord OE, Ferrer A. Multisynchro: a novel approach for batch synchronization in scenarios of multiple asynchronisms. J Chemom. 2014; 28(5):462–75.Samoilov MS. Reconstruction and Functional Analysis of General Chemical Reactions and Reaction Networks. California, United States: Stanford University; 1997.Samoilov M, Arkin A, Ross J. On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos (Woodbury, NY). 2001; 11(1):108–14.Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, et al.A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009; 137(1):172–81.Arkin A, Shen P, Ross J. A test case of correlation metric construction of a reaction pathway from measurements. Science. 1997; 277(5330):1275–9.Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics (Oxford, England). 2011; 27(16):2263–270.Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol J Comput Mol Cell Biol. 2009; 16(2):229–39

    Fusion of genomic, proteomic and phenotypic data: the case of potyviruses

    Full text link
    Data fusion has been widely applied to analyse different sources of information, combining all of them in a single multivariate model. This methodology is mandatory when different omic data sets must be integrated to fully understand an organism using a systems biology approach. Here, a data fusion procedure is presented to combine genomic, proteomic and phenotypic data sets gathered for Tobacco etch virus (TEV). The genomic data correspond to random mutations inserted in most viral genes. The proteomic data represent both the effect of these mutations on the encoded proteins and the perturbation induced by the mutated proteins to their neighbours in the protein protein interaction net- work (PPIN). Finally, the phenotypic trait evaluated for each mutant virus is replicative fitness. To analyse these three sources of information a Partial Least Squares (PLS) regression model is fitted in order to extract the latent variables from data that explain (and relate) the significant variables to the fitness of TEV. The final output of this methodology is a set of functional modules of the PPIN relating topology and mutations with fitness. Throughout the re-analysis of these diverse TEV data, we generated valuable information on the mechanism of action of certain mutations and how they translate into organismal fitness. Results show that the effect of some mutations goes beyond the protein they directly affect and spreads on the PPIN to neighbour proteins, thus defining functional modules.This work was supported by the Spanish Ministerio de Economia y Competitividad grants BFU2012-30805 (to SFE), and DPI2011-28112-C04-02, DPI2011-28112-C04-01, DPI2014-55276-C5-1-R (to AF and JP) and by Generalitat Valenciana grant PROMETEOII/2014/021 (to SFE). The first two authors are recipients of fellowships from the Spanish Ministerio de Economia y Competitividad: BES-2012-053772 (to GB) and BES-2012-057812 (to AF-F).Folch-Fortuny, A.; Bosque-Chacon, G.; Picó, J.; Ferrer, A.; Elena, S. (2016). Fusion of genomic, proteomic and phenotypic data: the case of potyviruses. Molecular BioSystems. 12(1):253-261. https://doi.org/10.1039/c5mb00507hS25326112

    Dynamic elementary mode modelling of non-steady state flux data

    Get PDF
    [EN] A novel framework is proposed to analyse metabolic fluxes in non-steady state conditions, based on the new concept of dynamic elementary mode (dynEM): an elementary mode activated partially depending on the time point of the experiment.This research work was partially supported by the Spanish Ministry of Economy and Competitiveness under the project DPI2014-55276-C5-1R.Folch-Fortuny, A.; Teusink, B.; Hoefsloot, HC.; Smilde, AK.; Ferrer, A. (2018). Dynamic elementary mode modelling of non-steady state flux data. BMC Systems Biology. 12:1-15. https://doi.org/10.1186/s12918-018-0589-3S11512Bro R, Smilde AK. Principal component analysis. Anal Methods. 2014; 6(9):2812–31.González-Martínez JM, Folch-Fortuny A, Llaneras F, Tortajada M, Picó J, Ferrer A. Metabolic flux understanding of Pichia pastoris grown on heterogenous culture media. Chemometr Intell Lab Syst. 2014; 134:89–99.Barrett CL, Herrgard MJ, Palsson B. Decomposing complex reaction networks using random sampling, principal component analysis and basis rotation. BMC Syst Biol. 2009; 3(30):1–8.Jaumot J, Gargallo R, De Juan A, Tauler R. A graphical user-friendly interface for MCR-ALS: A new tool for multivariate curve resolution in MATLAB. Chemometr Intell Lab Syst. 2005; 76(1):101–10.Folch-Fortuny A, Tortajada M, Prats-Montalbán JM, Llaneras F, Picó J, Ferrer A. MCR-ALS on metabolic networks: Obtaining more meaningful pathways. Chemometr Intell Lab Syst. 2015; 142:293–303.Folch-Fortuny A, Marques R, Isidro IA, Oliveira R, Ferrer A. Principal elementary mode analysis (PEMA). Mol BioSyst. 2016; 12(3):737–46.Hood L. Systems biology: Integrating technology, biology, and computation. Mech Ageing Dev. 2003; 124(1):9–16.Teusink B, Passarge J, Reijenga CA, Esgalhado E, van der Weijden CC, Schepper M, Walsh MC, Bakker BM, van Dam K, Westerhoff HV, Snoep JL. Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur J Biochem / FEBS. 2000; 267(17):5313–29.Mahadevan R, Edwards JS, Doyle FJ. Dynamic flux balance analysis of diauxic growth in Escherichia coli. Biophys J. 2002; 83(3):1331–40.Willemsen AM, Hendrickx DM, Hoefsloot HCJ, Hendriks MMWB, Wahl SA, Teusink B, Smilde AK, van Kampen AHC. MetDFBA: incorporating time-resolved metabolomics measurements into dynamic flux balance analysis. Mol BioSyst. 2015; 11(1):137–45.Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003; 17(3):166–73.Bartel J, Krumsiek J, Theis FJ. Statistical methods for the analysis of high-throughput metabolomics data. Comput Struct Biotechnol J. 2013; 4:201301009.Hendrickx DM, Hoefsloot HCJ, Hendriks MMWB, Canelas AB, Smilde AK. Global test for metabolic pathway differences between conditions. Anal Chim Acta. 2012; 719:8–15.Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006; 34(Database issue):354–7.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30.Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010; 38(Database issue):355–60.Andersson CA, Bro R. The N-way Toolbox for MATLAB. Chemometr Intell Lab Syst. 2000; 52(1):1–4.Terzer M, Stelling J. Large-scale computation of elementary flux modes with bit pattern trees. Bioinformatics. 2008; 24(19):2229–35.Heerden JHv, Wortel MT, Bruggeman FJ, Heijnen JJ, Bollen YJM, Planqué R, Hulshof J, O’Toole TG, Wahl SA, Teusink B. Lost in Transition: Start-Up of Glycolysis Yields Subpopulations of Nongrowing Cells. Science. 2014; 343(6174):1245114.Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U. COPASI–a COmplex PAthway SImulator. Bioinformatics. 2006; 22(24):3067–74.Petzold L. Automatic selection of methods for solving stiff and nonstiff systems of ordinary differential equations. SIAM J Sci Stat Comput. 1983; 4:136–48.Canelas AB, van Gulik WM, Heijnen JJ. Determination of the cytosolic free NAD/NADH ratio in Saccharomyces cerevisiae under steady-state and highly dynamic conditions. Biotechnol Bioeng. 2008; 100(4):734–43.Nikerel IE, Canelas AB, Jol SJ, Verheijen PJT, Heijnen JJ. Construction of kinetic models for metabolic reaction networks: Lessons learned in analysing short-term stimulus response data. Math Comput Model Dyn Syst. 2011; 17(3):243–60.Llaneras F, Picó J. Stoichiometric modelling of cell metabolism. J Biosci Bioeng. 2008; 105(1):1–11.Bro R. Multiway calibration. Multilinear PLS. J Chemom. 1998; 10(1):47–61.Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJv, Duijnhoven JPMv, Dorsten FAv. Assessment of PLSDA cross validation. Metabolomics. 2008; 4(1):81–9.Szymańska E, Saccenti E, Smilde AK, Westerhuis JA. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics. 2012; 8(Suppl 1):3–16.Rodrigues F, Ludovico P, Leão C. Sugar Metabolism in Yeasts: an Overview of Aerobic and Anaerobic Glucose Catabolism. In: Biodiversity and Ecophysiology of Yeasts. The Yeast Handbook. Berlin: Springer: 2006. p. 101–21.Larsson K, Ansell R, Eriksson P, Adler L. A gene encoding sn-glycerol 3-phosphate dehydrogenase (NAD+) complements an osmosensitive mutant of Saccharomyces cerevisiae. Mol Microbiol. 1993; 10(5):1101–11.Eriksson P, André L, Ansell R, Blomberg A, Adler L. Cloning and characterization of GPD2, a second gene encoding sn-glycerol 3-phosphate dehydrogenase (NAD+) in Saccharomyces cerevisiae, and its comparison with GPD1. Mol Microbiol. 1995; 17(1):95–107.Norbeck J, Pâhlman AK, Akhtar N, Blomberg A, Adler L. Purification and characterization of two isoenzymes of DL-glycerol-3-phosphatase from Saccharomyces cerevisiae. Identification of the corresponding GPP1 and GPP2 genes and evidence for osmotic regulation of Gpp2p expression by the osmosensing mitogen-activated protein kinase signal transduction pathway. J Biol Chem. 1996; 271(23):13875–81
    • …
    corecore