15 research outputs found

    Synthetic Genomics

    Get PDF
    The current advances in sequencing, data mining, DNA synthesis, cloning, in silico modeling, and genome editing have opened a new field of research known as Synthetic Genomics. The main goal of this emerging area is to engineer entire synthetic genomes from scratch using pre-designed building blocks obtained by chemical synthesis and rational design. This has opened the possibility to further improve our understanding of genome fundamentals by considering the effect of the whole biological system on biological function. Moreover, the construction of non-natural biological systems has allowed us to explore novel biological functions so far not discovered in nature. This book summarizes the current state of Synthetic Genomics, providing relevant examples in this emerging field

    Chemometric Approaches for Systems Biology

    Full text link
    The present Ph.D. thesis is devoted to study, develop and apply approaches commonly used in chemometrics to the emerging field of systems biology. Existing procedures and new methods are applied to solve research and industrial questions in different multidisciplinary teams. The methodologies developed in this document will enrich the plethora of procedures employed within omic sciences to understand biological organisms and will improve processes in biotechnological industries integrating biological knowledge at different levels and exploiting the software packages derived from the thesis. This dissertation is structured in four parts. The first block describes the framework in which the contributions presented here are based. The objectives of the two research projects related to this thesis are highlighted and the specific topics addressed in this document via conference presentations and research articles are introduced. A comprehensive description of omic sciences and their relationships within the systems biology paradigm is given in this part, jointly with a review of the most applied multivariate methods in chemometrics, on which the novel approaches proposed here are founded. The second part addresses many problems of data understanding within metabolomics, fluxomics, proteomics and genomics. Different alternatives are proposed in this block to understand flux data in steady state conditions. Some are based on applications of multivariate methods previously applied in other chemometrics areas. Others are novel approaches based on a bilinear decomposition using elemental metabolic pathways, from which a GNU licensed toolbox is made freely available for the scientific community. As well, a framework for metabolic data understanding is proposed for non-steady state data, using the same bilinear decomposition proposed for steady state data, but modelling the dynamics of the experiments using novel two and three-way data analysis procedures. Also, the relationships between different omic levels are assessed in this part integrating different sources of information of plant viruses in data fusion models. Finally, an example of interaction between organisms, oranges and fungi, is studied via multivariate image analysis techniques, with future application in food industries. The third block of this thesis is a thoroughly study of different missing data problems related to chemometrics, systems biology and industrial bioprocesses. In the theoretical chapters of this part, new algorithms to obtain multivariate exploratory and regression models in the presence of missing data are proposed, which serve also as preprocessing steps of any other methodology used by practitioners. Regarding applications, this block explores the reconstruction of networks in omic sciences when missing and faulty measurements appear in databases, and how calibration models between near infrared instruments can be transferred, avoiding costs and time-consuming full recalibrations in bioindustries and research laboratories. Finally, another software package, including a graphical user interface, is made freely available for missing data imputation purposes. The last part discusses the relevance of this dissertation for research and biotechnology, including proposals deserving future research.Esta tesis doctoral se centra en el estudio, desarrollo y aplicación de técnicas quimiométricas en el emergente campo de la biología de sistemas. Procedimientos comúnmente utilizados y métodos nuevos se aplican para resolver preguntas de investigación en distintos equipos multidisciplinares, tanto del ámbito académico como del industrial. Las metodologías desarrolladas en este documento enriquecen la plétora de técnicas utilizadas en las ciencias ómicas para entender el funcionamiento de organismos biológicos y mejoran los procesos en la industria biotecnológica, integrando conocimiento biológico a diferentes niveles y explotando los paquetes de software derivados de esta tesis. Esta disertación se estructura en cuatro partes. El primer bloque describe el marco en el cual se articulan las contribuciones aquí presentadas. En él se esbozan los objetivos de los dos proyectos de investigación relacionados con esta tesis. Asimismo, se introducen los temas específicos desarrollados en este documento mediante presentaciones en conferencias y artículos de investigación. En esta parte figura una descripción exhaustiva de las ciencias ómicas y sus interrelaciones en el paradigma de la biología de sistemas, junto con una revisión de los métodos multivariantes más aplicados en quimiometría, que suponen las pilares sobre los que se asientan los nuevos procedimientos aquí propuestos. La segunda parte se centra en resolver problemas dentro de metabolómica, fluxómica, proteómica y genómica a partir del análisis de datos. Para ello se proponen varias alternativas para comprender a grandes rasgos los datos de flujos metabólicos en estado estacionario. Algunas de ellas están basadas en la aplicación de métodos multivariantes propuestos con anterioridad, mientras que otras son técnicas nuevas basadas en descomposiciones bilineales utilizando rutas metabólicas elementales. A partir de éstas se ha desarrollado software de libre acceso para la comunidad científica. A su vez, en esta tesis se propone un marco para analizar datos metabólicos en estado no estacionario. Para ello se adapta el enfoque tradicional para sistemas en estado estacionario, modelando las dinámicas de los experimentos empleando análisis de datos de dos y tres vías. En esta parte de la tesis también se establecen relaciones entre los distintos niveles ómicos, integrando diferentes fuentes de información en modelos de fusión de datos. Finalmente, se estudia la interacción entre organismos, como naranjas y hongos, mediante el análisis multivariante de imágenes, con futuras aplicaciones a la industria alimentaria. El tercer bloque de esta tesis representa un estudio a fondo de diferentes problemas relacionados con datos faltantes en quimiometría, biología de sistemas y en la industria de bioprocesos. En los capítulos más teóricos de esta parte, se proponen nuevos algoritmos para ajustar modelos multivariantes, tanto exploratorios como de regresión, en presencia de datos faltantes. Estos algoritmos sirven además como estrategias de preprocesado de los datos antes del uso de cualquier otro método. Respecto a las aplicaciones, en este bloque se explora la reconstrucción de redes en ciencias ómicas cuando aparecen valores faltantes o atípicos en las bases de datos. Una segunda aplicación de esta parte es la transferencia de modelos de calibración entre instrumentos de infrarrojo cercano, evitando así costosas re-calibraciones en bioindustrias y laboratorios de investigación. Finalmente, se propone un paquete software que incluye una interfaz amigable, disponible de forma gratuita para imputación de datos faltantes. En la última parte, se discuten los aspectos más relevantes de esta tesis para la investigación y la biotecnología, incluyendo líneas futuras de trabajo.Aquesta tesi doctoral es centra en l'estudi, desenvolupament, i aplicació de tècniques quimiomètriques en l'emergent camp de la biologia de sistemes. Procediments comúnment utilizats i mètodes nous s'apliquen per a resoldre preguntes d'investigació en diferents equips multidisciplinars, tant en l'àmbit acadèmic com en l'industrial. Les metodologies desenvolupades en aquest document enriquixen la plétora de tècniques utilitzades en les ciències òmiques per a entendre el funcionament d'organismes biològics i milloren els processos en la indústria biotecnològica, integrant coneixement biològic a distints nivells i explotant els paquets de software derivats d'aquesta tesi. Aquesta dissertació s'estructura en quatre parts. El primer bloc descriu el marc en el qual s'articulen les contribucions ací presentades. En ell s'esbossen els objectius dels dos projectes d'investigació relacionats amb aquesta tesi. Així mateix, s'introduixen els temes específics desenvolupats en aquest document mitjançant presentacions en conferències i articles d'investigació. En aquesta part figura una descripació exhaustiva de les ciències òmiques i les seues interrelacions en el paradigma de la biologia de sistemes, junt amb una revisió dels mètodes multivariants més aplicats en quimiometria, que supossen els pilars sobre els quals s'assenten els nous procediments ací proposats. La segona part es centra en resoldre problemes dins de la metabolòmica, fluxòmica, proteòmica i genòmica a partir de l'anàlisi de dades. Per a això es proposen diverses alternatives per a compendre a grans trets les dades de fluxos metabòlics en estat estacionari. Algunes d'elles estàn basades en l'aplicació de mètodes multivariants propostos amb anterioritat, mentre que altres són tècniques noves basades en descomposicions bilineals utilizant rutes metabòliques elementals. A partir d'aquestes s'ha desenvolupat software de lliure accés per a la comunitat científica. Al seu torn, en aquesta tesi es proposa un marc per a analitzar dades metabòliques en estat no estacionari. Per a això s'adapta l'enfocament tradicional per a sistemes en estat estacionari, modelant les dinàmiques dels experiments utilizant anàlisi de dades de dues i tres vies. En aquesta part de la tesi també s'establixen relacions entre els distints nivells òmics, integrant diferents fonts d'informació en models de fusió de dades. Finalment, s'estudia la interacció entre organismes, com taronges i fongs, mitjançant l'anàlisi multivariant d'imatges, amb futures aplicacions a la indústria alimentària. El tercer bloc d'aquesta tesi representa un estudi a fons de diferents problemes relacionats amb dades faltants en quimiometria, biologia de sistemes i en la indústria de bioprocessos. En els capítols més teòrics d'aquesta part, es proposen nous algoritmes per a ajustar models multivariants, tant exploratoris com de regressió, en presencia de dades faltants. Aquests algoritmes servixen ademés com a estratègies de preprocessat de dades abans de l'ús de qualsevol altre mètode. Respecte a les aplicacions, en aquest bloc s'explora la reconstrucció de xarxes en ciències òmiques quan apareixen valors faltants o atípics en les bases de dades. Una segona aplicació d'aquesta part es la transferència de models de calibració entre instruments d'infrarroig proper, evitant així costoses re-calibracions en bioindústries i laboratoris d'investigació. Finalment, es proposa un paquet software que inclou una interfície amigable, disponible de forma gratuïta per a imputació de dades faltants. En l'última part, es discutixen els aspectes més rellevants d'aquesta tesi per a la investigació i la biotecnologia, incloent línies futures de treball.Folch Fortuny, A. (2016). Chemometric Approaches for Systems Biology [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/77148TESISPremios Extraordinarios de tesis doctorale

    Topology Reconstruction of Dynamical Networks via Constrained Lyapunov Equations

    Get PDF
    The network structure (or topology) of a dynamical network is often unavailable or uncertain. Hence, we consider the problem of network reconstruction. Network reconstruction aims at inferring the topology of a dynamical network using measurements obtained from the network. In this technical note we define the notion of solvability of the network reconstruction problem. Subsequently, we provide necessary and sufficient conditions under which the network reconstruction problem is solvable. Finally, using constrained Lyapunov equations, we establish novel network reconstruction algorithms, applicable to general dynamical networks. We also provide specialized algorithms for specific network dynamics, such as the well-known consensus and adjacency dynamics.Comment: 8 page

    Optimisation of microfluidic experiments for model calibration of a synthetic promoter in S. cerevisiae

    Get PDF
    This thesis explores, implements, and examines the methods to improve the efficiency of model calibration experiments for synthetic biological circuits in three aspects: experimental technique, optimal experimental design (OED), and automatic experiment abnormality screening (AEAS). Moreover, to obtain a specific benchmark that provides clear-cut evidence of the utility, an integrated synthetic orthogonal promoter in yeast (S. cerevisiae) and a corresponded model is selected as the experiment object. This work first focuses on the “wet-lab” part of the experiment. It verifies the theoretical benefit of adopting microfluidic technique by carrying out a series of in-vivo experiments on a developed automatic microfluidic experimental platform. Statistical analysis shows that compared to the models calibrated with flow-cytometry data (a representative traditional experimental technique), the models based on microfluidic data of the same experiment time give significantly more accurate behaviour predictions of never-encountered stimuli patterns. In other words, compare to flow-cytometry experiments, microfluidics can obtain models of the required prediction accuracy within less experiment time. The next aspect is to optimise the “dry-lab” part, i.e., the design of experiments and data processing. Previous works have proven that the informativeness of experiments can be improved by optimising the input design (OID). However, the amount of work and the time cost of the current OID approach rise dramatically with large and complex synthetic networks and mathematical models. To address this problem, this thesis introduces the parameter clustering analysis and visualisation (PCAV) to speed up the OID by narrowing down the parameters of interest. For the first time, this thesis proposes a parameter clustering algorithm based on the Fisher information matrix (FIMPC). Practices with in-silico experiments on the benchmarking promoter show that PCAV reduces the complexity of OID and provides a new way to explore the connections between parameters. Moreover, the analysis shows that experiments with FIMPC-based OID lead to significantly more accurate parameter estimations than the current OID approach. Automatic abnormality screening is the third aspect. For microfluidic experiments, the current identification of invalid microfluidic experiments is carried out by visual checks of the microscope images by experts after the experiments. To improve the automation level and robustness of this quality control process, this work develops an automatic experiment abnormality screening (AEAS) system supported by convolutional neural networks (CNNs). The system learns the features of six abnormal experiment conditions from images taken in actual microfluidic experiments and achieves identification within seconds in the application. The training and validation of six representative CNNs of different network depths and design strategies show that some shallow CNNs can already diagnose abnormal conditions with the desired accuracy. Moreover, to improve the training convergence of deep CNNs with small data sets, this thesis proposes a levelled-training method and improves the chance of convergence from 30% to 90%. With a benchmark of a synthetic promoter model in yeast, this thesis optimises model calibration experiments in three aspects to achieve a more efficient procedure: experimental technique, optimal experimental design (OED), and automatic experiment abnormality screening (AEAS). In this study, the efficiency of model calibration experiments for the benchmarking model can be improved by: adopting microfluidics technology, applying CAVP parameter analysis and FIMPC-based OID, and setting up an AEAS system supported by CNN. These contributions have the potential to be exploited for designing more efficient in-vivo experiments for model calibration in similar studies

    Grand Challenges for Biological and Environmental Research: A Long-Term Vision

    Get PDF
    The interactions and feedbacks among plants, animals, microbes, humans, and the environment ultimately form the world in which we live. This world is now facing challenges from a growing and increasingly affluent human population whose numbers and lifestyles are driving ever greater energy demand and impacting climate. These and other contributing factors will make energy and climate sustainability extremely difficult to achieve over the 20-year time horizon that is the focus of this report. Despite these severe challenges, there is optimism that deeper understanding of our environment will enable us to mitigate detrimental effects, while also harnessing biological and climate systems to ensure a sustainable energy future. This effort is advanced by scientific inquiries in the fields of atmospheric chemistry and physics, biology, ecology, and subsurface science - all made possible by computing. The Office of Biological and Environmental Research (BER) within the Department of Energy's (DOE) Office of Science has a long history of bringing together researchers from different disciplines to address critical national needs in determining the biological and environmental impacts of energy production and use, characterizing the interplay of climate and energy, and collaborating with other agencies and DOE programs to improve the world's most powerful climate models. BER science focuses on three distinct areas: (1) What are the roles of Earth system components (atmosphere, land, oceans, sea ice, and the biosphere) in determining climate? (2) How is the information stored in a genome translated into microbial, plant, and ecosystem processes that influence biofuel production, climate feedbacks, and the natural cycling of carbon? (3) What are the biological, geochemical, and physical forces that govern the behavior of Earth's subsurface environment? Ultimately, the goal of BER science is to support experimentation and modeling that can reliably predict the outcomes and behaviors of complex biological and environmental systems, leading to robust solutions for DOE missions and strategic goals. In March 2010, the Biological and Environmental Research Advisory Committee held the Grand Challenges for Biological and Environmental Research: A Long-Term Vision workshop to identify scientific opportunities and grand challenges for BER science in the coming decades and to develop an overall strategy for drafting a long-term vision for BER. Key workshop goals included: (1) Identifying the greatest scientific challenges in biology, climate, and the environment that DOE will face over a 20-year time horizon. (2) Describing how BER should be positioned to address those challenges. (3) Determining the new and innovative tools needed to advance BER science. (4) Suggesting how the workforce of the future should be trained in integrative system science. This report lays out grand research challenges for BER - in biological systems, climate, energy sustainability, computing, and education and workforce training - that can put society on a path to achieve the scientific evidence and predictive understanding needed to inform decision making and planning to address future energy needs, climate change, water availability, and land use

    Fouille de données complexes et biclustering avec l'analyse formelle de concepts

    Get PDF
    Knowledge discovery in database (KDD) is a process which is applied to possibly large volumes of data for discovering patterns which can be significant and useful. In this thesis, we are interested in data transformation and data mining in knowledge discovery applied to complex data, and we present several experiments related to different approaches and different data types.The first part of this thesis focuses on the task of biclustering using formal concept analysis (FCA) and pattern structures. FCA is naturally related to biclustering, where the objective is to simultaneously group rows and columns which verify some regularities. Related to FCA, pattern structures are its generalizations which work on more complex data. Partition pattern structures were proposed to discover constant-column biclustering, while interval pattern structures were studied in similar-column biclustering. Here we extend these approaches to enumerate other types of biclusters: additive, multiplicative, order-preserving, and coherent-sign-changes.The second part of this thesis focuses on two experiments in mining complex data. First, we present a contribution related to the CrossCult project, where we analyze a dataset of visitor trajectories in a museum. We apply sequence clustering and FCA-based sequential pattern mining to discover patterns in the dataset and to classify these trajectories. This analysis can be used within CrossCult project to build recommendation systems for future visitors. Second, we present our work related to the task of antibacterial drug discovery. The dataset for this task is generally a numerical matrix with molecules as rows and features/attributes as columns. The huge number of features makes it more complex for any classifier to perform molecule classification. Here we study a feature selection approach based on log-linear analysis which discovers associations among features.As a synthesis, this thesis presents a series of different experiments in the mining of complex real-world data.L'extraction de connaissances dans les bases de données (ECBD) est un processus qui s'applique à de (potentiellement larges) volumes de données pour découvrir des motifs qui peuvent être signifiants et utiles. Dans cette thèse, on s'intéresse à deux étapes du processus d'ECBD, la transformation et la fouille, que nous appliquons à des données complexes. Nous présentons de nombreuses expérimentations s'appuyant sur des approches et des types de données variés.La première partie de cette thèse s'intéresse à la tâche de biclustering en s'appuyant sur l'analyse formelle de concepts (FCA) et aux pattern structures. FCA est naturellement liées au biclustering, dont l'objectif consiste à grouper simultanément un ensemble de lignes et de colonnes qui vérifient certaines régularités. Les pattern structures sont une généralisation de la FCA qui permet de travailler avec des données plus complexes. Les "partition pattern structures'' ont été proposées pour du biclustering à colonnes constantes tandis que les "interval pattern structures'' ont été étudiées pour du biclustering à colonnes similaires. Nous proposons ici d'étendre ces approches afin d'énumérer d'autres types de biclusters : additif, multiplicatif, préservant l'ordre, et changement de signes cohérents.Dans la seconde partie, nous nous intéressons à deux expériences de fouille de données complexes. Premièrement, nous présentons une contribution dans la quelle nous analysons les trajectoires des visiteurs d'un musée dans le cadre du projet CrossCult. Nous utilisons du clustering de séquences et de la fouille de motifs séquentiels basée sur l'analyse formelle de concepts pour découvrir des motifs dans les données et classifier les trajectoires. Cette analyse peut ensuite être exploitée par un système de recommandation pour les futurs visiteurs. Deuxièmement, nous présentons un travail sur la découverte de médicaments antibactériens. Les jeux de données pour cette tâche, généralement des matrices numériques, décrivent des molécules par un certain nombre de variables/attributs. Le grand nombre de variables complexifie la classification des molécules par les classifieurs. Ici, nous étudions une approche de sélection de variables basée sur l'analyse log-linéaire qui découvre des associations entre variables.En somme, cette thèse présente différentes expériences de fouille de données réelles et complexes

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Functional synthesis of genetic systems

    Full text link
    Synthetic genetic regulatory networks (or genetic circuits) can operate in complex biochemical environments to process and manipulate biological information to produce a desired behavior. The ability to engineer such genetic circuits has wide-ranging applications in various fields such as therapeutics, energy, agriculture, and environmental remediation. However, engineering multilevel genetic circuits quickly and reliably is a big challenge in the field of synthetic biology. This difficulty can partly be attributed to the growing complexity of biology. But some of the predominant challenges include the absence of formal specifications -- that describe precise desired behavior of these biological systems, as well as a lack of computational and mathematical frameworks -- that enable rapid in-silico design and synthesis of genetic circuits. This thesis introduces two major frameworks to reliably design genetic circuits. The first implementation focuses on a framework that enables synthetic biologists to encode Boolean logic functions into living cells. Using high-level hardware description language to specify the desired behavior of a genetic logic circuit, this framework describes how, given a library of genetic gates, logic synthesis can be applied to synthesize a multilevel genetic circuit, while accounting for biological constraints such as 'signal matching', 'crosstalk', and 'genetic context effects'. This framework has been implemented in a tool called Cello, which was applied to design 60 circuits for Escherichia coli, where the circuit function was specified using Verilog code and transformed to a DNA sequence. Across all these circuits, 92% of the output states functioned as predicted. The second implementation focuses on a framework to design complex genetic systems where the focus is on how the system behaves over time instead of its behavior at steady-state. Using Signal Temporal Logic (STL) -- a formalism used to specify properties of dense-time real-valued signals, biologists can specify very precise temporal behaviors of a genetic system. The framework describes how genetic circuits that are built from a well characterized library of DNA parts, can be scored by quantifying the 'degree of robustness' of in-silico simulations against an STL formula. Using formal verification, experimental data can be used to validate these in-silico designs. In this framework, the design space is also explored to predict external controls (such as approximate small molecule concentrations) that might be required to achieve a desired temporal behavior. This framework has been implemented in a tool called Phoenix.2021-02-28T00:00:00
    corecore