107 research outputs found

    Gene expression data analysis using novel methods: Predicting time delayed correlations and evolutionarily conserved functional modules

    Get PDF
    Microarray technology enables the study of gene expression on a large scale. One of the main challenges has been to devise methods to cluster genes that share similar expression profiles. In gene expression time courses, a particular gene may encode transcription factor and thus controlling several genes downstream; in this case, the gene expression profiles may be staggered, indicating a time-delayed response in transcription of the later genes. The standard clustering algorithms consider gene expression profiles in a global way, thus often ignoring such local time-delayed correlations. We have developed novel methods to capture time-delayed correlations between expression profiles: (1) A method using dynamic programming and (2) CLARITY, an algorithm that uses a local shape based similarity measure to predict time-delayed correlations and local correlations. We used CLARITY on a dataset describing the change in gene expression during the mitotic cell cycle in Saccharomyces cerevisiae. The obtained clusters were significantly enriched with genes that share similar functions, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed. Time-shifted as well as local correlations could also be predicted using CLARITY. In datasets, where the expression profiles of independent experiments are compared, the standard clustering algorithms often cluster according to all conditions, considering all genes. This increases the background noise and can lead to the missing of genes that change the expression only under particular conditions. We have employed a genetic algorithm based module predictor that is capable to identify group of genes that change their expression only in a subset of conditions. With the aim of supplementing the Ustilago maydis genome annotation, we have used the module prediction algorithm on various independent datasets from Ustilago maydis. The predicted modules were cross-referenced in various Saccharomyces cerevisiae datasets to check its evolutionarily conservation between these two organisms. The key contributions of this thesis are novel methods that explore biological information from DNA microarray data

    Gene expression data analysis using novel methods: Predicting time delayed correlations and evolutionarily conserved functional modules

    Get PDF
    Microarray technology enables the study of gene expression on a large scale. One of the main challenges has been to devise methods to cluster genes that share similar expression profiles. In gene expression time courses, a particular gene may encode transcription factor and thus controlling several genes downstream; in this case, the gene expression profiles may be staggered, indicating a time-delayed response in transcription of the later genes. The standard clustering algorithms consider gene expression profiles in a global way, thus often ignoring such local time-delayed correlations. We have developed novel methods to capture time-delayed correlations between expression profiles: (1) A method using dynamic programming and (2) CLARITY, an algorithm that uses a local shape based similarity measure to predict time-delayed correlations and local correlations. We used CLARITY on a dataset describing the change in gene expression during the mitotic cell cycle in Saccharomyces cerevisiae. The obtained clusters were significantly enriched with genes that share similar functions, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed. Time-shifted as well as local correlations could also be predicted using CLARITY. In datasets, where the expression profiles of independent experiments are compared, the standard clustering algorithms often cluster according to all conditions, considering all genes. This increases the background noise and can lead to the missing of genes that change the expression only under particular conditions. We have employed a genetic algorithm based module predictor that is capable to identify group of genes that change their expression only in a subset of conditions. With the aim of supplementing the Ustilago maydis genome annotation, we have used the module prediction algorithm on various independent datasets from Ustilago maydis. The predicted modules were cross-referenced in various Saccharomyces cerevisiae datasets to check its evolutionarily conservation between these two organisms. The key contributions of this thesis are novel methods that explore biological information from DNA microarray data

    Single-cell morphological data reveals signaling network architecture

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2010.Cataloged from PDF version of thesis.Includes bibliographical references.Metastasis, the migration of cancer cells from the primary site of tumorigenesis and the subsequent invasion of secondary tissues, causes the vast majority of cancer deaths. To spread, metastatic cells dramatically rearrange their shape in complex, dynamic fashions. Genes encoding signaling proteins that regulate cell shape in normal cells are often mutated in cancer, especially in highly metastatic disease. To study these key signaling proteins in locomotion and metastasis, we develop and validate statistical methods to extract information from highthroughput morphological data from genetic screens. Our contributions fall into three major categories. 1) To define and apply robust statistical measures to identify genes regulating morphological variability. We develop and thoroughly test methods for measuring morphological variability of single-cells populations, and apply these metrics to genetic screens in yeast and fly. We further apply these techniques to subsets of genes involved in cellular processes to study genetic contributions to variability in these processes. We propose new roles for genes as suppressors or enhancers of morphological noise. We validate our findings on the basis of known gene function and network architecture. 2) To perform inference of protein signaling relationships by utilizing high-throughput morphological data. We apply machine-learning techniques to systematically identify genetic interactions between proteins on the basis of image-based data from double-knockout screens.(cont.) Next, we focus on RhoGTPases and RhoGTPase Activating Proteins (RhoGAPs) in Drosophila., where by using basic knowledge of network architecture we apply our techniques to detect signaling relationships. 3) To integrate expression data with high-throughput morphological data to study the mechanisms for determination of cell morphology. We utilize morphological and microarray data from fly screens. By comparing expression data between control treatment conditions and treatment conditions displaying morphological phenotypes (e.g. high population variability), we identify genes and pathways correlated with this class distinction, thereby validating our previous studies and providing further insight into the determination of morphology. A key challenge in systems biology is to analyze emerging high-throughput image-based data to understand how cellular phenotypes are genetically encoded. Our work makes significant contributions to the literature on high-throughput morphological study and describes a path for future investigation.by Oaz Nir.Ph.D

    The effect of noise on dynamics and the influence of biochemical systems

    No full text
    Understanding a complex system requires integration and collective analysis of data from many levels of organisation. Predictive modelling of biochemical systems is particularly challenging because of the nature of data being plagued by noise operating at each and every level. Inevitably we have to decide whether we can reliably infer the structure and dynamics of biochemical systems from present data. Here we approach this problem from many fronts by analysing the interplay between deterministic and stochastic dynamics in a broad collection of biochemical models. In a classical mathematical model we first illustrate how this interplay can be described in surprisingly simple terms; we furthermore demonstrate the advantages of a statistical point of view also for more complex systems. We then investigate strategies for the integrated analysis of models characterised by different organisational levels, and trace the propagation of noise through such systems. We use this approach to uncover, for the first time, the dynamics of metabolic adaptation of a plant pathogen throughout its life cycle and discuss the ecological implications. Finally, we investigate how reliably we can infer model parameters of biochemical models. We develop a novel sensitivity/inferability analysis framework that is generally applicable to a large fraction of current mathematical models of biochemical systems. By using this framework to quantify the effect of parametric variation on system dynamics, we provide practical guidelines as to when and why certain parameters are easily estimated while others are much harder to infer. We highlight the limitations on parameter inference due to model structure and qualitative dynamical behaviour, and identify candidate elements of control in biochemical pathways most likely of being subjected to regulation

    Mendelian randomization: concepts and scope

    Get PDF

    Data integration for the analysis of uncharacterized proteins in Mycobacterium tuberculosis

    Get PDF
    Includes abstract.Includes bibliographical references (leaves 126-150).Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis, a leading cause of human death worldwide from infectious diseases, especially in Africa. Despite enormous advances achieved in recent years in controlling the disease, tuberculosis remains a public health challenge. The contribution of existing drugs is of immense value, but the deadly synergy of the disease with Human Immunodeficiency Virus (HIV) or Acquired Immunodeficiency Syndrome (AIDS) and the emergence of drug resistant strains are threatening to compromise gains in tuberculosis control. In fact, the development of active tuberculosis is the outcome of the delicate balance between bacterial virulence and host resistance, which constitute two distinct and independent components. Significant progress has been made in understanding the evolution of the bacterial pathogen and its interaction with the host. The end point of these efforts is the identification of virulence factors and drug targets within the bacterium in order to develop new drugs and vaccines for the eradication of the disease

    Qualitative Change Detection Approach For Preventive Therapies

    Get PDF
    Currently, most diseases are diagnosed only after disease-associated changes have occurred. In this PhD dissertation, we propose a paradigm shift from treating the disease to maintaining the healthy state. The proposed approach is able to identify when systemic qualitative changes in biological systems happen, thus opening the possibility of therapeutic interventions before the occurrence of symptoms. The change detection method exploits knowledge from biological networks and longitudinal data using a system impact analysis approach. This approach is validated on eight datasets, for seven different model organisms and eight biological phenomena. On these data, our proposed method performs well, consistently identifying the qualitative change in each dataset. Most importantly, the method accurately detected the transition from the control stage (benign) to the early stage of hepatocellular carcinoma on an eight-stage disease dataset. Knowing when a transition (qualitative change) from healthy to disease occurs may help preserve the healthy state. We also propose a novel analysis approach for metabolic pathway analysis that uses an impact analysis approach and leverages the stoichiometry of bio-chemical reactions to identify which pathways are significantly disrupted by the change in metabolite levels in disease samples versus healthy controls. Our approach outperforms the over-representation approach when evaluated on simulated data. We applied our proposed method to biological experiment data that compares samples from pregnant women to non-pregnant control samples. Our method was able to identify biologically relevant results on real high-throughput data better than the classical approach. In summary, we developed two novel methods for the analysis of high-throughput biological data, gene expression and metabolite concentration, respectively. The proposed methods can be adapted to work together in order to capture relevant complementary information stored in time-course datasets for gene expression or metabolite levels that may available for complex diseases in order to identify when a qualitative change happens, before the physiological onset of the disease

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses

    Noise Propagation and Information Transmission in the Tumor Necrosis Factor Signaling Pathway

    Get PDF
    Biological noise is generally defined as the non-genetic variability that arises in populations. For instance, identical twins, although very similar in appearance, will commonly display slightly different phenotypes. Likewise, daughter cells sharing the same genetic material may differentiate along divergent paths. In the past decade, there have been considerable advances in understanding the genetic mechanisms underpinning this variability; however, there still remain unanswered questions surrounding how signaling networks contribute to biological noise and how this noise sets limitations on intracellular information transmission. In the first half of this thesis, we demonstrate that a linear relationship between signal transduction responses allows one to quantify and map the propagation of noise along different parts of a signaling network, even if the network is complex and partially defined. We discover that the JNK pathway generates higher noise than the NF-κB pathway while the activation of c-Jun adds a greater amount of noise than the activation of ATF-2. In addition, by analyzing the negative feedback mechanisms mediated by the protein A20, we find that A20 can suppress noise in the activation of ATF-2 by separately inhibiting the tumor necrosis factor (TNF) receptor complex and JNK pathway. In the second half of this thesis, we will describe an integrative theoretical and experimental framework, based on the formalism of information theory, to quantitatively predict and measure the amount of information transduced by molecular and cellular networks. Analyzing TNF signaling, we find that individual TNF signaling pathways transduce information only sufficient for accurate binary decisions, and an upstream bottleneck limits the information gained via multiple integrated pathways. In this dissertation, we demonstrate that the application of engineering concepts proves to be of great utility in uncovering novel characteristics of biological noise. We anticipate that these contributions will help move biology closer towards a more predictable and rule-based engineering discipline allowing us to design de novo biological solutions to pressing issues

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan välillä metabolian säätelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistäminen on tehokas tapa selvittää tuntemattomien geenien toiminnallisuutta sekä säätelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, välillä. Riippuvuuksia eri biologisilla säätelytasoilla toimivien komponenttien välillä voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: näytteiden määrä on usein riittämätön verrattuna aineiston piirteiden määrään, ja aineisto on multikollineaarista johtuen esim. yhdessä säädellyistä ja ilmentyvistä geeneistä. Tästä syystä usein käytetään regularisoitua versiota kanonisesta korrelaatioanalyysistä aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lähestymistapa yhdessä sopivien priorioletuksien kanssa. Tässä diplomityössä tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekä klassisen harjanneregressio-regularisoidun CCA:n kykyä löytää oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston välillä. Uuden bayesilaisen menetelmän nimi on ryhmittäin harva kanoninen korrelaatioanalyysi. Ryhmittäin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisäksi uusi menetelmä regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiä sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiä sietävästä ekotyypistä Col-0 ja hapetus-stressille herkästä rcd1 mutantista aika-sarjana, sekä otsoni-altistuksessa että kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittäin harvan CCA:n kykyä paljastaa jo tunnettuja ja mahdollisesti uusia säätelymekanismeja geenien ja metabolittien välillä kasvisolujen viestinnässä hapettavan stressin aikana
    corecore