19 research outputs found

    Finding correlations and independences in omics data

    Get PDF
    Biological studies across all omics fields generate vast amounts of data. To understand these complex data, biologically motivated data mining techniques are indispensable. Evaluation of the high-throughput measurements usually relies on the identification of underlying signals as well as shared or outstanding characteristics. Therein, methods have been developed to recover source signals of present datasets, reveal objects which are more similar to each other than to other objects as well as to detect observations which are in contrast to the background dataset. Biological problems got individually addressed by using solutions from computer science according to their needs. The study of protein-protein interactions (interactome) focuses on the identification of clusters, the sub-graphs of graphs: A parameter-free graph clustering algorithm was developed, which was based on the concept of graph compression, in order to find sets of highly interlinked proteins sharing similar characteristics. The study of lipids (lipidome) calls for co-regulation analyses: To reveal those lipids similarly responding to biological factors, partial correlations were generated with differential Gaussian Graphical Models while accounting for solely disease-specific correlations. The study on single cell level (cytomics) aims to understand cellular systems often with the help of microscopy techniques: A novel noise robust source separation technique allowed to reliably extract independent components from microscopy images describing protein behaviors. The study of peptides (peptidomics) often requires the detection outstanding observations: By assessing regularities in the data set, an outlier detection algorithm was implemented based on compression efficacy of independent components of the dataset. All developed algorithms had to fulfill most diverse constraints in each omics field, but were met with methods derived from standard correlation and dependency analyses

    Bayesian networks for omics data analysis

    Get PDF
    This thesis focuses on two aspects of high throughput technologies, i.e. data storage and data analysis, in particular in transcriptomics and metabolomics. Both technologies are part of a research field that is generally called ‘omics’ (or ‘-omics’, with a leading hyphen), which refers to genomics, transcriptomics, proteomics, or metabolomics. Although these techniques study different entities (genes, gene expression, proteins, or metabolites), they all have in common that they use high-throughput technologies such as microarrays and mass spectrometry, and thus generate huge amounts of data. Experiments conducted using these technologies allow one to compare different states of a living cell, for example a healthy cell versus a cancer cell or the effect of food on cell condition, and at different levels. The tools needed to apply omics technologies, in particular microarrays, are often manufactured by different vendors and require separate storage and analysis software for the data generated by them. Moreover experiments conducted using different technologies cannot be analyzed simultaneously to answer a biological question. Chapter 3 presents MADMAX, our software system which supports storage and analysis of data from multiple microarray platforms. It consists of a vendor-independent database which is tightly coupled with vendor-specific analysis tools. Upcoming technologies like metabolomics, proteomics and high-throughput sequencing can easily be incorporated in this system. Once the data are stored in this system, one obviously wants to deduce a biological relevant meaning from these data and here statistical and machine learning techniques play a key role. The aim of such analysis is to search for relationships between entities of interest, such as genes, metabolites or proteins. One of the major goals of these techniques is to search for causal relationships rather than mere correlations. It is often emphasized in the literature that "correlation is not causation" because people tend to jump to conclusions by making inferences about causal relationships when they actually only see correlations. Statistics are often good in finding these correlations; techniques called linear regression and analysis of variance form the core of applied multivariate statistics. However, these techniques cannot find causal relationships, neither are they able to incorporate prior knowledge of the biological domain. Graphical models, a machine learning technique, on the other hand do not suffer from these limitations. Graphical models, a combination of graph theory, statistics and information science, are one of the most exciting things happening today in the field of machine learning applied to biological problems (see chapter 2 for a general introduction). This thesis deals with a special type of graphical models known as probabilistic graphical models, belief networks or Bayesian networks. The advantage of Bayesian networks over classical statistical techniques is that they allow the incorporation of background knowledge from a biological domain, and that analysis of data is intuitive as it is represented in the form of graphs (nodes and edges). Standard statistical techniques are good in describing the data but are not able to find non-linear relations whereas Bayesian networks allow future prediction and discovering nonlinear relations. Moreover, Bayesian networks allow hierarchical representation of data, which makes them particularly useful for representing biological data, since most biological processes are hierarchical by nature. Once we have such a causal graph made either by a computer program or constructed manually we can predict the effects of a certain entity by manipulating the state of other entities, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very difficult and CPU-expensive, and in such cases approximate methods are used. Chapter 4 demonstrates the use of Bayesian networks to determine the metabolic state of feeding and fasting mice to determine the effect of a high fat diet on gene expression. This chapter also shows how selection of genes based on key biological processes generates more informative results than standard statistical tests. In chapter 5 the use of Bayesian networks is shown on the combination of gene expression data and clinical parameters, to determine the effect of smoking on gene expression and which genes are responsible for the DNA damage and the raise in plasma cotinine levels of blood of a smoking population. This study was conducted at Maastricht University where 22 twin smokers were profiled. Chapter 6 presents the reconstruction of a key metabolic pathway which plays an important role in ripening of tomatoes, thus showing the versatility of the use of Bayesian networks in metabolomics data analysis. The general trend in research shows a flood of data emerging from sequencing and metabolomics experiments. This means that to perform data mining on these data one requires intelligent techniques that are computationally feasible and able to take the knowledge of experts into account to generate relevant results. Graphical models fit this paradigm well and we expect them to play a key role in mining the data generated from omics experiments. <br/

    Learning Gaussian graphical models with fractional marginal pseudo-likelihood

    Get PDF
    We propose a Bayesian approximate inference method for learning the dependence structure of a Gaussian graphical model. Using pseudo-likelihood, we derive an analytical expression to approximate the marginal likelihood for an arbitrary graph structure without invoking any assumptions about decomposability. The majority of the existing methods for learning Gaussian graphical models are either restricted to decomposable graphs or require specification of a tuning parameter that may have a substantial impact on learned structures. By combining a simple sparsity inducing prior for the graph structures with a default reference prior for the model parameters, we obtain a fast and easily applicable scoring function that works well for even high-dimensional data. We demonstrate the favourable performance of our approach by large-scale comparisons against the leading methods for learning non-decomposable Gaussian graphical models. A theoretical justification for our method is provided by showing that it yields a consistent estimator of the graph structure. (C) 2017 Elsevier Inc. All rights reserved.Peer reviewe

    Influence of Geography on the Healthy Gut Microbiome and the Role of the Gut Microbiome in IBD Symptom and Disease Progression

    Get PDF
    The human gut microbiome is believed to play an integral role in host health and disease. In a microbial community, associations between constituent members play an important role in determining the overall structure and function of the community. To understand the nature of bacterial associations at the species level in healthy human gut microbiomes, we analyzed previously published collections of whole-genome shotgun sequence data, from fecal samples obtained from four different healthy human populations. Using a Random Forest Classifier, we identified bacterial species that were prevalent in these populations and whose relative abundances could be used to accurately distinguish between the populations. Bacterial association networks were also constructed using these signature species revealed conserved bacterial associations across populations and a dominance of positive associations over negative associations, with this dominance being driven by associations between species that are closely related either taxonomically or functionally. Functional analysis using protein families suggests that much of the taxonomic variation across human populations does not foment substantial functional differences. Next, multiple external healthy controls from the same geographical regions (American population) were compared to Inflammatory Bowel Disease (IBD) samples from the American population using shotgun sequencing data. We identified 34 bacterial species that were significantly elevated in IBD samples, relative to all control groups. These species elevated in IBD appear to play important roles in the healthy control groups, but it is possible that their over-abundance has deleterious effects on the host, possibly due to many of these bacteria being involved in mucin degradation, immune modulation, antibiotic resistance, and inflammation. We also identified differences in functional capacities between IBD and healthy controls and linked the changes in the functional capacity to previously published clinical research and to symptoms that commonly occur in IBD, such as rectal bleeding, diarrhea, vitamin K deficiency, and inflammation

    Short-term health effects of air pollution and viral exposure

    Get PDF
    This thesis describes the short-term effects of environmental exposures on the cardiorespiratory system, and consists of two parts. In the first part, we investigated the short-term health effects of air pollution, in which 21 healthy young adults were repeatedly (2-5 visits) exposed for 5 hours to the ambient air near a major airport and two highways. We found that exposures to high levels of ultrafine particles, decreased lung function, prolonged the QTc interval, altered the exhaled breath profile, and heightened concentrations of oxidative stress markers in urine. This study shows the importance of air pollution reduction and the need for future research to determine how detrimental the (long-term) effects of exposure to ultrafine particles (from aviation) are. In the second part, we investigated the effects of a rhinovirus challenge on the fluctuations in exhaled metabolites. Exhaled breath measurements were performed 2-3 times per week using an electronic nose, 60 days before and 30 days after a rhinovirus-16 (RV16) challenge, in non-atopic healthy adults (n=12) and atopic mild asthmatics (n=12). We found that day-to-day fluctuations in the exhaled breath profiles rapidly increased after the RV16 challenge, with distinct differences between atopic mild asthmatics and non-atopic healthy volunteers. This proof-of-concept study shows the potential of exhaled breath analysis for monitoring of virus-induced exacerbations in asthma

    Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

    Get PDF
    Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

    Statistical methods for genetic association studies with response-selective sampling designs

    Get PDF
    This dissertation describes new statistical methods designed to improve the power of genetic association studies. Of particular interest are studies with a response-selective sampling design, i.e. case-control studies of unrelated individuals and case-control studies of family members. The statistical methods presented in this dissertation (a) take advantage of information available in the distribution of the covariates in case-control studies by modeling the ascertainment process; (b) incorporate information from both family-based studies and case-control studies of unrelated individuals; (c) use "richer" models of the relationship between genetic variants and phenotypes, compared to models used in standard genetic association studies; and (d) integrate different types of data, such as genomic, epigenomic, transcriptomic and environmental information. Together, these methods will improve the ability of the genetics community to identify the genetic basis of complex human phenotypes.UBL - phd migration 201

    Analysing datafied life

    No full text
    Our life is being increasingly quantified by data. To obtain information from quantitative data, we need to develop various analysis methods, which can be drawn from diverse fields, such as computer science, information theory and statistics. This thesis focuses on investigating methods for analysing data generated for medical research. Its focus is on the purpose of using various data to quantify patients for personalized treatment. From the perspective of data type, this thesis proposes analysis methods for the data from the fields of Bioinformatics and medical imaging. We will discuss the need of using data from molecular level to pathway level and also incorporating medical imaging data. Different preprocessing methods should be developed for different data types, while some post-processing steps for various data types, such as classification and network analysis, can be done by a generalized approach. From the perspective of research questions, this thesis studies methods for answering five typical questions from simple to complex. These questions are detecting associations, identifying groups, constructing classifiers, deriving connectivity and building dynamic models. Each research question is studied in a specific field. For example, detecting associations is investigated for fMRI signals. However, the proposed methods can be naturally extended to solve questions in other fields. This thesis has successfully demonstrated that applying a method traditionally used in one field to a new field can bring lots of new insights. Five main research contributions for different research questions have been made in this thesis. First, to detect active brain regions associated to tasks using fMRI signals, a new significance index, CR-value, has been proposed. It is originated from the idea of using sparse modelling in gene association study. Secondly, in quantitative Proteomics analysis, a clustering based method has been developed to extract more information from large scale datasets than traditional methods. Clustering methods, which are usually used in finding subgroups of samples or features, are used to match similar identities across samples. Thirdly, a pipeline originally proposed in the field of Bioinformatics has been adapted to multivariate analysis of fMRI signals. Fourthly, the concept of elastic computing in computer science has been used to develop a new method for generating functional connectivity from fMRI data. Finally, sparse signal recovery methods from the domain of signal processing are suggested to solve the underdetermined problem of network model inference.Open Acces

    Kallikrein-ähnliche Peptidasen 4, 5, 7 und 12 als prognostische Biomarker beim fortgeschrittenen high grade serösen Ovarialkarzinom und triple-negativen Mammakarzinom

    Get PDF
    In the current study, mRNA expression levels of KLK4, 5, 7 and 12 were investigated in tumor tissues of two patient cohorts, either afflicted with advanced high-grade serous ovarian cancer (HGSOC) or triple-negative breast cancer (TNBC). Together with available KLK5 and KLK7 protein data from a previous study, coordinate expression was observed between KLK5 and KLK7 in both entities. Moreover, elevated KLK4 and KLK7 mRNA levels were found to represent independent unfavorable prognostic biomarkers in HGSOC, while elevated KLK12 mRNA expression was independently related to worse prognosis in TNBC.In der aktuellen Studie wurden mRNA-Expressionsniveaus von KLK4, 5, 7 und 12 im Tumorgewebe von Patientinnen-Kohorten, zum einen mit fortgeschrittenem high grade serösen Ovarialkarzinom (HGSOC) und zum anderen mit triple-negativem Mammakarzinom (TNBC) bestimmt. Zusammen mit verfügbaren KLK5 und KLK7 Protein-Daten wurde eine koordinierte Expression von KLK5 und KLK7 in beiden Entitäten festgestellt. Darüber hinaus stellen hohe KLK4 und KLK7 mRNA-Werte unabhängige ungünstige prognostische Biomarker beim HGSOC dar, während beim TNBC eine erhöhte KLK12 mRNA-Expression unabhängig mit einer schlechteren Prognose assoziiert ist
    corecore