523 research outputs found

    Integration and visualisation of clinical-omics datasets for medical knowledge discovery

    Get PDF
    In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

    Multi-dimensional experimental and computational exploration of metabolism pinpoints complex probiotic interactions

    Get PDF
    Multi-strain probiotics are widely regarded as effective products for improving gut microbiota stability and host health, providing advantages over single-strain probiotics. However, in general, it is unclear to what extent different strains would cooperate or compete for resources, and how the establishment of a common biofilm microenvironment could influence their interactions. In this work, we develop an integrative experimental and computational approach to comprehensively assess the metabolic functionality and interactions of probiotics across growth conditions. Our approach combines co-culture assays with genome-scale modelling of metabolism and multivariate data analysis, thus exploiting complementary data- and knowledge-driven systems biology techniques. To show the advantages of the proposed approach, we apply it to the study of the interactions between two widely used probiotic strains of Lactobacillus reuteri and Saccharomyces boulardii, characterising their production potential for compounds that can be beneficial to human health. Our results show that these strains can establish a mixed cooperative-antagonistic interaction best explained by competition for shared resources, with an increased individual exchange but an often decreased net production of amino acids and short-chain fatty acids. Overall, our work provides a strategy that can be used to explore microbial metabolic fingerprints of biotechnological interest, capable of capturing multifaceted equilibria even in simple microbial consortia

    The International Virus Bioinformatics Meeting 2020.

    Get PDF
    The International Virus Bioinformatics Meeting 2020 was originally planned to take place in Bern, Switzerland, in March 2020. However, the COVID-19 pandemic put a spoke in the wheel of almost all conferences to be held in 2020. After moving the conference to 8-9 October 2020, we got hit by the second wave and finally decided at short notice to go fully online. On the other hand, the pandemic has made us even more aware of the importance of accelerating research in viral bioinformatics. Advances in bioinformatics have led to improved approaches to investigate viral infections and outbreaks. The International Virus Bioinformatics Meeting 2020 has attracted approximately 120 experts in virology and bioinformatics from all over the world to join the two-day virtual meeting. Despite concerns being raised that virtual meetings lack possibilities for face-to-face discussion, the participants from this small community created a highly interactive scientific environment, engaging in lively and inspiring discussions and suggesting new research directions and questions. The meeting featured five invited and twelve contributed talks, on the four main topics: (1) proteome and RNAome of RNA viruses, (2) viral metagenomics and ecology, (3) virus evolution and classification and (4) viral infections and immunology. Further, the meeting featured 20 oral poster presentations, all of which focused on specific areas of virus bioinformatics. This report summarizes the main research findings and highlights presented at the meeting

    A Statistical Framework For Nutriomics Data Analysis

    Get PDF
    Nutriomics is a new discipline that investigates the relationship between nutrition and health through the use of high throughput omics technologies. However, the inherent complexity of nutriomics data poses several challenges for data analysis. In this thesis, the author introduces nutriomics and the statistical challenges associated with its analysis. They propose statistical modelling and machine learning methods to tackle three main challenges: non-linearity, high dimensionality, and data heterogeneity. To deal with these challenges, we first propose a statistical framework, that we coin LC-N2G, to test whether the association between nutrition intake and omics features of interest are significantly different from being unrelated. We use public data as an example to show LC-N2G's ability to discover non-linear associations between nutrition and gene expression. Then we propose a statistical method, coined eNODAL, to cluster high-dimensional omics features based on how they respond to nutrition intake. The application of eNODAL to a mouse proteomics nutrition study shows that eNODAL can identify interpretable clusters of proteins with similar responses to diet and drug treatment. Finally, a statistical model, which we call NEMoE, is proposed to uncover the heterogeneous interplay among diet, omics, and health outcomes. We use a microbiome Parkinson’s disease (PD) study to illustrate the method and show that NEMoE is able to identify diet-specific microbial signatures of PD. Overall, this thesis proposes statistical methods to analyze nutriomics data and provides possible future extensions based on the research. The methods proposed in this thesis could help researchers better understand the complex relationships between nutrition and health, ultimately leading to improved health outcomes

    The International Virus Bioinformatics Meeting 2020.

    Get PDF
    The International Virus Bioinformatics Meeting 2020 was originally planned to take place in Bern, Switzerland, in March 2020. However, the COVID-19 pandemic put a spoke in the wheel of almost all conferences to be held in 2020. After moving the conference to 8-9 October 2020, we got hit by the second wave and finally decided at short notice to go fully online. On the other hand, the pandemic has made us even more aware of the importance of accelerating research in viral bioinformatics. Advances in bioinformatics have led to improved approaches to investigate viral infections and outbreaks. The International Virus Bioinformatics Meeting 2020 has attracted approximately 120 experts in virology and bioinformatics from all over the world to join the two-day virtual meeting. Despite concerns being raised that virtual meetings lack possibilities for face-to-face discussion, the participants from this small community created a highly interactive scientific environment, engaging in lively and inspiring discussions and suggesting new research directions and questions. The meeting featured five invited and twelve contributed talks, on the four main topics: (1) proteome and RNAome of RNA viruses, (2) viral metagenomics and ecology, (3) virus evolution and classification and (4) viral infections and immunology. Further, the meeting featured 20 oral poster presentations, all of which focused on specific areas of virus bioinformatics. This report summarizes the main research findings and highlights presented at the meeting

    A deep phenotyping approach to understand major depressive disorder and responses to antidepressant pharmacotherapy

    Get PDF
    Major depressive disorder (MDD) is a debilitating psychiatric disorder characterised by a complex underlying biology and poor response to pharmacological antidepressant strategies. Given the heterogeneity of MDD and the diverse range of available treatment options, there is an increasing desire to develop and implement precision medicine approaches to tailor existing treatment strategies to the biological system of the individual. In this thesis, high-resolution omics data (connectomics [fMRI], metabolomics [1H NMR] and immunomics [inflammatory cytokines]) collected from the Canadian Biomarker Integration Network in Depression (CAN-BIND) study has been integrated to facilitate the deep phenotyping of MDD. In addition, this approach has been used to predict the treatment response to two common antidepressant drugs, monotherapy with the selective serotonin reuptake inhibitor (SSRI) escitalopram (10-20 mg) or combination therapy with escitalopram and the dopaminergic antipsychotic aripiprazole (2-10 mg). This approach identified a multi-modal panel of sex-specific biomarkers of MDD and treatment response, highlighting a strong immunometabolic component in depressed males, but not females. Unsupervised clustering methods indicated the superiority of biological (neuroimaging) over symptom-based (clinical questionnaires) data for the stratification of patients into MDD subtypes with differential response to treatment. More importantly, a set of multi-modal, sex-specific biomarkers were identified that predicted treatment response with escitalopram monotherapy (84.7% accuracy) or aripiprazole augmentation (88.5% accuracy). In addition to highlighting potential new aspects of the biology of MDD (e.g. relevance of lipoprotein size and density for their relation to depression), this work is one of the first attempts to apply systems biology approaches to high-resolution biological data from a large clinical trial to predict later treatment outcome. With the validation of the findings presented in this thesis in independent cohorts, and with further development of omics technologies, leading to cheaper and high-throughput screening of the patient population, pre-dose biomarkers have the potential to achieve personalised treatment. Each year, escitalopram and aripiprazole are prescribed to an estimated 26 million and 7 million individuals respectively, and over one third of them do not respond. Thus, being able to predict response to antidepressant medication from baseline biomarkers has enormous clinical and socioeconomic benefits.Open Acces

    Structured data abstractions and interpretable latent representations for single-cell multimodal genomics

    Get PDF
    Single-cell multimodal genomics involves simultaneous measurement of multiple types of molecular data, such as gene expression, epigenetic marks and protein abundance, in individual cells. This allows for a comprehensive and nuanced understanding of the molecular basis of cellular identity and function. The large volume of data generated by single-cell multimodal genomics experiments requires specialised methods and tools for handling, storing, and analysing it. This work provides contributions on multiple levels. First, it introduces a single-cell multimodal data standard — MuData — designed to facilitate the handling, storage and exchange of multimodal data. MuData provides interfaces that enable transparent access to multimodal annotations as well as data from individual modalities. This data structure has formed the foundation for the multimodal integration framework, which enables complex and composable workflows that can be naturally integrated with existing omics-specific analysis approaches. Joint analysis of multimodal data can be performed using integration methods. In order to enable integration of single-cell data, an improved multi-omics factor analysis model (MOFA+) has been designed and implemented building on the canonical dimensionality reduction approach for multi-omics integration. Inferring later factors that explain variation across multiple modalities of the data, MOFA+ enables the modelling of latent factors with cell group-specific patterns of activity. MOFA+ model has been implemented as part of the respective multi-omics integration framework, and its utility has been extended by software solutions that facilitate interactive model exploration and interpretation. The newly improved model for multi-omics integration of single cells has been applied to the study of gene expression signatures upon targeted gene activation. In a dataset featuring targeted activation of candidate regulators of zygotic genome activation (ZGA) — a crucial transcriptional event in early embryonic development, — modelling expression of both coding and non-coding loci with MOFA+ allowed to rank genes by their potency to activate a ZGA-like transcriptional response. With identification of Patz1, Dppa2 and Smarca5 as potent inducers of ZGA-like transcription in mouse embryonic stem cells, these findings have contributed to the understanding of molecular mechanisms behind ZGA and laid the foundation for future research of ZGA in vivo. In summary, this work’s contributions include the development of data handling and integration methods as well as new biological insights that arose from applying these methods to studying gene expression regulation in early development. This highlights how single-cell multimodal genomics can aid to generate valuable insights into complex biological systems

    Physiology, syntrophy and viral interplay in the marine sponge holobiont

    Get PDF
    Holobionts result from intimate associations of eukaryotic hosts and microbes and are now widely accepted as ubiquitous and important elements of nature. Marine sponge holobionts combine simple morphology and complex microbiology whilst diverging early in the animal kingdom. As filter feeders, sponges feed on planktonic bacteria, but also harbour stable species-specific microbial consortia. This interaction with bacteria renders sponges to exciting systems to study basal determinants of animal-microbe symbioses. While inventories of symbiont taxa and gene functions continue to grow, we still know little about the symbiont physiology, cellular interactions and metabolic currencies within sponges. This limits our mechanistic understanding of holobiont stability and function. Therefore, this PhD thesis set out to study the questions of what individual symbionts actually do and how they interact. The first part of this thesis focuses on the cell physiology of cosmopolitan sponge symbionts. For the first time, I characterised the ultrastructure of dominant sponge symbiont clades within sponge tissue by establishing fluorescence in situ hybridization-correlative light and electron microscopy (FISH-CLEM). In combination with genome-centred metatranscriptomics, this approach revealed structural adaptations of symbionts to process complex holobiont-derived nutrients (i.e., bacterial microcompartments and bipolar storage polymers). Next, we unravelled complementary symbiont physiologies and cell co-localisation indicating vivid symbiont-symbiont metabolic interactions within the holobiont. This suggests strategies of nutritional resource partitioning and syntrophy to dominate over spatial segregation to avoid competitive exclusion- a mechanistic framework to sustain high microbial diversity. By combining stable isotope pulse-chase experiments with metabolic imaging, we demonstrated that symbionts can account for up to 60 % of the heterotrophic carbon and nitrogen assimilation in sponges. Thus, sponge symbiont action determines sponge-driven biochemical cycles in marine ecosystems. Finally, I explored the role of phages in the sponge holobiont focussing on tripartie phage-microbe-host interplay. Sponges appeared as rich reservoirs of novel viral diversity with 491 previously unidentified genus-level viral clades. Further, sponges harboured highly individual, yet species-specific viral communities. Importantly, I discovered that phages, termed “Ankyphages”, abundantly encode ankyrin proteins. Such “Ankyphages” I found to be widespread in host-associated environments, including humans. Using macrophage infection assays I showed that phage ankyrins aid bacteria in eukaryote immune evasion by downregulating eukaryotic antibacterial immunity. Thus, I identified a potentially widespread mechanism of tripartite phage-prokaryote-host interplay where phages foster animal-microbe symbioses. Altogether, I draw three main conclusions: The sponge holobiont is a metabolically intertwined ecosystem, with symbiont action impacting the environment, and tripartite phage-prokaryote-eukaryote interplay fostering symbiosis

    Forestogram: Biclustering Visualization Framework with Applications in Public Transport and Bioinformatics

    Get PDF
    RÉSUMÉ : Dans de nombreux problèmes d’analyse de données, les données sont exprimées dans une matrice avec les sujets en ligne et les attributs en colonne. Les méthodes de segmentations traditionnelles visent à regrouper les sujets (lignes), selon des critères de similitude entre ces sujets. Le but est de constituer des groupes de sujets (lignes) qui partagent un certain degré de ressemblance. Les groupes obtenus permettent de garantir que les sujets partagent des similitudes dans leurs attributs (colonnes), il n’y a cependant aucune garantie sur ce qui se passe au niveau des attributs (les colonnes). Dans certaines applications, un regroupement simultané des lignes et des colonnes appelé biclustering de la matrice de données peut être souhaité. Pour cela, nous concevons et développons un nouveau cadre appelé Forestogram, qui permet le calcul de ce regroupement simultané des lignes et des colonnes (biclusters)dans un mode hiérarchique. Le regroupement simultané des lignes et des colonnes de manière hiérarchique peut aider les praticiens à mieux comprendre comment les groupes évoluent avec des propriétés théoriques intéressantes. Forestogram, le nouvel outil de calcul et de visualisation proposé, pourrait être considéré comme une extension 3D du dendrogramme, avec une fusion orthogonale étendue. Chaque bicluster est constitué d’un groupe de lignes (ou de sujets) qui déplie un schéma fortement corrélé avec le groupe de colonnes (ou attributs) correspondantes. Cependant, au lieu d’effectuer un clustering bidirectionnel indépendamment de chaque côté, nous proposons un algorithme de biclustering hiérarchique qui prend les lignes et les colonnes en même temps pour déterminer les biclusters. De plus, nous développons un critère d’information basé sur un modèle qui fournit un nombre estimé de biclusters à travers un ensemble de configurations hiérarchiques au sein du forestogramme sous des hypothèses légères. Nous étudions le cadre suggéré dans deux perspectives appliquées différentes, l’une dans le domaine du transport en commun, l’autre dans le domaine de la bioinformatique. En premier lieu, nous étudions le comportement des usagers dans le transport en commun à partir de deux informations distinctes, les données temporelles et les coordonnées spatiales recueillies à partir des données de transaction de la carte à puce des usagers. Dans de nombreuses villes, les sociétés de transport en commun du monde entier utilisent un système de carte à puce pour gérer la perception des tarifs. L’analyse de cette information fournit un aperçu complet de l’influence de l’utilisateur dans le réseau de transport en commun interactif. À cet égard, l’analyse des données temporelles, décrivant l’heure d’entrée dans le réseau de transport en commun est considérée comme la composante la plus importante des données recueillies à partir des cartes à puce. Les techniques classiques de segmentation, basées sur la distance, ne sont pas appropriées pour analyser les données temporelles. Une nouvelle projection intuitive est suggérée pour conserver le modèle de données horodatées. Ceci est introduit dans la méthode suggérée pour découvrir le modèle temporel comportemental des utilisateurs. Cette projection conserve la distance temporelle entre toute paire arbitraire de données horodatées avec une visualisation significative. Par conséquent, cette information est introduite dans un algorithme de classification hiérarchique en tant que méthode de segmentation de données pour découvrir le modèle des utilisateurs. Ensuite, l’heure d’utilisation est prise en compte comme une variable latente pour rendre la métrique euclidienne appropriée dans l’extraction du motif spatial à travers notre forestogramme. Comme deuxième application, le forestogramme est testé sur un ensemble de données multiomiques combinées à partir de différentes mesures biologiques pour étudier comment l’état de santé des patientes et les modalités biologiques correspondantes évoluent hiérarchiquement au cours du terme de la grossesse, dans chaque bicluster. Le maintien de la grossesse repose sur un équilibre finement équilibré entre la tolérance à l’allogreffe foetale et la protection mécanismes contre les agents pathogènes envahissants. Malgré l’impact bien établi du développement pendant les premiers mois de la grossesse sur les résultats à long terme, les interactions entre les divers mécanismes biologiques qui régissent la progression de la grossesse n’ont pas été étudiées en détail. Démontrer la chronologie de ces adaptations à la grossesse à terme fournit le cadre pour de futures études examinant les déviations impliquées dans les pathologies liées à la grossesse, y compris la naissance prématurée et la prééclampsie. Nous effectuons une analyse multi-physique de 51 échantillons de 17 femmes enceintes, livrant à terme. Les ensembles de données comprennent des mesures de l’immunome, du transcriptome, du microbiome, du protéome et du métabolome d’échantillons obtenus simultanément chez les mêmes patients. La modélisation prédictive multivariée utilisant l’algorithme Elastic Net est utilisée pour mesurer la capacité de chaque ensemble de données à prédire l’âge gestationnel. En utilisant la généralisation empilée, ces ensembles de données sont combinés en un seul modèle. Ce modèle augmente non seulement significativement le pouvoir prédictif en combinant tous les ensembles de données, mais révèle également de nouvelles interactions entre différentes modalités biologiques. En outre, notre forestogramme suggéré est une autre ligne directrice avec l’âge gestationnel au moment de l’échantillonnage qui fournit un modèle non supervisé pour montrer combien d’informations supervisées sont nécessaires pour chaque trimestre pour caractériser les changements induits par la grossesse dans Microbiome, Transcriptome, Génome, Exposome et Immunome réponses efficacement.----------ABSTRACT : In many statistical modeling problems data are expressed in a matrix with subjects in row and attributes in column. In this regard, simultaneous grouping of rows and columns known as biclustering of the data matrix is desired. We design and develop a new framework called Forestogram, with the aim of fast computational and hierarchical illustration of biclusters. Often in practical data analysis, we deal with a two-dimensional object known as the data matrix, where observations are expressed as samples (or subjects) in rows, and attributes (or features) in columns. Thus, simultaneous grouping of rows and columns in a hierarchical manner helps practitioners better understanding how clusters evolve. Forestogram, a novel computational and visualization tool, could be thought of as a 3D expansion of dendrogram, with extended orthogonal merge. Each bicluster consists of group of rows (or samples) that unfolds a highly-correlated schema with their corresponding group of columns (or attributes). However, instead of performing two-way clustering independently on each side, we propose a hierarchical biclustering algorithm which takes rows and columns at the same time to determine the biclusters. Furthermore, we develop a model-based information criterion which provides an estimated number of biclusters through a set of hierarchical configurations within the forestogram under mild assumptions. We study the suggested framework in two different applied perspectives, one in public transit domain, another one in bioinformatics field. First, we investigate the users’ behavior in public transit based on two distinct information, temporal data and spatial coordinates gathered from smart card. In many cities, worldwide public transit companies use smart card system to manage fare collection. Analysis of this information provides a comprehensive insight of user’s influence in the interactive public transit network. In this regard, analysis of temporal data, describing the time of entering to the public transit network is considered as the most substantial component of the data gathered from the smart cards. Classical distance-based techniques are not always suitable to analyze this time series data. A novel projection with intuitive visual map from higher dimension into a three-dimensional clock-like space is suggested to reveal the underlying temporal pattern of public transit users. This projection retains the temporal distance between any arbitrary pair of time-stamped data with meaningful visualization. Consequently, this information is fed into a hierarchical clustering algorithm as a method of data segmentation to discover the pattern of users. Then, the time of the usage is taken as a latent variable into account to make the Euclidean metric appropriate for extracting the spatial pattern through our forestogram. As a second application, forestogram is tested on a multiomics dataset combined from different biological measurements to study how patients and corresponding biological modalities evolve hierarchically in each bicluster over the term of pregnancy. The maintenance of pregnancy relies on a finely-tuned balance between tolerance to the fetal allograft and protective mechanisms against invading pathogens. Despite the well-established impact of development during the early months of pregnancy on long-term outcomes, the interactions between various biological mechanisms that govern the progression of pregnancy have not been studied in details. Demonstrating the chronology of these adaptations to term pregnancy provides the framework for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia. We perform a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets include measurements from the immunome, transcriptome, microbiome, proteome, and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net algorithm is used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets are combined into a single model. This model not only significantly increases the predictive power by combining all datasets, but also reveals novel interactions between different biological modalities. Furthermore, our suggested forestogram is another guideline along with the gestational age at time of sampling that provides an unsupervised model to show how much supervised information is necessary for each trimester to characterize the pregnancy-induced changes in Microbiome, Transcriptome, Genome, Exposome, and Immunome responses effectively

    Human-microbiota interactions in health and disease :bioinformatics analyses of gut microbiome datasets

    Get PDF
    EngD ThesisThe human gut harbours a vast diversity of microbial cells, collectively known as the gut microbiota, that are crucial for human health and dysfunctional in many of the most prevalent chronic diseases. Until recently culture dependent methods limited our ability to study the microbiota in depth including the collective genomes of the microbiota, the microbiome. Advances in culture independent metagenomic sequencing technologies have since provided new insights into the microbiome and lead to a rapid expansion of data rich resources for microbiome research. These high throughput sequencing methods and large datasets provide new opportunities for research with an emphasis on bioinformatics analyses and a novel field for drug discovery through data mining. In this thesis I explore a range of metagenomics analyses to extract insights from metagenomics data and inform drug discovery in the microbiota. Firstly I survey the existing technologies and data sources available for data mining therapeutic targets. Then I analyse 16S metagenomics data combined with metabolite data from mice to investigate the treatment model of a proposed antibiotic treatment targetting the microbiota. Then I investigate the occurence frequency and diversity of proteases in metagenomics data in order to inform understanding of host-microbiota-diet interactions through protein and peptide associated glycan degradation by the gut microbiota. Finally I develop a system to facilitate the process of integrating metagenomics data for gene annotations. One of the main challenges in leveraging the scale of data availability in microbiome research is managing the data resources from microbiome studies. Through a series of analytical studies I used metagenomics data to identify community trends, to demonstrate therapeutic interventions and to do a wide scale screen for proteases that are central to human-microbiota interactions. These studies articulated the requirement for a computational framework to integrate and access metagenomics data in a reproducible way using a scalable data store. The thesis concludes explaining how data integration in microbiome research is needed to provide the insights into metagenomics data that are required for drug discovery
    • …
    corecore