80 research outputs found

    Quantitative Inferences from the Lung Microbiome

    Get PDF
    Within the last decade, we have progressed from the belief that the healthy human lung is a sterile environment to attempts to study inter-kingdom interactions between microbial residents of the lungs. It has been repeatedly confirmed that the lungs contain both bacteria, predominantly from the Streptococcus, Veillonella, and Prevotella genera, and fungi, predominantly from the Cladosporium, Eurotium, Penicillium, and Aspergillus genera. The community composition as a whole undergoes shifts in every lung disease and condition that has been studied, including asthma, chronic obstructive pulmonary disorder, and cystic fibrosis. The studies that have observed these shifts have largely been descriptive, comparing the taxonomies present in healthy lungs to taxonomies in diseased lungs. Here we investigated the lung microbiome and relationships within the microbial community and between microbes and the host in a more quantitative and inferential manner. First, we introduced the lasso-penalized generalized linear mixed model (LassoGLMM) for microbiomes. LassoGLMM was applied to a short time-course study of the human oral bacterial microbiome with standard blood chemical measurements and to repeated measurements of the human lung bacterial microbiome and fungal mycobiome with local and systemic markers of inflammation. We sought to show that increased inflammation and other continuous clinical variables in human hosts are associated with distinct microbes present in the lung or oral microbiomes. Then, we examined cross-domain interactions between bacteria and fungi. Ecological interaction networks were inferred for the human lung and skin micro- and myco-biomes. Networks limited to a single domain of life were compared with those that include both bacteria and fungi to identify important components of the microbial community that would be overlooked in a single domain study. Finally, we explored the metabolism of the bacteria within the human lung using three different “-omics” datasets: taxonomic assignments from 16S rRNA gene sequences, gene families from metatranscriptomic sequences, and mass-to-charge ratio (m/z) features from metabolomics. Correlations were examined between pairs of datasets and all three datasets were integrated to identify bacteria contributing metabolic processes that may have otherwise gone unnoticed, resulting in the first complete characterization of the metabolism of the human lung bacterial microbiome

    Comportamento temporal da DPOC e influência do confinamento imposto pela COVID-19: comparação de métodos de seleção de variáveis

    Get PDF
    Modelling a certain outcome is challenging and it is common practice to collect several features in that attempt. Nevertheless, the appropriate statistical methods to select important and meaningful features are still unknown, namely under repeated measurements Longitudinal data can be grouped in forming trajectories that can be altered by countless factors, some of them unexpected. Identifying individuals’ outcome trajectories at early stage of illness, as well as potential risk factors should be of high priority since this knowledge can guide to the development of individually tailored treatment and result in effective interventions. Chronic obstructive pulmonary disease is a progressive and preventable disease and people with this disease could benefit from the identification of such risk factors and over time behaviour. In this dissertation we aimed to compare different feature selection methods based on regression algorithms, namely, random forest, Boruta, extreme gradient boosting, L-1 penalized estimation and automatic backward selection, adapted to longitudinal data. We also aimed to describe the effect of the Coronavirus disease 2019 lockdown on the one-minute sit-to-stand test, handgrip muscle strength and chronic obstructive pulmonary disease assessment test behaviour. We finally aimed to explore the factors influencing the behaviour of the one-minute sit-to-stand test over a six-month period in people with chronic obstructive pulmonary disease. We showed that the automatic backward elimination of features was consistent when it came to select statistically relevant features to be included in linear mixed-effects models with the lowest values of Akaike information criterion. The COVID-19 lockdown period seemed to have had no effect in the one-minute sit-to-stand test and handgrip muscle strength behaviour but a negative effect in the impact of the disease was observed. Also, an increase of the smoking load or age seems to lead to a worse evolution in the one-minute sit-to-stand test results over time in people with chronic obstructive pulmonary disease.Modelar um determinado resultado é desafiante e recorre-se habitualmente à recolha de diversas variáveis. Contudo, desconhecem-se ainda os métodos estatísticos apropriados para a seleção de variáveis importantes e com significado, nomeadamente em dados longitudinais. Dados longitudinais podem ser agrupados e definem trajetórias alteráveis por inúmeros fatores, alguns deles inesperados. Identificar as trajetórias individuais de determinados resultados em fases iniciais de uma doença, bem como os potenciais fatores de risco, deveria ser prioritário uma vez que esse conhecimento pode conduzir ao desenvolvimento de tratamentos individualizados e resultar em intervenções efetivas. A doença pulmonar obstrutiva crónica é uma doença prevenível e progressiva e indivíduos com esta doença poderiam beneficiar com a identificação desses fatores de risco e do comportamento da doença ao longo do tempo. Esta dissertação teve como objetivos comparar diferentes métodos de seleção de variáveis, em dados longitudinais, baseados em algoritmos de regressão, nomeadamente, random forest, Boruta, extreme gradient boosting, estimação com penalização L-1 e eliminação automática. Também pretendemos descrever o efeito provocado pelo confinamento decorrente da pandemia de COVID-19 no teste de sentar e levantar em 1 minuto, na força de preensão manual e no teste de avaliação do impacto da doença pulmonar obstrutiva crónica. Finalmente, explorámos os fatores que influenciam o comportamento do teste de sentar e levantar em 1 minuto ao longo de seis meses em indivíduos com doença pulmonar obstrutiva crónica. O método de eliminação automática foi consistente na seleção de variáveis que produziram modelos lineares de efeitos mistos com menores valores de critério de informação de Akaike. O período de confinamento não teve efeito estatisticamente significativo no teste de sentar e levantar em 1 minuto nem na força de preensão manual. No entanto, foi observado um efeito negativo no impacto da doença. Foi também observada uma pior evolução dos resultados do teste de sentar e levantar em 1 minuto, ao longo do tempo, em indivíduos com doença pulmonar obstrutiva crónica mais velhos e com maior carga tabágica.Mestrado em Estatística Médic

    Modeling and prediction of advanced prostate cancer

    Get PDF
    Background: Prostate cancer (PCa) is the most commonly diagnosed cancer and second leading cause of cancer-related deaths for men in Western countries. The advanced form of the disease is life-threatening with few options for curative therapies. The development of novel therapeutic alternatives would greatly benefit from a more comprehensive and tailored mathematical and statistical methodology. In particular, statistical inference of treatment effects and the prediction of time-dependent effects in both preclinical and clinical studies remains a challenging yet interesting opportunity for applied mathematicians. Such methods are likely to improve the reproducibility and translatability of results and offer possibility for novel holistic insights into disease progression, diagnosis, and prognosis. Methods: Several novel statistical and mathematical techniques were developed over the course of this thesis work for the in vivo modeling of PCa treatment responses. A matching-based, blinded randomized allocation procedure for preclinical experiments was developed that provides assistance for the statistical design of animal intervention studies, e.g., through power analysis and accounting for the stratification of individuals. For the post-intervention testing of treatment effects, two novel mixed-effects models were developed that aim to address the characteristic challenges of preclinical longitudinal experiments, including the heterogeneous response profiles observed in animal studies. Subsequently, a Finnish clinical PCa hospital registry cohort was inspected with a strong emphasis on prostate-specific antigen (PSA), the most commonly used PCa marker. After exploring the PSA trends using penalized splines, a generalized mixed-effects prediction model was implemented with a focus on the ultra-sensitive range of the PSA assay. Finally, for metastatic, aggressive PCa, an ensemble Cox regression methodology was developed for overall survival prediction in the DREAM 9.5 mCRPC Challenge based on open datasets from controlled clinical trials. Results: The advantages of the improved experimental design and two proposed statistical models were demonstrated in terms of both increased statistical power and accuracy in simulated and real preclinical testing settings. Penalized regression models applied to the clinical patient datasets support the use of PSA in the ultra-sensitive range together with a model for relapse prediction. Furthermore, the novel ensemble-based Cox regression model that was developed for the overall survival prediction in advanced PCa outperformed the state-of-the-art benchmark and all other models submitted to the Challenge and provided novel predictors of disease progression and treatment responses. Conclusions: The methods and results provide preclinical researchers and clinicians with novel tools for comprehensive modeling and prediction of PCa. All methodology is available as open source R statistical software packages and/or web-based graphical user interfaces

    구조적 자료의 심층 학습

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 자연과학대학 통계학과, 2020. 8. 원중호.When data possess some structure, a framework implementing the known structures of data can alleviate prominent challenges of deep learning such as robustness, generalizability, and explainability. This dissertation proposes deep learning frameworks for structured data in two tasks. The first task is to develop a representation learning model to simulate nested data. For example, the VGGFace2 dataset consists of more than 300 portraits per person on average. Interpreting such data with a nested structure as i.i.d. observations of a random process provides a fruitful viewpoint on disentangling representations. In this point of view, this thesis proposes the Ornstein auto-encoder (OAE), a promising new family of models for representation learning when data have a nested structure. The key attraction of OAE is its ability to generate samples nested within an observational unit, even if the unit is unknown to the model. This feature distinguishes OAE from conditional models. Furthermore, when the data exhibit exchangeability, OAE's reparametrization of Ornstein's d-bar distance, an infinite-dimensional optimal transport distance on which the OAE framework lies, produces a tractable learning algorithm. OAE has successfully demonstrated high performance in the three types of tasks that have been advocated in assessing the quality of generative models, namely exemplar generation, style transfer, and unit generation. This performance implies that the framework using the structures of data can handle the generalizability issues of deep learning. The second part of this dissertation includes a study for learning a predictive model for capturing a hierarchical correlation in microbiome taxonomic abundance data. Since bacteria are classified at a hierarchy of taxonomic levels, microbiome abundance data have a hierarchical correlation structure. DeepBome is a deep-neural-network-based predictive model for capturing microbiome signals at different phylogenetic depths. By leveraging the phylogenetic information, DeepBome relieves the heavy burden of tuning for the optimal deep learning architecture, avoids overfitting, and most importantly enables visualizing the path from microbiome counts to disease. The second part contributes to the development of the software for DeepBome. Comprehensive simulation experiments have demonstrated the ability of the software. The DeepBome model trained with the developed software shows better generalizability than other deep learning models. For both regression and classification tasks, compared to sparse regression and other deep learning models, DeepBome has competitive performance particularly when microbiome taxa associated with the outcome are clustered at different phylogenetic levels. More importantly, DeepBome enables an explainable visualization of the microbiome-phenotype association network. In real-life data analysis, DeepBome software shows the ability to train a high-performance predictive model and select taxa that are related to the disease according to previous clinical research.자료의 구조를 알고 있는 경우, 이 구조를 활용한 프레임워크는 심층 학습에서 마주하는 강건성, 일반화 및 설명가능성 등의 중요한 이슈를 해결하는 데 도움을 줄 수 있다. 본 학위 논문에서는 구조적 자료를 활용한 심층 학습 방법을 두 가지 문제에 대해 다룬다. 첫 번째 문제는 중첩 구조 자료를 생성할 수 있는 표현 학습 모형의 개발이다. 본 연구에서는 상관성이 있는 자료를 위한 표현 학습 모형 Ornstein auto-encoder (OAE)를 제안한다. 많은 실제 자료는 그룹화된 측정에서 얻어지므로 중첩 구조를 가진다. 예를 들어, VGGFace2 자료는 한 사람당 평균 300개의 이미지로 구성된 자료이다. 이러한 자료는 정상 확률 과정의 i.i.d. 샘플로 구성된 것으로 볼 수 있다. 이를 통해, 두 정상 확률 과정 사이의 최적 수송 거리 (optimal transport distance, Orstein's d-bar distance)를 이용하는 OAE 방법을 제안한다. OAE 방법은 훈련에 사용되지 않은 관측 유닛에 대해서도 해당 유닛의 새로운 이미지를 생성할 수 있다는 점에서 기존의 조건부 모형과 구별되는 고유한 특징을 가진다. 이는 자료의 구조를 활용한 프레임 워크로 심층 신경망 모형의 일반화 성능을 향상시킬 수 있음을 보여준다. 또한, 자료가 교환 가능한 수열 (exchangeable sequence)인 경우, OAE는 훈련 가능한 알고리즘을 제공한다. OAE 방법은 생성 모형의 성능을 나타내는 전형 생성(exemplar generation), 스타일 이전 (style transfer), 관측 유닛 생성 (unit generation) 문제에서 모두 높은 성능을 보여준다. 또한 불균형 자료에 대해서도 소수 집단에 속하는 유닛의 이미지 생성에 기존의 조건부 방법보다 강건한 결과를 보여준다. 본 학위 논문은 또한 미생물의 분류별 개수 자료 (microbiome taxonomic abundance data)의 계층적인 상관 구조를 포착할 수 있는 예측 모형 개발에 대한 내용을 담고 있다. 미생물 개수 자료는 많은 질병을 예측할 수 있는 지표이지만, 계통 발생학 관점에서 계층적인 상관 구조를 가지고 있어 이를 반영한 분석이 필요하다. DeepBome은 심층 학습 기반의 예측 모형으로 계통 발생 정보를 활용해 심층 신경망의 과적합을 막고, 질병과 미생물 개수 자료 간의 관계를 설명한다. 훈련된 모형은 일반화 및 설명 가능성 면에서 기존 심층 신경망보다 좋은 성능을 보여준다. 본 논문은 이 연구에서 DeepBome 소프트웨어 개발에 대한 내용을 담고 있다. 개발한 소프트웨어의 성능은 시뮬레이션 실험을 통해 확인한다. 회귀 문제와 분류 문제에서, 예측 성능 및 질병과 관련된 미생물 분류 선택 모두 기존의 희소 회귀 방법과 심층 학습 방법보다 DeepBome이 우수한 성능을 보이는 것을 확인할 수 있다. 또한 DeepBome 소프트웨어는 질병과 미생물 개수 자료 간의 관계에 대해 설명 가능한 심층 신경망의 시각화 자료를 제공한다.Chapter 1 Introduction 1 1.1 Representation learning from nested data 1 1.2 Learning a predictive model for capturing a hierarchical correlation 5 1.3 Outline of the thesis 7 I Representation Learning from Nested Data 9 Chapter 2 Ornstein Auto-Encoders 10 2.1 Notation 10 2.2 Preliminaries 11 2.3 Ornsteins d-bar distance 15 2.4 Ornstein auto-encoders 16 2.4.1 From Ornsteins d-bar distance to OAE 16 2.4.2 OAE for exchangeable data 19 Chapter 3 Random-Intercept Ornstein Auto-Encoders 23 3.1 Random-intercept OAE 23 3.2 Empirical results 27 3.2.1 Implementation 27 3.2.2 A toy model 29 3.2.3 VGGFace2 dataset 30 3.2.4 MNIST dataset 37 3.3 Appendix 41 3.3.1 Architectures 41 Chapter 4 Product-Space Ornstein Auto-Encoders 48 4.1 Issues with random-intercept OAE 48 4.2 Product-space model for latent space 50 4.3 Training product-space OAE 53 4.4 Empirical results 57 4.4.1 Imbalanced MNIST 58 4.4.2 VGGFace2 61 4.5 Discussion 63 4.6 Appendix 64 4.6.1 Implementation details 64 4.6.2 Architectures 65 4.6.3 Training details 71 4.6.4 Additional figures from the VGGFace2 experiment 72 II Predictive Model for Hierarchically Correlated Data 75 Chapter 5 DeepBiome 76 5.1 DeepBiome 77 5.2 Software 81 5.3 Simulation studies 82 5.4 Discussion 97 5.5 Appendix 101 5.5.1 Implementation details for Section 5.3 101 5.5.2 Real-world data analysis 102 Chapter 6 Conclusion 108 초록 118Docto

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Proceedings of the 36th International Workshop Statistical Modelling July 18-22, 2022 - Trieste, Italy

    Get PDF
    The 36th International Workshop on Statistical Modelling (IWSM) is the first one held in presence after a two year hiatus due to the COVID-19 pandemic. This edition was quite lively, with 60 oral presentations and 53 posters, covering a vast variety of topics. As usual, the extended abstracts of the papers are collected in the IWSM proceedings, but unlike the previous workshops, this year the proceedings will be not printed on paper, but it is only online. The workshop proudly maintains its almost unique feature of scheduling one plenary session for the whole week. This choice has always contributed to the stimulating atmosphere of the conference, combined with its informal character, encouraging the exchange of ideas and cross-fertilization among different areas as a distinguished tradition of the workshop, student participation has been strongly encouraged. This IWSM edition is particularly successful in this respect, as testified by the large number of students included in the program

    Mise en place d'approches bioinformatiques innovantes pour l'intégration de données multi-omiques longitudinales

    Get PDF
    Les nouvelles technologies «omiques» à haut débit, incluant la génomique, l'épigénomique, la transcriptomique, la protéomique, la métabolomique ou encore la métagénomique, ont connues ces dernières années un développement considérable. Indépendamment, chaque technologie omique est une source d'information incontournable pour l'étude du génome humain, de l'épigénome, du transcriptome, du protéome, du métabolome, et également de son microbiote permettant ainsi d'identifier des biomarqueurs responsables de maladies, de déterminer des cibles thérapeutiques, d'établir des diagnostics préventifs et d'accroître les connaissances du vivant. La réduction des coûts et la facilité d'acquisition des données multi-omiques à permis de proposer de nouveaux plans expérimentaux de type série temporelle où le même échantillon biologique est séquencé, mesuré et quantifié à plusieurs temps de mesures. Grâce à l'étude combinée des technologies omiques et des séries temporelles, il est possible de capturer les changements d'expressions qui s'opèrent dans un système dynamique pour chaque molécule et avoir une vision globale des interactions multi-omiques, inaccessibles par une approche simple standard. Cependant le traitement de cette somme de connaissances multi-omiques fait face à de nouveaux défis : l'évolution constante des technologies, le volume des données produites, leur hétérogénéité, la variété des données omiques et l'interprétabilité des résultats d'intégration nécessitent de nouvelles méthodes d'analyses et des outils innovants, capables d'identifier les éléments utiles à travers cette multitude d'informations. Dans cette perspective, nous proposons plusieurs outils et méthodes pour faire face aux challenges liés à l'intégration et l'interprétation de ces données multi-omiques particulières. Enfin, l'intégration de données multi-omiques longitudinales offre des perspectives dans des domaines tels que la médecine de précision ou pour des applications environnementales et industrielles. La démocratisation des analyses multi-omiques et la mise en place de méthodes d'intégration et d'interprétation innovantes permettront assurément d'obtenir une meilleure compréhension des écosystèmes biologiques.New high-throughput «omics» technologies, including genomics, epigenomics, transcriptomics, proteomics, metabolomics and metagenomics, have expanded considerably in recent years. Independently, each omics technology is an essential source of knowledge for the study of the human genome, epigenome, transcriptome, proteome, metabolome, and also its microbiota, thus making it possible to identify biomarkers leading to diseases, to identify therapeutic targets, to establish preventive diagnoses and to increase knowledge of living organisms. Cost reduction and ease of multi-omics data acquisition resulted in new experimental designs based on time series in which the same biological sample is sequenced, measured and quantified at several measurement times. Thanks to the combined study of omics technologies and time series, it is possible to capture the changes in expression that take place in a dynamic system for each molecule and get a comprehensive view of the multi-omics interactions, which was inaccessible with a simple standard omics approach. However, dealing with this amount of multi-omics data faces new challenges: continuous technological evolution, large volumes of produced data, heterogeneity, variety of omics data and interpretation of integration results require new analysis methods and innovative tools, capable of identifying useful elements through this multitude of information. In this perspective, we propose several tools and methods to face the challenges related to the integration and interpretation of these particular multi-omics data. Finally, integration of longidinal multi-omics data offers prospects in fields such as precision medicine or for environmental and industrial applications. Democratisation of multi-omics analyses and the implementation of innovative integration and interpretation methods will definitely lead to a deeper understanding of eco-systems biology

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces
    corecore