Machine Learning Analysis of the Human Infant Gut Microbiome Identifies Influential Species in Type 1 Diabetes

Abstract

Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Diabetes is a disease that is closely linked to genetics and epigenetics, yet mechanisms for clarifying the onset and/or progression of the disease have sometimes not been fully managed. In recent years and due to the large number of recent studies, it is known that changes in the balance of the microbiota can cause a high battery of diseases, including diabetes. Machine Learning (ML) techniques are able to identify complex, non-linear patterns of expression and relationships within the data set to extract intrinsic knowledge without any biological assumptions about the data. At the same time, mass sequencing techniques allow us to obtain the metagenomic profile of an individual, whether it is a body part, organ or tissue, and thus identify the composition of a given microbe. The great increase in the development of both technologies in their respective fields of study leads to the logical union of both to try to identify the bases of a complex disease such as diabetes. To this end, a Random Forest model has been developed at different taxonomic levels, obtaining results above 0.80 in AUC for families and above 0.98 at species level, following a strict experimental design to ensure that results are compared under equal conditions. It is identified how, in infants, the species Bacteroides uniformis, Bacteroides dorei and Bacteroides thetaiotaomicron are reduced in the microbiota of those with T1D, while, the populations of Prevotella copri increase slightly and that of Bacteroides vulgatus is much higher. Finally, thanks to the more specific metagenomic signature at species level, a model has been generated to predict those seroconverted patients not previously diagnosed with diabetes but who have expressed at least two of the autoantibodies analysed.This work was supported by the “Collaborative Project in Genomic Data Integration (CICLOGEN)” PI17/01826 funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013–2016 and the European Regional Development Funds (FEDER)—“A way to build Europe”. and the General Directorate of Culture, Education and University Management of Xunta de Galicia, Spain (Ref. ED431D 2017/16), the “Galician Network for Colorectal Cancer Research, Spain” (Ref. ED431D 2017/23) and Competitive Reference Groups, Spain (Ref. ED431C 2018/49). The funding body did not have a role in the experimental design; data collection, analysis and interpretation; and writing of this manuscript. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia, Spain”, supported in an 80% through ERDF Funds, Spain, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretaría Xeral de Universidades, Spain” (Grant ED431G 2019/01). The funding body did not have a role in the experimental design; data collection, analysis and interpretation; and writing of this manuscript. The calculations were performed on resources provided by the Spanish Ministry of Economy and Competitiveness via funding of the unique installation BIOCAI (UNLC08-1E-002, UNLC13-13-3503) and the European Regional Development Funds (FEDER) . Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431D 2017/16Xunta de Galicia; ED431D 2017/23Xunta de Galicia; ED431C 2018/49Xunta de Galicia; ED431G 2019/0

    Similar works