18 research outputs found

    MetaGT : A pipeline for de novo assembly of metatranscriptomes with the aid of metagenomic data

    Get PDF
    While metagenome sequencing may provide insights on the genome sequences and composition of microbial communities, metatranscriptome analysis can be useful for studying the functional activity of a microbiome. RNA-Seq data provides the possibility to determine active genes in the community and how their expression levels depend on external conditions. Although the field of metatranscriptomics is relatively young, the number of projects related to metatranscriptome analysis increases every year and the scope of its applications expands. However, there are several problems that complicate metatranscriptome analysis: complexity of microbial communities, wide dynamic range of transcriptome expression and importantly, the lack of high-quality computational methods for assembling meta-RNA sequencing data. These factors deteriorate the contiguity and completeness of metatranscriptome assemblies, therefore affecting further downstream analysis. Here we present MetaGT, a pipeline for de novo assembly of metatranscriptomes, which is based on the idea of combining both metatranscriptomic and metagenomic data sequenced from the same sample. MetaGT assembles metatranscriptomic contigs and fills in missing regions based on their alignments to metagenome assembly. This approach allows to overcome described complexities and obtain complete RNA sequences, and additionally estimate their abundances. Using various publicly available real and simulated datasets, we demonstrate that MetaGT yields significant improvement in coverage and completeness of metatranscriptome assemblies compared to existing methods that do not exploit metagenomic data. The pipeline is implemented in NextFlow and is freely available fromhttps://github.com/ablab/metaGT.Peer reviewe

    Wochenendeย โ€”ย modular and flexible alignment-based shotgun metagenome analysis

    Get PDF
    Background: Shotgun metagenome analysis provides a robust and verifiable method for comprehensive microbiome analysis of fungal, viral, archaeal and bacterial taxonomy, particularly with regard to visualization of read mapping location, normalization options, growth dynamics and functional gene repertoires. Current read classification tools use non-standard output formats, or do not fully show information on mapping location. As reference datasets are not perfect, portrayal of mapping information is critical for judging results effectively. Results: Our alignment-based pipeline, Wochenende, incorporates flexible quality control, trimming, mapping, various filters and normalization. Results are completely transparent and filters can be adjusted by the user. We observe stringent filtering of mismatches and use of mapping quality sharply reduces the number of false positives. Further modules allow genomic visualization and the calculation of growth rates, as well as integration and subsequent plotting of pipeline results as heatmaps or heat trees. Our novel normalization approach additionally allows calculation of absolute abundance profiles by comparison with reads assigned to the human host genome. Conclusion: Wochenende has the ability to find and filter alignments to all kingdoms of life using both short and long reads, and requires only good quality reference genomes. Wochenende automatically combines multiple available modules ranging from quality control and normalization to taxonomic visualization. Wochenende is available at https://github.com/MHH-RCUG/nf_wochenende

    A standardized pipeline for isolation and assembly of genomes from symbiotic bacteria in whole louse genomic sequence data.

    Get PDF
    Many insects are known to harbour intracellular and heritable bacteria (endosymbionts), which provide their hosts with adaptive traits. Whole insect gDNA shotgun sequencing projects often sequence the genome of endosymbiont, in addition to the insectโ€™s genome. There are approximately 600 whole genome shotgun libraries from insects available on the public repository (NCBI), which can be mined to obtain endosymbiont genomes. The assembly and annotation of endosymbiont genomes can contribute towards the exploration of their role as obligate symbiotic partners. However, de novo assembly of an endosymbiont genome, continues to be challenging, when the host and/or enteric bacterial gDNA is present in the library as well. So far, whole genome sequence data has been mined by investigators, who manually interrogate the data at multiple steps, a process that is time consuming and difficult to replicate. Here I developed and evaluated a novel strategy that reduces intervention required by the researcher. The strategy consists of two steps: 1) filtering of de novo assembled endosymbiont contigs using Blastn search against a custom reference database and 2) reconstruction of the genome through de novo assembly of reads associated with filtered contigs. Illumina HiSeq libraries were simulated in silico and the pipeline was deployed using the simulated data to test the efficacy of the method. The mean endosymbiont genome recovery from simulated data was 91.27% with a range of 100%-76%. When the method was tested with โ€œrealโ€ whole louse shotgun sequencing libraries obtained from a public repository, the results were mixed. The strategy was accurate in contig selection when the louse contained an endosymbiont which had a small genome enriched for AT bases, with a mean percent genome recovery of 98.38% and the range of 100% - 95.88%. However, in other cases involving symbionts with larger genomes, the resulting genomes appeared to be incomplete and require further evaluation

    A Benchmark of Genetic Variant Calling Pipelines Using Metagenomic Short-Read Sequencing

    Get PDF
    Microbes live in complex communities that are of major importance for environmental ecology, public health, and animal physiology and pathology. Short-read metagenomic shotgun sequencing is currently the state-of-the-art technique for exploring these communities. With the aid of metagenomics, our understanding of the microbiome is moving from composition toward functionality, even down to the genetic variant level. While the exploration of single-nucleotide variation in a genome is a standard procedure in genomics, and many sophisticated tools exist to perform this task, identification of genetic variation in metagenomes remains challenging. Major factors that hamper the widespread application of variant-calling analysis include low-depth sequencing of individual genomes (which is especially significant for the microorganisms present in low abundance), the existence of large genomic variation even within the same species, the absence of comprehensive reference genomes, and the noise introduced by next-generation sequencing errors. Some bioinformatics tools, such as metaSNV or InStrain, have been created to identify genetic variants in metagenomes, but the performance of these tools has not been systematically assessed or compared with the variant callers commonly used on single or pooled genomes. In this study, we benchmark seven bioinformatic tools for genetic variant calling in metagenomics data and assess their performance. To do so, we simulated metagenomic reads to mimic human microbial composition, sequencing errors, and genetic variability. We also simulated different conditions, including low and high depth of coverage and unique or multiple strains per species. Our analysis of the simulated data shows that probabilistic method-based tools such as HaplotypeCaller and Mutect2 from the GATK toolset show the best performance. By applying these tools to longitudinal gut microbiome data from the Human Microbiome Project, we show that the genetic similarity between longitudinal samples from the same individuals is significantly greater than the similarity between samples from different individuals. Our benchmark shows that probabilistic tools can be used to call metagenomes, and we recommend the use of GATK's tools as reliable variant callers for metagenomic samples

    Exploring environmental intra-species diversity through non-redundant pangenome assemblies

    Get PDF
    At the genome level, microorganisms are highly adaptable both in terms of allele and gene composition. Such heritable traits emerge in response to different environmental niches and can have a profound influence on microbial community dynamics. As a consequence, any individual genome or population will contain merely a fraction of the total genetic diversity of any operationally defined "species", whose ecological potential can thus be only fully understood by studying all of their genomes and the genes therein. This concept, known as the pangenome, is valuable for studying microbial ecology and evolution, as it partitions genomes into core (present in all the genomes from a species, and responsible for housekeeping and species-level niche adaptation among others) and accessory regions (present only in some, and responsible for intra-species differentiation). Here we present SuperPang, an algorithm producing pangenome assemblies from a set of input genomes of varying quality, including metagenome-assembled genomes (MAGs). SuperPang runs in linear time and its results are complete, non-redundant, preserve gene ordering and contain both coding and non-coding regions. Our approach provides a modular view of the pangenome, identifying operons and genomic islands, and allowing to track their prevalence in different populations. We illustrate this by analysing intra-species diversity in Polynucleobacter, a bacterial genus ubiquitous in freshwater ecosystems, characterized by their streamlined genomes and their ecological versatility. We show how SuperPang facilitates the simultaneous analysis of allelic and gene content variation under different environmental pressures, allowing us to study the drivers of microbial diversification at unprecedented resolution

    The META tool optimizes metagenomic analyses across sequencing platforms and classifiers

    Get PDF
    A major challenge in the field of metagenomics is the selection of the correct combination of sequencing platform and downstream metagenomic analysis algorithm, or โ€œclassifierโ€. Here, we present the Metagenomic Evaluation Tool Analyzer (META), which produces simulated data and facilitates platform and algorithm selection for any given metagenomic use case. META-generated in silico read data are modular, scalable, and reflect user-defined community profiles, while the downstream analysis is done using a variety of metagenomic classifiers. Reported results include information on resource utilization, time-to-answer, and performance. Real-world data can also be analyzed using selected classifiers and results benchmarked against simulations. To test the utility of the META software, simulated data was compared to real-world viral and bacterial metagenomic samples run on four different sequencers and analyzed using 12 metagenomic classifiers. Lastly, we introduce โ€œMETA Scoreโ€: a unified, quantitative value which rates an analytic classifierโ€™s ability to both identify and count taxa in a representative sample

    Metagenomics: A viable tool for reconstructing herbivore diet

    Get PDF
    Metagenomics can generate data on the diet of herbivores, without the need for primer selection and PCR enrichment steps as is necessary in metabarcoding. Metagenomic approaches to diet analysis have remained relatively unexplored, requiring validation of bioinformatic steps. Currently, no metagenomic herbivore diet studies have utilized both chloroplast and nuclear markers as reference sequences for plant identification, which would increase the number of reads that could be taxonomically informative. Here, we explore how in silico simulation of metagenomic data sets resembling sequences obtained from faecal samples can be used to validate taxonomic assignment. Using a known list of sequences to create simulated data sets, we derived reliable identification parameters for taxonomic assignments of sequences. We applied these parameters to characterize the diet of western capercaillies (Tetrao urogallus) located in Norway, and compared the results with metabarcoding trnL P6 loop data generated from the same samples. Both methods performed similarly in the number of plant taxa identified (metagenomics 42 taxa, metabarcoding 43 taxa), with no significant difference in species resolution (metagenomics 24%, metabarcoding 23%). We further observed that while metagenomics was strongly affected by the age of faecal samples, with fresh samples outperforming old samples, metabarcoding was not affected by sample age. On the other hand, metagenomics allowed us to simultaneously obtain the mitochondrial genome of the western capercaillies, thereby providing additional ecological information. Our study demonstrates the potential of utilizing metagenomics for diet reconstruction but also highlights key considerations as compared to metabarcoding for future utilization of this technique

    No evidence for a common blood microbiome based on a population study of 9,770 healthy humans

    Get PDF
    Human blood is conventionally considered sterile but recent studies suggest the presence of a blood microbiome in healthy individuals. Here we characterized the DNA signatures of microbes in the blood of 9,770 healthy individuals using sequencing data from multiple cohorts. After filtering for contaminants, we identified 117 microbial species in blood, some of which had DNA signatures of microbial replication. They were primarily commensals associated with the gut (nโ€‰=โ€‰40), mouth (nโ€‰=โ€‰32) and genitourinary tract (nโ€‰=โ€‰18), and were distinct from pathogens detected in hospital blood cultures. No species were detected in 84% of individuals, while the remainder only had a median of one species. Less than 5% of individuals shared the same species, no co-occurrence patterns between different species were observed and no associations between host phenotypes and microbes were found. Overall, these results do not support the hypothesis of a consistent core microbiome endogenous to human blood. Rather, our findings support the transient and sporadic translocation of commensal microbes from other body sites into the bloodstream

    ์ •ํ™•ํ•œ ์„œ์—ด์ •๋ ฌ๊ธฐ๋ฒ•๊ณผ ์ธ๋ฉ”๋ชจ๋ฆฌ ํ•ต์‹ฌ ์œ ์ „์ž ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ธฐ๋ฐ˜์˜ ํ–ฅ์ƒ๋œ ๋ฉ”ํƒ€์œ ์ „์ฒด ๋ถ„๋ฅ˜๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต, 2020. 8. ์ฒœ์ข…์‹.์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค๋Š” ๋ฏธ์ƒ๋ฌผ๊ณผ ์ˆ™์ฃผ ๋˜๋Š” ํ™˜๊ฒฝ์‚ฌ์ด์˜ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ดํ•ดํ•˜๋Š”๋ฐ ๋งค์šฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ธฐ์ˆ ์˜ ๋ฐœ๋‹ฌ๊ณผ ๋”๋ถˆ์–ด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค๋ฅผ ํ†ตํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋ฏธ์ƒ๋ฌผ ์ข…์˜ ๋™์ •๊ณผ ๊ฐ ์ข…๋“ค์˜ ๋ถ„ํฌ๋Š” ๋งˆ์ดํฌ๋กœ๋ฐ”์ด์˜ด ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ์š”์†Œ๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ์ง€๋‚œ 10๋…„๊ฐ„ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค ๋ถ„์„์„ ์œ„ํ•œ ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋“ค์ด ๊ฐœ๋ฐœ๋˜์–ด์ ธ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ ํ˜น์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜ ์ •๋ณด์™€ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ธํ•˜์—ฌ ํŽธํ–ฅ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ๋„ ํ•˜์˜€๋Š”๋ฐ, ์ด๋ฅผ ๋ณด์™„ํ•˜๊ณ  ๋ณด๋‹ค ์ •ํ™•ํ•œ ๋ถ„๋ฅ˜ ๋™์ •์„ ์œ„ํ•ด ๋ฐฐ์–‘์ด ์–ด๋ ค์šด ํ‘œ์ค€ ๊ท ์ฃผ์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ท ์ฃผ์˜ ์œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋˜๊ณ  ์žˆ๋‹ค. ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน์Šค ๋ถ„์„์—์„œ ๋˜ ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ๋ถ„์„์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ์ƒ๋ฌผ์ •๋ณดํ•™์  ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ์— ์žˆ์–ด ๋ฉ”๋ชจ๋ฆฌ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”๊ฐ€ ๋˜์–ด์žˆ์ง€ ์•Š์•„ ๋ถ„์„์— ์ƒ๋‹นํ•œ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” exact match k-mer classification๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์„ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ Up-to-date Bacterial Core Gene (UBCG)๋ฅผ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋…ธ๋ฏน ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. ๋ถ„์„์˜ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‘๊ฐœ์˜ ๊ธฐ์ค€ UBCG ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ ๋งŒ๋“ค์–ด ์กŒ์œผ๋ฉฐ ํ•œ ๊ฐœ๋Š” ๋ฐ•ํ…Œ๋ฆฌ์•„์˜ ๋ถ„๋ฅ˜์ฒด๊ณ„์—์„œ ์œ ํšจํ•œ ์ข…๋ช… (Valid names)๋งŒ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์™€ ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์œ ํšจํ•œ ์ข…๋ช…๊ณผ ํ•จ๊ป˜ EzBioCloud์— ์žˆ๋Š” genomospecies๋ฅผ ๊ฐ€์ง€๊ณ  ์ƒ์„ฑํ•˜์˜€๋‹ค. ๊ฒ€์ฆ์„ ์œ„ํ•ด Streptococcus ์ข…์„ ํฌํ•จํ•˜๋Š” (i) ํ•ฉ์„ฑ๋œ ๋ฉ”ํƒ€์ง€๋†ˆ ์ƒ˜ํ”Œ๊ณผ (ii) ๋งŒ์„ฑ ํ์‡„์„ฑ ํ์งˆํ™˜(COPD) ํ™˜์ž์˜ ์ž„์ƒ ๊ฒ€์ฒด (iii) ํ˜ˆ๋ฅ˜ ๊ฐ์—ผ ํ™˜์ž์˜ ์ž„์ƒ ๊ฒ€์ฒด๋กœ ์ด๋ฃจ์–ด์ง„ ์„ธ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด์— ๋„๋ฆฌ ์•Œ๋ ค์ง„ ์ƒท๊ฒƒ ํŒŒ์ดํ”„๋ผ์ธ์ธ MetaPhlan2๊ณผ ๋ณธ ์—ฐ๊ตฌ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋น„๊ต ๋ถ„์„ํ•˜์˜€๋‹ค. ์œ„ ๊ฒ€์ฆ ๋ถ„์„์—์„œ UBCG๋ฅผ ๊ธฐ์ค€ ์„œ์—ด๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•จ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๊ธฐ์ค€ ์œ ์ „์ฒด์—์„œ UBCG ์„œ์—ด์„ ๋ฝ‘์•„ ์ƒท๊ฑด ๋ถ„์„์— ์šฉ์ดํ•จ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋˜ํ•œ genomospecies๋ฅผ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ, ๋ณด๋‹ค ๊ฐœ์„ ๋œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋น„๋ก ์—ฌ๋Ÿฌ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋“ค์ด ์กด์žฌํ•˜์ง€๋งŒ ๋ณด๋‹ค ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ„๋ฅ˜๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„  ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ง€์†์ ์ธ ์—…๋ฐ์ดํŠธ์™€ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์˜ ๊ฒ€์ฆ์˜ ์ค‘์š”ํ•จ์„ ๊ฐ•์กฐํ•˜์˜€๋‹ค. ์ดํ›„ ๋ณธ ์—ฐ๊ตฌ์—์„œ ๊ฐœ๋ฐœ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ด์šฉํ•˜์—ฌ 4,000๊ฐœ์˜ ์ƒท๊ฑด ๋ฉ”ํƒ€์ง€๋†ˆ ์ƒ˜ํ”Œ์—์„œ ์‚ฌ๋žŒ์— ์žฅ๋‚ด์— ๊ฐ€์žฅ ๋งŽ์ด ๋ฐœ๊ฒฌ๋˜๋Š” Bacteroides ์ข…์— ๋Œ€ํ•œ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ์กด์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” MetaPhlAn2 ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์—ˆ์œผ๋ฉฐ ๋ถ„์„ ๊ฒฐ๊ณผ Bacteroides๋Š” ๋„์‹œํ™”๋œ ์‚ฌ๋žŒ์—๊ฒŒ ๋งŽ์ด ๋ถ„ํฌํ•˜๋Š” ๋ฐ˜๋ฉด ์•„ํ”„๋ฆฌ์นด ํ˜น์€ ๋‚จ๋ฏธ์ง€์—ญ์—์„œ ์›์‹œ์  ๋ถ€์กฑ์˜ ์‚ถ์„ ์‚ฌ๋Š” ์‚ฌ๋žŒ์—๊ฒŒ์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ ๊ฒŒ ๋ถ„ํฌํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ๊ฐ ๋‚˜๋ผ๋ณ„ ์ธ๊ตฌ์—์„œ๋Š” ์šฐ์ ๋˜๋Š” Bacteroides ์ข…์ด ๋‹ค๋ฆ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ ์ด๋Š” ๊ฐ ์—ฐ๊ตฌ์˜ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ• ํ˜น์€ ์œ„์น˜์— ๋”ฐ๋ผ ์„ค๋ช…๋˜์–ด ์งˆ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์‹คํ—˜์šฉ ์ฅ์˜ ๊ฒฐ๊ณผ์—์„œ๋Š” ๊ฐ€์žฅ ๋‹ค์–‘ํ•œ Bacteroides๋ฅผ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ ์ด๋Š” ๋งŽ์€ ์ˆ˜์˜ ๊ธฐ์ค€ ์œ ์ „์ฒด๊ฐ€ ์ƒ์ฅ์—๊ฒŒ์„œ ๋‚˜์™”๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐ๋œ๋‹ค. ๋˜ํ•œ ๊ณ ์–‘์ด๋‚˜ ๊ฐ•์•„์ง€ ๊ฐ™์€ ๋ฐ˜๋ ค๋™๋ฌผ์˜ ์ƒ˜ํ”Œ์—์„œ๋„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ ๊ฐ ๋™๋ฌผ๋“ค์˜ ์ƒํ™œ์–‘์‹๊ณผ ๋จน์ด์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋ณด๋‹ค ๋งŽ์€ ๋ฉ”ํƒ€์ง€๋†ˆ ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ํ•„์š”์„ฑ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ํ•ต์‹ฌ ์œ ์ „์ž๋“ค์„ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์‹คํšจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ์œ ์ „์ž ๊ธฐ๋ฐ˜์˜ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” ๋ณด๋‹ค ์ •ํ™•ํ•˜๊ณ  ์ „์ฒด ๋ฏธ์ƒ๋ฌผ์˜ ํ’๋ถ€๋„๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๊ณ  k-mer ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๊ธฐ์กด์— ์กด์žฌํ•˜๋˜ ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ ๋ณด๋‹ค ๋”์šฑ ๋น ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋น ๋ฅด๊ฒŒ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ญ์ƒ ์ตœ์‹ ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ๊ถ๊ทน์ ์œผ๋กœ ๋ณธ ์—ฐ๊ตฌ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹ค์งˆ์ ์œผ๋กœ ์—ฐ๊ตฌ๋‚˜ ์ง„๋‹จ ๋ชฉ์ ์œผ๋กœ ์ด์šฉํ•˜๋Š” ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ด๋‹ค.Shotgun metagenomics is of great importance to understand the microbial community composition of a sample and the impact it has on its host. The proper identification and quantification of bacterial species is a key component of any microbiome research that is based on metagenomic samples. In the last decade, several algorithms and databases have been developed, however the differences between references and the type of algorithm used for the classification makes the comparisons among themselves unfair and bias. The contents of the reference database, including genome sequences of type strains or reference genomes of uncultured species, have a great impact on the performance of the classification results of metagenomic samples. Another significant factor on shotgun metagenomics is the classification speed as most current bioinformatic tools lack computational and memory optimization. Here, I propose several enhancements to a well-known method, exact match k-mer classification in order to increase the overall speed of a metagenomic classification. This method was further improved by the use of Up-to-date Bacterial Core Gene (UBCG) sequences to provide better method for a faster and accurate shotgun metagenomic profiling classification. In order to prove the efficiency of our method, I built two UBCG-based reference databases: one containing UBCG sequences of valid named species, and the second one containing UBCG sequences of all valid named species and genomospecies in the EzBioCloud database. Three datasets containing Streptococcus species were used to evaluate the improved method against the MetaPhlan2 tool which is the most widely used open-source shotgun metagenomic classifier: (i) synthetic metagenomic samples, (ii) clinical sputum samples from patients with chronic obstructive pulmonary disease (COPD), and (iii) clinical samples of a blood stream infection. In this analysis, I demonstrated that UBCG sequences can be used as references for metagenomic classification, showing that they are easy to extract from genome sequences and accurate when predicting relative abundance. I also showed that the inclusion of genomospecies in the reference databases, significantly improves the classification accuracy of bacterial species within a metagenomic sample. Finally, I showed that while publicly available pipelines and databases are easily accessible, for accurate and reliable taxonomic classification, an updated database with proper taxonomic and genomic curation must be used. The method devised in this work is then applied to profile the Bacteroides species in over 4,000 shotgun metagenomic samples, which is one of most abundant members of the human gut microbiome. This task cannot be accomplished using conventional tools such as MetaPhlAn2 due to the high processing time they require. The results in this study showed that Bacteroides is high abundant in human samples from urban areas while being low abundant in humans from rural areas, particularly African and South American tribes. Countries showed dominance for a specific Bacteroides species, but this could also be explained by the type of study were the samples came from. Mice samples showed the most diversity of Bacteroides, this can be attributed by the number of bacterial references isolated from this organism. House cat and dog samples showed correlation between each other, this may be attributed to the similarities of their lifestyle and diet. This study shows the importance of having a great number of samples for any given metagenomic analysis, and even though, we have profiled thousands of samples, more might be needed in the future. The method proposed in this thesis demonstrates that core genes are reliable reference sequences for shotgun metagenomics. Their implementation as reference sequences in metagenomic databases improves the accuracy of the abundance prediction of any given sample. Additionally, with the use of a k-mer approach, this methods running time outperforms the most popular shotgun metagenomic tools. The work presented in this thesis aims to help microbial research by providing faster and accurate metagenomic taxonomic predictions. Finally, with the ability of updating a metagenomic database with ease, will help researchers to obtain the most up-to-date results to find potential diagnosis or treatments for diseases associated to human microbial communities.Chapter 1. General Introduction 1 1.1. Introduction to metagenomics 2 1.2. 16S rRNA sequencing 3 1.3. Shotgun metagenomic sequencing 5 1.3.1. History 5 1.3.2. Sample extraction 7 1.3.3. Library preparation 8 1.3.4. Sequencing 8 1.4. Shotgun metagenomic classification 9 1.4.1. Homology-based approaches 9 1.4.2. Exact match K-mer approaches 11 Chapter 2. An exact match k-mer algorithm 13 2.1. An exact match k-mer classification approach 14 2.1.1. Definition of the problem 14 2.1.2. Building a k-mer reference database 14 2.1.2.1. K-mer counting 14 2.1.2.2. K-mer mapping 16 2.1.3. Classification of a metagenomic read 16 2.1.3.1. K-mer search 19 2.1.3.2. Scoring a metagenomic read 20 2.1.4. Calculating the metagenome profile 20 2.1.4.1. Normalization for LCA-assigned reads 21 2.1.4.2. Normalization for cell count relative abundance 22 2.2. RAM memory usage 22 2.3. Quality Control 23 2.3.1. Read Trimming 23 2.3.2. Host read removal 24 Chapter 3. Revealing unrecognized species in the genus Streptococcus 28 3.1. A brief history of streptococcus in clinical metagenomics 29 3.2. Results and Discussion 32 3.2.1. Building a core gene reference database 32 3.2.2. Evaluation of Pipelines using Synthetic Metagenomes 36 3.2.3. Chronic obstructive pulmonary disease samples 44 3.2.3. Evaluating the value of genomospecies references in a metagenomic database 56 3.2.4. Identifying accurately a Streptococcal infection using clinical data 63 3.2.5. Effects of different ANI thresholds on the classification of genomospecies 69 3.3. Materials and Methods 76 3.3.1. Selecting the reference genomes 76 3.3.2. Average nucleotide identity and hierarchical clustering 76 3.3.3. Synthetic and Real metagenomic samples 77 3.3.4. Extracting the core genes 77 3.3.5. Taxonomic profiling 83 3.3.6. Biomarker discovery 84 3.4. Conclusions 85 Chapter 4. A large-scale shotgun metagenomic analysis on Bacteroides 86 4.1. Introduction 87 4.2. Bacteroides on the human gut 89 4.2.1. Collecting the samples 89 4.2.2. Methods 89 4.2.2.1. Reference Genomes 89 4.2.2.2. Metagenome profiling 90 4.2.3. Results 103 4.3. Bacteroides on Animal Species 128 4.3.1. Methods 128 4.3.2. Results 128 4.4. Discussion and conclusions 133 General Conclusion 135 References 139 Appendix I. A list of genomes from the genus Streptococcus used on Chapters 3 analysis. 146 ๊ตญ๋ฌธ์ดˆ๋ก 155Docto
    corecore