18 research outputs found

    Gene Prediction in Metagenomic Fragments with Deep Learning

    Get PDF

    Amplikoni põhine metsamuldade bakterikoosluse analüüs

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneMuldade rikkalike mikroobikoosluste uurimist on siiani palju takistanud tõsiasi, et enamik mulla mikroobe on kultiveerimatud. Seda kitsaskohta aitab leevendada lähenemine nimega metagenoomika, mis tähistab uurimistööd otse keskkonnaproovidest eraldatud geneetilise materjaliga. Selliste andmete kasutamiseks on levinud meetodid, mille abil grupeeritakse (klasterdatakse) kogutud DNA järjestused ad-hoc taksonoomilistesse üksustesse nn. OTU-desse (Operational Taxonomic Unit). Nii võib OTU-desse klasterdatud järjestusi kasutades hinnata bakterikoosluste mitmekesisust ja liigilist koostist. Saadud OTU-de arvukuse numbreid annab kasutada mitmesugustes erinevates analüüsides kui asendajaid tavapärasematele taksonoomilistele üksustele. Niisama kiire, kui on olnud uute sekveneerimistehnoloogiate areng, on ka olnud uute tööriistade arvu kasv – viimase kümnendi jooksul on loodud hulk programme, mis on mõeldud eelpoolmainitud OTU-de moodustamiseks DNA järjestuste andmetest. Antud doktoritöö töö keskendub sellele, kuidas mõjutavad erinevad OTU loomise meetodid edasisi analüüse ning järeldusi. Selleks kasutati järjestusandmeid artiklist “Bacterial community structure and its relationship to soil physico-chemical characteristics in alder stands with different management histories” ning erinevaid OTU klasterdamise meetodeid. OTU-d loodi erinevate programmide abil (Mothur,CROP,UCLUST,Swarm) – seejärel viidi läbi koosluste mitmesugused statistilised analüüsid. OTU andmete analüüs andis üldjoontes samasuguseid tulemusi. Seda visualiseerivad hästi töös olevad joonised. OTU arvude ja mitmekesisusindeksi statistilised testid ei leidnud statistiliselt olulist erinevust eri klasterdusmeetodite vahel. Kasutatud klasterdamismeetoditest jäid parimaina silma paistma CROP ja UCLUST meetodid.Lisaks näitasid analüüsid ka osade statistiliste meetodite eeliseid teiste ees sedasorti OTU andmete käsitlemiselThe soil as a central agent in many ecological processes has received a lot of research attention from many different angles. The investigation of the rich microbiome of the soil has been slowed by the fact that most of the microbes are unculturable. This gap can be filled by the metagenomics which is a field that deals with genetic material directly acquired form environmental samples. The analysis of 16S rDNA data usually begins with the construction of operational taxonomicunits (OTUs): clusters of reads that differ by less than a fixed sequence dissimilarity threshold. Consequently, the obtained sample-by-OTU abundance table serves as the basis for further statistical and exploratory analysis. During the last decade, a plethora of tools based on different principles and having different computational requirements to perform aforementioned OTU clustering has been created. This work we take an interest in the differences of the final outcome of series of analyses when different OTU clustering methods are used and also have a comparision of these methods. We used the dataset published in “Bacterial community structure and its relationship to soil physico-chemical characteristics in alder stands with different management histories” and analysed it using different software packages for processing bioinformatics data: Mothur UCLUST, CROP, Swarm. The results of analyses were on the whole quite similar and comparable.The differences between OTU numbers and diversity indeces were statistically not significant. The CROP and UCLUST methods stood out by their quality and useability. The work also showed the practicality of robust statistical methods when working with OTU data

    The MGX framework for microbial community analysis

    Get PDF
    Jaenicke S. The MGX framework for microbial community analysis. Bielefeld: Universität Bielefeld; 2020

    Can cyanobacterial diversity in the source predict the diversity in sludge and the risk of toxin release in a drinking water treatment plant?

    Get PDF
    ABSTRACT: Conventional processes (coagulation, flocculation, sedimentation, and filtration) are widely used in drinking water treatment plants and are considered a good treatment strategy to eliminate cyanobacterial cells and cell-bound cyanotoxins. The diversity of cyanobacteria was investigated using taxonomic cell counts and shotgun metagenomics over two seasons in a drinking water treat- ment plant before, during, and after the bloom. Changes in the community structure over time at the phylum, genus, and species levels were monitored in samples retrieved from raw water (RW), sludge in the holding tank (ST), and sludge supernatant (SST). Aphanothece clathrata brevis, Microcystis aeruginosa, Dolichospermum spiroides, and Chroococcus minimus were predominant species detected in RW by taxonomic cell counts. Shotgun metagenomics revealed that Proteobacteria was the pre- dominant phylum in RW before and after the cyanobacterial bloom. Taxonomic cell counts and shotgun metagenomic showed that the Dolichospermum bloom occurred inside the plant. Cyanobac- teria and Bacteroidetes were the major bacterial phyla during the bloom. Shotgun metagenomics also showed that Synechococcus, Microcystis, and Dolichospermum were the predominant detected cyanobacterial genera in the samples. Conventional treatment removed more than 92% of cyanobac- terial cells but led to cell accumulation in the sludge up to 31 times more than in the RW influx. Coagulation/sedimentation selectively removed more than 96% of Microcystis and Dolichospermum. Cyanobacterial community in the sludge varied from raw water to sludge during sludge storage (1–13 days). This variation was due to the selective removal of coagulation/sedimentation as well as the accumulation of captured cells over the period of storage time. However, the prediction of the cyanobacterial community composition in the SST remained a challenge. Among nutrient parameters, orthophosphate availability was related to community profile in RW samples, whereas communities in ST were influenced by total nitrogen, Kjeldahl nitrogen (N- Kjeldahl), total and particulate phos- phorous, and total organic carbon (TOC). No trend was observed on the impact of nutrients on SST communities. This study profiled new health-related, environmental, and technical challenges for the production of drinking water due to the complex fate of cyanobacteria in cyanobacteria-laden sludge and supernatant

    Unipept: computational exploration of metaproteome data

    Get PDF

    DEVELOPMENT OF HIGH-THROUGHPUT EXPERIMENTAL AND COMPUTATIONAL TECHNOLOGIES FOR ANALYZING MICROBIAL FUNCTIONS AND INTERACTIONS IN ENVIRONMENTAL METAGENOMES

    Get PDF
    Microorganisms are ubiquitous on earth, and they interact each other to form communities, which play unique and integral roles in various biochemical processes and functions that are of critical importance in global biogeochemical cycling, human health, energy, climate change, environmental remediation, engineering, industry, and agriculture. However, identification, characterization, and quantification of microbial communities are still limited, due to the extreme diversity and yet-uncultivable nature of a vast majority of microorganisms, and our understanding of microbial communities is further hindered by complex organization and dynamics of interactions among microorganisms. In this work, we developed high-throughput functional gene arrays (FGAs), bioinformatics tools and computational methods for analysis of microbial metagenomes and interactomes to address some of the limitations, whose powerfulness were demonstrated in application studies. In the beginning of this work, we developed a high-throughput FGA for characterizing a specific group of microorganisms - plant growth promoting microorganisms (PGPMs). PGPMs can promote plant growth and suppress disease directly and/or indirectly by enhancing soil fertility and plant resistance to biotic and abiotic stresses, thus may contribute to the success of invasive plants over native species. However, PGPMs are highly diverse in terms of both species richness and plant promoting mechanisms. Therefore, it is difficult to study the PGPMs changes along with environment shifts, and their subsequent impacts on plant performance and ecosystem functioning. The developed high-throughput FGA, termed Plant Associated Beneficial Microorganism Chip (PABMC), focused on functional genes from PGPMs that are beneficial to plants. A total of 3,870 probes covering 34 functional gene families were designed in PABMC, including six categories: plant growth-promoting hormones, plant pathogen resistance, antibiotics, antioxidants, drought tolerance, and secondary benefits (e.g. elicitor of plant immune defense response). Computational analysis showed that ~98% of the probes were highly specific at the species or strain level.  The PABMC was also applied to investigate PGPMs’ responses to Ageratina adenophora (A. adenophora) invasion in a natural grassland, and showed A. adenophora invasion increased the alpha diversity and shifted the composition of PGPM communities compared with what from the native site. The PABMC uncovered changes in abundance of a key gene related to drought tolerance, pathogen resistance, antibiotic biosynthesis, and antioxidant biosynthesis, due to A. adenophora invasion. These changes may promote the survival and growth of A. adenophora over native species in the site we studied. Next, we developed GeoChip 5.0, and advanced the FGA based metagenomics technology to a new level of comprehensiveness, for analyzing complex microbial communities. GeoChip 5.0 was based on Agilent platform, with two formats. The smaller format contained 60K probes (GeoChip 5.0S), majorly covering probes from carbon (C), nitrogen (N), sulphur (S), and phosphorus (P) cyclings and energy metabolism probes. The larger format (GeoChip 5.0M) contained all probes in GeoChip 5.0S and expanded to antibiotic resistance, metal resistance/reduction, organic contaminant remediation, stress responses, pathogenesis, soil beneficial microbes, soil pathogens, and virulence. GeoChip 5.0M contains 161,961 probes covering approximately 370,000 representative coding sequences from 1,447 functional gene families. These genes were derived from functionally divergent broad taxonomic groups, including bacteria (2,721 genera), archaea (101 genera), fungi (297 genera), protists (219), and viruses (167 genera, mainly phages). Both computational and experimental evaluation indicated that all designed probes were highly specific to their corresponding targets. Sensitivity tests revealed that as little as 0.05 ng of pure culture DNAs was detectable within 1 µg of complex soil community DNA as background, suggesting that the Agilent platform-based GeoChip is extremely sensitive. Additionally, very strong quantitative linear relationships were obtained between signal intensity and pure genomic DNAs or soil DNAs. Application of the designed FGAs to a contaminated groundwater with very low biomass indicated that environmental contaminants (majorly, heavy metals) had significant impacts on the biodiversity of microbial communities. Since next generation sequencing (NGS) technology has revolutionized metagenomics and microbial ecology studies, immense improvements made in sequencing speed, throughput, and cost. However, NGS technology also produces a formidable number of raw reads which poses computational challenges, especially for analyzing deep shotgun metagenomics sequencing data. To tackle some of the challenges, we present an Ecological Function oriented Metagenomic Analysis Pipeline (EcoFun-MAP), to facilitate analysis of shotgun metagenomic sequencing data in microbial ecology studies. The EcoFun-MAP consists of reference databases of different data structures, with a selective coverage of functional genes that are important to ecological functions. Meanwhile, multiple predefined data analysis workflows were built on the databases with most updated bioinformatics tools. Furthermore, the EcoFun-MAP was implemented and deployed on High-Performance Computing (HPC) infrastructure with high accessible and easy-to-use interfaces. In our evaluation, the EcoFun-MAP was found to be fast (multi-million reads/min.) and highly scalable, and capable of addressing disparate needs for accuracy and precision. In addition, we showcase the effectiveness of the EcoFun-MAP by applying it to reveal differences among metagenomes from underground water samples, and provide insights to link the metagenomic differences with distinctive levels of contaminants. To extend an emerging dimension of microbial community analysis, that is the analysis of complex microbial interactions, we provided a generalized Brody distribution (GBD) based Random Matrix Theory approach (GBD-RMT approach) for inferring microbial data association networks. The GBD-RMT approach addresses several limitations of a previous Random Matrix Theory (RMT)-based approach in the capability of detection and interpretability of detected thresholds. The GBD-RMT approach is capable of quantitatively characterizing the dynamics of Nearest Neighboring Spacing Distribution (NNSD) of eigenvalues against candidate thresholds, and detecting both the critical transitions and thresholds in NNSD dynamics using trend analysis. In our evaluation, the GBD-RMT approach successfully detected the critical thresholds in all of the numerically simulated and real datasets, including those for which the previous method failed. It also had higher detection resolution, and gained higher confidence and interpretability in detected critical thresholds. Meanwhile, the GBD-RMT approach integrated improvements for detecting more types of data association and reducing compositional data bias. In addition, the GBD-RMT approach uncovered a remarkable overlap between the critical transitions and the plateaus of scale-freeness from the inferred networks, and the overlap is showed to be statistically significant and universal in complex biological systems in our analysis. All the developed technologies and computational methods in this work provided powerful and up-to-date means for analyzing complex metagenomes, and should be ready to serve for improving our understanding of microbial communities in the studies of microbial ecology and global change biology

    Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

    Get PDF
    Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole- genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp- 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down- sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery

    Metagenomic and genomic analysis of the skin microbiota

    Get PDF
    Following birth the skin is rapidly colonised by microorganisms that, over time, delineate into niche-specific microbial communities that often exhibit specific host-associated functions. Due to local physiological conditions, the axilla boasts a unique microbial community that has been implicated in malodour generation via the biotransformation of odourless host-secreted substrates. To more comprehensively understand the role of the axillary microbiome in malodour generation, axillary samples of subjects exhibiting high and low malodour profiles were subject to metagenomic sequencing. Metagenomics is a relatively novel whole-genome shotgun technique that utilises high-throughput sequencing to taxonomically and functionally characterise microbial communities. Prior to the axillary analysis, an in vitro synthetic microbial community of known composition was created and subject to metagenomic sequencing and analysis to determine which methods most accurately represent the taxonomic and functional composition of a microbial community. Additionally, to allow a more thorough understanding of the intraspecies diversity of the most abundant skin genus Staphylococcus, the commensal resident Staphylococcus epidermidis and the closely related pathogen Staphylococcus aureus were both subject to comparative pan-genome analysis. Utilising a direct whole-genome sequencing approach revealed that Corynebacterium might not dominate the axillary microbiota as predominantly as previously thought. A wide range of microbial clades were associated with high levels of axillary malodour, however only the four following species-level groups were enriched: Corynebacterium amycolatum, Corynebacterium kroppenstedtii, Finegoldia magna and Kocuria rhizophila. The characterised ability of certain corynebacterial species to generate malodorous compounds indicates that C. amycolatum and C. kroppenstedtii may play a major role towards the generation of axillary malodour. Pan-genome analysis of the most abundant skin isolate S. epidermidis and its relative S. aureus resulted in the complete description of the core genome of both species, and revealed that S. epidermidis exhibits a much higher degree of intra-species variability than S. aureus. Also, although both species occupy distinctly divergent life-styles, a large proportion of the conserved function was present in the core-genomes of both species, indicating a high degree of shared conservation. Utilisation of high-throughput sequencing technologies allowed a more in-depth analysis of the axillary microbiota and the intraspecies variability of S. epidermidis and S. aureus
    corecore