101 research outputs found
Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data
Huang L. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: UniversitÀt Bielefeld; 2019.The increasing amount of next-generation sequencing data introduces a fundamental challenge on large scale genomic analytics. Storing and processing large amounts of sequencing data requires considerable hardware resources and efficient software that can fully utilize these resources. Nowadays, both industrial enterprises and nonprofit institutes are providing robust and easy-access cloud services for studies in life science. To facilitate genomic data analyses on such powerful computing resources, distributed bioinformatics tools are needed. However, most of existing tools have low scalability on the distributed computing cloud. Thus, in this thesis, I developed a cloud based bioinformatics framework that mainly addresses two computational challenges: (i) the run time intensive challenge in the sequence mapping process and (ii) the memory intensive challenge in the de novo genome assembly process.
For sequence mapping, I have natively implemented an Apache Spark based distributed sequence mapping tool called Sparkhit. It uses the q-gram filter and Pigeonhole principle to accelerate the speeds of fragment recruitment and short read mapping processes. These algorithms are implemented in the Spark extended MapReduce model. Sparkhit runs 92â157 times faster than MetaSpark on metagenomic fragment recruitment and 18â32 times faster than Crossbow on data pre-processing.
For de novo genome assembly, I have invented a new data structure called Reflexible Distributed K-mer (RDK) and natively implemented a distributed genome assembler called Reflexiv. Reflexiv is built on top of the Apache Spark platform, uses Spark Resilient Distributed Dataset (RDD) to distributed large amount of k-mers across the cluster and assembles the genome in a recursive way. As a result, Reflexiv runs 8-17 times faster than Ray assembler and 5-18 times faster than AbySS assembler on the clusters deployed at the de.NBI cloud.
In addition, I have incorporated a variety of analytical methods into the framework. I have also developed a tool wrapper to distribute external tools and Docker containers on the Spark cluster. As a large scale genomic use case, my framework processed 100 terabytes of data across four genomic projects on the Amazon cloud in 21 hours. Furthermore, the application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 hours, presenting an approach to easily associate large amounts of public datasets with reference data.
Thus, my work contributes to the interdisciplinary research of life science and distributed cloud computing by improving existing methods with a new data structure, new algorithms, and robust distributed implementations
Molecular conversions of cyclic sulfur compounds utilizing the features of sulfer
Thesis--University of Tsukuba, D.Sc.(A), no. 649, 1989. 3. 2
The Devil of Face Recognition is in the Noise
The growing scale of face recognition datasets empowers us to train strong
convolutional networks for face recognition. While a variety of architectures
and loss functions have been devised, we still have a limited understanding of
the source and consequence of label noise inherent in existing datasets. We
make the following contributions: 1) We contribute cleaned subsets of popular
face databases, i.e., MegaFace and MS-Celeb-1M datasets, and build a new
large-scale noise-controlled IMDb-Face dataset. 2) With the original datasets
and cleaned subsets, we profile and analyze label noise properties of MegaFace
and MS-Celeb-1M. We show that a few orders more samples are needed to achieve
the same accuracy yielded by a clean subset. 3) We study the association
between different types of noise, i.e., label flips and outliers, with the
accuracy of face recognition models. 4) We investigate ways to improve data
cleanliness, including a comprehensive user study on the influence of data
labeling strategies to annotation accuracy. The IMDb-Face dataset has been
released on https://github.com/fwang91/IMDb-Face.Comment: accepted to ECCV'1
Recommended from our members
Adaptation of a microbial community to demand-oriented biological methanation
Background: Biological conversion of the surplus of renewable electricity and carbon dioxide (CO2) from biogas plants to biomethane (CH4) could support energy storage and strengthen the power grid. Biological methanation (BM) is linked closely to the activity of biogas-producing Bacteria and methanogenic Archaea. During reactor operations, the microbiome is often subject to various changes, e.g., substrate limitation or pH-shifts, whereby the microorganisms are challenged to adapt to the new conditions. In this study, various process parameters including pH value, CH4 production rate, conversion yields and final gas composition were monitored for a hydrogenotrophic-adapted microbial community cultivated in a laboratory-scale BM reactor. To investigate the robustness of the BM process regarding power oscillations, the biogas microbiome was exposed to five hydrogen (H2)-feeding regimes lasting several days. Results: Applying various âonâoffâ H2-feeding regimes, the CH4 production rate recovered quickly, demonstrating a significant resilience of the microbial community. Analyses of the taxonomic composition of the microbiome revealed a high abundance of the bacterial phyla Firmicutes, Bacteroidota and Thermotogota followed by hydrogenotrophic Archaea of the phylum Methanobacteriota. Homo-acetogenic and heterotrophic fermenting Bacteria formed a complex food web with methanogens. The abundance of the methanogenic Archaea roughly doubled during discontinuous H2-feeding, which was related mainly to an increase in acetoclastic Methanothrix species. Results also suggested that Bacteria feeding on methanogens could reduce overall CH4 production. On the other hand, using inactive biomass as a substrate could support the growth of methanogenic Archaea. During the BM process, the additional production of H2 by fermenting Bacteria seemed to support the maintenance of hydrogenotrophic methanogens at non-H2-feeding phases. Besides the elusive role of Methanothrix during the H2-feeding phases, acetate consumption and pH maintenance at the non-feeding phase can be assigned to this species. Conclusions: Taken together, the high adaptive potential of microbial communities contributes to the robustness of BM processes during discontinuous H2-feeding and supports the commercial use of BM processes for energy storage. Discontinuous feeding strategies could be used to enrich methanogenic Archaea during the establishment of a microbial community for BM. Both findings could contribute to design and improve BM processes from lab to pilot scale
The novel oligopeptide utilizing species Anaeropeptidivorans aminofermentans M3/9T, its role in anaerobic digestion and occurrence as deduced from large-scale fragment recruitment analyses
Research on biogas-producing microbial communities aims at elucidation of correlations and dependencies between the anaerobic digestion (AD) process and the corresponding microbiome composition in order to optimize the performance of the process and the biogas output. Previously, Lachnospiraceae species were frequently detected in mesophilic to moderately thermophilic biogas reactors. To analyze adaptive genome features of a representative Lachnospiraceae strain, Anaeropeptidivorans aminofermentans M3/9T was isolated from a mesophilic laboratory-scale biogas plant and its genome was sequenced and analyzed in detail. Strain M3/9T possesses a number of genes encoding enzymes for degradation of proteins, oligo- and dipeptides. Moreover, genes encoding enzymes participating in fermentation of amino acids released from peptide hydrolysis were also identified. Based on further findings obtained from metabolic pathway reconstruction, M3/9T was predicted to participate in acidogenesis within the AD process. To understand the genomic diversity between the biogas isolate M3/9T and closely related Anaerotignum type strains, genome sequence comparisons were performed. M3/9T harbors 1,693 strain-specific genes among others encoding different peptidases, a phosphotransferase system (PTS) for sugar uptake, but also proteins involved in extracellular solute binding and import, sporulation and flagellar biosynthesis. In order to determine the occurrence of M3/9T in other environments, large-scale fragment recruitments with the M3/9T genome as a template and publicly available metagenomes representing different environments was performed. The strain was detected in the intestine of mammals, being most abundant in goat feces, occasionally used as a substrate for biogas production.Peer Reviewe
Recommended from our members
The role of petrimonas mucosa ING2-E5at in mesophilic biogas reactor systems as deduced from multiomics analyses
Members of the genera Proteiniphilum and Petrimonas were speculated to represent indicators reflecting process instability within anaerobic digestion (AD) microbiomes. Therefore, Petrimonas mucosa ING2-E5AT was isolated from a biogas reactor sample and sequenced on the PacBio RSII and Illumina MiSeq sequencers. Phylogenetic classification positioned the strain ING2-E5AT in close proximity to Fermentimonas and Proteiniphilum species (family Dysgonomonadaceae). ING2-E5AT encodes a number of genes for glycosyl-hydrolyses (GH) which are organized in Polysaccharide Utilization Loci (PUL) comprising tandem susCD-like genes for a TonB-dependent outer-membrane transporter and a cell surface glycan-binding protein. Different GHs encoded in PUL are involved in pectin degradation, reflecting a pronounced specialization of the ING2-E5AT PUL systems regarding the decomposition of this polysaccharide. Genes encoding enzymes participating in amino acids fermentation were also identified. Fragment recruitments with the ING2-E5AT genome as a template and publicly available metagenomes of AD microbiomes revealed that Petrimonas species are present in 146 out of 257 datasets supporting their importance in AD microbiomes. Metatranscriptome analyses of AD microbiomes uncovered active sugar and amino acid fermentation pathways for Petrimonas species. Likewise, screening of metaproteome datasets demonstrated expression of the Petrimonas PUL-specific component SusC providing further evidence that PUL play a central role for the lifestyle of Petrimonas species. © 2020 by the authors. Licensee MDPI, Basel, Switzerland
Recommended from our members
The Role of Petrimonas mucosa ING2-E5AT in Mesophilic Biogas Reactor Systems as Deduced from Multiomics Analyses
Members of the genera Proteiniphilum and Petrimonas were speculated to represent indicators reflecting process instability within anaerobic digestion (AD) microbiomes. Therefore, Petrimonas mucosa ING2-E5AT was isolated from a biogas reactor sample and sequenced on the PacBio RSII and Illumina MiSeq sequencers. Phylogenetic classification positioned the strain ING2-E5AT in close proximity to Fermentimonas and Proteiniphilum species (family Dysgonomonadaceae). ING2-E5AT encodes a number of genes for glycosyl-hydrolyses (GH) which are organized in Polysaccharide Utilization Loci (PUL) comprising tandem susCD-like genes for a TonB-dependent outer-membrane transporter and a cell surface glycan-binding protein. Different GHs encoded in PUL are involved in pectin degradation, reflecting a pronounced specialization of the ING2-E5AT PUL systems regarding the decomposition of this polysaccharide. Genes encoding enzymes participating in amino acids fermentation were also identified. Fragment recruitments with the ING2-E5AT genome as a template and publicly available metagenomes of AD microbiomes revealed that Petrimonas species are present in 146 out of 257 datasets supporting their importance in AD microbiomes. Metatranscriptome analyses of AD microbiomes uncovered active sugar and amino acid fermentation pathways for Petrimonas species. Likewise, screening of metaproteome datasets demonstrated expression of the Petrimonas PUL-specific component SusC providing further evidence that PUL play a central role for the lifestyle of Petrimonas species.Peer Reviewe
Simultaneous Metabarcoding and Quantification of Neocallimastigomycetes from Environmental Samples:Insights into Community Composition and Novel Lineages
Anaerobic fungi from the herbivore digestive tract (Neocallimastigomycetes) are primary lignocellulose modifiers and hold promise for biotechnological applications. Their molecular detection is currently difficult due to the non-specificity of published primer pairs, which impairs evolutionary and ecological research with environmental samples. We developed and validated a Neocallimastigomycetes-specific PCR primer pair targeting the D2 region of the ribosomal large subunit suitable for screening, quantifying, and sequencing. We evaluated this primer pair in silico on sequences from all known genera, in vitro with pure cultures covering 16 of the 20 known genera, and on environmental samples with highly diverse microbiomes. The amplified region allowed phylogenetic differentiation of all known genera and most species. The amplicon is about 350 bp long, suitable for short-read high-throughput sequencing as well as qPCR assays. Sequencing of herbivore fecal samples verified the specificity of the primer pair and recovered highly diverse and so far unknown anaerobic gut fungal taxa. As the chosen barcoding region can be easily aligned and is taxonomically informative, the sequences can be used for classification and phylogenetic inferences. Several new Neocallimastigomycetes clades were obtained, some of which represent putative novel lineages such as a clade from feces of the rodent Dolichotis patagonum (mara)
Simultaneous Metabarcoding and Quantification of Neocallimastigomycetes from Environmental Samples:Insights into Community Composition and Novel Lineages
Anaerobic fungi from the herbivore digestive tract (Neocallimastigomycetes) are primary lignocellulose modifiers and hold promise for biotechnological applications. Their molecular detection is currently difficult due to the non-specificity of published primer pairs, which impairs evolutionary and ecological research with environmental samples. We developed and validated a Neocallimastigomycetes-specific PCR primer pair targeting the D2 region of the ribosomal large subunit suitable for screening, quantifying, and sequencing. We evaluated this primer pair in silico on sequences from all known genera, in vitro with pure cultures covering 16 of the 20 known genera, and on environmental samples with highly diverse microbiomes. The amplified region allowed phylogenetic differentiation of all known genera and most species. The amplicon is about 350 bp long, suitable for short-read high-throughput sequencing as well as qPCR assays. Sequencing of herbivore fecal samples verified the specificity of the primer pair and recovered highly diverse and so far unknown anaerobic gut fungal taxa. As the chosen barcoding region can be easily aligned and is taxonomically informative, the sequences can be used for classification and phylogenetic inferences. Several new Neocallimastigomycetes clades were obtained, some of which represent putative novel lineages such as a clade from feces of the rodent Dolichotis patagonum (mara)
- âŠ